7 min read

Production RAG Patterns: Beyond the Tutorial

RAGAIVector DBGen AIProductionChromaDB

Production RAG Patterns: Beyond the Tutorial

Retrieval-Augmented Generation looks simple in tutorials. Embed your documents, store in a vector DB, retrieve top-K, prepend to prompt. Done. Then you try it in production and accuracy is 60%. Latency is 4 seconds. Costs are unpredictable. The model hallucinates things that aren't in the retrieved context.

After building RAG into FieldFix and several other systems, here's what actually works.

The Tutorial Pattern (And Why It Fails)

The standard tutorial:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)

store = Chroma.from_documents(chunks, OpenAIEmbeddings())
results = store.similarity_search(query, k=5)

prompt = f"Context: {results}\n\nQuestion: {query}"
answer = llm.generate(prompt)

This pattern fails in production because:

1. Chunk boundaries cut concepts in half. A 1000-character chunk might split a definition from its explanation.
2. Cosine similarity is not relevance. A semantically similar chunk might not actually answer the question.
3. No reranking. Top-5 by embedding similarity is rarely top-5 by usefulness.
4. No retrieval evaluation. You don't know if your retrieval is actually finding the right chunks.
5. No structured fallback. If retrieval finds nothing useful, the LLM still tries to answer.

Chunk Strategy: Semantic, Not Character-Based

Fixed-size chunks split mid-sentence, mid-concept, mid-table. The fix is semantic chunking — splitting at natural boundaries.

from llama_index.core.node_parser import MarkdownNodeParser, SentenceWindowNodeParser

# For markdown docs: split on headers, preserve hierarchy
md_parser = MarkdownNodeParser(
    include_metadata=True,
    include_prev_next_rel=True,
)

# For prose: window of N sentences centered on each sentence
sentence_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_sentence",
)

Better: chunk by document structure (headers, sections, paragraphs), not character count. Each chunk should be a self-contained semantic unit.

For FieldFix's repair docs, I chunked by procedure step — each step is one chunk, with metadata linking back to the parent procedure. Retrieval returns a single relevant step plus its procedure context.

Hybrid Retrieval: BM25 + Embeddings

Embeddings capture semantic similarity. Keyword search captures exact term matches. Both signals matter — neither is sufficient alone.

from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self, chunks: list[str], embedder, vector_store):
        # BM25 keyword index
        tokenized = [c.lower().split() for c in chunks]
        self.bm25 = BM25Okapi(tokenized)
        self.chunks = chunks

        # Vector index (existing)
        self.vector_store = vector_store
        self.embedder = embedder

    def retrieve(self, query: str, k: int = 10) -> list[dict]:
        # Vector retrieval
        query_emb = self.embedder.encode(query)
        vec_results = self.vector_store.query(query_emb, n_results=k)

        # BM25 retrieval
        bm25_scores = self.bm25.get_scores(query.lower().split())
        bm25_top_idx = sorted(
            range(len(bm25_scores)),
            key=lambda i: bm25_scores[i],
            reverse=True,
        )[:k]

        # Reciprocal Rank Fusion
        return self._fuse(vec_results, bm25_top_idx, k=k)

    def _fuse(self, vec_results, bm25_indices, k=5):
        # RRF: score = sum(1 / (60 + rank_i)) across all ranking systems
        scores = {}
        for rank, idx in enumerate(vec_results.indices):
            scores[idx] = scores.get(idx, 0) + 1 / (60 + rank)
        for rank, idx in enumerate(bm25_indices):
            scores[idx] = scores.get(idx, 0) + 1 / (60 + rank)

        top = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:k]
        return [{"index": i, "score": s, "text": self.chunks[i]} for i, s in top]

The hybrid approach catches cases where embeddings miss exact terminology (e.g. specific part numbers) and where keyword search misses semantic similarity (e.g. paraphrases).

Reranking: The Single Highest-Leverage Optimization

After initial retrieval, rerank the top-K with a cross-encoder. The cross-encoder evaluates query-document pairs explicitly rather than using pre-computed embeddings.

from sentence_transformers import CrossEncoder

class Reranker:
    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        self.model = CrossEncoder(model_name)

    def rerank(self, query: str, candidates: list[dict], top_n: int = 3) -> list[dict]:
        if not candidates:
            return []
        pairs = [(query, c["text"]) for c in candidates]
        scores = self.model.predict(pairs)
        scored = list(zip(candidates, scores))
        scored.sort(key=lambda x: x[1], reverse=True)
        return [{"text": c["text"], "score": float(s), **c} for c, s in scored[:top_n]]

The full pipeline:
1. Hybrid retrieve top 20 candidates (broad recall)
2. Rerank to top 3-5 (precision)
3. Pass to LLM

In FieldFix testing, adding a reranker improved answer accuracy from 68% to 87% on the validation set. Single biggest quality win.

Retrieval Evaluation: Without It You're Flying Blind

The core failure mode of RAG systems: you don't know if retrieval is working. The LLM produces fluent answers regardless. You only notice when accuracy is bad.

Build retrieval evaluation into the pipeline:

@dataclass
class RetrievalEval:
    query: str
    expected_chunk_ids: set[str]
    retrieved_chunk_ids: list[str]

    @property
    def recall_at_k(self) -> float:
        if not self.expected_chunk_ids:
            return 1.0
        found = sum(
            1 for cid in self.retrieved_chunk_ids
            if cid in self.expected_chunk_ids
        )
        return found / len(self.expected_chunk_ids)

    @property
    def mrr(self) -> float:
        """Mean Reciprocal Rank"""
        for rank, cid in enumerate(self.retrieved_chunk_ids, start=1):
            if cid in self.expected_chunk_ids:
                return 1.0 / rank
        return 0.0

Build a small eval set (50-100 queries with expected relevant chunks) and run it on every retrieval pipeline change. If recall@5 drops, something broke.

This is the most important thing to set up early. Everything else is guessing without it.

Prompt Engineering for RAG

The retrieved context should be structured, not dumped:

PROMPT_TEMPLATE = """You are a repair diagnostics assistant. Use ONLY the provided context to answer.
If the context doesn't contain the answer, say "I don't have information about that specific issue."

Context:
{context_blocks}

Question: {query}

Instructions:
- Cite the specific source (e.g., [Source 1]) for each claim
- If multiple sources conflict, note the conflict
- If safety is mentioned in any source, prioritize that"""

def build_context_blocks(retrieved: list[dict]) -> str:
    blocks = []
    for i, r in enumerate(retrieved, start=1):
        source = r.get("metadata", {}).get("source", "unknown")
        blocks.append(f"[Source {i} — {source}]\n{r['text']}\n")
    return "\n".join(blocks)

Key elements:
1. Explicit grounding instruction — "use ONLY the provided context"
2. Citation requirement — forces the model to anchor to specific sources
3. Fallback instruction — explicit "I don't know" path
4. Numbered sources — makes citations parseable

Operational Realities

Embedding Drift

If you upgrade your embedding model, every embedding in your store is now wrong. Plan for this:

@dataclass
class IndexedChunk:
    text: str
    embedding: list[float]
    embedding_model: str  # ← track this
    embedding_version: str  # ← and this
    indexed_at: datetime

When you switch models, you need to re-embed everything. Have a versioning strategy.

Cost Modeling

For self-hosted embeddings (like all-MiniLM-L6-v2):
- Embedding generation: ~free at inference time
- Storage: proportional to chunk count × embedding dim

For API embeddings (like text-embedding-3-small):
- Generation: $0.02 per 1M tokens (small)
- One-time cost for initial indexing
- Recurring cost for new docs and re-embedding

For a 295-chunk knowledge base (FieldFix scale), local embeddings won. For a 10M-chunk enterprise corpus, API embeddings with caching is the better fit.

Latency Budget

End-to-end RAG latency breakdown:


Query embedding              : 5ms (local) / 100ms (API)
Vector retrieval (top-20)    : 50-200ms
BM25 retrieval               : 10ms
Reranking (cross-encoder)    : 200-400ms
LLM generation (4B local)    : 1500-3000ms
LLM generation (API)         : 800-2500ms
─────────────────────────────────────────
Total                        : 2-4s typical

If you need <1s latency, you're in trouble. The model generation is the bottleneck. Options:
- Smaller model (Phi-3 mini, Gemma 2B)
- Streaming output (perceived latency drops)
- Speculative decoding
- Caching common queries

When NOT to Use RAG

RAG is hyped. It's not always the right answer:

  • Small, static knowledge: just put it in the system prompt. RAG adds infrastructure for no benefit.
  • Frequently changing facts: RAG needs re-indexing. Consider a structured DB + SQL retrieval instead.
  • Tasks requiring reasoning across many documents: standard RAG retrieves K chunks but can't reason about the union. You need agentic retrieval or graph-based approaches.
  • Single-document Q&A: just include the full document in context. Modern context windows are 100K+ tokens.

The Stack That Works

After several production deployments, this stack reliably works:


Embeddings: all-MiniLM-L6-v2 (local) or text-embedding-3-small (API)
Vector store: ChromaDB (local, embedded) or Pinecone (cloud)
Hybrid: + BM25 via rank_bm25
Reranker: ms-marco-MiniLM-L-6-v2 (cross-encoder)
Eval: custom recall@k + MRR on hand-built eval set
LLM: depends on use case (Gemma 4B local, GPT-4o or Claude API)

Nothing exotic. The differentiation isn't in the stack — it's in the chunk strategy, the eval discipline, and the reranking.

The Real Lesson

RAG that works in production isn't built with a better vector database. It's built with:

1. Evaluation infrastructure so you know what's broken
2. Hybrid retrieval because no single signal is enough
3. Reranking because initial retrieval is imprecise
4. Semantic chunking because character chunks destroy meaning
5. Structured prompting that forces grounding

Skip any of these and you'll spend months tuning embedding models that aren't the problem. Include all five and the system actually works.

SYS:ONLINE
--:--:--