How to Build a RAG Pipeline from Scratch (With Code)

Retrieval-Augmented Generation (RAG) is the reason most AI features at product companies work as well as they do. Instead of hoping a language model memorised your data during training, RAG lets you inject the right context at query time. It's the difference between "the model might know" and "the model knows, because we told it."

This guide walks through building a RAG pipeline from scratch — what each step does, why it matters, and code you can actually run.

What Problem Does RAG Solve?

Large language models are trained on a snapshot of the internet. They don't know:

Your product documentation
Your company's internal policies
Anything that happened after their training cutoff
Anything that was never public

When users ask about these things, a bare LLM either hallucinates or says it doesn't know. RAG fixes this by retrieving relevant documents from your data and passing them into the model's context window before it answers.

The loop is: Query → Retrieve relevant docs → Stuff them into the prompt → Generate a grounded answer.

The Five Components of a RAG System

User Query
    │
    ▼
[1] Embed Query  →  Vector (384 dimensions, e.g.)
    │
    ▼
[2] Vector Search  →  Top-K similar document chunks
    │
    ▼
[3] Re-rank (optional)  →  Re-ordered chunks by relevance
    │
    ▼
[4] Prompt Assembly  →  System prompt + retrieved chunks + user query
    │
    ▼
[5] LLM Generation  →  Grounded answer

Let's build each one.

Step 1: Chunk Your Documents

The LLM has a finite context window. You can't stuff an entire knowledge base in — you chunk your documents into pieces that are small enough to be useful but large enough to carry meaning.

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = " ".join(words[i : i + chunk_size])
        chunks.append(chunk)
        i += chunk_size - overlap  # overlap so context isn't lost at boundaries
    return chunks

Chunking decisions that actually matter:

Decision	Why it matters
Chunk size	Too small = no context. Too large = irrelevant content dilutes the good stuff.
Overlap	Without overlap, a sentence split across chunks loses meaning at the boundary.
Chunk by token vs word	Token-based is more precise for context window budgeting.
Chunk by structure	Split on `\n\n` or headings for structured docs — often beats fixed-size.

A good default starting point: 400–600 words with 50-word overlap, or split on paragraphs if your docs are structured.

Step 2: Embed Your Chunks

An embedding converts text into a vector of numbers — a point in a high-dimensional space — such that semantically similar texts are closer together.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")  # fast, free, good enough

def embed_chunks(chunks: list[str]) -> list[list[float]]:
    return model.encode(chunks, convert_to_list=True)

Which embedding model should you use?

all-MiniLM-L6-v2 — Fast, 384 dimensions, great for most use cases. Free via sentence-transformers.
text-embedding-3-small (OpenAI) — Better quality, especially for English. ~$0.02/1M tokens.
text-embedding-3-large (OpenAI) — Best quality, 3× the cost.

For a prototype, start local with all-MiniLM-L6-v2. Swap to OpenAI embeddings when quality matters more than cost.

Step 3: Store Embeddings in a Vector Database

Once embedded, store your chunks and their vectors so you can query them efficiently.

import chromadb

client = chromadb.Client()
collection = client.create_collection("my_docs")

def index_documents(chunks: list[str], embeddings: list[list[float]]):
    collection.add(
        documents=chunks,
        embeddings=embeddings,
        ids=[f"chunk_{i}" for i in range(len(chunks))]
    )

Vector store options:

Store	Best for
ChromaDB	Local development, prototypes. Zero setup.
pgvector	Already on Postgres? Add a vector column. Production-ready.
Pinecone	Managed cloud, scales effortlessly. Pay-as-you-go.
Qdrant	Open source, Docker-friendly, fast filtering.

For most production systems on a budget: pgvector if you're already on Postgres, Qdrant if you need a dedicated vector DB.

Step 4: Retrieve Relevant Chunks at Query Time

When a user asks a question, embed it the same way you embedded your documents, then find the closest chunks.

def retrieve(query: str, top_k: int = 5) -> list[str]:
    query_embedding = model.encode([query], convert_to_list=True)[0]
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    return results["documents"][0]  # list of matching chunks

Cosine similarity is the standard distance metric — it measures the angle between two vectors, not their magnitude. Two texts with the same meaning map to the same angle even if one is short and one is long.

Step 5: Assemble the Prompt and Generate

Now stuff the retrieved chunks into your prompt and ask the LLM to answer using only what you gave it.

from openai import OpenAI

client = OpenAI()

def answer(query: str) -> str:
    chunks = retrieve(query)
    context = "\n\n---\n\n".join(chunks)

    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful assistant. Answer the user's question using ONLY "
                "the context provided below. If the answer is not in the context, "
                "say you don't know — do not make things up.\n\n"
                f"CONTEXT:\n{context}"
            ),
        },
        {"role": "user", "content": query},
    ]

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.2,  # lower temp = more factual
    )
    return response.choices[0].message.content

The key instruction in the system prompt: answer only from the context, and say you don't know if it's not there. Without this, the model freely blends retrieved context with its training data — and you lose the groundedness that RAG is supposed to guarantee.

Putting It All Together

# 1. Load and chunk your docs
raw_text = open("my_knowledge_base.txt").read()
chunks = chunk_text(raw_text)

# 2. Embed and index
embeddings = embed_chunks(chunks)
index_documents(chunks, embeddings)

# 3. Answer questions
print(answer("What is our refund policy?"))
print(answer("How do I reset my password?"))

Common RAG Failure Modes (and How to Fix Them)

The retrieved chunks are wrong. The model answers confidently but with the wrong information. Cause: the embedding model isn't finding the right chunks. Fix: inspect what's being retrieved with print(retrieve(query)). Often the chunking strategy is too coarse or the query needs preprocessing.

The answer ignores the retrieved context. The model just responds from its training data. Cause: the grounding instruction in the system prompt is too weak. Fix: be explicit — "Only use the CONTEXT below. Do not use prior knowledge."

Chunks cut off mid-sentence. The context looks garbled. Cause: fixed-size chunking without overlap cuts semantic units. Fix: use paragraph-based or sentence-aware chunking.

Cost blowup at scale. Embedding thousands of documents gets expensive. Fix: batch embed offline, cache embeddings, only re-embed on document changes.

What's Next After a Basic RAG?

Once your basic pipeline works, these are the improvements that move the needle most:

Re-ranking: Use a cross-encoder model to re-order the top-K results by relevance to the query (not just vector similarity). Libraries: sentence-transformers cross-encoders, Cohere Rerank.
Hybrid search: Combine vector search (semantic) with BM25 keyword search. Catches cases where exact terminology matters.
Metadata filtering: Add document source, date, or category as metadata and filter before retrieval. Reduces noise dramatically.
Query expansion: Rewrite the user's query with the LLM before embedding to improve retrieval quality.

The PM Perspective

If you're a product manager reading this: RAG is how you turn "we have all this data in a database/docs/Notion" into "our AI product actually knows our business."

The quality ceiling of your RAG system is determined by three things:

Chunk quality — does each chunk contain a coherent, complete piece of information?
Embedding quality — is the model good at encoding the kind of text you have?
Grounding instruction — how strictly are you telling the model to stick to what it retrieved?

Engineers decide all three, but as a PM, you define the quality bar that tells them whether they got it right.

Want to build and ship a RAG system yourself? The Joinloop AI Engineering Cohort covers RAG end-to-end — chunking strategy, vector stores, retrieval, evals, and production deployment — in Week 4.