RAG Explained: How Retrieval-Augmented Generation Actually Works
A technical deep-dive into RAG for engineers who don't have an ML background. Covers chunking, embeddings, vector search, context assembly, and LLM generation — with real code.
Retrieval-Augmented Generation (RAG) is the dominant architecture for building AI systems that answer questions from your own data. Every serious product knowledge system, internal wiki assistant, and customer support AI is built on some version of this pattern. But most explanations of RAG fall into one of two traps: they're either hand-wavy ("the AI looks things up and writes an answer") or they assume you already know what an embedding is.
This is a practical, ground-up explanation for software engineers who've shipped production systems but haven't worked with ML pipelines before. By the end, you'll understand every component well enough to evaluate implementations, debug failures, and make architecture decisions.
The Core Problem RAG Solves
Large language models (LLMs) like GPT-4 or Claude are trained on enormous text corpora and compress that training data into model weights. When you ask the model a question, it generates a response based on patterns in those weights — essentially, what the model "remembers" from training.
This creates two hard problems for product knowledge systems:
- Knowledge cutoff: The model was trained at a specific point in time. Your product catalog, updated last Tuesday, isn't in its training data.
- Hallucination: When the model doesn't know something, it often generates plausible-sounding but incorrect answers. For business-critical product queries — "What's the maximum working pressure of this fitting?" — a wrong answer is worse than no answer.
RAG solves both problems by retrieving relevant information at query time and providing it to the LLM as context. The LLM doesn't need to remember your product catalog; it just needs to read the relevant excerpts that your retrieval system surfaces.
The pipeline has five stages: ingest → chunk → embed → index → retrieve → generate. Let's walk through each.
Stage 1: Ingestion — Getting Data In
Before you can retrieve anything, you need to import your source material. In a product knowledge context, this typically means:
- Structured data: CSV or JSON product feeds from your PIM or ERP system
- PDFs: Technical datasheets, manuals, compliance documents
- Web pages: Product pages, knowledge base articles
- APIs: Real-time inventory systems, specification databases
The ingestion stage handles format conversion. For PDFs, you need to extract text while preserving structure — page numbers, section headers, table contents. Libraries like pdf-parse (Node.js), pypdf2 (Python), or commercial services like AWS Textract handle this.
For structured data, ingestion is simpler: map your product fields (title, description, specifications, part numbers) into a standardized document format.
The output of ingestion is a collection of raw text documents with associated metadata (source file, product ID, category, date).
Stage 2: Chunking — Splitting Documents into Retrievable Pieces
Here's where most tutorials gloss over the hardest part of RAG. LLMs have a context window limit — you can't stuff an entire 200-page product manual into a prompt. You need to split documents into smaller pieces called chunks, and retrieve only the chunks relevant to a given query.
The chunking strategy profoundly affects retrieval quality. The wrong chunking strategy will cause your system to miss relevant information or retrieve irrelevant noise.
Fixed-Size Chunking
The simplest approach: split text every N characters, with an overlap of M characters.
def fixed_chunk(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunksThe overlap prevents you from splitting a critical sentence right at the boundary. The problem: fixed-size chunking is completely ignorant of document structure. It might cut a technical specification table in half, separating a property from its value.
Semantic/Recursive Chunking
A better approach: split on natural document boundaries first (paragraphs, sections, sentences), then further split chunks that are too large.
def recursive_chunk(text: str, separators: list[str], max_size: int) -> list[str]:
# Try to split on the first separator
for sep in separators:
if sep in text:
parts = text.split(sep)
chunks = []
for part in parts:
if len(part) <= max_size:
chunks.append(part)
else:
# Recursively split this part with the next separator
chunks.extend(recursive_chunk(part, separators[1:], max_size))
return chunks
# If no separator works, fall back to character splitting
return [text[i:i+max_size] for i in range(0, len(text), max_size)]
separators = ["\n\n", "\n", ". ", " "]
chunks = recursive_chunk(document_text, separators, max_size=512)For product data specifically, consider chunking by semantic units: one chunk per product, one chunk per specification section, one chunk per FAQ answer. See our deep-dive on chunking strategies → for advanced patterns.
Metadata Preservation
Every chunk must carry its metadata forward:
interface Chunk {
id: string
content: string
metadata: {
sourceId: string // Original document/product ID
sourceType: 'product' | 'manual' | 'faq'
title: string
pageNumber?: number // For PDFs
sectionHeader?: string // For structured docs
productCategory?: string
partNumber?: string
}
}Without metadata, retrieved chunks are anonymous text blobs. With metadata, your system can cite sources, filter by category, and provide traceable answers.
Stage 3: Embedding — Turning Text into Vectors
An embedding is a numerical representation of meaning. An embedding model takes a piece of text and outputs a vector of floating-point numbers — typically 768 to 1,536 dimensions. The key property: text with similar meaning produces vectors that are close together in this high-dimensional space.
You don't need to understand the mathematics. What matters is the interface:
import OpenAI from 'openai'
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
async function embed(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small', // 1,536 dimensions, $0.02/1M tokens
input: text,
})
return response.data[0].embedding
}
// Embed a chunk
const vector = await embed("The Model 3200 pressure relief valve opens at 150 PSI")
// Returns: [0.023, -0.147, 0.892, ...] — 1,536 numbersYou embed every chunk during ingestion, storing the vector alongside the chunk text. This is a one-time (per chunk) operation that can be done in batch.
At query time, you embed the user's query with the same model:
const queryVector = await embed("What pressure does the 3200 valve open at?")The query vector will be close in embedding space to the chunk vector above, because they refer to the same concept, even though they use different words.
Choosing an Embedding Model
| Model | Dimensions | Context | Cost | Notes |
|---|---|---|---|---|
text-embedding-3-small | 1,536 | 8,191 tokens | $0.02/1M | Good balance |
text-embedding-3-large | 3,072 | 8,191 tokens | $0.13/1M | Best quality |
text-embedding-ada-002 | 1,536 | 8,191 tokens | $0.10/1M | Legacy |
cohere-embed-v3-english | 1,024 | 512 tokens | $0.10/1M | Strong retrieval |
nomic-embed-text | 768 | 8,192 tokens | Free (local) | Good open source |
For product catalogs, text-embedding-3-small hits the sweet spot: strong performance on technical text, 1,536 dimensions (enough nuance for complex queries), and cheap enough to re-embed your entire catalog as needed.
Stage 4: Indexing — Storing and Searching Vectors
Vector databases are purpose-built for one operation: given a query vector, find the K nearest vectors in a collection of millions. This is the approximate nearest-neighbor (ANN) problem.
// Store a chunk with its embedding
await vectorDB.upsert({
id: chunk.id,
vector: embedding,
metadata: chunk.metadata,
content: chunk.content,
})
// Query: find 5 most similar chunks
const results = await vectorDB.query({
vector: queryVector,
topK: 5,
includeMetadata: true,
})The distance metric used is typically cosine similarity (how aligned are the two vectors?) or dot product (similar in normalized vector spaces). Cosine similarity between 0 and 1 gives you an interpretable confidence score.
The ANN algorithms (HNSW, IVF, ScaNN) trade a small accuracy loss for massive speed gains. Without approximation, searching a million vectors exactly would take too long; with HNSW, you can search 10 million vectors in under 10ms.
See our vector database comparison → for a practical guide to pgvector, Pinecone, and Weaviate.
Stage 5: Context Assembly — Building the Prompt
Once you have your top-K retrieved chunks, you assemble them into a context block that gets passed to the LLM. The assembly step is more subtle than it appears.
async function assembleContext(
query: string,
chunks: RetrievedChunk[]
): Promise<string> {
const contextBlock = chunks
.map((chunk, i) => {
return `--- Source ${i + 1}: ${chunk.metadata.title} ---\n${chunk.content}`
})
.join('\n\n')
return `You are a product knowledge assistant. Answer the following question using only the provided context. If the context doesn't contain enough information to answer, say so — do not guess.
CONTEXT:
${contextBlock}
QUESTION: ${query}
ANSWER:`
}Key decisions in context assembly:
Deduplication: If you retrieve 10 chunks and 3 of them are from the same product's datasheet (just different sections), merging them reduces token usage and improves coherence.
Re-ranking: The top-5 results by cosine similarity aren't always the best 5 for answering the query. Cross-encoder re-rankers (models that score query+chunk pairs) can significantly improve relevance. Cohere's /rerank endpoint and ms-marco-MiniLM are popular options.
Context window management: If your chunks exceed the LLM's context window, you need to trim. Prioritize by similarity score, then by recency or authority (your canonical product descriptions over third-party content).
Stage 6: Generation — The LLM Turn
With the assembled prompt, you call your LLM to generate the answer:
async function generateAnswer(prompt: string): Promise<string> {
const completion = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: prompt }],
temperature: 0.1, // Low temperature for factual consistency
max_tokens: 800,
})
return completion.choices[0].message.content ?? ''
}Temperature is critical for product knowledge systems. Low temperature (0.0–0.2) produces more deterministic, factual answers. High temperature produces more creative but potentially inaccurate answers. For product specs and technical queries, you want determinism.
The LLM's job in RAG is reading comprehension, not knowledge recall. It reads the context you provided and extracts or synthesises the answer. This is why RAG systems are far less prone to hallucination than bare LLM queries — the model is constrained to what you give it.
Putting It All Together
Here's a simplified end-to-end flow:
async function askProductKB(query: string): Promise<Answer> {
// 1. Embed the query
const queryVector = await embed(query)
// 2. Retrieve top-K chunks
const chunks = await vectorDB.query({ vector: queryVector, topK: 8 })
// 3. (Optional) Re-rank
const reranked = await rerank(query, chunks)
// 4. Assemble context
const prompt = await assembleContext(query, reranked.slice(0, 5))
// 5. Generate answer
const answer = await generateAnswer(prompt)
// 6. Return with sources for citation
return {
answer,
sources: reranked.slice(0, 5).map(c => ({
title: c.metadata.title,
type: c.metadata.sourceType,
score: c.score,
})),
}
}Where RAG Goes Wrong
Understanding RAG failure modes helps you debug and improve the system:
Poor chunking: The answer is in your documents, but it was split across two chunks and neither chunk alone has enough context to answer the query. Fix: increase chunk overlap, or use semantic chunking that keeps related content together.
Retrieval misses: The right chunk exists but doesn't make the top-K because its embedding is semantically distant from the query. Fix: hybrid retrieval (combine BM25 + vector search), or add a cross-encoder re-ranker.
Context overflow: Too many chunks, LLM loses track of the relevant information. Fix: reduce topK, improve re-ranking, use a longer-context model.
Prompt hallucination: Despite the context, the LLM still makes up information. Fix: lower temperature, add explicit instructions ("do not add information not present in the context"), use a more instruction-following model.
Stale index: Products updated in your catalog but not re-embedded. Fix: real-time or near-real-time ingestion triggers on catalog changes.
The RAG Landscape in Production
A production RAG system for a B2B catalog adds several layers on top of the basic pipeline:
- Hybrid retrieval (BM25 + vector, merged with Reciprocal Rank Fusion)
- Metadata filtering (only search within the relevant product category)
- Query expansion (reformulate ambiguous queries before embedding)
- Answer grounding (cite sources with chunk-level attribution)
- Confidence scoring (surface the similarity scores alongside answers)
- Feedback loop (use thumbs up/down to identify poor retrievals)
These aren't optional features — they're what separates a prototype from a system your customers can trust.
Axoverna handles all of this out of the box. Bring your product data; we handle the ingestion, chunking, embedding, indexing, retrieval, and generation. The result: an AI knowledge base that answers product questions accurately, cites its sources, and never hallucinates your specifications.
Get started with Axoverna → Build your AI product knowledge base in minutes
Turn your product catalog into an AI knowledge base
Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.
Related articles
Why Session Memory Matters for Repeat B2B Buyers, and How to Design It Without Breaking Trust
The strongest B2B product AI systems do not treat every conversation like a cold start. They use session memory to preserve buyer context, speed up repeat interactions, and improve recommendation quality, while staying grounded in live product data and clear trust boundaries.
Unit Normalization in B2B Product AI: Why 1/2 Inch, DN15, and 15 mm Should Mean the Same Thing
B2B product AI breaks fast when dimensions, thread sizes, pack quantities, and engineering units are stored in inconsistent formats. Here is how to design unit normalization that improves retrieval, filtering, substitutions, and answer accuracy.
Source-Aware RAG: How to Combine PIM, PDFs, ERP, and Policy Content Without Conflicting Answers
Most product AI failures are not caused by weak models, but by mixing sources with different authority levels. Here is how B2B teams design source-aware RAG that keeps specs, availability, pricing rules, and policy answers aligned.