RAG Explained: How Retrieval-Augmented Generation Actually Works

A technical deep-dive into RAG for engineers who don't have an ML background. Covers chunking, embeddings, vector search, context assembly, and LLM generation — with real code.

Axoverna Team

February 6, 202611 min read

Retrieval-Augmented Generation (RAG) is the dominant architecture for building AI systems that answer questions from your own data. Every serious product knowledge system, internal wiki assistant, and customer support AI is built on some version of this pattern. But most explanations of RAG fall into one of two traps: they're either hand-wavy ("the AI looks things up and writes an answer") or they assume you already know what an embedding is.

This is a practical, ground-up explanation for software engineers who've shipped production systems but haven't worked with ML pipelines before. By the end, you'll understand every component well enough to evaluate implementations, debug failures, and make architecture decisions.

The Core Problem RAG Solves

Large language models (LLMs) like GPT-4 or Claude are trained on enormous text corpora and compress that training data into model weights. When you ask the model a question, it generates a response based on patterns in those weights — essentially, what the model "remembers" from training.

This creates two hard problems for product knowledge systems:

Knowledge cutoff: The model was trained at a specific point in time. Your product catalog, updated last Tuesday, isn't in its training data.
Hallucination: When the model doesn't know something, it often generates plausible-sounding but incorrect answers. For business-critical product queries — "What's the maximum working pressure of this fitting?" — a wrong answer is worse than no answer.

RAG solves both problems by retrieving relevant information at query time and providing it to the LLM as context. The LLM doesn't need to remember your product catalog; it just needs to read the relevant excerpts that your retrieval system surfaces.

The pipeline has five stages: ingest → chunk → embed → index → retrieve → generate. Let's walk through each.

Stage 1: Ingestion — Getting Data In

Before you can retrieve anything, you need to import your source material. In a product knowledge context, this typically means:

Structured data: CSV or JSON product feeds from your PIM or ERP system
PDFs: Technical datasheets, manuals, compliance documents
Web pages: Product pages, knowledge base articles
APIs: Real-time inventory systems, specification databases

The ingestion stage handles format conversion. For PDFs, you need to extract text while preserving structure — page numbers, section headers, table contents. Libraries like pdf-parse (Node.js), pypdf2 (Python), or commercial services like AWS Textract handle this.

For structured data, ingestion is simpler: map your product fields (title, description, specifications, part numbers) into a standardized document format.

The output of ingestion is a collection of raw text documents with associated metadata (source file, product ID, category, date).

Stage 2: Chunking — Splitting Documents into Retrievable Pieces

Here's where most tutorials gloss over the hardest part of RAG. LLMs have a context window limit — you can't stuff an entire 200-page product manual into a prompt. You need to split documents into smaller pieces called chunks, and retrieve only the chunks relevant to a given query.

The chunking strategy profoundly affects retrieval quality. The wrong chunking strategy will cause your system to miss relevant information or retrieve irrelevant noise.

Fixed-Size Chunking

The simplest approach: split text every N characters, with an overlap of M characters.

def fixed_chunk(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

The overlap prevents you from splitting a critical sentence right at the boundary. The problem: fixed-size chunking is completely ignorant of document structure. It might cut a technical specification table in half, separating a property from its value.

Semantic/Recursive Chunking

A better approach: split on natural document boundaries first (paragraphs, sections, sentences), then further split chunks that are too large.

def recursive_chunk(text: str, separators: list[str], max_size: int) -> list[str]:
    # Try to split on the first separator
    for sep in separators:
        if sep in text:
            parts = text.split(sep)
            chunks = []
            for part in parts:
                if len(part) <= max_size:
                    chunks.append(part)
                else:
                    # Recursively split this part with the next separator
                    chunks.extend(recursive_chunk(part, separators[1:], max_size))
            return chunks
    # If no separator works, fall back to character splitting
    return [text[i:i+max_size] for i in range(0, len(text), max_size)]
 
separators = ["\n\n", "\n", ". ", " "]
chunks = recursive_chunk(document_text, separators, max_size=512)

For product data specifically, consider chunking by semantic units: one chunk per product, one chunk per specification section, one chunk per FAQ answer. See our deep-dive on chunking strategies → for advanced patterns.

Metadata Preservation

Every chunk must carry its metadata forward:

interface Chunk {
  id: string
  content: string
  metadata: {
    sourceId: string          // Original document/product ID
    sourceType: 'product' | 'manual' | 'faq'
    title: string
    pageNumber?: number       // For PDFs
    sectionHeader?: string    // For structured docs
    productCategory?: string
    partNumber?: string
  }
}

Without metadata, retrieved chunks are anonymous text blobs. With metadata, your system can cite sources, filter by category, and provide traceable answers.

Stage 3: Embedding — Turning Text into Vectors

An embedding is a numerical representation of meaning. An embedding model takes a piece of text and outputs a vector of floating-point numbers — typically 768 to 1,536 dimensions. The key property: text with similar meaning produces vectors that are close together in this high-dimensional space.

You don't need to understand the mathematics. What matters is the interface:

import OpenAI from 'openai'
 
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
 
async function embed(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',   // 1,536 dimensions, $0.02/1M tokens
    input: text,
  })
  return response.data[0].embedding
}
 
// Embed a chunk
const vector = await embed("The Model 3200 pressure relief valve opens at 150 PSI")
// Returns: [0.023, -0.147, 0.892, ...] — 1,536 numbers

You embed every chunk during ingestion, storing the vector alongside the chunk text. This is a one-time (per chunk) operation that can be done in batch.

At query time, you embed the user's query with the same model:

const queryVector = await embed("What pressure does the 3200 valve open at?")

The query vector will be close in embedding space to the chunk vector above, because they refer to the same concept, even though they use different words.

Choosing an Embedding Model

Model	Dimensions	Context	Cost	Notes
`text-embedding-3-small`	1,536	8,191 tokens	$0.02/1M	Good balance
`text-embedding-3-large`	3,072	8,191 tokens	$0.13/1M	Best quality
`text-embedding-ada-002`	1,536	8,191 tokens	$0.10/1M	Legacy
`cohere-embed-v3-english`	1,024	512 tokens	$0.10/1M	Strong retrieval
`nomic-embed-text`	768	8,192 tokens	Free (local)	Good open source

For product catalogs, text-embedding-3-small hits the sweet spot: strong performance on technical text, 1,536 dimensions (enough nuance for complex queries), and cheap enough to re-embed your entire catalog as needed.

Stage 4: Indexing — Storing and Searching Vectors

Vector databases are purpose-built for one operation: given a query vector, find the K nearest vectors in a collection of millions. This is the approximate nearest-neighbor (ANN) problem.

// Store a chunk with its embedding
await vectorDB.upsert({
  id: chunk.id,
  vector: embedding,
  metadata: chunk.metadata,
  content: chunk.content,
})
 
// Query: find 5 most similar chunks
const results = await vectorDB.query({
  vector: queryVector,
  topK: 5,
  includeMetadata: true,
})

The distance metric used is typically cosine similarity (how aligned are the two vectors?) or dot product (similar in normalized vector spaces). Cosine similarity between 0 and 1 gives you an interpretable confidence score.

The ANN algorithms (HNSW, IVF, ScaNN) trade a small accuracy loss for massive speed gains. Without approximation, searching a million vectors exactly would take too long; with HNSW, you can search 10 million vectors in under 10ms.

See our vector database comparison → for a practical guide to pgvector, Pinecone, and Weaviate.

Stage 5: Context Assembly — Building the Prompt

Once you have your top-K retrieved chunks, you assemble them into a context block that gets passed to the LLM. The assembly step is more subtle than it appears.

async function assembleContext(
  query: string,
  chunks: RetrievedChunk[]
): Promise<string> {
  const contextBlock = chunks
    .map((chunk, i) => {
      return `--- Source ${i + 1}: ${chunk.metadata.title} ---\n${chunk.content}`
    })
    .join('\n\n')
 
  return `You are a product knowledge assistant. Answer the following question using only the provided context. If the context doesn't contain enough information to answer, say so — do not guess.
 
CONTEXT:
${contextBlock}
 
QUESTION: ${query}
 
ANSWER:`
}

Key decisions in context assembly:

Deduplication: If you retrieve 10 chunks and 3 of them are from the same product's datasheet (just different sections), merging them reduces token usage and improves coherence.

Re-ranking: The top-5 results by cosine similarity aren't always the best 5 for answering the query. Cross-encoder re-rankers (models that score query+chunk pairs) can significantly improve relevance. Cohere's /rerank endpoint and ms-marco-MiniLM are popular options.

Context window management: If your chunks exceed the LLM's context window, you need to trim. Prioritize by similarity score, then by recency or authority (your canonical product descriptions over third-party content).

Stage 6: Generation — The LLM Turn

With the assembled prompt, you call your LLM to generate the answer:

async function generateAnswer(prompt: string): Promise<string> {
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: prompt }],
    temperature: 0.1,   // Low temperature for factual consistency
    max_tokens: 800,
  })
  return completion.choices[0].message.content ?? ''
}

Temperature is critical for product knowledge systems. Low temperature (0.0–0.2) produces more deterministic, factual answers. High temperature produces more creative but potentially inaccurate answers. For product specs and technical queries, you want determinism.

The LLM's job in RAG is reading comprehension, not knowledge recall. It reads the context you provided and extracts or synthesises the answer. This is why RAG systems are far less prone to hallucination than bare LLM queries — the model is constrained to what you give it.

Putting It All Together

Here's a simplified end-to-end flow:

async function askProductKB(query: string): Promise<Answer> {
  // 1. Embed the query
  const queryVector = await embed(query)
 
  // 2. Retrieve top-K chunks
  const chunks = await vectorDB.query({ vector: queryVector, topK: 8 })
 
  // 3. (Optional) Re-rank
  const reranked = await rerank(query, chunks)
 
  // 4. Assemble context
  const prompt = await assembleContext(query, reranked.slice(0, 5))
 
  // 5. Generate answer
  const answer = await generateAnswer(prompt)
 
  // 6. Return with sources for citation
  return {
    answer,
    sources: reranked.slice(0, 5).map(c => ({
      title: c.metadata.title,
      type: c.metadata.sourceType,
      score: c.score,
    })),
  }
}

Where RAG Goes Wrong

Understanding RAG failure modes helps you debug and improve the system:

Poor chunking: The answer is in your documents, but it was split across two chunks and neither chunk alone has enough context to answer the query. Fix: increase chunk overlap, or use semantic chunking that keeps related content together.

Retrieval misses: The right chunk exists but doesn't make the top-K because its embedding is semantically distant from the query. Fix: hybrid retrieval (combine BM25 + vector search), or add a cross-encoder re-ranker.

Context overflow: Too many chunks, LLM loses track of the relevant information. Fix: reduce topK, improve re-ranking, use a longer-context model.

Prompt hallucination: Despite the context, the LLM still makes up information. Fix: lower temperature, add explicit instructions ("do not add information not present in the context"), use a more instruction-following model.

Stale index: Products updated in your catalog but not re-embedded. Fix: real-time or near-real-time ingestion triggers on catalog changes.

The RAG Landscape in Production

A production RAG system for a B2B catalog adds several layers on top of the basic pipeline:

Hybrid retrieval (BM25 + vector, merged with Reciprocal Rank Fusion)
Metadata filtering (only search within the relevant product category)
Query expansion (reformulate ambiguous queries before embedding)
Answer grounding (cite sources with chunk-level attribution)
Confidence scoring (surface the similarity scores alongside answers)
Feedback loop (use thumbs up/down to identify poor retrievals)

These aren't optional features — they're what separates a prototype from a system your customers can trust.

Axoverna handles all of this out of the box. Bring your product data; we handle the ingestion, chunking, embedding, indexing, retrieval, and generation. The result: an AI knowledge base that answers product questions accurately, cites its sources, and never hallucinates your specifications.

Get started with Axoverna → Build your AI product knowledge base in minutes

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.

Start free — no credit card required →Read the docs

Technical

BOM-Aware Product AI: How to Turn Part-Level Questions Into Procurement-Ready Answers

Most product AI systems answer one SKU at a time. B2B buyers work from assemblies, spare parts lists, and bills of materials. BOM-aware retrieval helps AI reason across sets of parts, dependencies, alternates, and order constraints so conversations lead to real purchasing decisions.

May 24, 202611 min read

Technical

Revenue-Weighted Evaluation for B2B Product AI: Why All Retrieval Errors Are Not Equal

Most B2B teams evaluate product AI with flat accuracy metrics. The better approach is to weight failures by commercial risk, so mistakes on high-value, high-complexity workflows get fixed before low-stakes browsing errors.

May 23, 202611 min read

Technical

How Conversation Mining Turns Product AI Into a Product Data Improvement Engine

Most B2B teams treat AI chat logs as support exhaust. The smarter move is to mine them for missing attributes, broken mappings, unclear terminology, and catalog blind spots, then feed those insights back into product data operations.

May 22, 202612 min read

The Core Problem RAG Solves

Stage 1: Ingestion — Getting Data In

Stage 2: Chunking — Splitting Documents into Retrievable Pieces

Fixed-Size Chunking

Semantic/Recursive Chunking

Metadata Preservation

Stage 3: Embedding — Turning Text into Vectors

Choosing an Embedding Model

Stage 4: Indexing — Storing and Searching Vectors

Stage 5: Context Assembly — Building the Prompt

Stage 6: Generation — The LLM Turn

Putting It All Together

Where RAG Goes Wrong

The RAG Landscape in Production

Turn your product catalog into an AI knowledge base

Related articles

BOM-Aware Product AI: How to Turn Part-Level Questions Into Procurement-Ready Answers

Revenue-Weighted Evaluation for B2B Product AI: Why All Retrieval Errors Are Not Equal

How Conversation Mining Turns Product AI Into a Product Data Improvement Engine