The Complete Guide to Document Chunking for RAG

Chunking is the most underestimated lever in RAG system performance. Deep dive into fixed, semantic, and recursive chunking with code examples and when to use each.

Axoverna Team

March 2, 202610 min read

Chunking is the process of splitting documents into smaller pieces so they can be embedded, indexed, and retrieved as discrete units. It's also one of the most underestimated factors in RAG system quality.

Bad chunking breaks your retrieval. Good chunking is invisible because the system just works. Excellent chunking is a competitive advantage. Here's how to get it right.

Why Chunking Matters

An LLM has a context window limit — typically 128K tokens for modern models, but you don't want to use all of it. You retrieve N documents and assemble them into a prompt that leaves room for the LLM's response (and for your instructions about how to answer).

If your chunks are too small, you lose context. A single sentence about "maximum operating pressure: 150 PSI" is missing the surrounding context (what product? what temperature?). When that chunk is retrieved and passed to the LLM, the LLM can't determine if the answer applies to the customer's question.

If your chunks are too large, you lose precision. A 5,000-token chunk containing 12 different product specifications gets retrieved for a query about one specific spec, and the embedding-based retrieval system has to guess which specification the query is actually about.

Additionally, bad chunking causes your hybrid retrieval system to fail. If you're doing BM25 + semantic search and you chunk in a way that semantically related content is split across multiple chunks, the BM25 index will have fragmented term frequencies and sparse term occurrence.

The chunking strategy shapes everything downstream: embedding quality, retrieval precision, index size, inference cost, and ultimately answer quality.

The Chunking Trade-Off Space

Chunking strategy sits at the intersection of three variables:

Chunk size: How many tokens per chunk? (50–500 is typical for product data) Chunk overlap: How much do consecutive chunks overlap? (0–50% is common) Chunking logic: What are the boundaries? (character count, semantic units, document structure)

There's no universal optimal choice. The right strategy depends on:

Your document types (product specs, PDFs, web pages, FAQs all chunk differently)
Your query patterns (specific queries benefit from small chunks; exploratory queries benefit from larger context)
Your retrieval architecture (hybrid vs. semantic-only; with or without reranking)

Fixed-Size Chunking

The simplest approach: split documents every N characters (or tokens), with overlap.

def fixed_size_chunk(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
    """
    Split text into fixed-size chunks with overlap.
    
    Args:
        text: Raw text to chunk
        chunk_size: Tokens per chunk (approximate)
        overlap: Overlap size in tokens
    """
    # Approximate token count: 1 token ≈ 4 characters
    char_size = chunk_size * 4
    char_overlap = overlap * 4
    
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + char_size, len(text))
        chunks.append(text[start:end])
        start = end - char_overlap
    
    return chunks

Advantages:

Simple, no external dependencies
Consistent chunk size makes indexing predictable
Overlap prevents splitting important concepts

Disadvantages:

No awareness of document structure (might split a sentence mid-way or break a table)
No semantic coherence (unrelated content might be in the same chunk)
Requires careful tuning for different document types

When to use: Quick prototypes, simple documents without complex structure, when you don't have time to implement semantic chunking.

Product data example: Fixed-size chunking works poorly for product catalogs because specifications are semantically related but might be spread across a document. "Model 3200" (in a heading) and "150 PSI maximum operating pressure" (in a spec table) are critical for each other's context but will likely be split into different chunks.

Semantic Chunking

Semantic chunking respects document structure. Instead of splitting every N characters, you split on natural boundaries (paragraphs, sections, sentences) and only further subdivide chunks that exceed your size limit.

def semantic_chunk(
    text: str,
    separators: list[str] = ["\n\n", "\n", ". ", " "],
    max_chunk_size: int = 512
) -> list[str]:
    """
    Recursively split on natural boundaries until chunks fit within max_chunk_size.
    """
    chunks = []
    
    def _split(text: str, separator_index: int = 0) -> list[str]:
        if len(text) < max_chunk_size:
            return [text]
        
        if separator_index >= len(separators):
            # No separator left; fall back to character split
            return [text[i:i+max_chunk_size] for i in range(0, len(text), max_chunk_size)]
        
        separator = separators[separator_index]
        if separator not in text:
            # This separator isn't in the text; try the next one
            return _split(text, separator_index + 1)
        
        # Split on separator
        splits = text.split(separator)
        good_splits = []
        current = ""
        
        for split in splits:
            if len(current) + len(split) < max_chunk_size:
                current += split + separator
            else:
                if current:
                    good_splits.append(current)
                current = split + separator
        
        if current:
            good_splits.append(current)
        
        # If we still have chunks that are too large, try the next separator
        final_splits = []
        for chunk in good_splits:
            if len(chunk) >= max_chunk_size:
                final_splits.extend(_split(chunk, separator_index + 1))
            else:
                final_splits.append(chunk)
        
        return final_splits
    
    return _split(text)

Advantages:

Respects document structure (paragraphs, sections stay together)
Better semantic coherence than fixed-size
Still simple to implement

Disadvantages:

Tuning the separator list and max_chunk_size takes manual testing
Different document types need different separator lists
Doesn't handle complex structures (tables, nested lists, code blocks)

When to use: Most text documents, FAQs, blog posts, general documentation.

Document-Aware Chunking

For product catalogs and technical documents, understanding document structure is crucial. Different sections mean different things:

Product header (name, model number, SKU): Critical context for all other chunks
Specifications table: Each row is related to other rows; don't split a spec in half
Description: May contain use cases, compatibility info, warnings — keep together
Troubleshooting: Each Q&A pair is independent
Compliance info: Each certification or standard is independent but valuable

class ProductDataframeChunker:
    """Chunk product data with awareness of product structure."""
    
    def chunk_product(self, product: dict) -> list[dict]:
        """
        Chunk a product record into semantic units.
        
        Expected product structure:
        {
            "id": "SKU-123",
            "name": "Model 3200 Pressure Valve",
            "specifications": {...},
            "description": "...",
            "faqs": [...],
            "certifications": [...]
        }
        """
        chunks = []
        
        # Chunk 1: Product Identity (always included in every retrieval)
        identity = f"Product: {product['name']} (SKU: {product['id']})"
        chunks.append({
            "content": identity,
            "metadata": {"type": "product_identity", "product_id": product["id"]},
            "priority": "high",  # Always retrieve this
        })
        
        # Chunk 2: Specifications (each spec is a separate chunk)
        if "specifications" in product:
            specs = product["specifications"]
            for spec_name, spec_value in specs.items():
                spec_text = f"{product['name']}: {spec_name} = {spec_value}"
                chunks.append({
                    "content": spec_text,
                    "metadata": {
                        "type": "specification",
                        "product_id": product["id"],
                        "spec_name": spec_name,
                        "spec_value": spec_value,
                    },
                })
        
        # Chunk 3: Description (keep as single chunk for context)
        if "description" in product:
            chunks.append({
                "content": f"About {product['name']}: {product['description']}",
                "metadata": {
                    "type": "description",
                    "product_id": product["id"],
                },
            })
        
        # Chunk 4: FAQs (one per chunk, but grouped)
        if "faqs" in product:
            for i, faq in enumerate(product["faqs"]):
                faq_text = f"Q: {faq['question']}\nA: {faq['answer']}"
                chunks.append({
                    "content": faq_text,
                    "metadata": {
                        "type": "faq",
                        "product_id": product["id"],
                        "question": faq["question"],
                    },
                })
        
        # Chunk 5: Certifications (one per chunk)
        if "certifications" in product:
            for cert in product["certifications"]:
                chunks.append({
                    "content": f"{product['name']} is certified: {cert['name']} (standard: {cert['standard']})",
                    "metadata": {
                        "type": "certification",
                        "product_id": product["id"],
                        "standard": cert["standard"],
                    },
                })
        
        return chunks

This approach creates chunks that map directly to semantic units (a specification, an FAQ answer, a certification) rather than arbitrary text boundaries.

Advantages:

Semantically coherent chunks
Metadata is structured and queryable
Each chunk represents a complete "fact" that can stand alone
Excellent for hybrid retrieval (BM25 can find specific spec names)

Disadvantages:

Requires custom logic for each document type
More engineering upfront
Harder to generalize across different product catalogs

When to use: This is the gold standard for product data and structured documents.

Chunking with Metadata Preservation

Every chunk needs rich metadata so you can:

Filter at retrieval time ("only show specs for products in category X")
Cite sources ("this answer comes from the Model 3200 datasheet, section 4.2")
Debug retrieval failures ("why wasn't this chunk retrieved?")

from dataclasses import dataclass
 
@dataclass
class Chunk:
    id: str                    # Unique ID (product_id + chunk_type + index)
    content: str               # The actual text
    metadata: dict            # Rich metadata
    
    def to_embedding_input(self) -> str:
        """Generate the text to embed, including metadata for context."""
        # Include metadata in the embedding so semantic search understands product context
        title = self.metadata.get("product_name", "")
        chunk_type = self.metadata.get("chunk_type", "")
        
        if title and chunk_type:
            return f"[{title}] {chunk_type}: {self.content}"
        return self.content
 
# Example usage
chunk = Chunk(
    id="SKU-3200-spec-pressure",
    content="Maximum operating pressure: 150 PSI at 200°F",
    metadata={
        "product_id": "SKU-3200",
        "product_name": "Model 3200 Pressure Valve",
        "product_category": "valves",
        "chunk_type": "specification",
        "specification_name": "maximum_operating_pressure",
        "source_document": "3200-datasheet-v2.pdf",
        "source_page": 2,
    }
)

Notice to_embedding_input() — the metadata gets included in what's being embedded. This means the embedding model encodes the context (what product, what section) alongside the content, improving semantic search precision.

Overlap: The Critical Parameter

Overlap prevents your system from breaking important concepts. Consider:

Chunk 1 (no overlap): "The Model 3200 valve operates at a maximum pressure of 150 PSI..."
Chunk 2 (fresh start): "...and a maximum temperature of 400°F. The body is constructed of..."

If a query asks "What's the operating envelope of the 3200?", Chunk 1 alone has no temperature information. Chunk 2 alone has no pressure information. But if the chunks overlapped by 50% (the last ~256 characters of Chunk 1 overlap with the first ~256 characters of Chunk 2), Chunk 1 would include temperature context.

For semantic chunking, use 10–20% overlap. For fixed-size chunking, use 20–30%. For document-aware chunking with rich metadata, you often don't need overlap — the metadata carries the context.

Token-Based vs Character-Based Sizing

Most guidance says "chunk size: 256–512 tokens," but implementing that requires a tokenizer. The pragmatic approximation:

1 token ≈ 4 characters for English text
1 token ≈ 2 characters for code or highly technical content

For safety, tokenize a sample of your content and calibrate.

import tiktoken
 
def chunk_by_tokens(text: str, max_tokens: int = 512, overlap_tokens: int = 64):
    """Split text using token-based boundaries."""
    enc = tiktoken.encoding_for_model("gpt-4")
    tokens = enc.encode(text)
    
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        start = end - overlap_tokens
    
    return chunks

Token-based sizing is more accurate and worth the extra code if you're handling varied content types.

Practical Recommendation for Product Catalogs

Use document-aware semantic chunking:

Extract product-level metadata (name, SKU, category, URL)
For each product, identify semantic units:
- Core specifications (one chunk per spec or one chunk per spec table)
- Description (as single chunk, split only if > 1000 tokens)
- FAQs (one chunk per Q&A pair)
- Compatibility info (one chunk per compatible product or standard)
- Use cases (one chunk per use case)
For each chunk, attach rich metadata (source, product, section, confidence)
When embedding, include metadata in the embedding input for semantic context

Expected chunk distribution for a well-chunked product catalog:

40% specification chunks
25% description/use case chunks
20% FAQ chunks
10% compatibility chunks
5% other (certifications, warnings, etc.)

This distribution maximizes the probability that relevant chunks are retrieved for both specific queries ("what's the max pressure?") and exploratory queries ("what's a good valve for this application?").

The difference between mediocre chunking and excellent chunking is the difference between a system that works 60% of the time and one that works 85% of the time. And that directly translates to user satisfaction, deflection rates, and ROI.

See production chunking in action with Axoverna → Semantic search with structured product data

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.

Start free — no credit card required →Read the docs

Technical

BOM-Aware Product AI: How to Turn Part-Level Questions Into Procurement-Ready Answers

Most product AI systems answer one SKU at a time. B2B buyers work from assemblies, spare parts lists, and bills of materials. BOM-aware retrieval helps AI reason across sets of parts, dependencies, alternates, and order constraints so conversations lead to real purchasing decisions.

May 24, 202611 min read

Technical

Revenue-Weighted Evaluation for B2B Product AI: Why All Retrieval Errors Are Not Equal

Most B2B teams evaluate product AI with flat accuracy metrics. The better approach is to weight failures by commercial risk, so mistakes on high-value, high-complexity workflows get fixed before low-stakes browsing errors.

May 23, 202611 min read

Technical

How Conversation Mining Turns Product AI Into a Product Data Improvement Engine

Most B2B teams treat AI chat logs as support exhaust. The smarter move is to mine them for missing attributes, broken mappings, unclear terminology, and catalog blind spots, then feed those insights back into product data operations.

May 22, 202612 min read

Why Chunking Matters

The Chunking Trade-Off Space

Fixed-Size Chunking

Semantic Chunking

Document-Aware Chunking

Chunking with Metadata Preservation

Overlap: The Critical Parameter

Token-Based vs Character-Based Sizing

Practical Recommendation for Product Catalogs

Turn your product catalog into an AI knowledge base

Related articles

BOM-Aware Product AI: How to Turn Part-Level Questions Into Procurement-Ready Answers

Revenue-Weighted Evaluation for B2B Product AI: Why All Retrieval Errors Are Not Equal

How Conversation Mining Turns Product AI Into a Product Data Improvement Engine