The Complete Guide to Document Chunking for RAG

Chunking is the most underestimated lever in RAG system performance. Deep dive into fixed, semantic, and recursive chunking with code examples and when to use each.

Axoverna Team
10 min read

Chunking is the process of splitting documents into smaller pieces so they can be embedded, indexed, and retrieved as discrete units. It's also one of the most underestimated factors in RAG system quality.

Bad chunking breaks your retrieval. Good chunking is invisible because the system just works. Excellent chunking is a competitive advantage. Here's how to get it right.

Why Chunking Matters

An LLM has a context window limit — typically 128K tokens for modern models, but you don't want to use all of it. You retrieve N documents and assemble them into a prompt that leaves room for the LLM's response (and for your instructions about how to answer).

If your chunks are too small, you lose context. A single sentence about "maximum operating pressure: 150 PSI" is missing the surrounding context (what product? what temperature?). When that chunk is retrieved and passed to the LLM, the LLM can't determine if the answer applies to the customer's question.

If your chunks are too large, you lose precision. A 5,000-token chunk containing 12 different product specifications gets retrieved for a query about one specific spec, and the embedding-based retrieval system has to guess which specification the query is actually about.

Additionally, bad chunking causes your hybrid retrieval system to fail. If you're doing BM25 + semantic search and you chunk in a way that semantically related content is split across multiple chunks, the BM25 index will have fragmented term frequencies and sparse term occurrence.

The chunking strategy shapes everything downstream: embedding quality, retrieval precision, index size, inference cost, and ultimately answer quality.

The Chunking Trade-Off Space

Chunking strategy sits at the intersection of three variables:

Chunk size: How many tokens per chunk? (50–500 is typical for product data) Chunk overlap: How much do consecutive chunks overlap? (0–50% is common) Chunking logic: What are the boundaries? (character count, semantic units, document structure)

There's no universal optimal choice. The right strategy depends on:

  • Your document types (product specs, PDFs, web pages, FAQs all chunk differently)
  • Your query patterns (specific queries benefit from small chunks; exploratory queries benefit from larger context)
  • Your retrieval architecture (hybrid vs. semantic-only; with or without reranking)

Fixed-Size Chunking

The simplest approach: split documents every N characters (or tokens), with overlap.

def fixed_size_chunk(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
    """
    Split text into fixed-size chunks with overlap.
    
    Args:
        text: Raw text to chunk
        chunk_size: Tokens per chunk (approximate)
        overlap: Overlap size in tokens
    """
    # Approximate token count: 1 token ≈ 4 characters
    char_size = chunk_size * 4
    char_overlap = overlap * 4
    
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + char_size, len(text))
        chunks.append(text[start:end])
        start = end - char_overlap
    
    return chunks

Advantages:

  • Simple, no external dependencies
  • Consistent chunk size makes indexing predictable
  • Overlap prevents splitting important concepts

Disadvantages:

  • No awareness of document structure (might split a sentence mid-way or break a table)
  • No semantic coherence (unrelated content might be in the same chunk)
  • Requires careful tuning for different document types

When to use: Quick prototypes, simple documents without complex structure, when you don't have time to implement semantic chunking.

Product data example: Fixed-size chunking works poorly for product catalogs because specifications are semantically related but might be spread across a document. "Model 3200" (in a heading) and "150 PSI maximum operating pressure" (in a spec table) are critical for each other's context but will likely be split into different chunks.

Semantic Chunking

Semantic chunking respects document structure. Instead of splitting every N characters, you split on natural boundaries (paragraphs, sections, sentences) and only further subdivide chunks that exceed your size limit.

def semantic_chunk(
    text: str,
    separators: list[str] = ["\n\n", "\n", ". ", " "],
    max_chunk_size: int = 512
) -> list[str]:
    """
    Recursively split on natural boundaries until chunks fit within max_chunk_size.
    """
    chunks = []
    
    def _split(text: str, separator_index: int = 0) -> list[str]:
        if len(text) < max_chunk_size:
            return [text]
        
        if separator_index >= len(separators):
            # No separator left; fall back to character split
            return [text[i:i+max_chunk_size] for i in range(0, len(text), max_chunk_size)]
        
        separator = separators[separator_index]
        if separator not in text:
            # This separator isn't in the text; try the next one
            return _split(text, separator_index + 1)
        
        # Split on separator
        splits = text.split(separator)
        good_splits = []
        current = ""
        
        for split in splits:
            if len(current) + len(split) < max_chunk_size:
                current += split + separator
            else:
                if current:
                    good_splits.append(current)
                current = split + separator
        
        if current:
            good_splits.append(current)
        
        # If we still have chunks that are too large, try the next separator
        final_splits = []
        for chunk in good_splits:
            if len(chunk) >= max_chunk_size:
                final_splits.extend(_split(chunk, separator_index + 1))
            else:
                final_splits.append(chunk)
        
        return final_splits
    
    return _split(text)

Advantages:

  • Respects document structure (paragraphs, sections stay together)
  • Better semantic coherence than fixed-size
  • Still simple to implement

Disadvantages:

  • Tuning the separator list and max_chunk_size takes manual testing
  • Different document types need different separator lists
  • Doesn't handle complex structures (tables, nested lists, code blocks)

When to use: Most text documents, FAQs, blog posts, general documentation.

Document-Aware Chunking

For product catalogs and technical documents, understanding document structure is crucial. Different sections mean different things:

  • Product header (name, model number, SKU): Critical context for all other chunks
  • Specifications table: Each row is related to other rows; don't split a spec in half
  • Description: May contain use cases, compatibility info, warnings — keep together
  • Troubleshooting: Each Q&A pair is independent
  • Compliance info: Each certification or standard is independent but valuable
class ProductDataframeChunker:
    """Chunk product data with awareness of product structure."""
    
    def chunk_product(self, product: dict) -> list[dict]:
        """
        Chunk a product record into semantic units.
        
        Expected product structure:
        {
            "id": "SKU-123",
            "name": "Model 3200 Pressure Valve",
            "specifications": {...},
            "description": "...",
            "faqs": [...],
            "certifications": [...]
        }
        """
        chunks = []
        
        # Chunk 1: Product Identity (always included in every retrieval)
        identity = f"Product: {product['name']} (SKU: {product['id']})"
        chunks.append({
            "content": identity,
            "metadata": {"type": "product_identity", "product_id": product["id"]},
            "priority": "high",  # Always retrieve this
        })
        
        # Chunk 2: Specifications (each spec is a separate chunk)
        if "specifications" in product:
            specs = product["specifications"]
            for spec_name, spec_value in specs.items():
                spec_text = f"{product['name']}: {spec_name} = {spec_value}"
                chunks.append({
                    "content": spec_text,
                    "metadata": {
                        "type": "specification",
                        "product_id": product["id"],
                        "spec_name": spec_name,
                        "spec_value": spec_value,
                    },
                })
        
        # Chunk 3: Description (keep as single chunk for context)
        if "description" in product:
            chunks.append({
                "content": f"About {product['name']}: {product['description']}",
                "metadata": {
                    "type": "description",
                    "product_id": product["id"],
                },
            })
        
        # Chunk 4: FAQs (one per chunk, but grouped)
        if "faqs" in product:
            for i, faq in enumerate(product["faqs"]):
                faq_text = f"Q: {faq['question']}\nA: {faq['answer']}"
                chunks.append({
                    "content": faq_text,
                    "metadata": {
                        "type": "faq",
                        "product_id": product["id"],
                        "question": faq["question"],
                    },
                })
        
        # Chunk 5: Certifications (one per chunk)
        if "certifications" in product:
            for cert in product["certifications"]:
                chunks.append({
                    "content": f"{product['name']} is certified: {cert['name']} (standard: {cert['standard']})",
                    "metadata": {
                        "type": "certification",
                        "product_id": product["id"],
                        "standard": cert["standard"],
                    },
                })
        
        return chunks

This approach creates chunks that map directly to semantic units (a specification, an FAQ answer, a certification) rather than arbitrary text boundaries.

Advantages:

  • Semantically coherent chunks
  • Metadata is structured and queryable
  • Each chunk represents a complete "fact" that can stand alone
  • Excellent for hybrid retrieval (BM25 can find specific spec names)

Disadvantages:

  • Requires custom logic for each document type
  • More engineering upfront
  • Harder to generalize across different product catalogs

When to use: This is the gold standard for product data and structured documents.

Chunking with Metadata Preservation

Every chunk needs rich metadata so you can:

  1. Filter at retrieval time ("only show specs for products in category X")
  2. Cite sources ("this answer comes from the Model 3200 datasheet, section 4.2")
  3. Debug retrieval failures ("why wasn't this chunk retrieved?")
from dataclasses import dataclass
 
@dataclass
class Chunk:
    id: str                    # Unique ID (product_id + chunk_type + index)
    content: str               # The actual text
    metadata: dict            # Rich metadata
    
    def to_embedding_input(self) -> str:
        """Generate the text to embed, including metadata for context."""
        # Include metadata in the embedding so semantic search understands product context
        title = self.metadata.get("product_name", "")
        chunk_type = self.metadata.get("chunk_type", "")
        
        if title and chunk_type:
            return f"[{title}] {chunk_type}: {self.content}"
        return self.content
 
# Example usage
chunk = Chunk(
    id="SKU-3200-spec-pressure",
    content="Maximum operating pressure: 150 PSI at 200°F",
    metadata={
        "product_id": "SKU-3200",
        "product_name": "Model 3200 Pressure Valve",
        "product_category": "valves",
        "chunk_type": "specification",
        "specification_name": "maximum_operating_pressure",
        "source_document": "3200-datasheet-v2.pdf",
        "source_page": 2,
    }
)

Notice to_embedding_input() — the metadata gets included in what's being embedded. This means the embedding model encodes the context (what product, what section) alongside the content, improving semantic search precision.

Overlap: The Critical Parameter

Overlap prevents your system from breaking important concepts. Consider:

Chunk 1 (no overlap): "The Model 3200 valve operates at a maximum pressure of 150 PSI..."
Chunk 2 (fresh start): "...and a maximum temperature of 400°F. The body is constructed of..."

If a query asks "What's the operating envelope of the 3200?", Chunk 1 alone has no temperature information. Chunk 2 alone has no pressure information. But if the chunks overlapped by 50% (the last ~256 characters of Chunk 1 overlap with the first ~256 characters of Chunk 2), Chunk 1 would include temperature context.

For semantic chunking, use 10–20% overlap. For fixed-size chunking, use 20–30%. For document-aware chunking with rich metadata, you often don't need overlap — the metadata carries the context.

Token-Based vs Character-Based Sizing

Most guidance says "chunk size: 256–512 tokens," but implementing that requires a tokenizer. The pragmatic approximation:

  • 1 token ≈ 4 characters for English text
  • 1 token ≈ 2 characters for code or highly technical content

For safety, tokenize a sample of your content and calibrate.

import tiktoken
 
def chunk_by_tokens(text: str, max_tokens: int = 512, overlap_tokens: int = 64):
    """Split text using token-based boundaries."""
    enc = tiktoken.encoding_for_model("gpt-4")
    tokens = enc.encode(text)
    
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        start = end - overlap_tokens
    
    return chunks

Token-based sizing is more accurate and worth the extra code if you're handling varied content types.

Practical Recommendation for Product Catalogs

Use document-aware semantic chunking:

  1. Extract product-level metadata (name, SKU, category, URL)
  2. For each product, identify semantic units:
    • Core specifications (one chunk per spec or one chunk per spec table)
    • Description (as single chunk, split only if > 1000 tokens)
    • FAQs (one chunk per Q&A pair)
    • Compatibility info (one chunk per compatible product or standard)
    • Use cases (one chunk per use case)
  3. For each chunk, attach rich metadata (source, product, section, confidence)
  4. When embedding, include metadata in the embedding input for semantic context

Expected chunk distribution for a well-chunked product catalog:

  • 40% specification chunks
  • 25% description/use case chunks
  • 20% FAQ chunks
  • 10% compatibility chunks
  • 5% other (certifications, warnings, etc.)

This distribution maximizes the probability that relevant chunks are retrieved for both specific queries ("what's the max pressure?") and exploratory queries ("what's a good valve for this application?").

The difference between mediocre chunking and excellent chunking is the difference between a system that works 60% of the time and one that works 85% of the time. And that directly translates to user satisfaction, deflection rates, and ROI.

See production chunking in action with Axoverna → Semantic search with structured product data

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.