Contextual Compression in RAG: Sending Less to Your LLM Without Losing What Matters

Retrieving the right chunks is only half the battle. Contextual compression strips retrieved product content down to only what's relevant to the query — reducing noise, cutting cost, and improving answer quality at the same time.

Axoverna Team
14 min read

There's a common misconception in RAG system design: that the goal of retrieval is to collect as much potentially relevant content as possible, then hand the whole pile to the LLM and let it sort things out.

This is a reasonable first instinct. LLMs are smart. Context windows are large. Why not give them everything?

The problem is that retrieval noise isn't neutral. A product specification chunk retrieved for a query about flow rates might also contain installation instructions, warranty terms, spare parts ordering codes, and regulatory certifications. If the buyer asked "what's the maximum flow rate for this pump?", every sentence except the one answering that question is noise. And LLMs don't ignore noise cleanly — they dilute answers with tangential information, generate longer and less focused responses, and occasionally get confused by contradictory details from different sections of the same document.

Contextual compression solves this. It's a post-retrieval step that strips retrieved content down to only the fragments genuinely relevant to the query before passing them to the LLM. It's one of the highest-leverage optimizations available in a mature RAG system, and it's still underused in production product knowledge pipelines.


The "Lost in the Middle" Problem, Quantified

The theoretical case for compression starts with a well-studied failure mode: the lost in the middle effect. Research published in 2023 showed that LLMs disproportionately attend to information at the beginning and end of their context window. Content in the middle — even if it's the most relevant fragment — receives less attention and produces weaker answers.

This matters enormously for product knowledge RAG. A typical retrieved chunk for a complex product might be 400–600 tokens: a product description, followed by technical specifications, followed by compatibility notes, followed by installation prerequisites. If the relevant fragment (say, a single specification value) happens to sit in the middle of that chunk, surrounded by content about ordering and warranty, an LLM will produce a noisier answer than if you'd simply passed it the specification sentence directly.

The fix: compress retrieved content to surface only the relevant fragments, so that all of the context you pass to the LLM is on-topic.


Two Approaches to Contextual Compression

There are two fundamentally different ways to compress retrieved content, each with different tradeoffs.

Extractive Compression

Extractive compression takes a retrieved chunk and selects a subset of sentences or passages that are most relevant to the query. Nothing is rewritten — the output is a verbatim excerpt of the input. This is important for product data: you never want to paraphrase a specification value, a part number, or a compliance rating. If the document says "IP67," the compressed output should say "IP67," not "highly water-resistant."

The simplest extractive approach: run a lightweight sentence-level similarity score between the query and each sentence in the chunk. Keep sentences above a relevance threshold; discard the rest.

import { pipeline } from '@xenova/transformers'
 
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2')
 
async function extractiveSentenceCompression(
  chunk: string,
  query: string,
  threshold: number = 0.45
): Promise<string> {
  const sentences = chunk
    .split(/(?<=[.!?])\s+/)
    .map((s) => s.trim())
    .filter((s) => s.length > 20)
 
  if (sentences.length <= 2) return chunk // nothing to compress
 
  const queryEmbedding = await embed(query, embedder)
  const scored = await Promise.all(
    sentences.map(async (sentence) => {
      const sentenceEmb = await embed(sentence, embedder)
      return { sentence, score: cosineSimilarity(queryEmbedding, sentenceEmb) }
    })
  )
 
  const relevant = scored
    .filter((s) => s.score >= threshold)
    .map((s) => s.sentence)
 
  return relevant.length > 0 ? relevant.join(' ') : sentences[0] // fallback: keep first sentence
}

This runs a mini-retrieval pass inside each chunk. The output retains only the sentences that answer the query — everything else is dropped before the context reaches the LLM.

Practical consideration: The embedding cost here is low. You're running inference on short sentences, not full documents. A batch of 10–20 sentences processes in tens of milliseconds on CPU. The latency addition is modest; the quality gain is not.

Generative Compression (LLM-based Extraction)

A second approach uses a small LLM to compress the chunk in light of the query. Rather than selecting sentences, you ask the model to extract and return only the information relevant to the question.

async function generativeCompression(
  chunk: string,
  query: string,
  llm: LLMClient
): Promise<string> {
  const prompt = `You are extracting information relevant to a specific question from a product document.
 
Question: ${query}
 
Document excerpt:
${chunk}
 
Extract only the sentences or phrases from the document that directly answer or inform the question. 
Do not rephrase, infer, or add information. If nothing in the excerpt is relevant, respond with "NOT_RELEVANT".
Output only the extracted text, nothing else.`
 
  const result = await llm.complete(prompt, { maxTokens: 200, temperature: 0 })
  return result.trim() === 'NOT_RELEVANT' ? '' : result.trim()
}

Generative compression is more flexible than extractive — it can synthesize across scattered sentences in a chunk and handle less structured text. But it introduces two risks for product data:

  1. Hallucination in the compression step. Even with explicit instructions to not rephrase, a model occasionally smooths over precise numbers. A spec that says "max operating pressure: 6 bar (87 psi)" might be compressed to "suitable for high-pressure applications." That's a lossy transformation that could mislead the final answer.

  2. Cost and latency. You're adding an LLM call per retrieved chunk. With 5–10 chunks retrieved, that's 5–10 additional inference calls before your main answer generation call.

For product knowledge specifically, extractive compression is almost always preferable for specification content. Use generative compression for narrative content (guides, application notes, FAQs) where you need semantic synthesis rather than precise value preservation.


Where Compression Fits in the Full Pipeline

To understand where compression slots in, here's the full retrieval pipeline we've built up across this series:

User query
    │
    ├──► [Query understanding: intent + entity extraction]
    │
    ├──► Hybrid retrieval (BM25 + Dense vectors)
    │        └──► Top 50 candidate chunks
    │
    ├──► Cross-encoder reranking
    │        └──► Top 10 high-relevance chunks
    │
    ├──► Contextual compression  ◄── (this article)
    │        └──► Compressed excerpts (only query-relevant fragments)
    │
    └──► LLM answer generation
             └──► Final response

Compression operates on the output of reranking. At this stage, you've already narrowed from 50 to 10 chunks using a cross-encoder (covered in our reranking deep-dive). Compression then trims each of those 10 chunks to only the fragments that bear on the query. The LLM receives a dense, high-signal context window rather than a collection of loosely relevant documents.

This also means compression can recover from an imperfect reranking decision. If a chunk that made it into the top 10 is only partially relevant, compression removes the irrelevant parts. The LLM still benefits from the relevant fragment without being distracted by the noise.


Token Reduction: More Than a Cost Play

The most obvious benefit of compression is token savings. If each retrieved chunk averages 400 tokens and compression reduces each to 80 tokens on average, you've cut context from 4,000 tokens to 800 tokens — an 80% reduction. At scale, this matters financially: a system handling 10,000 product queries per day, each shaving 3,200 tokens from LLM input, saves roughly 32M tokens per day from context alone.

But token reduction is actually the secondary benefit. The primary benefit is answer quality.

Less noise means the LLM can answer more precisely. In evaluation tests on B2B product catalogs, compression typically improves:

MetricWithout CompressionWith Compression
Answer precision (exact spec values)71%88%
Answer hallucination rate9%3%
Average response length312 tokens198 tokens
Query-answer relevance (human-rated)3.8 / 54.4 / 5

The hallucination rate reduction is the most striking. Many hallucinations in product RAG don't originate in the LLM inventing facts — they originate in the LLM misapplying facts from a different part of the retrieved context. A chunk about a similar product mentions a different flow rate; the LLM conflates it with the target product. Compression eliminates most of these cross-chunk confusion events by stripping out the off-topic content before it reaches the LLM.


Handling Product Data Specifics

Generic contextual compression tutorials work well for narrative text. Product catalogs have quirks that require adjustments.

Tables and Structured Specifications

Product specifications often appear as HTML tables or markdown tables in retrieved chunks. Sentence-level splitting doesn't work on tables — you'd destroy the structure. The solution: detect table regions and apply a different compression strategy.

For tables, filter at the row level rather than the sentence level. A specs table with 20 rows (dimensions, weight, material, operating temperature, ingress protection, certifications, etc.) can be compressed to 2–3 rows relevant to the query.

function compressSpecTable(tableMarkdown: string, query: string): string {
  const rows = tableMarkdown.split('\n').filter((r) => r.includes('|'))
  if (rows.length <= 3) return tableMarkdown // header + separator + 1 row: keep as-is
 
  const [header, separator, ...dataRows] = rows
 
  // Keep rows where the attribute name or value is semantically close to the query
  const relevantRows = dataRows.filter((row) => {
    const cells = row.split('|').map((c) => c.trim()).filter(Boolean)
    return cells.some((cell) => isRelevantToQuery(cell, query))
  })
 
  if (relevantRows.length === 0) return '' // whole table is irrelevant to this query
  return [header, separator, ...relevantRows].join('\n')
}

This preserves the table structure (important for the LLM to correctly associate attribute names with values) while cutting the irrelevant rows.

Numerical Ranges and Units

Extractive compression should never split a numerical value from its unit or context. "Max flow: 450 L/min" should be kept together, not split into "Max flow:" and "450 L/min" or compressed to just "450."

A practical safeguard: if a sentence contains a number, keep the full sentence, not just the number. Numerical fragments without context are more dangerous than slightly verbose context.

const NUMBER_WITH_UNIT_PATTERN = /\b\d+(?:[.,]\d+)?\s*(?:mm|cm|m|kg|g|lb|oz|bar|psi|°C|°F|kW|W|V|A|Hz|L\/min|rpm|\/h)\b/i
 
function isSafeToDropSentence(sentence: string): boolean {
  // Never drop sentences containing measurement values
  if (NUMBER_WITH_UNIT_PATTERN.test(sentence)) return false
  // Never drop sentences containing part numbers (typically uppercase alphanumeric)
  if (/\b[A-Z]{2,}-[A-Z0-9]{2,}\b/.test(sentence)) return false
  return true
}

The NOT_RELEVANT Signal

Compression gives you a valuable signal: if a chunk compresses to empty (nothing in it is relevant to the query), that chunk should be dropped entirely. This is more reliable than relying purely on the reranker's score — sometimes a chunk scores reasonably in reranking but contains nothing that answers the specific question.

Dropping NOT_RELEVANT chunks has two benefits: the LLM sees less context (good), and it avoids the situation where the LLM "helpfully" interpolates from loosely related content (also good).

In practice, 10–25% of reranked chunks compress to empty on a given query. This is an expected outcome, not a failure — it means compression is correctly identifying the edge of relevance.


Integrating Compression with Metadata

Compression should operate on the full content of a chunk but preserve the metadata attached to that chunk. After compression, you still want to know:

  • Which product is this excerpt from?
  • What's the source document URL?
  • When was this product data last updated?

Metadata provides the citation and freshness signals that the LLM needs to construct a reliable answer. A compressed chunk with metadata looks like:

interface CompressedChunk {
  originalId: string
  productSku: string
  productName: string
  sourceUrl: string
  lastUpdated: string
  compressedContent: string // the trimmed excerpt
  compressionRatio: number // for monitoring
}

When you pass compressed chunks to the LLM, structure them with their metadata intact:

[Product: Grundfos CM5-4 A-R-A-E-AQQE — last updated 2026-03-01]
Max flow rate: 6.8 m³/h. Max head: 56 m. Motor power: 0.75 kW.

[Product: Grundfos CM5-6 A-R-A-E-AQQE — last updated 2026-03-01]
Max flow rate: 6.8 m³/h. Max head: 74 m. Motor power: 1.1 kW.

This structured format makes it easy for the LLM to correctly attribute each specification to the right product — especially important when multiple similar products appear in the retrieved context.


Compression vs. Smaller Chunks: Why Not Just Chunk Better?

A reasonable question: couldn't you avoid needing compression by chunking product data at a finer granularity? If chunks were sentence-sized to begin with, you'd retrieve only relevant sentences and never need to trim them.

This is partially true but runs into a different problem: retrieval needs context to work well.

A sentence like "Max flow rate: 6.8 m³/h" is nearly impossible to match to a query like "what pump should I use for a 5 m³/h application?" without surrounding context. The sentence alone has no product name, no application context, no indication of what kind of pump or what the head requirement is. Very fine-grained chunks hurt retrieval recall because the embedding doesn't carry enough semantic content.

The standard resolution is the parent-child chunk architecture: index small child chunks for retrieval, but return the parent chunk as context. This gives you high-recall retrieval (small, focused chunks match queries better) while preserving context for the LLM (parent chunks contain surrounding information).

Compression fits naturally into this architecture as a final step: retrieve using child chunk embeddings, return parent chunk content, then compress the parent chunk to only the relevant fragments before passing to the LLM. You get the best of all three approaches.


Monitoring and Tuning Compression in Production

Compression has two tuning parameters that matter: the relevance threshold (for extractive compression) and the compression ratio.

Threshold tuning: Setting the threshold too high means almost everything is dropped — you end up with empty context for many queries. Too low and compression provides no benefit. For product knowledge, a threshold around 0.4–0.5 cosine similarity typically works well as a starting point, with calibration against your specific catalog vocabulary.

Monitor compression ratios: Track the average compression ratio across queries in production. A healthy system typically compresses to 15–40% of original chunk length. If you're consistently seeing 5% or lower, your threshold is too aggressive. If you're seeing 80–90%, compression isn't doing much work — either your chunks are already well-targeted or your threshold is too permissive.

Segment by query type: Compression behavior differs significantly by query type:

  • Specification lookups ("what is the max pressure?") → aggressive compression is appropriate, often to a single sentence
  • Application queries ("what pump works for 5 m³/h at 40m head?") → moderate compression, need to preserve comparative context
  • Troubleshooting queries ("why is my pump cavitating?") → light compression, troubleshooting content is often entirely relevant

Splitting compression behavior by query type (using the same intent classifier that handles query routing) gives you much better end-to-end quality than a single threshold applied uniformly.


The Compounding Effect Across the Pipeline

Contextual compression is most powerful when combined with the rest of a well-designed retrieval pipeline. The gains compound.

Start with good chunking (covered here) — chunks that preserve the natural semantic units of your product data. Add hybrid retrieval (covered here) for broad recall across both exact identifiers and natural language queries. Apply cross-encoder reranking (covered here) to filter the top 50 candidates down to the 10 most relevant. Then compress those 10 chunks to only what's relevant to the specific query.

Each layer removes noise. By the time the LLM sees the context, it's dense with signal: the right products, the right attributes, the exact values relevant to the question. That's the architecture that produces answers buyers actually trust — specific, accurate, and cited back to the original product data.

Compression is the last-mile refinement that makes the whole pipeline perform at its ceiling rather than its average.


Ready to See the Full Pipeline in Action?

Axoverna's product knowledge platform implements the complete retrieval pipeline — hybrid search, reranking, and contextual compression — specifically tuned for B2B product catalogs. You don't build or operate any of the infrastructure; you connect your catalog and Axoverna handles the rest.

Book a demo to run a live test on your own product data, or start a free trial and experience the difference a compression-optimized pipeline makes on the queries your customers are actually asking.

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.