Reranking in RAG: Why Two-Stage Retrieval Dramatically Improves Answer Quality

First-pass vector search is fast but imprecise. Learn how cross-encoder reranking transforms mediocre retrieval into highly accurate results — and why it matters for product knowledge systems.

Axoverna Team
10 min read

If you've read our primer on how RAG works and our deep-dive on vector databases for product search, you already know that embedding-based retrieval is powerful. You also know it isn't perfect.

A buyer types: "What's the torque rating for the M16 stainless flange bolt at 200°C?"

Your vector index returns the top 10 chunks — some are relevant, some are close, some are noise. The LLM then generates an answer from whatever you hand it. If the best chunk is ranked 8th and you only pass in the top 5, you've already lost.

This is the retrieval precision problem, and reranking is the standard solution. It's one of the highest-leverage improvements you can make to a RAG pipeline that's already working but not quite right. This article explains what reranking is, how it works technically, and how to implement it in a product knowledge system.


Why First-Pass Retrieval Has a Precision Ceiling

Vector search works by embedding your query and your documents into the same high-dimensional space, then finding the nearest neighbors by cosine similarity. It's fast — you can search millions of chunks in milliseconds — and it handles semantic similarity well (finding torque limit when you asked about maximum force).

But embeddings compress meaning. A single embedding vector must represent the entire semantic content of a chunk, and that compression loses nuance. Two chunks about M16 bolts might be almost identical in embedding space, even if one is about stainless steel in high-temperature applications and the other is about zinc-plated bolts for furniture assembly.

The fundamental trade-off of bi-encoder (embedding) search:

PropertyBi-Encoder / Vector Search
SpeedVery fast (pre-computed, ANN index)
ScalabilityMillions of docs, milliseconds
Semantic coverageGood — finds related concepts
PrecisionModerate — compressed representations lose nuance

This is why serious RAG systems use a two-stage approach:

  1. Stage 1 — Retrieve: Use fast vector search to pull back a candidate set (e.g. top 50 chunks)
  2. Stage 2 — Rerank: Use a slower but more accurate model to score and re-order those candidates

The reranker doesn't need to scan your entire corpus — it only scores the small candidate set. You get the speed of vector search and the precision of a full pairwise comparison.


Cross-Encoders: The Engine of Reranking

The key to understanding reranking is the difference between bi-encoders and cross-encoders.

Bi-encoders (used in standard vector search) encode the query and document independently. You can pre-compute document embeddings offline. At query time, you only need to embed the query — then compare it against pre-stored vectors.

Cross-encoders encode the query and document together in a single forward pass. The model can directly attend to the relationship between them, catching nuances that bi-encoders miss.

Bi-encoder:
  embed(query)  ──┐
                  ├──> cosine_similarity ──> score
  embed(doc)    ──┘

Cross-encoder:
  concat(query + doc) ──> transformer ──> score

Because cross-encoders process query and document jointly, they're significantly more accurate at judging relevance. They're also significantly slower — you can't pre-compute scores offline because the query isn't known in advance. That's why they're only practical for reranking a small candidate set, not for searching a full corpus.


Implementing Two-Stage Retrieval

Here's a practical implementation in TypeScript for a product knowledge context. We'll use a vector store for first-pass retrieval and a cross-encoder API for reranking.

Step 1: First-Pass Retrieval

Retrieve a larger-than-usual candidate set. If you normally pass 5 chunks to your LLM, retrieve 25–50 for reranking.

async function firstPassRetrieve(
  query: string,
  vectorStore: VectorStore,
  candidateCount: number = 40
): Promise<Chunk[]> {
  const queryEmbedding = await embed(query)
  return vectorStore.similaritySearch(queryEmbedding, candidateCount)
}

The exact number depends on your corpus and latency budget. More candidates gives the reranker more to work with, at the cost of reranking time. In practice, 20–50 candidates works well for most product catalogs.

Step 2: Reranking

Popular options for the reranking model:

  • Cohere Rerank — managed API, easy to integrate, strong performance on technical text
  • cross-encoder/ms-marco-MiniLM-L-6-v2 — open-source via HuggingFace, good speed/accuracy trade-off
  • BAAI/bge-reranker-v2-m3 — state-of-the-art open-source reranker as of early 2026
  • Jina Reranker — managed API with good multilingual support, useful for international catalogs

Here's an example using Cohere's managed reranker:

import { CohereClient } from 'cohere-ai'
 
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY })
 
async function rerank(
  query: string,
  candidates: Chunk[],
  topN: number = 5
): Promise<Chunk[]> {
  const response = await cohere.rerank({
    model: 'rerank-english-v3.0',
    query,
    documents: candidates.map((c) => c.text),
    topN,
  })
 
  return response.results.map((result) => ({
    ...candidates[result.index],
    rerankScore: result.relevanceScore,
  }))
}

And the self-hosted version using transformers.js or a Python sidecar:

from sentence_transformers import CrossEncoder
 
model = CrossEncoder('BAAI/bge-reranker-v2-m3')
 
def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[tuple[int, float]]:
    pairs = [[query, doc] for doc in candidates]
    scores = model.predict(pairs)
    ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
    return ranked[:top_n]

Step 3: Generate with Reranked Context

After reranking, pass only the top-N chunks to the LLM:

async function answerWithReranking(query: string): Promise<string> {
  // Stage 1: broad retrieval
  const candidates = await firstPassRetrieve(query, vectorStore, 40)
 
  // Stage 2: precise reranking
  const reranked = await rerank(query, candidates, 5)
 
  // Build context from top-ranked chunks
  const context = reranked
    .map((chunk) => `[${chunk.source}]\n${chunk.text}`)
    .join('\n\n---\n\n')
 
  // Generate
  return llm.complete({
    system: PRODUCT_SYSTEM_PROMPT,
    user: `Context:\n${context}\n\nQuestion: ${query}`,
  })
}

What Changes in a Product Catalog Context

General-purpose rerankers trained on web data (like MS MARCO) work reasonably well out of the box, but product catalogs have specific characteristics worth accounting for.

Exact Specification Matching Matters More

A buyer asking "M12 × 1.75 thread pitch stainless A2-70 bolt" is giving you a precise technical specification. Generic rerankers may score a loosely related fastener chunk highly because the text sounds similar. In product knowledge contexts, exact attribute matching often trumps semantic similarity.

One approach: combine the reranker score with an exact-match boost for key attributes.

function hybridScore(chunk: Chunk, query: string, rerankScore: number): number {
  const exactMatchBoost = extractedAttributes(query)
    .filter((attr) => chunk.text.includes(attr))
    .length * 0.1  // +0.1 per exact-matched attribute
 
  return rerankScore + exactMatchBoost
}

Part Numbers Are Not Semantic

Part numbers like SS-M12-175-A270-HEX-50 contain no semantic meaning for an embedding model. Ensure your chunking strategy (covered in our document chunking guide) preserves part numbers in full and includes surrounding context.

A dedicated part number lookup layer — exact string match before you even hit the vector index — handles the most common query type in industrial catalogs. Let the semantic layer handle everything that can't be resolved by exact match.

Multilingual Catalogs

If your product catalog exists in multiple languages (German, French, Dutch for EU distributors), prefer multilingual rerankers. BAAI/bge-reranker-v2-m3 and Jina's reranker handle cross-lingual reranking well — a French query can rerank German document chunks, which is powerful for catalogs that haven't been fully translated.


Measuring Retrieval Quality Before and After Reranking

Improving retrieval without measuring it is guesswork. Build a small evaluation set:

  1. Collect 50–100 real queries from your support logs or sales team
  2. For each query, manually identify the 1–3 chunks that should be in the context (the "golden" relevant chunks)
  3. Run both pipelines and measure whether those chunks appear in the top-5 results

Key metrics:

Recall@K: Of all relevant chunks, what fraction appears in the top-K results?

recall@5 = (relevant chunks in top 5) / (total relevant chunks)

Mean Reciprocal Rank (MRR): Measures how highly the first relevant result is ranked.

MRR = (1/N) * Σ (1 / rank_of_first_relevant_result)

In our experience with industrial product catalogs, adding a cross-encoder reranking step typically improves Recall@5 by 15–30 percentage points over vanilla vector search, with the biggest gains on queries that involve specific technical attributes.


Latency Trade-offs

The obvious concern with two-stage retrieval is latency. Here's what to expect:

StageTypical Latency
Embedding query20–50ms
Vector search (top 40)5–30ms
Cross-encoder rerank (40→5)80–300ms (managed API)
Cross-encoder rerank (40→5)20–80ms (self-hosted, GPU)
LLM generation500–2000ms

Reranking adds 80–300ms to a pipeline whose bottleneck is already LLM generation at 500ms+. For interactive chat interfaces, this is negligible. For bulk operations or real-time autocomplete, optimize by reducing candidate count or using a faster (smaller) reranker model.

Managed API latency varies with load. If consistency matters, self-hosting a quantized reranker model is worth it.


When Reranking Isn't the Right Fix

Reranking helps with precision — surfacing the most relevant chunks from a good candidate set. It won't fix:

  • Missing data: If the answer isn't in your corpus, no reranker will find it. Gaps in product data need to be filled at the source.
  • Bad chunking: If relevant information is split across chunks in ways that lose context, reranking can't reconstruct it. Fix chunking strategy first (see our chunking deep-dive).
  • Embedding model mismatch: If your embedding model systematically fails to retrieve a relevant chunk (it doesn't appear in the top 40 candidates), reranking can't help. You need a better embedding model or a hybrid retrieval strategy that combines semantic and full-text search.
  • Ambiguous queries: If the query genuinely has multiple valid interpretations, reranking will pick one. The right solution is query clarification or multi-path retrieval.

Think of reranking as the final polish on your retrieval pipeline, not a substitute for solid foundations.


A Practical Rollout Plan

If you're adding reranking to an existing RAG system:

  1. Baseline first: Capture your current Recall@5 and MRR on a representative query set before changing anything.
  2. Start with a managed API: Cohere Rerank or Jina Reranker let you validate the improvement without infrastructure investment. Measure the uplift.
  3. Tune candidate count: Try retrieving 20, 40, and 60 candidates. More candidates generally helps up to a point, then the quality of fringe candidates dilutes the reranker's signal.
  4. Profile latency end-to-end: Measure P50 and P99 latency with reranking enabled. Make sure it's acceptable for your interface.
  5. If it works, consider self-hosting: For production systems handling hundreds of queries per day, a self-hosted reranker (e.g. running bge-reranker-v2-m3 on a GPU instance) cuts API costs and latency. For lower-volume deployments, managed is fine.

The Bigger Picture: Retrieval Is Your Product Quality Floor

Every AI answer your system produces is only as good as the context you retrieve. The LLM is a reasoning engine — it works with what you give it. If you give it the right product data, it produces accurate, useful answers. If retrieval misses the mark, hallucination fills the gap.

Reranking is one of the most cost-effective ways to raise the quality floor of your product knowledge system. The implementation is straightforward, the gains are measurable, and it works without changes to your data model, chunking strategy, or LLM prompt.

If you're seeing confident-sounding but occasionally wrong answers from your AI, retrieval precision is usually the first place to look. Two-stage retrieval with a cross-encoder reranker is the fix most teams reach for — and it consistently delivers.


Want to See This in Action?

Axoverna's product knowledge platform includes two-stage retrieval out of the box, tuned specifically for B2B product catalogs. No ML infrastructure to stand up — just connect your catalog and start improving answer accuracy from day one.

Book a demo to see how it performs on your product data, or start a free trial and run a benchmark against your existing search.

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.