Reranking in RAG: Why Two-Stage Retrieval Dramatically Improves Answer Quality
First-pass vector search is fast but imprecise. Learn how cross-encoder reranking transforms mediocre retrieval into highly accurate results — and why it matters for product knowledge systems.
If you've read our primer on how RAG works and our deep-dive on vector databases for product search, you already know that embedding-based retrieval is powerful. You also know it isn't perfect.
A buyer types: "What's the torque rating for the M16 stainless flange bolt at 200°C?"
Your vector index returns the top 10 chunks — some are relevant, some are close, some are noise. The LLM then generates an answer from whatever you hand it. If the best chunk is ranked 8th and you only pass in the top 5, you've already lost.
This is the retrieval precision problem, and reranking is the standard solution. It's one of the highest-leverage improvements you can make to a RAG pipeline that's already working but not quite right. This article explains what reranking is, how it works technically, and how to implement it in a product knowledge system.
Why First-Pass Retrieval Has a Precision Ceiling
Vector search works by embedding your query and your documents into the same high-dimensional space, then finding the nearest neighbors by cosine similarity. It's fast — you can search millions of chunks in milliseconds — and it handles semantic similarity well (finding torque limit when you asked about maximum force).
But embeddings compress meaning. A single embedding vector must represent the entire semantic content of a chunk, and that compression loses nuance. Two chunks about M16 bolts might be almost identical in embedding space, even if one is about stainless steel in high-temperature applications and the other is about zinc-plated bolts for furniture assembly.
The fundamental trade-off of bi-encoder (embedding) search:
| Property | Bi-Encoder / Vector Search |
|---|---|
| Speed | Very fast (pre-computed, ANN index) |
| Scalability | Millions of docs, milliseconds |
| Semantic coverage | Good — finds related concepts |
| Precision | Moderate — compressed representations lose nuance |
This is why serious RAG systems use a two-stage approach:
- Stage 1 — Retrieve: Use fast vector search to pull back a candidate set (e.g. top 50 chunks)
- Stage 2 — Rerank: Use a slower but more accurate model to score and re-order those candidates
The reranker doesn't need to scan your entire corpus — it only scores the small candidate set. You get the speed of vector search and the precision of a full pairwise comparison.
Cross-Encoders: The Engine of Reranking
The key to understanding reranking is the difference between bi-encoders and cross-encoders.
Bi-encoders (used in standard vector search) encode the query and document independently. You can pre-compute document embeddings offline. At query time, you only need to embed the query — then compare it against pre-stored vectors.
Cross-encoders encode the query and document together in a single forward pass. The model can directly attend to the relationship between them, catching nuances that bi-encoders miss.
Bi-encoder:
embed(query) ──┐
├──> cosine_similarity ──> score
embed(doc) ──┘
Cross-encoder:
concat(query + doc) ──> transformer ──> score
Because cross-encoders process query and document jointly, they're significantly more accurate at judging relevance. They're also significantly slower — you can't pre-compute scores offline because the query isn't known in advance. That's why they're only practical for reranking a small candidate set, not for searching a full corpus.
Implementing Two-Stage Retrieval
Here's a practical implementation in TypeScript for a product knowledge context. We'll use a vector store for first-pass retrieval and a cross-encoder API for reranking.
Step 1: First-Pass Retrieval
Retrieve a larger-than-usual candidate set. If you normally pass 5 chunks to your LLM, retrieve 25–50 for reranking.
async function firstPassRetrieve(
query: string,
vectorStore: VectorStore,
candidateCount: number = 40
): Promise<Chunk[]> {
const queryEmbedding = await embed(query)
return vectorStore.similaritySearch(queryEmbedding, candidateCount)
}The exact number depends on your corpus and latency budget. More candidates gives the reranker more to work with, at the cost of reranking time. In practice, 20–50 candidates works well for most product catalogs.
Step 2: Reranking
Popular options for the reranking model:
- Cohere Rerank — managed API, easy to integrate, strong performance on technical text
cross-encoder/ms-marco-MiniLM-L-6-v2— open-source via HuggingFace, good speed/accuracy trade-offBAAI/bge-reranker-v2-m3— state-of-the-art open-source reranker as of early 2026- Jina Reranker — managed API with good multilingual support, useful for international catalogs
Here's an example using Cohere's managed reranker:
import { CohereClient } from 'cohere-ai'
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY })
async function rerank(
query: string,
candidates: Chunk[],
topN: number = 5
): Promise<Chunk[]> {
const response = await cohere.rerank({
model: 'rerank-english-v3.0',
query,
documents: candidates.map((c) => c.text),
topN,
})
return response.results.map((result) => ({
...candidates[result.index],
rerankScore: result.relevanceScore,
}))
}And the self-hosted version using transformers.js or a Python sidecar:
from sentence_transformers import CrossEncoder
model = CrossEncoder('BAAI/bge-reranker-v2-m3')
def rerank(query: str, candidates: list[str], top_n: int = 5) -> list[tuple[int, float]]:
pairs = [[query, doc] for doc in candidates]
scores = model.predict(pairs)
ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
return ranked[:top_n]Step 3: Generate with Reranked Context
After reranking, pass only the top-N chunks to the LLM:
async function answerWithReranking(query: string): Promise<string> {
// Stage 1: broad retrieval
const candidates = await firstPassRetrieve(query, vectorStore, 40)
// Stage 2: precise reranking
const reranked = await rerank(query, candidates, 5)
// Build context from top-ranked chunks
const context = reranked
.map((chunk) => `[${chunk.source}]\n${chunk.text}`)
.join('\n\n---\n\n')
// Generate
return llm.complete({
system: PRODUCT_SYSTEM_PROMPT,
user: `Context:\n${context}\n\nQuestion: ${query}`,
})
}What Changes in a Product Catalog Context
General-purpose rerankers trained on web data (like MS MARCO) work reasonably well out of the box, but product catalogs have specific characteristics worth accounting for.
Exact Specification Matching Matters More
A buyer asking "M12 × 1.75 thread pitch stainless A2-70 bolt" is giving you a precise technical specification. Generic rerankers may score a loosely related fastener chunk highly because the text sounds similar. In product knowledge contexts, exact attribute matching often trumps semantic similarity.
One approach: combine the reranker score with an exact-match boost for key attributes.
function hybridScore(chunk: Chunk, query: string, rerankScore: number): number {
const exactMatchBoost = extractedAttributes(query)
.filter((attr) => chunk.text.includes(attr))
.length * 0.1 // +0.1 per exact-matched attribute
return rerankScore + exactMatchBoost
}Part Numbers Are Not Semantic
Part numbers like SS-M12-175-A270-HEX-50 contain no semantic meaning for an embedding model. Ensure your chunking strategy (covered in our document chunking guide) preserves part numbers in full and includes surrounding context.
A dedicated part number lookup layer — exact string match before you even hit the vector index — handles the most common query type in industrial catalogs. Let the semantic layer handle everything that can't be resolved by exact match.
Multilingual Catalogs
If your product catalog exists in multiple languages (German, French, Dutch for EU distributors), prefer multilingual rerankers. BAAI/bge-reranker-v2-m3 and Jina's reranker handle cross-lingual reranking well — a French query can rerank German document chunks, which is powerful for catalogs that haven't been fully translated.
Measuring Retrieval Quality Before and After Reranking
Improving retrieval without measuring it is guesswork. Build a small evaluation set:
- Collect 50–100 real queries from your support logs or sales team
- For each query, manually identify the 1–3 chunks that should be in the context (the "golden" relevant chunks)
- Run both pipelines and measure whether those chunks appear in the top-5 results
Key metrics:
Recall@K: Of all relevant chunks, what fraction appears in the top-K results?
recall@5 = (relevant chunks in top 5) / (total relevant chunks)
Mean Reciprocal Rank (MRR): Measures how highly the first relevant result is ranked.
MRR = (1/N) * Σ (1 / rank_of_first_relevant_result)
In our experience with industrial product catalogs, adding a cross-encoder reranking step typically improves Recall@5 by 15–30 percentage points over vanilla vector search, with the biggest gains on queries that involve specific technical attributes.
Latency Trade-offs
The obvious concern with two-stage retrieval is latency. Here's what to expect:
| Stage | Typical Latency |
|---|---|
| Embedding query | 20–50ms |
| Vector search (top 40) | 5–30ms |
| Cross-encoder rerank (40→5) | 80–300ms (managed API) |
| Cross-encoder rerank (40→5) | 20–80ms (self-hosted, GPU) |
| LLM generation | 500–2000ms |
Reranking adds 80–300ms to a pipeline whose bottleneck is already LLM generation at 500ms+. For interactive chat interfaces, this is negligible. For bulk operations or real-time autocomplete, optimize by reducing candidate count or using a faster (smaller) reranker model.
Managed API latency varies with load. If consistency matters, self-hosting a quantized reranker model is worth it.
When Reranking Isn't the Right Fix
Reranking helps with precision — surfacing the most relevant chunks from a good candidate set. It won't fix:
- Missing data: If the answer isn't in your corpus, no reranker will find it. Gaps in product data need to be filled at the source.
- Bad chunking: If relevant information is split across chunks in ways that lose context, reranking can't reconstruct it. Fix chunking strategy first (see our chunking deep-dive).
- Embedding model mismatch: If your embedding model systematically fails to retrieve a relevant chunk (it doesn't appear in the top 40 candidates), reranking can't help. You need a better embedding model or a hybrid retrieval strategy that combines semantic and full-text search.
- Ambiguous queries: If the query genuinely has multiple valid interpretations, reranking will pick one. The right solution is query clarification or multi-path retrieval.
Think of reranking as the final polish on your retrieval pipeline, not a substitute for solid foundations.
A Practical Rollout Plan
If you're adding reranking to an existing RAG system:
- Baseline first: Capture your current Recall@5 and MRR on a representative query set before changing anything.
- Start with a managed API: Cohere Rerank or Jina Reranker let you validate the improvement without infrastructure investment. Measure the uplift.
- Tune candidate count: Try retrieving 20, 40, and 60 candidates. More candidates generally helps up to a point, then the quality of fringe candidates dilutes the reranker's signal.
- Profile latency end-to-end: Measure P50 and P99 latency with reranking enabled. Make sure it's acceptable for your interface.
- If it works, consider self-hosting: For production systems handling hundreds of queries per day, a self-hosted reranker (e.g. running
bge-reranker-v2-m3on a GPU instance) cuts API costs and latency. For lower-volume deployments, managed is fine.
The Bigger Picture: Retrieval Is Your Product Quality Floor
Every AI answer your system produces is only as good as the context you retrieve. The LLM is a reasoning engine — it works with what you give it. If you give it the right product data, it produces accurate, useful answers. If retrieval misses the mark, hallucination fills the gap.
Reranking is one of the most cost-effective ways to raise the quality floor of your product knowledge system. The implementation is straightforward, the gains are measurable, and it works without changes to your data model, chunking strategy, or LLM prompt.
If you're seeing confident-sounding but occasionally wrong answers from your AI, retrieval precision is usually the first place to look. Two-stage retrieval with a cross-encoder reranker is the fix most teams reach for — and it consistently delivers.
Want to See This in Action?
Axoverna's product knowledge platform includes two-stage retrieval out of the box, tuned specifically for B2B product catalogs. No ML infrastructure to stand up — just connect your catalog and start improving answer accuracy from day one.
Book a demo to see how it performs on your product data, or start a free trial and run a benchmark against your existing search.
Turn your product catalog into an AI knowledge base
Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.
Related articles
Why Session Memory Matters for Repeat B2B Buyers, and How to Design It Without Breaking Trust
The strongest B2B product AI systems do not treat every conversation like a cold start. They use session memory to preserve buyer context, speed up repeat interactions, and improve recommendation quality, while staying grounded in live product data and clear trust boundaries.
Unit Normalization in B2B Product AI: Why 1/2 Inch, DN15, and 15 mm Should Mean the Same Thing
B2B product AI breaks fast when dimensions, thread sizes, pack quantities, and engineering units are stored in inconsistent formats. Here is how to design unit normalization that improves retrieval, filtering, substitutions, and answer accuracy.
Source-Aware RAG: How to Combine PIM, PDFs, ERP, and Policy Content Without Conflicting Answers
Most product AI failures are not caused by weak models, but by mixing sources with different authority levels. Here is how B2B teams design source-aware RAG that keeps specs, availability, pricing rules, and policy answers aligned.