Hybrid Search in Practice: Combining BM25 and Dense Vectors for B2B Product Catalogs
Neither keyword search nor vector search alone handles the full range of B2B product queries. Hybrid search — fusing BM25 and dense retrieval — is how serious product AI systems solve both halves of the problem.
If you've been following along with our series on B2B product retrieval, you know the story so far. Keyword search fails on complex B2B catalogs because it can't handle synonyms, intent, or natural language phrasing. Semantic search and vector embeddings solve the semantic gap — but introduce a new failure mode: they struggle with exact identifiers, part numbers, and precise technical attributes.
Every production RAG system we've seen at scale eventually reaches the same conclusion: you need both.
Hybrid search — combining traditional BM25 keyword scoring with dense vector retrieval — is the architecture that handles the full spectrum of queries a B2B buyer actually asks. This article is a practical guide to how it works, how to implement it, and how to tune it for a product catalog specifically.
The Two Failure Modes You're Trying to Solve
Before diving into the solution, it's worth being precise about the problem.
Where pure vector search breaks down
Dense retrieval is excellent at matching meaning across surface variations. "What's the load rating for your structural channel?" matches "maximum allowable load on C-channels" even though no words overlap. That's real value.
But embeddings are terrible at exact-match retrieval. A query for part number 304-SS-HEX-M10-1.5-A2 is a precise identifier with no semantic content. To an embedding model, it looks like a string of tokens with no meaningful semantic relationship — and it will confidently retrieve a chunk about a superficially similar part number that isn't the right product at all.
The same failure applies to model codes, SKUs, CAS numbers in chemical products, UPC codes, and any domain-specific identifier your industry uses. If a buyer knows exactly what they want, vector search often gets in the way.
Where pure BM25 breaks down
BM25 is a refined TF-IDF scoring function. It's fast, deterministic, and explainable. It's also brittle. It matches tokens — not intent.
A buyer typing "what can I use to connect 3/4 inch copper pipe in a wet environment" will get poor results if your catalog describes the same product as "push-fit plumbing fitting, 19mm, corrosion-resistant." The tokens don't overlap. BM25 scores that chunk near zero even though it's the perfect answer.
Synonyms, unit conversions (inches to millimeters), technical jargon, and natural language phrasing all kill keyword retrieval. We covered the specifics in our analysis of why keyword search fails in B2B — the short version is that buyers don't speak catalog.
The hybrid premise
The insight behind hybrid search is simple: these two failure modes are almost perfectly complementary. When BM25 fails (semantic mismatch), vector search succeeds. When vector search fails (exact identifiers, token-specific queries), BM25 succeeds.
A hybrid system runs both retrieval paths in parallel and fuses the results. Done right, it achieves higher recall than either approach alone — without sacrificing the precision of exact-match retrieval.
A Quick Primer on BM25
BM25 (Best Match 25) is the ranking function behind most traditional search engines, including Elasticsearch and OpenSearch at their core. Understanding the formula helps when you need to tune it.
For a query term q and a document D, the BM25 score is approximately:
score(D, q) = IDF(q) × (tf(q,D) × (k1 + 1)) / (tf(q,D) + k1 × (1 - b + b × |D|/avgdl))
Where:
tf(q,D)= term frequency of q in document DIDF(q)= inverse document frequency (rarer terms score higher)|D|= document lengthavgdl= average document length across the corpusk1(default ≈ 1.2) = term frequency saturation: controls how much repeated terms boost the scoreb(default ≈ 0.75) = length normalization: penalizes longer documents
For product catalogs, you'll often want to tune b downward (toward 0.3–0.5). Product descriptions vary dramatically in length — a one-line fastener entry versus a multi-page pump specification — and heavy length normalization penalizes longer documents unfairly. In this case, a longer product description is just a more complete description, not bloat.
Dense Retrieval: The Counterpart
Dense retrieval, covered in depth in our vector databases for product search article, works by embedding both documents and queries into a shared semantic vector space. Retrieval is Approximate Nearest Neighbor (ANN) search over pre-computed embeddings.
The key difference from BM25:
| Property | BM25 | Dense Vector |
|---|---|---|
| Query-doc relationship | Token overlap | Semantic proximity |
| Handles synonyms | No | Yes |
| Handles part numbers | Yes (exact) | Poor |
| Handles natural language | Limited | Yes |
| Infrastructure | Inverted index | Vector index (HNSW, etc.) |
| Pre-computation | Inverted index at ingest | Embeddings at ingest |
| Query time | Milliseconds | Milliseconds (ANN) |
Both are fast at query time — the latency difference in practice is negligible. Both require pre-computation at ingest. The choice isn't either/or; it's both.
Fusion Strategies: How to Combine Two Result Lists
You now have two ranked lists of candidate chunks — one from BM25, one from dense retrieval. How do you merge them into a single ranked list to pass to the reranker or directly to the LLM?
Option 1: Reciprocal Rank Fusion (RRF)
RRF is the most robust fusion method, and the one we recommend as a starting point. It doesn't require you to normalize scores across the two systems (which is tricky since BM25 scores and cosine similarity scores operate on completely different scales). Instead, it uses only the rank of each result.
RRF_score(chunk) = Σ 1 / (k + rank_in_list_i)
Where k is a constant (typically 60) and the sum is over each retrieval list the chunk appears in.
In practice:
function reciprocalRankFusion(
bm25Results: RankedChunk[],
vectorResults: RankedChunk[],
k: number = 60
): RankedChunk[] {
const scores = new Map<string, number>()
const addScores = (results: RankedChunk[]) => {
results.forEach((chunk, index) => {
const rank = index + 1
const current = scores.get(chunk.id) ?? 0
scores.set(chunk.id, current + 1 / (k + rank))
})
}
addScores(bm25Results)
addScores(vectorResults)
// Merge unique chunks and sort by fused score
const allChunks = new Map<string, RankedChunk>()
;[...bm25Results, ...vectorResults].forEach((c) => allChunks.set(c.id, c))
return [...allChunks.values()]
.map((chunk) => ({ ...chunk, rrfScore: scores.get(chunk.id) ?? 0 }))
.sort((a, b) => b.rrfScore - a.rrfScore)
}RRF is remarkably forgiving. Even when one retrieval system performs poorly on a given query, the fusion naturally down-weights its contribution because its results appear lower in the combined ranking.
Option 2: Weighted Linear Combination
If you want more control, you can normalize both score distributions (e.g., min-max normalization to [0,1]) and combine them with weights:
hybrid_score = α × normalized_bm25_score + (1 - α) × normalized_vector_score
Where α controls the balance. Higher α = more weight on exact-match retrieval; lower α = more weight on semantic retrieval.
The challenge: normalizing BM25 scores is non-trivial. BM25 scores depend on the corpus size and term distribution, so a raw score of 12.4 means very different things across different catalogs or after catalog updates. You'd need to normalize against the score distribution of the current result set, not a fixed scale.
For this reason, most production systems start with RRF and only move to linear combination if they have strong evidence that one retrieval signal is systematically more reliable for their corpus.
Option 3: Cascaded Hybrid
A third approach: use BM25 as a hard pre-filter before dense retrieval. If a part number or exact model code matches, return those results immediately without touching the vector index. Otherwise, fall through to dense retrieval.
This is appropriate when exact-match queries dominate your traffic (common in spare parts and aftermarket catalogs) and you want to optimize latency for the common case.
Full Pipeline Architecture
Here's how a complete hybrid search pipeline looks for a product knowledge RAG system:
User query
│
├──► BM25 index ──► top-50 candidates (BM25 scores)
│
├──► Vector index ──► top-50 candidates (cosine similarity)
│
└──► Fusion (RRF) ──► merged top-80 candidates
│
└──► Cross-encoder reranker ──► top-5 chunks
│
└──► LLM (with context) ──► answer
Note the reranking step at the end. As we covered in our reranking deep-dive, a cross-encoder reranker is a highly effective final-stage filter. In a hybrid pipeline, the reranker has an even richer candidate set to work from — chunks that scored well on semantic grounds, chunks that scored well on keyword grounds, and chunks that scored moderately on both. The reranker can sort through that diversity much better than either retrieval signal alone.
Product Catalog Specifics
Generic hybrid search implementations handle most query types well. B2B product catalogs have a few specific challenges worth addressing explicitly.
Part Numbers Deserve Their Own Lookup Layer
Before hybrid search even runs, add a dedicated part number lookup stage. If the query contains a token that matches a part number exactly (or closely after normalization), retrieve that product directly via database lookup and inject it at the top of the context.
async function productKnowledgeRetrieval(query: string): Promise<Chunk[]> {
// Stage 0: exact part number match
const partNumberMatch = await lookupByPartNumber(extractPartNumbers(query))
if (partNumberMatch.length > 0) {
return [...partNumberMatch, ...(await hybridSearch(query, 10))]
}
// Stage 1: hybrid search
return hybridSearch(query, 50)
}This ensures part number queries are never degraded by retrieval noise — you get the exact match guaranteed, supplemented by semantically related context.
Handling Measurement Unit Variation
B2B buyers switch between unit systems constantly. "3/4 inch" and "19mm" refer to the same dimension, but neither BM25 nor vector search handles this well out of the box.
A practical solution: at ingest time, normalize measurements and add both imperial and metric representations to the indexed text. If your product description says "19mm," add "¾ inch (19mm)" to the searchable text. This is content enrichment at the data layer — it makes both BM25 and vector search smarter without changes to the retrieval logic.
function enrichProductText(text: string): string {
return text
.replace(/(\d+(?:\.\d+)?)\s*mm\b/g, (match, mm) => {
const inches = (parseFloat(mm) / 25.4).toFixed(3)
return `${mm}mm (${inches}")`
})
.replace(/(\d+(?:\.\d+)?)\s*"\s*(?:inch(?:es)?)?/gi, (match, inches) => {
const mm = (parseFloat(inches) * 25.4).toFixed(1)
return `${inches}" (${mm}mm)`
})
}This is a simple example — production implementations add richer unit normalization for weight, pressure, temperature, voltage, and whatever units dominate your specific catalog.
Structured Attribute Fields Benefit from BM25
Dense retrieval treats product descriptions as blobs of text. But product data has structure: a pump has a flow rate, a pressure rating, a fluid compatibility list, a connection size. These structured attributes are perfect candidates for BM25 — they're the fields buyers search with precision.
If your vector store supports metadata filtering, combine it with hybrid search:
async function attributeAwareSearch(query: string, filters: AttributeFilters) {
const bm25Candidates = await bm25Search(query, {
fields: ['description', 'specifications', 'part_number'],
filters,
})
const vectorCandidates = await vectorSearch(query, {
metadataFilter: filters,
limit: 50,
})
return reciprocalRankFusion(bm25Candidates, vectorCandidates)
}Pre-filtering by structured attributes (category, material, connection standard) before running retrieval reduces noise dramatically, especially for large catalogs with hundreds of thousands of SKUs.
Evaluation: Measuring the Hybrid Uplift
How do you know if hybrid search is actually better than either component alone? Build a retrieval evaluation harness.
Step 1: Collect representative queries. Pull 100–200 real queries from support tickets, sales rep logs, or web analytics. Include a mix of:
- Natural language descriptions ("quiet ceiling fan for a large office")
- Partial part numbers or model codes
- Technical specification lookups ("IP67 rated connector, 6 pole, M23")
- Comparison queries ("difference between Type A and Type B seals")
Step 2: Label relevant chunks manually. For each query, identify which chunks from your corpus are genuinely relevant. This is the ground truth.
Step 3: Run three retrieval configurations — BM25 only, dense only, hybrid — and measure Recall@10 and MRR for each.
A typical result pattern on B2B product catalogs:
| Retrieval Strategy | Recall@10 | MRR |
|---|---|---|
| BM25 only | 0.61 | 0.48 |
| Dense vector only | 0.67 | 0.54 |
| Hybrid (RRF) | 0.79 | 0.63 |
The gains are not uniform. Queries with exact identifiers see the biggest lift from BM25's contribution. Queries with natural language descriptions see the biggest lift from dense retrieval. Hybrid captures both — which is exactly why the aggregate metrics improve so substantially.
Infrastructure Choices
If you're building this from scratch, several infrastructure options natively support hybrid search:
Elasticsearch / OpenSearch: Both have native support for BM25 (the default) and dense vector fields (kNN search). You can run both queries in a single API call and fuse results server-side. Mature, battle-tested, good for teams already running these stacks.
Weaviate: Purpose-built vector database with native BM25 support and built-in hybrid search with RRF fusion. Simpler to operate than Elasticsearch for teams that don't need the broader feature set.
Qdrant + BM25 sidecar: Qdrant is a high-performance vector database that doesn't include BM25 natively. For hybrid, you run a lightweight BM25 sidecar (e.g. using rank-bm25 in Python or Tantivy in Rust) and handle fusion in your application layer. More moving parts, but gives you independent control over each component.
pgvector + PostgreSQL full-text search: If you're already on PostgreSQL, pgvector handles dense retrieval and Postgres's built-in ts_rank handles BM25-style scoring. Fusion happens in SQL. Surprisingly capable for medium-scale catalogs (up to ~5M chunks) and keeps your operational footprint minimal.
The right choice depends on your existing infrastructure, team expertise, and scale requirements — not on which has the most impressive benchmarks on synthetic datasets.
When Hybrid Search Makes the Most Difference
Hybrid search is most impactful when your query traffic is mixed — some users know exactly what they want (part numbers, model codes), others are exploring or troubleshooting (natural language, application descriptions). This is the normal distribution for a B2B distributor or wholesaler: sales reps running exact lookups, end customers browsing by application.
If your traffic is predominantly exact identifier lookups, a well-tuned BM25 with a part number lookup layer might be sufficient. If it's predominantly conversational queries, pure vector search gets you most of the way there.
But for the typical B2B product knowledge use case, hybrid search is the standard architecture — not an optimization for later. Start with it; the implementation complexity delta over pure vector search is modest, and the retrieval quality uplift is consistent.
Putting It All Together
The retrieval architecture we've described across this series now has three interlocking layers:
- Hybrid retrieval (this article): Run BM25 and dense vector in parallel, fuse with RRF. Broad recall, captures both exact-match and semantic queries.
- Cross-encoder reranking (covered here): Take the fused top-50 candidates and re-score them with a pairwise relevance model. Precision uplift.
- Chunking strategy (covered here): Ensure the right information is in the chunks that retrieval can find. Foundation that everything else depends on.
These layers compound. A well-chunked corpus, retrieved with hybrid search, and reranked with a cross-encoder, reliably outperforms any single-component approach by a wide margin. The gains are measurable, not theoretical.
This is the architecture Axoverna is built on — specifically tuned for the characteristics of B2B product data: structured attributes, technical terminology, exact identifiers, and the full mix of user queries that come with a live product catalog.
Ready to See It Working on Your Catalog?
Axoverna handles the full hybrid retrieval stack — BM25, dense vectors, RRF fusion, and reranking — without requiring you to build or operate any of the underlying infrastructure. Connect your product catalog, and you get production-grade retrieval out of the box.
Book a demo to run a live retrieval benchmark against your own product data, or start a free trial and see the difference hybrid search makes on the queries your customers are actually asking.
Turn your product catalog into an AI knowledge base
Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.
Related articles
Why Session Memory Matters for Repeat B2B Buyers, and How to Design It Without Breaking Trust
The strongest B2B product AI systems do not treat every conversation like a cold start. They use session memory to preserve buyer context, speed up repeat interactions, and improve recommendation quality, while staying grounded in live product data and clear trust boundaries.
Unit Normalization in B2B Product AI: Why 1/2 Inch, DN15, and 15 mm Should Mean the Same Thing
B2B product AI breaks fast when dimensions, thread sizes, pack quantities, and engineering units are stored in inconsistent formats. Here is how to design unit normalization that improves retrieval, filtering, substitutions, and answer accuracy.
Source-Aware RAG: How to Combine PIM, PDFs, ERP, and Policy Content Without Conflicting Answers
Most product AI failures are not caused by weak models, but by mixing sources with different authority levels. Here is how B2B teams design source-aware RAG that keeps specs, availability, pricing rules, and policy answers aligned.