Hybrid Search in Practice: Combining BM25 and Dense Vectors for B2B Product Catalogs

Neither keyword search nor vector search alone handles the full range of B2B product queries. Hybrid search — fusing BM25 and dense retrieval — is how serious product AI systems solve both halves of the problem.

Axoverna Team

March 5, 202614 min read

If you've been following along with our series on B2B product retrieval, you know the story so far. Keyword search fails on complex B2B catalogs because it can't handle synonyms, intent, or natural language phrasing. Semantic search and vector embeddings solve the semantic gap — but introduce a new failure mode: they struggle with exact identifiers, part numbers, and precise technical attributes.

Every production RAG system we've seen at scale eventually reaches the same conclusion: you need both.

Hybrid search — combining traditional BM25 keyword scoring with dense vector retrieval — is the architecture that handles the full spectrum of queries a B2B buyer actually asks. This article is a practical guide to how it works, how to implement it, and how to tune it for a product catalog specifically.

The Two Failure Modes You're Trying to Solve

Before diving into the solution, it's worth being precise about the problem.

Where pure vector search breaks down

Dense retrieval is excellent at matching meaning across surface variations. "What's the load rating for your structural channel?" matches "maximum allowable load on C-channels" even though no words overlap. That's real value.

But embeddings are terrible at exact-match retrieval. A query for part number 304-SS-HEX-M10-1.5-A2 is a precise identifier with no semantic content. To an embedding model, it looks like a string of tokens with no meaningful semantic relationship — and it will confidently retrieve a chunk about a superficially similar part number that isn't the right product at all.

The same failure applies to model codes, SKUs, CAS numbers in chemical products, UPC codes, and any domain-specific identifier your industry uses. If a buyer knows exactly what they want, vector search often gets in the way.

Where pure BM25 breaks down

BM25 is a refined TF-IDF scoring function. It's fast, deterministic, and explainable. It's also brittle. It matches tokens — not intent.

A buyer typing "what can I use to connect 3/4 inch copper pipe in a wet environment" will get poor results if your catalog describes the same product as "push-fit plumbing fitting, 19mm, corrosion-resistant." The tokens don't overlap. BM25 scores that chunk near zero even though it's the perfect answer.

Synonyms, unit conversions (inches to millimeters), technical jargon, and natural language phrasing all kill keyword retrieval. We covered the specifics in our analysis of why keyword search fails in B2B — the short version is that buyers don't speak catalog.

The hybrid premise

The insight behind hybrid search is simple: these two failure modes are almost perfectly complementary. When BM25 fails (semantic mismatch), vector search succeeds. When vector search fails (exact identifiers, token-specific queries), BM25 succeeds.

A hybrid system runs both retrieval paths in parallel and fuses the results. Done right, it achieves higher recall than either approach alone — without sacrificing the precision of exact-match retrieval.

A Quick Primer on BM25

BM25 (Best Match 25) is the ranking function behind most traditional search engines, including Elasticsearch and OpenSearch at their core. Understanding the formula helps when you need to tune it.

For a query term q and a document D, the BM25 score is approximately:

score(D, q) = IDF(q) × (tf(q,D) × (k1 + 1)) / (tf(q,D) + k1 × (1 - b + b × |D|/avgdl))

Where:

tf(q,D) = term frequency of q in document D
IDF(q) = inverse document frequency (rarer terms score higher)
|D| = document length
avgdl = average document length across the corpus
k1 (default ≈ 1.2) = term frequency saturation: controls how much repeated terms boost the score
b (default ≈ 0.75) = length normalization: penalizes longer documents

For product catalogs, you'll often want to tune b downward (toward 0.3–0.5). Product descriptions vary dramatically in length — a one-line fastener entry versus a multi-page pump specification — and heavy length normalization penalizes longer documents unfairly. In this case, a longer product description is just a more complete description, not bloat.

Dense Retrieval: The Counterpart

Dense retrieval, covered in depth in our vector databases for product search article, works by embedding both documents and queries into a shared semantic vector space. Retrieval is Approximate Nearest Neighbor (ANN) search over pre-computed embeddings.

The key difference from BM25:

Property	BM25	Dense Vector
Query-doc relationship	Token overlap	Semantic proximity
Handles synonyms	No	Yes
Handles part numbers	Yes (exact)	Poor
Handles natural language	Limited	Yes
Infrastructure	Inverted index	Vector index (HNSW, etc.)
Pre-computation	Inverted index at ingest	Embeddings at ingest
Query time	Milliseconds	Milliseconds (ANN)

Both are fast at query time — the latency difference in practice is negligible. Both require pre-computation at ingest. The choice isn't either/or; it's both.

Fusion Strategies: How to Combine Two Result Lists

You now have two ranked lists of candidate chunks — one from BM25, one from dense retrieval. How do you merge them into a single ranked list to pass to the reranker or directly to the LLM?

Option 1: Reciprocal Rank Fusion (RRF)

RRF is the most robust fusion method, and the one we recommend as a starting point. It doesn't require you to normalize scores across the two systems (which is tricky since BM25 scores and cosine similarity scores operate on completely different scales). Instead, it uses only the rank of each result.

RRF_score(chunk) = Σ 1 / (k + rank_in_list_i)

Where k is a constant (typically 60) and the sum is over each retrieval list the chunk appears in.

In practice:

function reciprocalRankFusion(
  bm25Results: RankedChunk[],
  vectorResults: RankedChunk[],
  k: number = 60
): RankedChunk[] {
  const scores = new Map<string, number>()
 
  const addScores = (results: RankedChunk[]) => {
    results.forEach((chunk, index) => {
      const rank = index + 1
      const current = scores.get(chunk.id) ?? 0
      scores.set(chunk.id, current + 1 / (k + rank))
    })
  }
 
  addScores(bm25Results)
  addScores(vectorResults)
 
  // Merge unique chunks and sort by fused score
  const allChunks = new Map<string, RankedChunk>()
  ;[...bm25Results, ...vectorResults].forEach((c) => allChunks.set(c.id, c))
 
  return [...allChunks.values()]
    .map((chunk) => ({ ...chunk, rrfScore: scores.get(chunk.id) ?? 0 }))
    .sort((a, b) => b.rrfScore - a.rrfScore)
}

RRF is remarkably forgiving. Even when one retrieval system performs poorly on a given query, the fusion naturally down-weights its contribution because its results appear lower in the combined ranking.

Option 2: Weighted Linear Combination

If you want more control, you can normalize both score distributions (e.g., min-max normalization to [0,1]) and combine them with weights:

hybrid_score = α × normalized_bm25_score + (1 - α) × normalized_vector_score

Where α controls the balance. Higher α = more weight on exact-match retrieval; lower α = more weight on semantic retrieval.

The challenge: normalizing BM25 scores is non-trivial. BM25 scores depend on the corpus size and term distribution, so a raw score of 12.4 means very different things across different catalogs or after catalog updates. You'd need to normalize against the score distribution of the current result set, not a fixed scale.

For this reason, most production systems start with RRF and only move to linear combination if they have strong evidence that one retrieval signal is systematically more reliable for their corpus.

Option 3: Cascaded Hybrid

A third approach: use BM25 as a hard pre-filter before dense retrieval. If a part number or exact model code matches, return those results immediately without touching the vector index. Otherwise, fall through to dense retrieval.

This is appropriate when exact-match queries dominate your traffic (common in spare parts and aftermarket catalogs) and you want to optimize latency for the common case.

Full Pipeline Architecture

Here's how a complete hybrid search pipeline looks for a product knowledge RAG system:

User query
    │
    ├──► BM25 index ──► top-50 candidates (BM25 scores)
    │
    ├──► Vector index ──► top-50 candidates (cosine similarity)
    │
    └──► Fusion (RRF) ──► merged top-80 candidates
                │
                └──► Cross-encoder reranker ──► top-5 chunks
                                │
                                └──► LLM (with context) ──► answer

Note the reranking step at the end. As we covered in our reranking deep-dive, a cross-encoder reranker is a highly effective final-stage filter. In a hybrid pipeline, the reranker has an even richer candidate set to work from — chunks that scored well on semantic grounds, chunks that scored well on keyword grounds, and chunks that scored moderately on both. The reranker can sort through that diversity much better than either retrieval signal alone.

Product Catalog Specifics

Generic hybrid search implementations handle most query types well. B2B product catalogs have a few specific challenges worth addressing explicitly.

Part Numbers Deserve Their Own Lookup Layer

Before hybrid search even runs, add a dedicated part number lookup stage. If the query contains a token that matches a part number exactly (or closely after normalization), retrieve that product directly via database lookup and inject it at the top of the context.

async function productKnowledgeRetrieval(query: string): Promise<Chunk[]> {
  // Stage 0: exact part number match
  const partNumberMatch = await lookupByPartNumber(extractPartNumbers(query))
  if (partNumberMatch.length > 0) {
    return [...partNumberMatch, ...(await hybridSearch(query, 10))]
  }
 
  // Stage 1: hybrid search
  return hybridSearch(query, 50)
}

This ensures part number queries are never degraded by retrieval noise — you get the exact match guaranteed, supplemented by semantically related context.

Handling Measurement Unit Variation

B2B buyers switch between unit systems constantly. "3/4 inch" and "19mm" refer to the same dimension, but neither BM25 nor vector search handles this well out of the box.

A practical solution: at ingest time, normalize measurements and add both imperial and metric representations to the indexed text. If your product description says "19mm," add "¾ inch (19mm)" to the searchable text. This is content enrichment at the data layer — it makes both BM25 and vector search smarter without changes to the retrieval logic.

function enrichProductText(text: string): string {
  return text
    .replace(/(\d+(?:\.\d+)?)\s*mm\b/g, (match, mm) => {
      const inches = (parseFloat(mm) / 25.4).toFixed(3)
      return `${mm}mm (${inches}")`
    })
    .replace(/(\d+(?:\.\d+)?)\s*"\s*(?:inch(?:es)?)?/gi, (match, inches) => {
      const mm = (parseFloat(inches) * 25.4).toFixed(1)
      return `${inches}" (${mm}mm)`
    })
}

This is a simple example — production implementations add richer unit normalization for weight, pressure, temperature, voltage, and whatever units dominate your specific catalog.

Structured Attribute Fields Benefit from BM25

Dense retrieval treats product descriptions as blobs of text. But product data has structure: a pump has a flow rate, a pressure rating, a fluid compatibility list, a connection size. These structured attributes are perfect candidates for BM25 — they're the fields buyers search with precision.

If your vector store supports metadata filtering, combine it with hybrid search:

async function attributeAwareSearch(query: string, filters: AttributeFilters) {
  const bm25Candidates = await bm25Search(query, {
    fields: ['description', 'specifications', 'part_number'],
    filters,
  })
 
  const vectorCandidates = await vectorSearch(query, {
    metadataFilter: filters,
    limit: 50,
  })
 
  return reciprocalRankFusion(bm25Candidates, vectorCandidates)
}

Pre-filtering by structured attributes (category, material, connection standard) before running retrieval reduces noise dramatically, especially for large catalogs with hundreds of thousands of SKUs.

Evaluation: Measuring the Hybrid Uplift

How do you know if hybrid search is actually better than either component alone? Build a retrieval evaluation harness.

Step 1: Collect representative queries. Pull 100–200 real queries from support tickets, sales rep logs, or web analytics. Include a mix of:

Natural language descriptions ("quiet ceiling fan for a large office")
Partial part numbers or model codes
Technical specification lookups ("IP67 rated connector, 6 pole, M23")
Comparison queries ("difference between Type A and Type B seals")

Step 2: Label relevant chunks manually. For each query, identify which chunks from your corpus are genuinely relevant. This is the ground truth.

Step 3: Run three retrieval configurations — BM25 only, dense only, hybrid — and measure Recall@10 and MRR for each.

A typical result pattern on B2B product catalogs:

Retrieval Strategy	Recall@10	MRR
BM25 only	0.61	0.48
Dense vector only	0.67	0.54
Hybrid (RRF)	0.79	0.63

The gains are not uniform. Queries with exact identifiers see the biggest lift from BM25's contribution. Queries with natural language descriptions see the biggest lift from dense retrieval. Hybrid captures both — which is exactly why the aggregate metrics improve so substantially.

Infrastructure Choices

If you're building this from scratch, several infrastructure options natively support hybrid search:

Elasticsearch / OpenSearch: Both have native support for BM25 (the default) and dense vector fields (kNN search). You can run both queries in a single API call and fuse results server-side. Mature, battle-tested, good for teams already running these stacks.

Weaviate: Purpose-built vector database with native BM25 support and built-in hybrid search with RRF fusion. Simpler to operate than Elasticsearch for teams that don't need the broader feature set.

Qdrant + BM25 sidecar: Qdrant is a high-performance vector database that doesn't include BM25 natively. For hybrid, you run a lightweight BM25 sidecar (e.g. using rank-bm25 in Python or Tantivy in Rust) and handle fusion in your application layer. More moving parts, but gives you independent control over each component.

pgvector + PostgreSQL full-text search: If you're already on PostgreSQL, pgvector handles dense retrieval and Postgres's built-in ts_rank handles BM25-style scoring. Fusion happens in SQL. Surprisingly capable for medium-scale catalogs (up to ~5M chunks) and keeps your operational footprint minimal.

The right choice depends on your existing infrastructure, team expertise, and scale requirements — not on which has the most impressive benchmarks on synthetic datasets.

When Hybrid Search Makes the Most Difference

Hybrid search is most impactful when your query traffic is mixed — some users know exactly what they want (part numbers, model codes), others are exploring or troubleshooting (natural language, application descriptions). This is the normal distribution for a B2B distributor or wholesaler: sales reps running exact lookups, end customers browsing by application.

If your traffic is predominantly exact identifier lookups, a well-tuned BM25 with a part number lookup layer might be sufficient. If it's predominantly conversational queries, pure vector search gets you most of the way there.

But for the typical B2B product knowledge use case, hybrid search is the standard architecture — not an optimization for later. Start with it; the implementation complexity delta over pure vector search is modest, and the retrieval quality uplift is consistent.

Putting It All Together

The retrieval architecture we've described across this series now has three interlocking layers:

Hybrid retrieval (this article): Run BM25 and dense vector in parallel, fuse with RRF. Broad recall, captures both exact-match and semantic queries.
Cross-encoder reranking (covered here): Take the fused top-50 candidates and re-score them with a pairwise relevance model. Precision uplift.
Chunking strategy (covered here): Ensure the right information is in the chunks that retrieval can find. Foundation that everything else depends on.

These layers compound. A well-chunked corpus, retrieved with hybrid search, and reranked with a cross-encoder, reliably outperforms any single-component approach by a wide margin. The gains are measurable, not theoretical.

This is the architecture Axoverna is built on — specifically tuned for the characteristics of B2B product data: structured attributes, technical terminology, exact identifiers, and the full mix of user queries that come with a live product catalog.

Ready to See It Working on Your Catalog?

Axoverna handles the full hybrid retrieval stack — BM25, dense vectors, RRF fusion, and reranking — without requiring you to build or operate any of the underlying infrastructure. Connect your product catalog, and you get production-grade retrieval out of the box.

Book a demo to run a live retrieval benchmark against your own product data, or start a free trial and see the difference hybrid search makes on the queries your customers are actually asking.

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.

Start free — no credit card required →Read the docs

Technical

BOM-Aware Product AI: How to Turn Part-Level Questions Into Procurement-Ready Answers

Most product AI systems answer one SKU at a time. B2B buyers work from assemblies, spare parts lists, and bills of materials. BOM-aware retrieval helps AI reason across sets of parts, dependencies, alternates, and order constraints so conversations lead to real purchasing decisions.

May 24, 202611 min read

Technical

Revenue-Weighted Evaluation for B2B Product AI: Why All Retrieval Errors Are Not Equal

Most B2B teams evaluate product AI with flat accuracy metrics. The better approach is to weight failures by commercial risk, so mistakes on high-value, high-complexity workflows get fixed before low-stakes browsing errors.

May 23, 202611 min read

Technical

How Conversation Mining Turns Product AI Into a Product Data Improvement Engine

Most B2B teams treat AI chat logs as support exhaust. The smarter move is to mine them for missing attributes, broken mappings, unclear terminology, and catalog blind spots, then feed those insights back into product data operations.

May 22, 202612 min read

The Two Failure Modes You're Trying to Solve

Where pure vector search breaks down

Where pure BM25 breaks down

The hybrid premise

A Quick Primer on BM25

Dense Retrieval: The Counterpart

Fusion Strategies: How to Combine Two Result Lists

Option 1: Reciprocal Rank Fusion (RRF)

Option 2: Weighted Linear Combination

Option 3: Cascaded Hybrid

Full Pipeline Architecture

Product Catalog Specifics

Part Numbers Deserve Their Own Lookup Layer

Handling Measurement Unit Variation

Structured Attribute Fields Benefit from BM25

Evaluation: Measuring the Hybrid Uplift

Infrastructure Choices

When Hybrid Search Makes the Most Difference

Putting It All Together

Ready to See It Working on Your Catalog?

Turn your product catalog into an AI knowledge base

Related articles

BOM-Aware Product AI: How to Turn Part-Level Questions Into Procurement-Ready Answers

Revenue-Weighted Evaluation for B2B Product AI: Why All Retrieval Errors Are Not Equal

How Conversation Mining Turns Product AI Into a Product Data Improvement Engine