Semantic Search vs Full-Text Search: A Practical Comparison
When should you use semantic search, full-text search, or a hybrid of both? Real benchmarks, concrete trade-offs, and implementation guidance for production systems.
The debate between semantic search and full-text search is usually framed as a competition — as if you have to pick one. In practice, the question isn't "which is better?" but "which works for which query types, and how do you combine them effectively?"
This article gives you the practical answer: what each approach does, where each fails, how to measure which is better for your use case, and how to implement the hybrid that outperforms both.
What Full-Text Search Actually Does
Full-text search (FTS), in its modern form, typically means BM25 — the relevance function that powers Elasticsearch, Solr, and Lucene. BM25 scores documents based on:
- Term frequency (TF): How often does the query term appear in this document? (With diminishing returns for repeated terms)
- Inverse document frequency (IDF): How rare is this term across the entire corpus? Rarer terms get higher weight.
- Document length normalization: Longer documents don't automatically score higher just because they contain more words.
The BM25 score for a query term t in document d:
BM25(t, d) = IDF(t) × (TF(t,d) × (k1 + 1)) / (TF(t,d) + k1 × (1 - b + b × |d| / avgdl))
Where k1 (typically 1.2–2.0) controls term saturation and b (typically 0.75) controls length normalization. These are tunable parameters, and tuning them for your specific corpus can meaningfully improve BM25 performance.
What BM25 is good at:
- Exact and near-exact term matching
- Part number lookup (the query "SKF-6205-2RS" should return exactly that product)
- Known-item search where users know the exact name
- Faceted search where users combine filter terms
What BM25 fails at:
- Synonyms and abbreviations (searches for "solenoid valve" miss "SOV" or "electromechanical valve")
- Paraphrased queries ("valve that resists high temperatures" vs. "high-temperature valve")
- Intent-based queries ("I need a fitting to connect NPT to BSP")
- Concept search where meaning matters more than words
What Semantic Search Actually Does
Semantic search uses dense vector embeddings (as described in detail in our RAG explainer →) to represent documents and queries as points in high-dimensional space. Similarity is computed as cosine distance between vectors.
The key property: the embedding model encodes meaning, learned from pre-training on large text corpora. "Valve that resists high temperatures" and "high-temperature valve" produce similar vectors, even though they share no words.
What semantic search is good at:
- Synonym and abbreviation handling (zero configuration required)
- Intent-based and natural language queries
- Cross-language retrieval (queries in English can find documents in French, if the embedding model supports it)
- Conceptual similarity (finding a product that solves the same problem, even if described differently)
What semantic search fails at:
- Exact string matching (a query for "Model 3200-1NPT" can occasionally retrieve close-but-wrong variants)
- Rare or domain-specific tokens not seen during pre-training
- Numbers and codes without semantic context (part number "A4B7-9923" means nothing to a general-purpose embedding model)
- Short, keyword-style queries where there's no semantic context to capture
Benchmarking on Product Catalog Data
To illustrate the difference concretely, here's a representative benchmark on a sample industrial product catalog (~20,000 products). We tested 200 realistic queries across four types:
| Query Type | Example | BM25 Recall@5 | Semantic Recall@5 | Hybrid Recall@5 |
|---|---|---|---|---|
| Exact part number | "SKF 6205-2RS" | 97% | 74% | 96% |
| Technical spec match | "150 PSI 1-inch NPT valve" | 71% | 78% | 89% |
| Synonym-heavy | "SOV for natural gas service" | 34% | 81% | 83% |
| Intent-based | "What fitting converts NPT to BSP?" | 22% | 78% | 80% |
| Mixed | "PTFE-seated butterfly valve, 6 inch" | 68% | 72% | 87% |
The pattern is clear:
- BM25 wins on exact strings
- Semantic wins on synonyms and intent
- Hybrid wins on mixed and spec-based queries
- Hybrid is never significantly worse than either individual method
Recall@5 measures whether the correct product appears in the top 5 results. For a B2B buyer, if the right product isn't in the first page of results, it might as well not be there.
Implementing Hybrid Search
The standard approach for combining BM25 and semantic search is Reciprocal Rank Fusion (RRF). RRF merges ranked result lists by giving each document a score based on its rank in each list:
def reciprocal_rank_fusion(
ranked_lists: list[list[str]],
k: int = 60
) -> dict[str, float]:
"""
Merge multiple ranked lists using RRF.
Args:
ranked_lists: List of document ID lists, ordered by relevance
k: Constant to prevent very small denominator (default: 60)
Returns:
Dict mapping document_id to RRF score
"""
scores: dict[str, float] = {}
for ranked_list in ranked_lists:
for rank, doc_id in enumerate(ranked_list, start=1):
if doc_id not in scores:
scores[doc_id] = 0.0
scores[doc_id] += 1.0 / (k + rank)
return dict(sorted(scores.items(), key=lambda x: x[1], reverse=True))
# Usage:
bm25_results = bm25_search(query, top_k=50) # Returns list of doc IDs
semantic_results = vector_search(query, top_k=50) # Returns list of doc IDs
merged = reciprocal_rank_fusion([bm25_results, semantic_results])
final_results = list(merged.keys())[:10] # Take top 10The constant k=60 is the standard choice from the original RRF paper and works well in practice. You can tune it, but the improvement from tuning is usually marginal.
Why RRF Beats Weighted Sum
An alternative to RRF is normalizing the scores from each system and summing them with weights. This is intuitive but problematic in practice: BM25 and semantic similarity scores use different scales and distributions. Normalizing them to the same range is non-trivial and changes as the corpus grows.
RRF is rank-based, so it's distribution-agnostic. A document that ranks 3rd in BM25 and 5th in semantic gets the same RRF contribution regardless of what the raw scores were. This makes it robust and parameter-free.
Adding Metadata Filtering
In a product catalog, you often want to combine hybrid search with structured filters. The typical pattern: apply filters first (reducing the search space), then run hybrid retrieval within the filtered set.
async function hybridProductSearch(
query: string,
filters: {
category?: string
minPressure?: number
material?: string
},
topK: number = 10
): Promise<Product[]> {
// 1. Get vector embedding of query
const queryVector = await embed(query)
// 2. BM25 search within filtered products
const bm25Results = await bm25Search(query, {
filter: buildFilter(filters),
limit: 50,
})
// 3. Vector search within filtered products
const vectorResults = await vectorSearch(queryVector, {
filter: buildFilter(filters),
limit: 50,
})
// 4. Merge with RRF
const merged = reciprocalRankFusion([
bm25Results.map(r => r.id),
vectorResults.map(r => r.id),
])
// 5. Fetch full product data and return
return fetchProducts(Object.keys(merged).slice(0, topK))
}When Full-Text Search Alone Is Acceptable
There are scenarios where BM25 alone is the right choice:
Admin and internal tools: Your warehouse staff searching for a product by SKU don't need semantic understanding — they need exact match.
Autocomplete: Prefix completion (suggesting "Model 320..." as the user types) is purely a string operation. Semantic search adds no value here.
Audit trails and compliance: Searching for an exact regulation number, document version, or SKU in a compliance context requires precision over recall.
Very small corpora: For a catalog under 500 products, BM25 with a well-maintained synonym file is probably good enough and significantly simpler to operate.
When Semantic Search Alone Is Acceptable
Conversational search: When users are querying via a chat interface with natural language, semantic retrieval (usually as part of a RAG pipeline) is the right tool. See how AI chat widgets are replacing FAQ pages →
Concept discovery: "Show me products similar to this one" is a pure embedding similarity task.
Cross-language scenarios: If your products are documented in multiple languages and your customers query in their native language, a multilingual embedding model handles this elegantly without BM25's language-specific complexity.
The Cross-Encoder Re-Ranker Layer
For high-stakes retrieval (where the best result really matters, not just a good result), both BM25 and semantic search can be improved by a third-stage re-ranker.
A cross-encoder takes a (query, document) pair as input and outputs a relevance score. Unlike bi-encoders (which embed query and document separately and compute similarity), cross-encoders can "attend" to both simultaneously — capturing nuanced relevance signals at the cost of being too slow to run on every document in your corpus.
The standard pattern: run BM25 + semantic retrieval to get the top 50 candidates (fast), then re-rank those 50 with a cross-encoder (still fast at 50, too slow at 50,000).
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query: str, candidates: list[str], top_k: int = 10) -> list[int]:
pairs = [(query, candidate) for candidate in candidates]
scores = reranker.predict(pairs)
ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
return ranked_indices[:top_k]Cross-encoder re-ranking on the top 50 candidates typically improves Recall@5 by another 5–10% on top of hybrid retrieval. It's the most impactful single improvement you can add to an already-good retrieval pipeline.
Measuring What Matters: Recall@K and MRR
Two metrics matter most for product search quality:
Recall@K: Given a known-correct answer, did it appear in the top K results? Recall@5 is the most practical for most UIs. Calculate by running your test queries, checking whether the ground-truth product appears in the results.
Mean Reciprocal Rank (MRR): How high in the ranked list did the correct result appear on average? MRR penalizes systems that return the right answer at position 5 vs. position 1.
def mean_reciprocal_rank(results: list[list[str]], ground_truth: list[str]) -> float:
"""Calculate MRR across a set of queries."""
reciprocal_ranks = []
for result_list, correct_id in zip(results, ground_truth):
if correct_id in result_list:
rank = result_list.index(correct_id) + 1 # 1-indexed
reciprocal_ranks.append(1.0 / rank)
else:
reciprocal_ranks.append(0.0)
return sum(reciprocal_ranks) / len(reciprocal_ranks)If you don't have labeled test queries yet, start by collecting them from your support ticket history — any ticket where a customer asked for a specific product is a test case.
The Practical Recommendation
For a B2B product catalog with more than 1,000 products:
- Start with hybrid search (BM25 + semantic). The incremental engineering cost over either alone is small, and the performance gain is significant.
- Add metadata filtering for structured attributes (category, size, material, certification).
- Consider a cross-encoder re-ranker if your use case involves high-precision requirements (medical, industrial safety, regulatory).
- Skip pure BM25 unless you're building internal tooling or have a very small catalog.
The era of "just use Elasticsearch" for product search is over. Your buyers expect the search bar to understand what they mean, not just match the words they typed.
Axoverna implements hybrid search + RAG generation → See your catalog through a semantic lens
Turn your product catalog into an AI knowledge base
Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.
Related articles
Why Session Memory Matters for Repeat B2B Buyers, and How to Design It Without Breaking Trust
The strongest B2B product AI systems do not treat every conversation like a cold start. They use session memory to preserve buyer context, speed up repeat interactions, and improve recommendation quality, while staying grounded in live product data and clear trust boundaries.
Unit Normalization in B2B Product AI: Why 1/2 Inch, DN15, and 15 mm Should Mean the Same Thing
B2B product AI breaks fast when dimensions, thread sizes, pack quantities, and engineering units are stored in inconsistent formats. Here is how to design unit normalization that improves retrieval, filtering, substitutions, and answer accuracy.
Source-Aware RAG: How to Combine PIM, PDFs, ERP, and Policy Content Without Conflicting Answers
Most product AI failures are not caused by weak models, but by mixing sources with different authority levels. Here is how B2B teams design source-aware RAG that keeps specs, availability, pricing rules, and policy answers aligned.