Semantic Search vs Full-Text Search: A Practical Comparison

When should you use semantic search, full-text search, or a hybrid of both? Real benchmarks, concrete trade-offs, and implementation guidance for production systems.

Axoverna Team

February 18, 20269 min read

The debate between semantic search and full-text search is usually framed as a competition — as if you have to pick one. In practice, the question isn't "which is better?" but "which works for which query types, and how do you combine them effectively?"

This article gives you the practical answer: what each approach does, where each fails, how to measure which is better for your use case, and how to implement the hybrid that outperforms both.

What Full-Text Search Actually Does

Full-text search (FTS), in its modern form, typically means BM25 — the relevance function that powers Elasticsearch, Solr, and Lucene. BM25 scores documents based on:

Term frequency (TF): How often does the query term appear in this document? (With diminishing returns for repeated terms)
Inverse document frequency (IDF): How rare is this term across the entire corpus? Rarer terms get higher weight.
Document length normalization: Longer documents don't automatically score higher just because they contain more words.

The BM25 score for a query term t in document d:

BM25(t, d) = IDF(t) × (TF(t,d) × (k1 + 1)) / (TF(t,d) + k1 × (1 - b + b × |d| / avgdl))

Where k1 (typically 1.2–2.0) controls term saturation and b (typically 0.75) controls length normalization. These are tunable parameters, and tuning them for your specific corpus can meaningfully improve BM25 performance.

What BM25 is good at:

Exact and near-exact term matching
Part number lookup (the query "SKF-6205-2RS" should return exactly that product)
Known-item search where users know the exact name
Faceted search where users combine filter terms

What BM25 fails at:

Synonyms and abbreviations (searches for "solenoid valve" miss "SOV" or "electromechanical valve")
Paraphrased queries ("valve that resists high temperatures" vs. "high-temperature valve")
Intent-based queries ("I need a fitting to connect NPT to BSP")
Concept search where meaning matters more than words

What Semantic Search Actually Does

Semantic search uses dense vector embeddings (as described in detail in our RAG explainer →) to represent documents and queries as points in high-dimensional space. Similarity is computed as cosine distance between vectors.

The key property: the embedding model encodes meaning, learned from pre-training on large text corpora. "Valve that resists high temperatures" and "high-temperature valve" produce similar vectors, even though they share no words.

What semantic search is good at:

Synonym and abbreviation handling (zero configuration required)
Intent-based and natural language queries
Cross-language retrieval (queries in English can find documents in French, if the embedding model supports it)
Conceptual similarity (finding a product that solves the same problem, even if described differently)

What semantic search fails at:

Exact string matching (a query for "Model 3200-1NPT" can occasionally retrieve close-but-wrong variants)
Rare or domain-specific tokens not seen during pre-training
Numbers and codes without semantic context (part number "A4B7-9923" means nothing to a general-purpose embedding model)
Short, keyword-style queries where there's no semantic context to capture

Benchmarking on Product Catalog Data

To illustrate the difference concretely, here's a representative benchmark on a sample industrial product catalog (~20,000 products). We tested 200 realistic queries across four types:

Query Type	Example	BM25 Recall@5	Semantic Recall@5	Hybrid Recall@5
Exact part number	"SKF 6205-2RS"	97%	74%	96%
Technical spec match	"150 PSI 1-inch NPT valve"	71%	78%	89%
Synonym-heavy	"SOV for natural gas service"	34%	81%	83%
Intent-based	"What fitting converts NPT to BSP?"	22%	78%	80%
Mixed	"PTFE-seated butterfly valve, 6 inch"	68%	72%	87%

The pattern is clear:

BM25 wins on exact strings
Semantic wins on synonyms and intent
Hybrid wins on mixed and spec-based queries
Hybrid is never significantly worse than either individual method

Recall@5 measures whether the correct product appears in the top 5 results. For a B2B buyer, if the right product isn't in the first page of results, it might as well not be there.

Implementing Hybrid Search

The standard approach for combining BM25 and semantic search is Reciprocal Rank Fusion (RRF). RRF merges ranked result lists by giving each document a score based on its rank in each list:

def reciprocal_rank_fusion(
    ranked_lists: list[list[str]],
    k: int = 60
) -> dict[str, float]:
    """
    Merge multiple ranked lists using RRF.
    
    Args:
        ranked_lists: List of document ID lists, ordered by relevance
        k: Constant to prevent very small denominator (default: 60)
    
    Returns:
        Dict mapping document_id to RRF score
    """
    scores: dict[str, float] = {}
    
    for ranked_list in ranked_lists:
        for rank, doc_id in enumerate(ranked_list, start=1):
            if doc_id not in scores:
                scores[doc_id] = 0.0
            scores[doc_id] += 1.0 / (k + rank)
    
    return dict(sorted(scores.items(), key=lambda x: x[1], reverse=True))
 
 
# Usage:
bm25_results = bm25_search(query, top_k=50)   # Returns list of doc IDs
semantic_results = vector_search(query, top_k=50)  # Returns list of doc IDs
 
merged = reciprocal_rank_fusion([bm25_results, semantic_results])
final_results = list(merged.keys())[:10]  # Take top 10

The constant k=60 is the standard choice from the original RRF paper and works well in practice. You can tune it, but the improvement from tuning is usually marginal.

Why RRF Beats Weighted Sum

An alternative to RRF is normalizing the scores from each system and summing them with weights. This is intuitive but problematic in practice: BM25 and semantic similarity scores use different scales and distributions. Normalizing them to the same range is non-trivial and changes as the corpus grows.

RRF is rank-based, so it's distribution-agnostic. A document that ranks 3rd in BM25 and 5th in semantic gets the same RRF contribution regardless of what the raw scores were. This makes it robust and parameter-free.

Adding Metadata Filtering

In a product catalog, you often want to combine hybrid search with structured filters. The typical pattern: apply filters first (reducing the search space), then run hybrid retrieval within the filtered set.

async function hybridProductSearch(
  query: string,
  filters: {
    category?: string
    minPressure?: number
    material?: string
  },
  topK: number = 10
): Promise<Product[]> {
  // 1. Get vector embedding of query
  const queryVector = await embed(query)
 
  // 2. BM25 search within filtered products
  const bm25Results = await bm25Search(query, {
    filter: buildFilter(filters),
    limit: 50,
  })
 
  // 3. Vector search within filtered products
  const vectorResults = await vectorSearch(queryVector, {
    filter: buildFilter(filters),
    limit: 50,
  })
 
  // 4. Merge with RRF
  const merged = reciprocalRankFusion([
    bm25Results.map(r => r.id),
    vectorResults.map(r => r.id),
  ])
 
  // 5. Fetch full product data and return
  return fetchProducts(Object.keys(merged).slice(0, topK))
}

When Full-Text Search Alone Is Acceptable

There are scenarios where BM25 alone is the right choice:

Admin and internal tools: Your warehouse staff searching for a product by SKU don't need semantic understanding — they need exact match.

Autocomplete: Prefix completion (suggesting "Model 320..." as the user types) is purely a string operation. Semantic search adds no value here.

Audit trails and compliance: Searching for an exact regulation number, document version, or SKU in a compliance context requires precision over recall.

Very small corpora: For a catalog under 500 products, BM25 with a well-maintained synonym file is probably good enough and significantly simpler to operate.

When Semantic Search Alone Is Acceptable

Conversational search: When users are querying via a chat interface with natural language, semantic retrieval (usually as part of a RAG pipeline) is the right tool. See how AI chat widgets are replacing FAQ pages →

Concept discovery: "Show me products similar to this one" is a pure embedding similarity task.

Cross-language scenarios: If your products are documented in multiple languages and your customers query in their native language, a multilingual embedding model handles this elegantly without BM25's language-specific complexity.

The Cross-Encoder Re-Ranker Layer

For high-stakes retrieval (where the best result really matters, not just a good result), both BM25 and semantic search can be improved by a third-stage re-ranker.

A cross-encoder takes a (query, document) pair as input and outputs a relevance score. Unlike bi-encoders (which embed query and document separately and compute similarity), cross-encoders can "attend" to both simultaneously — capturing nuanced relevance signals at the cost of being too slow to run on every document in your corpus.

The standard pattern: run BM25 + semantic retrieval to get the top 50 candidates (fast), then re-rank those 50 with a cross-encoder (still fast at 50, too slow at 50,000).

from sentence_transformers import CrossEncoder
 
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
 
def rerank(query: str, candidates: list[str], top_k: int = 10) -> list[int]:
    pairs = [(query, candidate) for candidate in candidates]
    scores = reranker.predict(pairs)
    ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
    return ranked_indices[:top_k]

Cross-encoder re-ranking on the top 50 candidates typically improves Recall@5 by another 5–10% on top of hybrid retrieval. It's the most impactful single improvement you can add to an already-good retrieval pipeline.

Measuring What Matters: Recall@K and MRR

Two metrics matter most for product search quality:

Recall@K: Given a known-correct answer, did it appear in the top K results? Recall@5 is the most practical for most UIs. Calculate by running your test queries, checking whether the ground-truth product appears in the results.

Mean Reciprocal Rank (MRR): How high in the ranked list did the correct result appear on average? MRR penalizes systems that return the right answer at position 5 vs. position 1.

def mean_reciprocal_rank(results: list[list[str]], ground_truth: list[str]) -> float:
    """Calculate MRR across a set of queries."""
    reciprocal_ranks = []
    for result_list, correct_id in zip(results, ground_truth):
        if correct_id in result_list:
            rank = result_list.index(correct_id) + 1  # 1-indexed
            reciprocal_ranks.append(1.0 / rank)
        else:
            reciprocal_ranks.append(0.0)
    return sum(reciprocal_ranks) / len(reciprocal_ranks)

If you don't have labeled test queries yet, start by collecting them from your support ticket history — any ticket where a customer asked for a specific product is a test case.

The Practical Recommendation

For a B2B product catalog with more than 1,000 products:

Start with hybrid search (BM25 + semantic). The incremental engineering cost over either alone is small, and the performance gain is significant.
Add metadata filtering for structured attributes (category, size, material, certification).
Consider a cross-encoder re-ranker if your use case involves high-precision requirements (medical, industrial safety, regulatory).
Skip pure BM25 unless you're building internal tooling or have a very small catalog.

The era of "just use Elasticsearch" for product search is over. Your buyers expect the search bar to understand what they mean, not just match the words they typed.

Axoverna implements hybrid search + RAG generation → See your catalog through a semantic lens

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.

Start free — no credit card required →Read the docs

Technical

BOM-Aware Product AI: How to Turn Part-Level Questions Into Procurement-Ready Answers

Most product AI systems answer one SKU at a time. B2B buyers work from assemblies, spare parts lists, and bills of materials. BOM-aware retrieval helps AI reason across sets of parts, dependencies, alternates, and order constraints so conversations lead to real purchasing decisions.

May 24, 202611 min read

Technical

Revenue-Weighted Evaluation for B2B Product AI: Why All Retrieval Errors Are Not Equal

Most B2B teams evaluate product AI with flat accuracy metrics. The better approach is to weight failures by commercial risk, so mistakes on high-value, high-complexity workflows get fixed before low-stakes browsing errors.

May 23, 202611 min read

Technical

How Conversation Mining Turns Product AI Into a Product Data Improvement Engine

Most B2B teams treat AI chat logs as support exhaust. The smarter move is to mine them for missing attributes, broken mappings, unclear terminology, and catalog blind spots, then feed those insights back into product data operations.

May 22, 202612 min read

What Full-Text Search Actually Does

What Semantic Search Actually Does

Benchmarking on Product Catalog Data

Implementing Hybrid Search

Why RRF Beats Weighted Sum

Adding Metadata Filtering

When Full-Text Search Alone Is Acceptable

When Semantic Search Alone Is Acceptable

The Cross-Encoder Re-Ranker Layer

Measuring What Matters: Recall@K and MRR

The Practical Recommendation

Turn your product catalog into an AI knowledge base

Related articles

BOM-Aware Product AI: How to Turn Part-Level Questions Into Procurement-Ready Answers

Revenue-Weighted Evaluation for B2B Product AI: Why All Retrieval Errors Are Not Equal

How Conversation Mining Turns Product AI Into a Product Data Improvement Engine