How to Know If Your Product AI Actually Works: RAG Evaluation and Production Monitoring
Deploying a RAG-powered product AI is the easy part. Knowing whether it's answering correctly, catching drift before customers do, and systematically improving quality over time — that's where most teams struggle. Here's how to build a rigorous evaluation framework for B2B product knowledge AI.
Here's the uncomfortable truth about shipping a RAG-powered product AI: the demo looks great, the early queries impress everyone in the room, and then three months into production you discover that a significant percentage of answers are subtly wrong — citing discontinued specs, confusing product variants, or confidently hallucinating compatibility claims that don't exist anywhere in the catalog.
By then, trust has eroded. Sales reps have stopped using it. Some customers got bad information.
The root cause is almost never the model. It's almost always the lack of a measurement system. Traditional software has unit tests, integration tests, regression suites. You know immediately when something breaks. RAG systems have no equivalent discipline by default — which means problems accumulate silently until they surface in a customer complaint or a deal that went sideways.
This article is a practical guide to closing that gap: how to evaluate retrieval quality, measure answer quality, build a golden dataset that evolves with your catalog, and set up production monitoring that catches drift before your customers do.
Why Software Testing Doesn't Transfer Directly
The first instinct of most engineering teams is to write unit tests for the AI. Call the system with a known query, assert that the response contains the correct product name. If it passes, ship it.
This breaks down almost immediately for three reasons.
Outputs are non-deterministic. LLM responses vary across runs. Temperature, sampling, and even infrastructure-level differences mean your assertion-style test fails intermittently even when the system is working perfectly. You end up either setting assertions so loose they catch nothing, or spending cycles chasing test flakiness.
Correctness is multi-dimensional. A response can be factually correct but irrelevant (answered a different question). It can be relevant but unfaithful (the answer isn't actually supported by the retrieved context). It can be faithful and relevant but dangerously incomplete (correct but missing a critical safety warning). None of these failures looks like a "test failure" in the traditional sense.
The ground truth shifts. Your product catalog changes. New SKUs, updated specifications, discontinued models. A response that was correct in January may be incorrect by March. Static test suites go stale without continuous maintenance.
What you need instead is a measurement framework — one that quantifies quality across multiple dimensions, runs continuously against production traffic, and alerts you when quality degrades.
The Two Layers You Need to Measure
RAG systems have two distinct failure points, and they require different measurement approaches.
Layer 1: Retrieval Quality
Before the LLM generates an answer, the retrieval step assembles context from your product catalog. If the wrong chunks are retrieved, no amount of prompting or model capability can produce a correct answer. Garbage in, garbage out — reliably.
The retrieval layer answers the question: Did we find the right information from the catalog?
Layer 2: Generation Quality
Given retrieved context, the LLM generates an answer. Failures here include hallucinations (claims not supported by context), incomplete answers, poor formatting, or misinterpretation of the retrieved content.
The generation layer answers the question: Did we use the right information correctly?
Most teams focus almost entirely on generation quality because that's what users see. Retrieval quality is invisible to end users — they only see the final answer — but it's responsible for the majority of production failures. A disciplined evaluation program measures both layers explicitly.
Retrieval Metrics: What They Are and What They Tell You
To evaluate retrieval, you need a labeled dataset: a set of queries where you've manually identified which chunks from your corpus are the correct ones to retrieve. We'll discuss how to build this dataset in a later section; for now, assume it exists.
Recall@k
Recall@k measures the fraction of relevant chunks that appear in the top-k retrieved results.
Recall@k = (relevant chunks in top-k) / (total relevant chunks for query)
For a product AI, k=5 or k=10 are typical. If a query about a particular pump model has three relevant chunks (product sheet, spec table, installation note) and your system retrieves 2 of them in the top 5, Recall@5 = 0.67.
Recall tells you about coverage. Low recall means the right information isn't in the context window when the LLM generates its answer. This is the most common reason for incomplete or wrong answers.
Mean Reciprocal Rank (MRR)
MRR is the average of the reciprocal of the rank at which the first relevant chunk appears:
MRR = (1/|Q|) × Σ 1/rank_i
Where rank_i is the position of the first relevant result for query i. If the first relevant chunk is always in position 1, MRR = 1.0. If it's typically in position 3, MRR ≈ 0.33.
MRR is particularly important because most LLM prompts weight earlier context more heavily (and because contextual compression pipelines often truncate lower-ranked chunks before they reach the model). Getting the best chunk to the top of the list matters.
Normalized Discounted Cumulative Gain (NDCG)
NDCG is a more nuanced metric that accounts for graded relevance (not just binary relevant/irrelevant) and the position of relevant results:
NDCG@k = DCG@k / IDCG@k
Where DCG discounts the relevance score of each result by its rank (log₂(rank+1)), and IDCG is the ideal DCG — the best possible ranking given the corpus. NDCG@k = 1.0 means perfect retrieval.
NDCG is most useful when you have relevance grades (this chunk is perfect, this one is partially relevant, this one is a stretch). For teams early in their evaluation journey, Recall@k and MRR are usually sufficient.
A healthy B2B product AI targeting medium-complexity queries should aim for:
| Metric | Minimum Acceptable | Target |
|---|---|---|
| Recall@5 | 0.65 | 0.80+ |
| Recall@10 | 0.75 | 0.88+ |
| MRR | 0.50 | 0.65+ |
These are rough benchmarks; what matters more is your trend over time and your delta versus a baseline.
Generation Quality Metrics
Retrieval metrics tell you nothing about what the LLM does with the context. Generation metrics fill that gap.
Faithfulness
Faithfulness measures whether every claim in the generated answer is actually supported by the retrieved context. An unfaithful response fabricates information that wasn't in the retrieved chunks — the classic hallucination pattern.
Faithfulness scoring breaks the answer into individual claims and checks each one:
faithfulness = (supported claims) / (total claims in answer)
For B2B product AI, faithfulness is non-negotiable. An answer claiming your industrial adhesive is rated for 200°C continuous exposure when the datasheet says 150°C is a product liability issue, not just a quality metric.
Answer Relevance
Answer relevance measures whether the generated answer actually addresses the question asked. A response can be entirely faithful to its retrieved context and still be irrelevant — for example, if the retrieval brought back chunks about a related but different product, and the LLM answered faithfully about the wrong thing.
Context Precision
Context precision asks: of the retrieved chunks passed to the LLM, how many were actually useful for answering the question? High context precision means your retrieval is tight — you're retrieving signal, not noise. Low precision wastes context window space and increases the chance the LLM gets confused by irrelevant content.
The RAGAS Framework
RAGAS (Retrieval-Augmented Generation Assessment) is an open-source framework that operationalizes these metrics into a consistent measurement pipeline. It handles:
- Automatic faithfulness scoring using LLM-as-judge
- Answer relevance scoring
- Context precision and recall
- Batch evaluation against a dataset of (question, answer, context) triples
For a B2B product AI, a minimal RAGAS evaluation run looks like:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
# Your evaluation dataset: questions + ground truth answers + retrieved contexts
eval_data = Dataset.from_list([
{
"question": "What is the maximum operating pressure of the Vega Series 3 pump?",
"answer": generated_answer, # from your RAG pipeline
"contexts": retrieved_chunks, # what was passed to the LLM
"ground_truth": "The Vega Series 3 pump operates at up to 16 bar maximum.",
},
# ... more examples
])
result = evaluate(
eval_data,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)Running this weekly (or on every significant catalog update) gives you a longitudinal view of quality trends.
LLM-as-Judge: Scaling Your Evaluation
Manual evaluation — a human reviewing each answer for correctness — is the gold standard. It's also expensive and doesn't scale to production traffic. LLM-as-judge bridges the gap: you use a capable LLM (often a different one from your production model) to grade responses automatically.
A practical judge prompt for product AI looks like:
You are evaluating a B2B product knowledge AI.
Question: {question}
Retrieved context:
{context}
Generated answer:
{answer}
Rate the answer on each dimension (1-5):
1. FAITHFULNESS: Is every claim in the answer directly supported by the context?
2. COMPLETENESS: Does the answer address all aspects of the question using available context?
3. ACCURACY: Are technical details (specs, part numbers, ratings) correctly stated?
4. RELEVANCE: Does the answer address what the user actually asked?
Respond with JSON: {"faithfulness": N, "completeness": N, "accuracy": N, "relevance": N, "explanation": "..."}
LLM-as-judge is not perfect — it has known biases (preferring longer answers, verbose explanations) and blindspots — but calibration studies consistently show high agreement with human judges for factual recall tasks. For production monitoring at scale, it's the only practical approach.
Calibration tip: Before relying on LLM-as-judge scores, benchmark them against human ratings on 50–100 examples from your specific domain. Domain-specific technical content (industrial specifications, chemical properties, electrical ratings) is where LLM judges are most likely to miss errors that a domain expert would catch.
Building a Golden Dataset from Real Traffic
The evaluation framework above requires a labeled dataset. How do you build one without spending weeks on manual annotation?
Mine your support tickets. Every "why did the product AI say X?" complaint is a labeled example: you know the question, you know the answer was wrong, and you can determine what the correct answer should be. Support ticket data is your highest-priority evaluation cases — real failures, real consequences.
Use sales rep feedback. Sales reps who use the product AI daily are your best signal source. A quick thumbs-up/thumbs-down feedback widget on the chat interface generates labeled data passively. Even 50 labeled examples from weekly sales team usage gives you a meaningful evaluation set within a month.
Mine query logs for high-frequency questions. Export your top-100 most common queries and annotate them manually. These high-frequency queries represent the highest-impact surface area — if they're working well, most of your users are having a good experience.
Synthetic augmentation for edge cases. Use an LLM to generate adversarial queries from your product catalog: "What happens if I use [Product A] in a [condition it's not rated for]?" These stress-test your guardrails and hallucination prevention under conditions that might not appear in your organic traffic.
A practical golden dataset for a mid-size B2B product catalog should have:
- 50–100 high-frequency queries with verified correct answers
- 20–30 complex, multi-step queries (compatibility checks, substitution requests)
- 20–30 adversarial/edge case queries
- 10–20 examples of known past failures
You don't need thousands of examples. You need representative examples that cover the query types your users actually run.
Production Monitoring: Catching Drift Before Your Customers Do
Offline evaluation against a golden dataset tells you how good your system was at the time you last evaluated it. Production monitoring tells you how it's performing right now, including the effects of catalog changes, new query patterns, and model updates.
Query Distribution Drift
Plot the embedding distribution of incoming queries over time. If the centroid of query embeddings shifts significantly, your users are asking new types of questions that your system may not have been evaluated against. This is an early warning sign that your golden dataset needs updating.
import numpy as np
from scipy.spatial.distance import cosine
def detect_query_drift(recent_embeddings, baseline_embeddings):
recent_centroid = np.mean(recent_embeddings, axis=0)
baseline_centroid = np.mean(baseline_embeddings, axis=0)
drift = cosine(recent_centroid, baseline_centroid)
return drift # > 0.05 is worth investigatingAnswer Quality Signals
Even without running full RAGAS on every query, you can monitor lightweight signals:
Answer length distribution. Unusually short answers often indicate retrieval failure (no relevant context found, so the LLM hedges with a short disclaimer). A sudden drop in average answer length suggests retrieval degradation.
Refusal rate. How often does your system say "I don't have information about that"? A baseline refusal rate of 5% is normal; a spike to 20% indicates either retrieval failure or a catalog gap that's suddenly being hit by new queries.
User feedback rate. If you're logging thumbs-up/thumbs-down, track the feedback ratio daily. A declining positive-feedback rate is the most direct user signal of quality degradation.
Latency outliers. P95 latency spikes often correlate with retrieval issues (falling back to full-corpus search when filters fail) or with LLM generation on unusually long contexts. Monitor latency as a proxy for system health.
Catalog Change Events as Triggers
Your product catalog is not static. New products, specification updates, discontinued lines — each catalog update is a potential quality regression for any queries that touch the updated content. Build a trigger: whenever a catalog sync event updates more than N products in a given category, automatically run your evaluation suite against the queries in that category.
This turns catalog updates from silent quality risks into explicit verification checkpoints.
A Systematic Improvement Loop
Evaluation is only valuable if it drives improvement. Here's the loop that converts metrics into quality gains:
1. Measure baseline. Run your golden dataset through retrieval metrics and RAGAS. Record the numbers.
2. Diagnose failures. For low-scoring examples, determine which layer failed. Was relevant context missing from the retrieved set (retrieval failure)? Was the right context retrieved but the answer was wrong (generation failure)? Was context present and the answer technically correct but poorly formed (formatting/instruction failure)?
3. Route to the right fix:
- Retrieval failures → fix chunking strategy, embedding model choice, or hybrid search balance
- Generation failures → fix prompts, add more explicit instructions, consider a more capable model for high-stakes queries
- Coverage gaps → identify missing content in your knowledge base and add it
4. Implement one change at a time. RAG systems have many interacting components. Changing the chunking strategy, the reranker, and the system prompt simultaneously makes it impossible to know what drove a metric change. Isolate variables.
5. Re-run evaluation. Measure the delta. If the change helped, keep it. If it hurt or was neutral, revert and try the next hypothesis.
6. Update the golden dataset. Add any newly discovered failure patterns as labeled examples so they don't regress.
This loop typically runs on a two-week cycle for a well-instrumented production system. The first iteration is always the most revealing — teams routinely discover that 20–30% of their high-frequency queries have quality issues they weren't aware of.
Connecting Evaluation to Business Metrics
Technical metrics matter, but the ultimate measure of a product AI is business impact. Bridge the gap by tying your evaluation framework to outcomes:
Query resolution rate: What fraction of product questions are fully answered without escalation to a sales rep or support ticket? Track this weekly. A rising resolution rate means the AI is handling more load effectively.
Time-to-answer: How long does it take a buyer to get the product information they need? Compare sessions with AI assistance versus without (if you have historical data). The ROI story for B2B product AI is often strongest here — we covered the financial case in detail in our hidden cost of unanswered product questions analysis.
Escalation quality: When the AI does escalate to a human, is it escalating the right queries? A well-calibrated system escalates confidently when it doesn't know, rather than hallucinating an answer. Building trust in AI responses starts with good calibration.
Sales influence: For pre-sales product discovery use cases, track whether customers who used the product AI converted at higher rates. This is harder to measure but often shows the strongest ROI signal.
The Evaluation Maturity Model
Most teams building product AI move through predictable stages:
Stage 1: Vibe-based. Quality assessment = "it seemed to work in the demo." No metrics. Problems discovered via customer complaints.
Stage 2: Ad-hoc. Manual spot-checking of queries before major releases. Better than nothing, but not scalable.
Stage 3: Golden dataset. 50–100 labeled examples, offline evaluation on releases. Catches regressions before deployment.
Stage 4: Automated monitoring. LLM-as-judge running on production samples daily. Drift detection, feedback loop integration. Problems caught within 24–48 hours.
Stage 5: Continuous improvement loop. Evaluation data drives systematic experimentation. Metric trends inform roadmap decisions. The system measurably improves quarter over quarter.
Most production deployments plateau at Stage 2 or 3. The teams that reach Stage 4 and 5 are the ones whose product AI becomes a durable competitive advantage — because they can confidently iterate while keeping quality high, while competitors are afraid to change anything in case they break something.
Where to Start
If you're currently at Stage 1 or 2, here's the minimum viable evaluation program to run this week:
- Export your top 50 queries from production logs or support tickets.
- Manually verify the answer quality for each one. Rate: correct, partially correct, incorrect.
- Run RAGAS offline on the same 50 queries to calibrate LLM-as-judge scores against your human ratings.
- Add a simple feedback widget (👍/👎) to your chat interface if you don't have one. This generates continuous labeled data.
- Set up a weekly review of the lowest-scoring queries. Twenty minutes per week, focused on the bottom 10.
This is not a large investment. It's the minimum to stop flying blind.
Your Product AI Is Only as Good as Your Ability to Improve It
The teams shipping product AI that their customers actually trust have a common trait: they measure relentlessly. Not because they're perfectionists, but because they understand that an unmeasured AI system is an uncontrolled system — and uncontrolled systems degrade silently until the damage is visible.
The evaluation framework in this article is how you maintain control: over quality, over regressions, over the impact of every catalog change and every model update. It's how you have the confidence to iterate quickly, because you can see exactly what your changes are doing.
It's also how you build the case internally for expanding the product AI investment — because you have the data to show that it works, which queries it handles well, and a clear picture of where the remaining gaps are.
Ready to Ship Product AI You Can Measure?
Axoverna builds evaluation instrumentation into the platform from day one. Retrieval metrics, answer quality monitoring, feedback collection, and drift detection are included — not bolted on after the fact. When your catalog changes, you'll know exactly what impact it had on retrieval quality before your customers do.
Book a demo to see how Axoverna's monitoring dashboard works, or start a free trial and connect your first product catalog today.
Turn your product catalog into an AI knowledge base
Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.
Related articles
Why Session Memory Matters for Repeat B2B Buyers, and How to Design It Without Breaking Trust
The strongest B2B product AI systems do not treat every conversation like a cold start. They use session memory to preserve buyer context, speed up repeat interactions, and improve recommendation quality, while staying grounded in live product data and clear trust boundaries.
Unit Normalization in B2B Product AI: Why 1/2 Inch, DN15, and 15 mm Should Mean the Same Thing
B2B product AI breaks fast when dimensions, thread sizes, pack quantities, and engineering units are stored in inconsistent formats. Here is how to design unit normalization that improves retrieval, filtering, substitutions, and answer accuracy.
Source-Aware RAG: How to Combine PIM, PDFs, ERP, and Policy Content Without Conflicting Answers
Most product AI failures are not caused by weak models, but by mixing sources with different authority levels. Here is how B2B teams design source-aware RAG that keeps specs, availability, pricing rules, and policy answers aligned.