Structured Data in RAG: Making Product Specs, Tables, and Pricing Sheets Actually Retrievable

Most RAG pipelines are built for prose. B2B product catalogs are full of tables, spec sheets, and structured data. Here's how to make that content work instead of break your retrieval.

Axoverna Team
13 min read

Here is a quiet assumption baked into most RAG tutorials: your source documents are paragraphs of flowing prose. You chunk them, embed them, retrieve the right chunks, and feed them to the LLM. It works beautifully for knowledge bases written in sentences.

Then you try to apply that pipeline to an actual B2B product catalog.

The catalog has a 400-row spec table where each row is a different SKU with a dozen numeric attributes. It has a pricing sheet laid out in Excel. It has a technical datasheet where the meaningful content is in a two-column table of "parameter" and "value." It has a conformance matrix that maps products to certifications across a grid.

The standard chunking pipeline does not know what to do with this. And the failure is silent — the chunks go into your vector store, queries return something, and nobody realizes until a buyer asks "what's the max operating temperature of the 316-SS version?" and gets back a chunk from a completely different product family.

This article is about the gap between RAG-for-prose and RAG-for-product-data, and what you actually need to do to close it.


Why Tables Break Standard Chunking

The standard approach — split text on paragraph boundaries or token counts, embed each chunk, retrieve by cosine similarity — assumes that each chunk is a self-contained, semantically coherent piece of text. A well-written paragraph satisfies that assumption. A table row does not.

Consider a spec table for a range of industrial valves:

ModelSize (DN)Pressure Rating (bar)Temp Range (°C)Body MaterialSeal Material
BV-2505016-20 to 180Carbon SteelPTFE
BV-2518025-20 to 180Carbon SteelPTFE
BV-2525016-40 to 250Stainless 316Graphite

If you convert this to text naively — which most PDF parsers and Markdown converters do — you get something like:

Model Size (DN) Pressure Rating (bar) Temp Range (°C) Body Material Seal Material BV-250 50 16 -20 to 180 Carbon Steel PTFE BV-251 80 25...

The spatial structure that gives each number its meaning has been destroyed. The embedding of this chunk will be an average of all the models it contains, and retrieval for "BV-252 temperature rating" will compete against chunks about every other valve in the table.

Even if you chunk row by row, the row "BV-252 50 16 -40 to 250 Stainless 316 Graphite" has been stripped of its headers. The embedding model has no idea that 316 here refers to a steel grade, not a model number, and that -40 to 250 is a temperature range in Celsius.

This is the structural information problem: numbers and codes only have meaning relative to their column headers, and those headers are usually in a different chunk.


Four Patterns for Handling Structured Product Data

There is no single correct approach. The right choice depends on your data volume, update frequency, and query patterns. But there are four well-proven patterns, and most production systems use at least two of them.

1. Row-to-Prose Serialization

The simplest effective approach: convert each table row into a natural language sentence that re-embeds the column context.

Instead of indexing the raw row, you generate:

BV-252 is a ball valve with a DN50 bore (50mm), rated to 16 bar, suitable for operating temperatures from -40°C to 250°C. The body is 316 stainless steel with a graphite seal.

This prose chunk retrieves well because the embedding captures actual meaning. A query like "stainless valves for high temperature applications" matches the phrase "suitable for operating temperatures up to 250°C" and "316 stainless steel" semantically — even if the buyer doesn't use those exact words.

The tradeoff: you need a serialization step, and the output quality depends on how well your template maps column names to readable text. For catalogs with dozens of attribute types, you'll build a schema-aware serializer rather than a single fixed template.

The payoff is significant. In our experience, row-to-prose serialization reliably improves retrieval recall on structured product data by 30–50% over raw table chunking, with no change to the retrieval architecture.

2. Metadata-Augmented Embedding

Rather than stuffing all attributes into the prose, you embed a minimal description and store structured attributes as metadata filters. This works especially well when buyers often filter before querying.

The embedding might just be: "BV-252: DN50 stainless steel ball valve, graphite seal, rated for high-temperature and cryogenic service."

But in your vector store, the chunk also carries metadata:

{
  "model": "BV-252",
  "category": "ball-valves",
  "dn_size": 50,
  "pressure_bar": 16,
  "temp_min_c": -40,
  "temp_max_c": 250,
  "body_material": "316-stainless",
  "seal_material": "graphite"
}

At query time, structured filters pre-filter the candidate set before semantic ranking. A query like "DN50 valves rated above 200°C" becomes: filter where dn_size = 50 AND temp_max_c >= 200, then rank by semantic similarity among the results.

We covered the mechanics of this in depth in our article on metadata filtering for RAG product catalogs. The key insight here: structured data lives more naturally as metadata than as embedded text. Numbers that need to be compared or filtered are not well served by embedding — they need to be stored as queryable scalars.

3. Table-as-Unit Retrieval with Summary Embedding

For cases where the table itself is the answer — a full conformance matrix, a compatibility grid, a pricing tier table — you don't want to retrieve a fragment. You want to retrieve the whole table and let the LLM parse it.

The trick is embedding a summary of the table, not the table content itself. The embedding captures what the table is about, while the retrieved content includes the complete structured artifact.

Example: a compatibility matrix for cable glands across different conduit sizes and IP ratings gets a summary embedding like: "Compatibility matrix showing which cable gland models are certified for IP66 and IP68 in conduit sizes M12 through M63." When a buyer asks "which M25 glands are rated IP68?", the retrieval hits this summary, and the full matrix is returned to the LLM to answer precisely.

This pattern requires that you track table boundaries during ingestion — something most generic chunking libraries won't do without configuration. Libraries like unstructured.io have explicit table extraction modes. For PDFs, pdfplumber returns tables as structured objects you can serialize and re-embed independently.

4. Structured Query Augmentation (SQL/Filter Routing)

For large, regularly updated structured datasets — pricing, stock levels, SKU attribute databases — the right answer is often not RAG at all. It's a lookup against a real database, triggered by the AI layer.

When a buyer asks "what's the list price for part 4871-B in quantities over 500?", that's not a retrieval problem. It's a parameterized query against your pricing table. An agentic RAG architecture can classify that intent and route it to a SQL executor or API call rather than vector search.

The challenge is query generation: translating natural language into correct structured queries requires knowing your schema. Techniques here include:

  • Schema-grounded prompting: giving the LLM a schema description and few-shot examples of NL→SQL for your specific tables
  • Entity extraction first: resolving "part 4871-B" against a known SKU list before constructing the query
  • Guardrails on generated queries: restricting to SELECT-only, parameterized queries to prevent injection and limit scope

This is more engineering than RAG configuration, but for catalogs with thousands of products and dynamic pricing, it's the only approach that scales. A vector search over a 50,000-row price list is neither accurate nor efficient.


The Ingestion Layer: Where Structured Data Is Usually Broken

Most of the failure with structured product data happens before any retrieval question arises. It happens at ingestion.

The default PDF-to-text pipeline discards table structure. A table that looks clean in a datasheet becomes a stream of tokens with no spatial information. The column headers are usually captured once at the top, then never again — so rows below the first page have no context.

What to do instead:

  1. Use table-aware parsers. For PDFs, pdfplumber extracts tables as row/column objects. unstructured.io has an hi_res strategy with explicit table detection. For Excel and CSV, parse as DataFrames, not text.

  2. Detect and branch on document type. A spec datasheet should go through a different processing path than a product description page or a technical manual. Use query intent classification at ingestion time, not just at query time.

  3. Re-attach headers to every row. When you split a table for indexing, every chunk needs its column headers. Either serialize each row with headers (row-to-prose), or store headers as chunk metadata so they're always available at render time.

  4. Track provenance. Every chunk should know which product(s) it describes. A spec table chunk with no product_id or sku reference becomes unreliable when your catalog grows — you can't tell which item the spec belongs to without reading it.


What to Do About Pricing Sheets

Pricing data deserves special attention because it's both highly structured and frequently updated.

The volatility problem: embedding pricing data means your index goes stale the moment prices change. For static products, this might happen annually. For commodities or dynamic pricing, it could change daily. Product catalog freshness is already challenging for product descriptions — for pricing data, real-time staleness is a serious liability.

The solution stack for pricing:

  • Don't embed raw prices. Store pricing data in a live database, not a vector index.
  • Embed the pricing structure and policy. Things like "volume discounts apply at 50, 100, and 500 units" or "contractor pricing is available with an approved account" are semantic descriptions that belong in RAG.
  • Route pricing queries to a live lookup. At query time, detect pricing intent and call a live API or database. Return the current price to the LLM as context, not from the vector store.

This architecture means your AI always returns accurate prices without re-ingestion, and your vector index carries only the structural product knowledge that changes slowly.


Real Queries That Reveal Structured Data Gaps

A useful diagnostic is to identify the classes of buyer queries that your current pipeline handles poorly. Structured data problems surface in specific, predictable ways:

Comparison queries: "What's the difference between the BV-250 and BV-252?" — this requires knowing both products well enough to compare. If they're in separate chunks without overlap, the answer is often hallucinated or incomplete.

Threshold queries: "Which valves can handle temperatures above 200°C?" — this requires filtering by a numeric attribute. Without metadata filtering or structured routing, vector search will retrieve the most semantically similar chunks, which may not be the numerically correct ones.

Compatibility queries: "Does part X work with system Y?" — often answered by a compatibility matrix that's been chunked beyond usefulness. Table-as-unit retrieval is the fix here.

Completeness queries: "List all your DN80 fittings" — exhaustive queries don't work well with top-k retrieval at all. This is a structured query problem, not a semantic one.

If your logs show high rates of these query types paired with low user satisfaction, your pipeline has a structured data gap.


Hybrid Architecture for Production Product Catalogs

In practice, a mature B2B product AI handles structured data through a layered architecture:

  1. Semantic layer (vector search): Product descriptions, use cases, application notes, technical manuals — content written in prose. Standard RAG with hybrid BM25 + dense retrieval handles this well.

  2. Attribute layer (metadata filters): Structured spec attributes stored alongside embeddings. Used to pre-filter before semantic ranking. Updated on product sync, not re-embedding.

  3. Structured data layer (database/API): Pricing, inventory, configurator rules. Queried live at runtime. Never embedded.

  4. Table artifact layer: Complete tables (compatibility matrices, conformance docs) stored as retrievable artifacts with summary embeddings. Retrieved whole, not as fragments.

Reranking at the final stage helps when multiple layers return results for the same query — letting a cross-encoder score candidates from different retrieval paths before the LLM synthesizes an answer.

This isn't complexity for its own sake. Each layer handles a real class of query that the other layers handle badly. The alternative — cramming everything through a single vector pipeline — is what produces the confident-but-wrong answers that erode buyer trust.


Getting Started: Auditing Your Structured Data Coverage

If you're not sure how much of your catalog is structured data, a quick audit helps:

  1. Pull 50 random chunks from your current vector index
  2. Count how many are table fragments (rows without headers, numbers without units, codes without context)
  3. Query your index with 10 known spec comparison questions and score the answers manually
  4. Check whether your top retrieved chunks actually contain the right product, or just a thematically similar one

Most teams are surprised by the structural data percentage. In a typical B2B product catalog — spec sheets, technical datasheets, pricing documents, compatibility matrices — structured content often comprises 40–60% of the indexed data. If your pipeline treats it the same as prose, you're getting good retrieval on less than half your catalog.

Building trust with B2B buyers means consistently returning correct, complete answers to the queries they actually care about. As we've explored in our guide to building AI responses buyers trust, confidence calibration matters — but so does making sure the underlying retrieval actually surfaces the right data in the first place.

Structured data is where product AI systems most frequently fail. It's also where fixing things has the most direct impact on the queries that drive purchase decisions.


What This Looks Like in Practice

Axoverna handles structured data in product catalogs through a purpose-built ingestion pipeline that detects table content, serializes rows with full attribute context, and routes structured queries to the appropriate retrieval layer automatically. Customers bring their spec sheets, datasheets, and pricing structures — the platform handles the rest without custom pipeline engineering.

If your current product AI returns vague answers to specific technical questions, structured data handling is likely where the gap is.

Book a demo to see how Axoverna handles your specific catalog structure — bring a real datasheet and we'll show you exactly how it gets indexed and retrieved.


Related reading: Document Chunking Strategies for RAG · Metadata Filtering for Product Catalogs · Hybrid Search for B2B Product Catalogs · Agentic RAG for Product Discovery

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.