From Feed to AI: Using CSV, XML, and JSON Product Feeds as RAG Knowledge Sources

Most B2B distributors already have structured product data in CSV, XML, or JSON feeds from their suppliers or PIM. Here's how to turn those feeds into a live, queryable product knowledge AI — without a full data engineering project.

Axoverna Team
15 min read

Every B2B distributor has a version of the same file graveyard.

A folder somewhere — on a shared drive, an FTP server, or an email inbox — full of supplier product files. CSV exports from the ERP. XML feeds from the manufacturer portal. JSON blobs from the last integration project. Each file contains the raw material for excellent product knowledge: specs, descriptions, part numbers, compatibility notes, pricing tiers.

And almost none of it is searchable by your customers or sales team in any useful way.

The traditional path from "product data file" to "customer-facing knowledge" runs through a PIM system, a content team, a web developer, and six weeks of project time. The modern path is shorter: ingest the feed, map the fields, embed the content, serve queries. Conversational product AI doesn't need perfectly formatted prose — it needs structured, semantically rich data. Which is exactly what product feeds contain.

This article is a practical guide to that modern path.


Why Product Feeds Are Actually Ideal for RAG

There's a misconception that RAG works best with long-form documentation — technical manuals, white papers, knowledge base articles. That's one good input type. But for B2B product knowledge specifically, structured feed data has properties that make it excellent for retrieval:

Density of facts. A single product row in a well-structured CSV might encode 30–50 distinct facts: dimensions, material grades, standards compliance, weight, minimum order quantity, available finishes, compatible accessories. That's enormously high information density per token of LLM context.

Consistent structure. Unlike documentation written by different authors in different styles, feed data has uniform schema. A column called tensile_strength_mpa means the same thing for every row. That consistency makes it easier to write reliable field mappings and produce predictable chunk quality.

Already in your possession. You don't need to crawl a website, extract text from PDFs, or transcribe technical videos. The data exists. It's sitting in a file or streaming from an endpoint. The question isn't "where do we get the data?" but "how do we turn it into something queryable?"

Relatively easy to keep fresh. Product documentation updates are events. Feed updates are continuous. Your supplier sends a new CSV every Monday. Your PIM generates a JSON export nightly. That cadence is exactly what a well-designed feed ingestion pipeline needs: a predictable refresh schedule and a diff mechanism to detect changes.

The architectural challenge isn't whether feeds work as RAG sources — they do. It's doing the transformation correctly: turning rows and fields into semantic chunks that retrieve well, without losing the structural precision that makes specs trustworthy.


The Three Feed Formats and Their Trade-offs

CSV is the most common format in practice because it's the path of least resistance: any ERP, PIM, or spreadsheet tool can export it. The trade-off is flatness. A CSV naturally represents one product per row with scalar field values. Nested data (multiple images, tiered pricing, variant attributes) gets awkward — either flattened into multi-valued fields with delimiter conventions (color: red|blue|green) or split across multiple rows requiring reassembly. For products with simple attribute sets, CSV is fine. For products with complex hierarchical data, it introduces preprocessing friction.

XML is the legacy B2B standard, still dominant in manufacturing and industrial distribution. BMEcat, cXML, and custom supplier schemas all share XML's structural strengths: proper nesting, type declarations, namespace isolation. A single XML product record can naturally represent a product with its variants, its dimensional tables, its related accessories, and its compliance documents — all in one tree. The preprocessing challenge is the opposite of CSV: XML is often over-structured for RAG purposes, with content buried inside deeply nested elements that require careful XPath traversal to extract meaningful text.

JSON is increasingly common from modern supplier APIs, cloud PIMs, and e-commerce platforms (Shopify, BigCommerce, and their wholesale equivalents all export JSON product feeds). JSON sits between CSV and XML on the structure spectrum: more expressive than CSV, less verbose than XML. JSON feeds from modern platforms often include rich text fields (descriptions, usage notes) alongside structured attributes — a nice combination for RAG since you get both semantic content and precise structured data in one record.

The practical implication: your feed ingestion pipeline needs format-specific parsers, but the downstream steps (field mapping → chunk construction → embedding → indexing) can be format-agnostic once you've normalized to a common internal representation.


Field Mapping: The Step That Determines Quality

The most important — and most underestimated — step in feed-to-RAG conversion is field mapping. Which fields become part of the embedded content? Which become metadata filters? Which should be combined? Which should be ignored?

Get this wrong and you end up with either over-sparse chunks (the AI can't answer questions about attributes you didn't embed) or over-dense chunks where everything is concatenated into an unstructured blob that retrieves poorly.

A practical framework for categorizing fields:

Core semantic content → embed as the main chunk text. These are fields that carry meaning a buyer might ask about in natural language: product name, description, key technical attributes (dimensions, material, standard), application notes, compatibility notes. The goal is that a well-formed question ("what's the tensile strength of this grade of fastener?") will semantically match this content.

Structured filter metadata → store as vector index metadata, not in the embedded text. Category, product family, supplier code, unit of measure, stock status, price tier. These aren't what buyers query about — they're constraints that narrow down which products are relevant. Metadata filtering at retrieval time is far more precise than expecting the vector similarity search to handle categorical constraints.

Identity and reference fields → SKU, EAN/GTIN, supplier part number, internal ID. Store these both in metadata (for exact-match lookup) and in the chunk text (so a buyer who pastes a part number into a query gets the right product back). Part number search is fundamentally a keyword problem, not a semantic one — which is one more reason to use hybrid retrieval rather than pure vector search for product catalogs.

Exclude or normalize → pricing, stock quantities, lead times. As we covered in our article on live inventory integration, volatile operational data doesn't belong in your embedding index. Either exclude it entirely and retrieve it from a live cache, or store it as metadata that gets refreshed frequently without requiring re-embedding.

Here's what a concrete field mapping configuration looks like for a typical industrial component feed:

field_mapping:
  # Core semantic content — combined into the embedded chunk
  content_fields:
    - name
    - description
    - technical_summary
    - dimensions_text        # "M10 x 1.5mm pitch, 30mm length"
    - material_grade
    - standards_compliance   # "ISO 4014, DIN 931"
    - application_notes
    - compatible_with
 
  # Metadata for filtering — stored but not embedded
  metadata_fields:
    - category
    - subcategory
    - supplier_code
    - unit_of_measure
    - min_order_qty
    - product_family
    - surface_treatment
 
  # Identity fields — embedded AND stored as metadata
  identity_fields:
    - sku
    - ean
    - manufacturer_part_number
    - supplier_part_number
 
  # Excluded from RAG entirely
  exclude_fields:
    - stock_qty
    - price_eur
    - price_usd
    - last_updated_erp

Constructing Chunks: Don't Just Concatenate Fields

A common mistake is treating field mapping as a concatenation problem: take the content fields, join them with newlines or commas, embed the result. That works as a baseline. It does not work well at scale.

The problem is that concatenated field values don't read like natural text that would match a buyer's question. Consider this raw concatenation:

Hexagon Bolt | High-strength structural fastener for load-bearing steel connections.
M12 | 1.75mm | 60mm | Grade 10.9 | ISO 4014 | DIN 931 | Zinc-nickel plated |
Recommended torque: 88 Nm | Compatible with: ISO 4032 hex nut, DIN 125 washer

vs. a lightly structured prose reconstruction:

Hexagon Bolt (M12 × 1.75mm pitch, 60mm length) — Grade 10.9 structural fastener
designed for high-load steel connections. Complies with ISO 4014 and DIN 931.
Surface treatment: zinc-nickel plated. Recommended installation torque: 88 Nm.
Compatible fasteners: ISO 4032 hex nut, DIN 125 washer.

The second version embeds better. Not because prose is inherently superior to structured data, but because the embedding model was trained predominantly on natural language text. Phrases like "designed for high-load steel connections" and "complies with ISO 4014" are patterns the model has seen many times and can relate to buyer queries like "what bolt should I use for structural steel applications?"

The reconstruction step doesn't need to be perfect prose — it just needs to connect field names and values in ways that approximate how the product would be described in natural language. A simple template per product category works well:

function reconstructProductChunk(product: ProductRecord, template: ChunkTemplate): string {
  return template.render({
    name: product.name,
    dimensions: formatDimensions(product),
    material: product.materialGrade,
    standards: product.standardsCompliance.join(', '),
    surface: product.surfaceTreatment,
    torque: product.recommendedTorqueNm ? `${product.recommendedTorqueNm} Nm` : null,
    compatible: product.compatibleWith?.join(', '),
    notes: product.applicationNotes,
  })
}

For product families with very different attribute schemas (fasteners vs. cables vs. pneumatic components), maintain a separate template per family. Fifteen product-family templates is manageable. One generic template that produces mediocre chunks for all of them is not.


Handling Variants and Product Families

Feeds from B2B suppliers often represent products as families with variants: one base product (M10 hex bolt, grade 8.8, zinc-plated) with dozens of length variants, all sharing most attributes but differing on a few dimensions.

Should each variant be a separate chunk? Should the whole family be one chunk? The answer depends on the query patterns you're optimizing for.

One chunk per variant is the right default for most product AI use cases. Buyers typically ask about specific dimensions ("do you have M10 × 40mm in grade 8.8?"), and a chunk per variant means you can retrieve exactly the right record and surface precise inventory data for that exact SKU. The cost is index size — a family with 50 length variants becomes 50 chunks — but that's manageable.

One chunk per family with a variants summary works better when variants differ only on a single dimension (say, length) and queries are rarely about specific lengths. A chunk that says "available in lengths from 16mm to 200mm, see product configurator for specific availability" is more useful than 30 separate "M10 × Xmm" chunks that are nearly identical and dilute each other's relevance.

Hierarchical chunks — one family-level chunk and N variant-level chunks — give you the best of both worlds at the cost of more sophisticated retrieval logic: surface the family chunk for broad questions ("tell me about grade 8.8 hex bolts") and the variant chunk for specific queries ("do you have M10 × 50mm in grade 8.8?"). This is worth the complexity for large catalogs.


Feed Sync: Keeping Your Index Fresh

A product feed ingested once and never updated is worse than no feed at all — it gives buyers wrong information with AI confidence. Feed sync is not optional.

The minimal sync strategy: scheduled full re-ingest. Pull the complete feed on a cadence that matches how often your supplier updates it (daily for most, weekly for slow-moving industrial catalogs). Diff against the previous version, re-embed only changed records. This is simple to implement and works fine for feeds with up to a few hundred thousand products.

The better sync strategy: incremental updates via change detection. Most structured feeds include a last_updated or modified_at timestamp per record. Use this to pull only records that changed since your last sync — massively reducing processing cost and latency for large catalogs.

async function syncFeedIncremental(feedUrl: string, lastSyncAt: Date): Promise<SyncResult> {
  const feed = await fetchFeed(feedUrl)
  const records = await parseFeed(feed)
 
  const changed = records.filter(r =>
    new Date(r.lastUpdatedAt) > lastSyncAt
  )
 
  const deleted = await detectDeletedRecords(records, currentIndex)
 
  await Promise.all([
    upsertEmbeddings(changed),
    deleteEmbeddings(deleted.map(r => r.sku)),
  ])
 
  return {
    processed: changed.length,
    deleted: deleted.length,
    syncedAt: new Date(),
  }
}

The best sync strategy for mission-critical catalogs: webhook-triggered sync combined with scheduled fallback. If your supplier or PIM can send a webhook when products are updated, trigger a targeted sync immediately for those specific SKUs. Use scheduled full sync as a safety net in case webhook delivery fails. You get near-real-time freshness without polling overhead.

We covered feed sync architecture in more depth in our article on product catalog sync and RAG freshness.


Automatic Field Detection: Lowering the Barrier to Entry

One practical challenge with multi-supplier feed ingestion is schema heterogeneity. Supplier A sends a CSV with headers like product_name, material, dim_length_mm. Supplier B sends XML with <ProductTitle>, <MaterialGrade>, <NominalLength>. Supplier C sends JSON with name, spec.material, dimensions.length.

Manually writing field mappings for each supplier is tedious and error-prone. A better approach: use an LLM to auto-detect likely field mappings from a sample of records, then present the detected mapping for human review and confirmation.

async function detectFieldMapping(sampleRecords: Record<string, unknown>[]): Promise<FieldMapping> {
  const sampleJson = JSON.stringify(sampleRecords.slice(0, 5), null, 2)
 
  const response = await llm.complete({
    prompt: `Given these sample product records from a B2B product feed, 
    identify which fields correspond to: product name, description, 
    SKU/part number, dimensions, material, standards compliance, category, 
    and any other semantically meaningful product attributes.
    
    Records:
    ${sampleJson}
    
    Return a JSON field mapping.`,
  })
 
  return JSON.parse(response.text)
}

This isn't a replacement for human review — auto-detected mappings will sometimes be wrong, especially for ambiguous column names. But it dramatically reduces the time to configure a new feed from "30 minutes of careful manual mapping" to "2 minutes of reviewing an auto-suggested mapping." For distributors with dozens of supplier feeds, that efficiency compounds quickly.


From Feed to Chat: What the User Experience Unlocks

When feed ingestion is working well, the visible effect is a product knowledge AI that behaves like an expert who actually knows your catalog.

A buyer asks: "I need a grade 10.9 hex bolt, M12, around 50–60mm length, zinc coated. What do you have?" Instead of full-text search returning a results page they have to sift through, the AI returns: "We have M12 × 50mm and M12 × 60mm in grade 10.9 with zinc-nickel coating, both in stock. The 50mm variant has 840 units available; the 60mm has 212. Both comply with ISO 4014 and DIN 931. Want me to add either to your quote?"

That answer pulls from three places: the embedded product knowledge (grade, coating, compliance), the metadata filter (M12, length range), and the live inventory cache (stock levels). The buyer gets an answer that a knowledgeable sales rep would give — not a search results page they have to interpret.

For sales reps using the same system internally, the benefit is faster onboarding and lower cognitive load. A new rep handling a product line they've never sold can answer technical questions fluently on day one because the product knowledge — structured, searchable, precise — is right there in the chat interface.


Common Pitfalls to Avoid

Embedding fields that should be metadata. Category names, supplier codes, price ranges — these don't add semantic meaning to an embedding. They inflate chunk size and dilute the signal. Use them as metadata filters instead.

Ignoring multi-lingual supplier feeds. If your supplier sends German product descriptions and your buyers query in Dutch or English, semantic search breaks down. Multilingual embedding models (or a translation step before ingestion) are necessary. See our multilingual RAG article for strategies.

Treating sync failure as silent. If your nightly feed sync fails, you don't get an error in the chat UI — you get subtly wrong information served with AI confidence. Monitoring feed sync health and alerting on failures is not optional.

Skipping field normalization. Supplier A calls it weight_kg, supplier B calls it gross_weight_lbs. Without normalization, a query about "product weight" will miss everything from supplier B. Build a normalization layer into your field mapping step.

Over-trusting auto-generated descriptions. Some PIMs generate AI-written product descriptions. These can read naturally but contain errors. When the authoritative data lives in structured attribute fields (dimensions, material grade, standards), prefer those over auto-generated text for the embedded chunk content.


Getting Started

The conceptual complexity here is lower than it appears. You don't need a data engineering team and a six-month pipeline project. The core steps are:

  1. Get a sample feed file — a few hundred rows of your most representative products
  2. Identify your field types — what's semantic content, what's metadata, what's identity
  3. Build a simple chunk template for one product family
  4. Ingest, embed, and run test queries against a small index
  5. Iterate on the template until retrieval quality is good
  6. Automate the sync at the cadence that matches your feed update frequency

The first working prototype is usually a few hours of work. Getting to production quality — with multi-family templates, incremental sync, metadata filters, and live inventory enrichment — is a sprint, not a quarter.


Start With Your Catalog

Axoverna natively ingests CSV, XML, and JSON product feeds with automatic field detection, configurable chunk templates, and scheduled or webhook-triggered sync. Connect a feed from your ERP, PIM, or supplier portal, and your catalog becomes conversational in under an hour.

Book a demo to see feed ingestion on a catalog like yours, or start a free trial and connect your first feed today.

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.