Keeping Your Product AI Fresh: Catalog Sync, Versioning, and Change Detection

A RAG system is only as good as the data it retrieves. As your product catalog evolves — new items, discontinued lines, updated specs — your AI can silently drift into giving stale answers. Here's how to build a catalog sync pipeline that keeps it current.

Axoverna Team
17 min read

Here's a scenario that plays out more often than it should: a B2B distributor spends weeks configuring a product knowledge AI. It launches well — customers love it, support tickets drop, sales reps stop fielding basic spec questions. Six months later, the same AI is confidently quoting a lead time that changed four months ago, recommending a product that was discontinued last quarter, and citing a specification sheet that has since been revised three times.

Nobody broke anything. The AI just fell behind.

This is the catalog drift problem, and it's the most underestimated operational challenge in deploying a RAG-based product knowledge system. The technical work of building retrieval-augmented generation is well-documented. What gets less attention is the ongoing work of keeping that system synchronized with a product catalog that never stops changing.

This article covers what catalog drift looks like in practice, how to detect it reliably, and the architectural patterns that keep your product AI current without requiring manual re-indexing every time something changes.


Why Product Catalogs Change Constantly

Before designing a sync strategy, it helps to understand the full surface area of change in a typical B2B product catalog.

Product lifecycle events are the obvious ones: new SKUs added, items discontinued, end-of-life notices. In a large wholesale catalog, this can mean hundreds of SKU changes per week.

Specification updates are more subtle and more dangerous. A manufacturer updates a torque specification, a chemical supplier revises a safety data sheet, a component changes its operating temperature range. These updates don't create new products — they silently change what your AI "knows" about existing ones.

Pricing and availability change constantly — sometimes daily. While product knowledge AI is typically not the right place to serve live pricing (that belongs in your ERP or pricing layer), availability status and lead time ranges often end up embedded in product descriptions and are thus indexed into the knowledge base.

Documentation revisions compound the problem. Product PDFs, installation guides, and spec sheets are versioned documents. When the document in your knowledge base is version 2 and the manufacturer has published version 5, every answer your AI draws from that document is potentially outdated.

Structural changes — product line reorganizations, category merges, brand consolidations — can make entire sections of your catalog index stale simultaneously.

A well-designed system needs to handle all of these change types without requiring a full re-index from scratch each time.


The Cost of Catalog Drift

Stale product AI has real business consequences that extend beyond customer frustration.

Liability exposure: In regulated industries — chemicals, electrical components, medical devices — an outdated safety specification cited with AI-generated confidence is not just embarrassing; it can be a legal and safety issue. A distributor confidently quoting an old pressure rating for a valve that has since been derated is giving customers incorrect safety information.

Erosion of AI trust: When users encounter one confidently wrong answer, they stop trusting all answers — including the correct ones. This is disproportionate: it takes many good answers to build trust, and one bad answer to destroy it. As we covered in our article on building trust in AI responses, trust is the foundation everything else depends on.

Support ticket boomerang: The sales pitch for product AI is that it reduces support burden. When the AI starts giving wrong answers about discontinued products or outdated specs, support tickets don't just return to baseline — they often spike above it, because now customers are confused by conflicting information from the AI and your human team.

Invisible errors are worse than visible ones: A system that says "I don't know" sends a clear signal to users to ask a human. A system that confidently gives a wrong answer doesn't. Catalog drift creates confident errors, which are significantly more harmful than acknowledged gaps.


Anatomy of a Catalog Sync Pipeline

An effective sync pipeline has four stages: detection, ingestion, diff and update, and validation.

Stage 1: Change Detection

You can't update what you don't know has changed. Change detection is where most ad-hoc sync implementations fail — they either poll too infrequently (batch re-index nightly, miss intraday changes) or re-process everything on every run (expensive and slow).

The right approach depends on your data sources:

Webhook-driven change feeds: If your PIM (Product Information Management system) or ERP supports outbound webhooks on record changes, this is the gold standard. Every product update triggers an immediate event that kicks off a targeted re-ingestion for just that product. Axoverna's PIM integration layer is designed to work exactly this way — connecting webhook feeds directly into the ingestion pipeline.

Change data capture (CDC): If your product data lives in a relational database, CDC tools (Debezium is the standard open-source option) monitor the database transaction log and emit an event for every INSERT, UPDATE, and DELETE. This gives you a real-time stream of changes without requiring polling or modifying application code.

Checksum-based polling: For simpler setups or when operating against a catalog export (CSV, XML, JSON) rather than a live system, a hash-based approach works well. At each sync interval, compute a hash of each product record and compare it against the stored hash from the last sync. Only records with changed hashes are re-processed.

interface ProductRecord {
  id: string
  data: Record<string, unknown>
}
 
async function detectChanges(
  currentProducts: ProductRecord[],
  hashStore: Map<string, string>
): Promise<{ added: string[]; changed: string[]; removed: string[] }> {
  const currentIds = new Set(currentProducts.map(p => p.id))
  const previousIds = new Set(hashStore.keys())
 
  const added: string[] = []
  const changed: string[] = []
 
  for (const product of currentProducts) {
    const currentHash = computeHash(product.data)
    const previousHash = hashStore.get(product.id)
 
    if (!previousHash) {
      added.push(product.id)
    } else if (currentHash !== previousHash) {
      changed.push(product.id)
    }
  }
 
  const removed = [...previousIds].filter(id => !currentIds.has(id))
 
  return { added, changed, removed }
}
 
function computeHash(data: Record<string, unknown>): string {
  // Use a stable serialization — sort keys to avoid hash changes from key ordering
  const canonical = JSON.stringify(data, Object.keys(data).sort())
  return crypto.createHash('sha256').update(canonical).digest('hex')
}

Document modification timestamps: For file-based sources (PDF spec sheets, Word documents), mtime comparison is a simple first-pass filter. Only documents newer than the last sync need to be re-processed.

Stage 2: Targeted Re-ingestion

Once you have a list of changed records, re-ingestion should be surgical — not a full re-index. For each changed product:

  1. Retrieve the new product data (from PIM, ERP, or file source)
  2. Extract and clean text (strip HTML, normalize units, expand abbreviations — the same enrichment pipeline used at initial ingest)
  3. Chunk the document using the same chunking strategy as the original ingest (document chunking strategy is worth reading if you haven't designed yours yet)
  4. Generate new embeddings for each chunk
  5. Replace the old chunks in the vector store (don't append — this is where drift accumulates if you're not careful)

Step 5 deserves emphasis. A common mistake is to add new chunks without deleting the old ones. Vector databases don't have a concept of "the same product" — they just have chunks with metadata. If you re-ingest a product and add the new chunks without removing the old ones, retrieval will surface both old and new chunks, potentially mixing stale and current information in the same response context.

The clean solution: associate a product_id (or equivalent stable identifier) with every chunk at ingest time. When re-ingesting, first delete all chunks with that product_id, then insert the new chunks.

async function reingestProduct(
  productId: string,
  newData: ProductRecord,
  vectorStore: VectorStore,
  embedder: Embedder
): Promise<void> {
  // Delete all existing chunks for this product
  await vectorStore.deleteByFilter({ product_id: productId })
 
  // Process and chunk new data
  const enrichedText = enrichProductText(newData)
  const chunks = chunkDocument(enrichedText, { productId })
 
  // Generate embeddings
  const embeddings = await embedder.embed(chunks.map(c => c.text))
 
  // Insert new chunks
  await vectorStore.upsertBatch(
    chunks.map((chunk, i) => ({
      id: `${productId}-chunk-${i}`,
      text: chunk.text,
      embedding: embeddings[i],
      metadata: {
        product_id: productId,
        chunk_index: i,
        ingested_at: new Date().toISOString(),
        source_hash: computeHash(newData.data),
      },
    }))
  )
}

Stage 3: Handling Removals and Discontinuations

Removed products need special treatment — not just in the vector store, but in how the AI responds when asked about them.

The naive approach is to delete chunks for discontinued products and let the AI respond "I don't have information about that product." This is correct but unhelpful. A better approach:

Maintain a discontinuation index: A lightweight lookup (a simple database table or even a JSON file, depending on your catalog size) that records products that have been discontinued, with their discontinuation date and — if available — recommended replacement products.

Before retrieval runs, check if the query matches a product in the discontinuation index. If it does, inject a "this product has been discontinued" notice into the retrieved context, along with any replacement recommendation. The AI can then give a genuinely useful response: "Product X was discontinued as of [date]. The recommended replacement is Product Y, which has the following specifications..."

This is meaningfully better than silence. Buyers asking about discontinued parts are often trying to replace them — and a RAG system that can bridge the gap between "product I know" and "product that replaced it" provides real value.

interface DiscontinuedProduct {
  id: string
  name: string
  discontinuedAt: string
  replacementId?: string
  replacementName?: string
  notes?: string
}
 
async function enrichContextWithDiscontinuations(
  query: string,
  retrievedChunks: Chunk[],
  discontinuationIndex: Map<string, DiscontinuedProduct>
): Promise<Chunk[]> {
  const mentionedProducts = extractProductReferences(query)
 
  for (const productRef of mentionedProducts) {
    const discontinued = discontinuationIndex.get(productRef)
    if (discontinued) {
      retrievedChunks.unshift({
        text: buildDiscontinuationNotice(discontinued),
        metadata: { type: 'discontinuation_notice', product_id: discontinued.id },
        score: 1.0, // Pin to top of context
      })
    }
  }
 
  return retrievedChunks
}

Stage 4: Post-Sync Validation

After a sync run completes, validate that the operation succeeded and the knowledge base is internally consistent. At minimum, check:

  • Coverage: Every active product in the catalog has at least one chunk indexed. Zero-chunk products are a silent failure mode — the AI will just say it doesn't know anything about them.
  • Staleness: For each indexed chunk, the ingested_at timestamp should be recent relative to the corresponding catalog record's updated_at. A chunk indexed six months ago for a product last modified three months ago is stale.
  • Orphan detection: Chunks with product_id values that no longer exist in the catalog (e.g., from a failed deletion during a previous sync) should be flagged and removed.

Versioning: Knowing What You Indexed

Beyond detecting what changed, it's useful to track what version of a product was indexed. This serves two purposes:

  1. Debugging: When a user reports an incorrect answer, you can trace exactly which version of the product data was active when the relevant chunks were created.
  2. Rollback: If a bad data update propagates to your knowledge base, you can identify affected chunks and — if you've retained previous versions of the source data — restore the previous state.

A lightweight versioning scheme attaches source version metadata to every chunk:

interface ChunkMetadata {
  product_id: string
  chunk_index: number
  ingested_at: string       // ISO timestamp of when this chunk was indexed
  source_hash: string       // Hash of the product record at ingest time
  source_version?: string   // PIM version number, if available
  source_url?: string       // URL or path of the source document
  document_mtime?: string   // Modification timestamp of source document
}

This metadata doesn't need to be retrievable by the AI — it's operational instrumentation. Store it in your vector database's metadata fields, where it can be queried for monitoring and debugging without affecting retrieval behavior.


Sync Frequency and Architecture Tradeoffs

There's no universal answer to "how often should we sync?" — it depends on the velocity of your catalog changes and the freshness SLA your business requires.

Sync StrategyLatencyComplexityBest For
Webhook / real-time CDCSecondsHighHigh-velocity catalogs, regulated industries
Near-real-time (5–15 min polling)MinutesMediumStandard B2B distributors
Daily batchHoursLowSlow-moving catalogs, internal tools
On-demand (manual trigger)N/AVery lowSmall catalogs, low change frequency

For most B2B distributors, a near-real-time polling sync (every 5–15 minutes) with webhook capability for high-priority change types is a practical starting point. It's significantly simpler than full CDC, handles the vast majority of catalog change velocity, and keeps freshness within a window that's acceptable for most product knowledge use cases.

Reserve real-time webhook integration for changes where latency genuinely matters: safety data sheet revisions, product recalls, availability status for fast-moving items.


Monitoring Catalog Freshness in Production

A sync pipeline that runs silently and fails silently is almost as bad as no sync pipeline at all. Build observability in from the start.

Freshness metrics to track:

  • catalog_coverage_pct: percentage of active catalog products with at least one indexed chunk
  • stale_chunk_count: number of chunks where source_hash doesn't match the current catalog record hash
  • sync_lag_p95: 95th percentile delay between a catalog change and the corresponding chunk update
  • sync_error_rate: percentage of sync operations that fail (and the error categories)

Alerting thresholds (starting points to tune for your catalog):

  • Alert if catalog_coverage_pct drops below 98%
  • Alert if stale_chunk_count exceeds 0.5% of total chunk count
  • Alert if any sync run fails twice in a row
  • Alert if no successful sync has run in over 2× the expected sync interval

A simple Grafana dashboard with these four metrics gives you immediate visibility into knowledge base health. When a product manager asks "is our AI up to date?", you should be able to answer with a number, not a guess.


Practical Patterns for Common Data Sources

Syncing from a PIM

If you're using a PIM system — Akeneo, Pimcore, Plytix, or similar — the cleanest approach is to use the PIM's native export or API as your source of truth, rather than syncing from downstream systems like your ERP or website CMS. PIMs are designed to be the authoritative product data record; sync from the source, not from copies.

Most enterprise PIMs have:

  • A changelog API (records modified since timestamp X)
  • Webhook support for real-time events
  • Export profiles that let you define exactly which attributes to export

For the PIM integration pattern in more detail, that article covers the specific attribute mapping and enrichment steps needed to turn structured PIM records into retrieval-quality text.

Syncing from Flat File Exports

Many B2B companies don't have a PIM — they export from their ERP to CSV or XML. The hash-based polling approach described earlier works well here. Keep the hash store as a simple key-value store (Redis, a SQLite database, or even a flat JSON file for smaller catalogs), and run the comparison on each scheduled poll.

The main risk: if the export schema changes (column renames, format changes), your hash-based change detection will flag every record as changed — because the hashes will differ even if the underlying data is the same. Guard against this by hashing field values by name, not by position, and by testing schema changes in a staging environment before promoting to production.

Syncing PDF Spec Sheets

PDF documents are the hardest catalog artifact to sync reliably. PDFs don't have structured IDs you can track easily, and content changes can be subtle (a single number in a table of specifications).

A practical approach:

  1. Assign each spec sheet a stable identifier (the product ID + document type is usually sufficient: PROD-1234-datasheet)
  2. Store the PDF's SHA-256 hash in your chunk metadata
  3. When re-ingesting, compare hashes first — only re-parse and re-chunk if the hash has changed
  4. For documents that change frequently, prefer structured product data (from PIM/ERP) over PDFs when both are available; the structured data is easier to sync reliably

Building Incremental Re-embedding into Your Roadmap

One thing worth planning for: embedding model changes. The field moves fast. The model you chose for your initial ingest may be superseded by a significantly better model in 12–18 months.

When that happens, you'll need to re-embed your entire corpus with the new model — because embeddings from different models live in incompatible vector spaces and can't be mixed. If your catalog is large, this is a significant operation.

A few practices that make this manageable:

  • Track the embedding model version in chunk metadata: embedding_model: "text-embedding-3-large". This lets you identify which chunks were embedded with which model.
  • Design for full re-index as a batch operation: Your sync pipeline's targeted re-ingest logic should be composable into a full catalog re-index when needed — just run the same pipeline over all records instead of just changed ones.
  • Maintain the raw source data: Keep a copy of the processed text for each chunk (before embedding). Re-embedding then requires only running the embedder on stored text — not re-parsing and re-chunking from source documents. This significantly speeds up model migration.

The Freshness SLA Conversation

Before building a sync pipeline, have an explicit internal conversation about freshness requirements — and communicate them clearly to business stakeholders.

A product knowledge AI cannot be real-time for all data types simultaneously without significant investment. Prioritize:

  1. Critical safety-related data (SDS sheets, pressure/electrical ratings): Aim for real-time or near-real-time sync. Budget for the webhook/CDC complexity this requires.
  2. Core specifications and dimensions: Near-real-time (minutes) sync via polling is usually sufficient.
  3. Marketing copy and general product descriptions: Daily batch sync is typically acceptable.
  4. Product imagery and video: These shouldn't be in your RAG corpus at all — serve them directly from your CDN, not via retrieval.

Setting a clear freshness SLA per data type gives your engineering team a tractable problem and gives your business stakeholders a realistic expectation. "Our product specs are synced within 15 minutes of any change in our PIM" is a specific, testable, and valuable commitment.


The Foundation Everything Else Depends On

Building a retrieval system is the exciting part. Hybrid search, reranking, multi-turn conversations — these are the features that differentiate a good product AI from a bad one. But all of them depend on one unglamorous prerequisite: the data in the knowledge base being correct and current.

A world-class retrieval system running over stale data produces confidently wrong answers. A mediocre retrieval system running over current data produces occasionally imprecise but reliable answers. Freshness isn't optional infrastructure — it's the foundation.

The sync pipeline patterns described here aren't exotic engineering. They're applied versions of standard data engineering practice: change detection, incremental updates, observability, and versioning. What makes them specific to product knowledge AI is the care taken around discontinuations, chunk-level granularity, and the downstream impact of retrieval quality on LLM output.

Get this right, and your product AI stays accurate as your catalog evolves. Get it wrong, and you'll spend months debugging "why is the AI saying that?" — chasing down stale chunks that should have been updated months ago.


Keep Your Product AI Current — Without the Ops Burden

Building and maintaining a robust catalog sync pipeline is non-trivial, especially when you're managing PIM integrations, embedding versioning, discontinuation indexes, and freshness monitoring on top of everything else. At Axoverna, catalog sync is a first-class feature — not an afterthought. We handle ingestion, change detection, re-embedding, and freshness monitoring automatically, so your product AI stays current as your catalog evolves.

Book a demo to see how Axoverna keeps your product knowledge fresh in production, or start a free trial to connect your catalog and experience consistent, accurate AI answers — regardless of how often your product data changes.

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.