Beyond the Product Catalog: Building a Complete AI Knowledge Base with Technical Documents

Product catalog data alone leaves your AI unable to answer a huge class of buyer questions. Here's how to bring datasheets, installation manuals, SDS files, and application notes into your RAG pipeline — and why it changes everything.

Axoverna Team

March 13, 202617 min read

Most B2B companies, when they first build a conversational product AI, make the same scoping decision: we'll use the product catalog as the knowledge source. SKUs, attributes, specs, descriptions — all the structured product data living in the ERP or PIM gets ingested, chunked, and embedded. The demo looks great. Buyers can ask about dimensions, materials, and compatibility. Leadership is excited.

Then the first wave of real questions arrives:

"How do I install the PV2200 inverter in a three-phase system?"
"What PPE is required when handling product XR-440?"
"The datasheet says max ambient temp is 50°C — does that apply to the fan-cooled version too?"
"Is the new 2024 revision of this cable compatible with the connectors I ordered last year?"

Your catalog AI answers zero of these correctly. It might hallucinate an answer, or it might say I don't have that information — which is technically honest but commercially frustrating. Either way, a buyer with a complex technical question ends up in a support queue.

The problem is architectural: a product catalog describes what a product is. Technical documents describe how it works, how to use it safely, and how it fits into a larger system. B2B buyers need both.

This article walks through how to build the second layer — incorporating datasheets, manuals, safety data sheets, application notes, and related documents into your RAG knowledge base — and the engineering decisions that determine whether it actually works.

What Lives in Technical Documents That Doesn't Live in Your Catalog

Before diving into implementation, it's worth mapping the knowledge gap precisely. Here's what product catalogs typically contain well, and where they fall short:

Query Type	Product Catalog	Technical Documents
What is it? (material, dimensions, weight)	✅ Excellent	✅ Also covered
How does it compare to alternatives?	⚠️ Partial	✅ Application notes, comparison guides
How do I install / set up / configure it?	❌ Missing	✅ Installation manuals
What are the safety requirements?	❌ Missing	✅ Safety data sheets (SDS/MSDS)
What does the wiring diagram show?	❌ Missing	✅ Technical drawings in datasheets
What firmware version does this require?	❌ Rarely tracked	✅ Release notes, tech bulletins
What changed between revisions A and B?	❌ Missing	✅ Revision history in datasheets
What certifications does it carry and to what standard?	⚠️ Often listed without detail	✅ Detailed in compliance documents
What happens if I operate it outside rated range?	❌ Missing	✅ Covered in technical notes

The gap isn't a data quality issue — it's a structural one. Catalog attributes are designed for filtering and purchasing. Technical documents are designed for engineering and operational decision-making. B2B buyers need both, often in the same conversation.

The Document Taxonomy: What to Prioritize

Not all documents are equally valuable for a product AI. Here's how to prioritize ingestion effort:

Tier 1: High Retrieval Value, Ingest First

Product datasheets are the single most important document type for B2B product AI. A well-written datasheet covers the complete performance specification, operating envelope, application guidance, and certification details in one place. For electrical, mechanical, and industrial components, datasheets answer the majority of pre-sales technical questions.

Safety Data Sheets (SDS / MSDS) are mandatory for products involving chemicals, hazardous materials, or regulated substances. When a buyer's EHS team asks about exposure limits, storage requirements, or disposal procedures, the answer comes from the SDS — not the product listing. Failure to surface this information accurately is a compliance risk.

Installation and commissioning guides answer the how do I deploy this? question that every technical buyer eventually asks. These documents contain procedure steps, torque specifications, clearance requirements, tool lists, and commissioning checklists that have no home in a product catalog.

Tier 2: High Value for Specific Product Categories

Application notes are short engineering documents that describe how to use a product or product family in a specific context — "Using the XR-440 in ATEX Zone 2 environments," for example. They're incredibly valuable for retrieval because they map product capabilities to real buyer scenarios. A buyer asking about a specific application will find an application note far more useful than a spec sheet.

Technical bulletins and product change notices document revisions, corrections, firmware changes, and product supersessions. These are often the difference between a buyer correctly understanding that product revision B has a different pinout than revision A — versus discovering it the hard way on the production line.

Integration and compatibility guides matter most for products that form part of a larger system — software libraries, protocol adapters, industrial communication modules, anything with an API or a bus interface. These documents answer the compatibility and interoperability questions that catalog attributes can't capture.

Tier 3: Ingest When Available

FAQs, service bulletins, and technical Q&A documents compiled from support cases. These are gold if they exist because they represent exactly the questions buyers ask, with verified answers. See our article on building a knowledge base that actually gets used for why this content type often outperforms polished documentation for retrieval quality.

Training materials and product brochures can supplement the knowledge base, but tend to be marketing-heavy and less precise than datasheets. Ingest them selectively.

Extracting Content from PDFs: Where Most Implementations Go Wrong

The majority of technical documents exist as PDFs. This seems straightforward — extract text, chunk it, embed it — but there are several common failure modes that degrade retrieval quality significantly.

Problem 1: Multi-column Layout Destroys Reading Order

Most datasheets use a two-column or three-column layout. Naive PDF extraction that reads left-to-right across the full page width will interleave content from the left column and right column, producing incoherent chunks. For example, a spec table on the left might be interspersed with a cautions section on the right:

Input voltage range: 85–264 VAC  WARNING: Do not connect to DC supply
Operating temp: -40 to +70°C     Failure to observe this warning may
Max input current: 10A            result in equipment damage or personal

Use a PDF extraction library that understands column layout — PyMuPDF's block detection, pdfplumber's table-aware extraction, or a dedicated document parsing service. The extra effort pays off immediately in chunk coherence.

Problem 2: Tables Lose Structure When Converted to Text

Specification tables are a core content type in technical documents. A raw text extraction of a specifications table typically looks like:

Parameter Min Typ Max Unit
Input Voltage 85 — 264 VAC
Operating Temperature -40 — +70 °C
Efficiency — 91 — %

The column headers have been separated from the values, and it's not clear which row belongs to which parameter. A language model trying to answer "what is the maximum operating temperature?" may fail to match the "Max" column value to the "Operating Temperature" row.

The fix is to render tables as structured prose during ingestion:

function tableToProseChunk(table: ParsedTable, context: string): string {
  const rows = table.rows.map(row => {
    const param = row[0]
    const min = row[1] !== '—' ? `min ${row[1]}` : null
    const typ = row[2] !== '—' ? `typical ${row[2]}` : null
    const max = row[3] !== '—' ? `max ${row[3]}` : null
    const unit = row[4]
    
    const values = [min, typ, max]
      .filter(Boolean)
      .join(', ')
    
    return `${param}: ${values} ${unit}`.trim()
  })
  
  return `${context}\n\n${rows.join('\n')}`
}

Output: "Operating Temperature: min -40°C, max +70°C" — now clearly answerable by a language model.

Problem 3: Figures and Diagrams Are Invisible

Datasheets are full of wiring diagrams, dimensional drawings, block diagrams, and performance curves that communicate critical information. Naive text extraction ignores all of this.

The pragmatic approach: extract figure captions and any surrounding descriptive text, and log which figures exist. For the most important figures (wiring diagrams, dimensional drawings), either:

Use a multimodal vision model to generate a text description of the figure that gets included in the relevant chunk
Tag the chunk with a reference to the figure so the chat interface can present the image when the chunk is retrieved

Full multimodal extraction is increasingly viable — see our guide on multimodal RAG and image search for implementation details.

Problem 4: Scanned PDFs with No Text Layer

Older technical documents, especially for legacy equipment, are often scanned images with no searchable text layer. Standard PDF text extraction returns nothing.

The solution is OCR at ingestion time. Tesseract (open source) and commercial APIs (Google Document AI, AWS Textract, Azure Document Intelligence) all handle this reliably. Cloud document intelligence APIs have an additional advantage: they detect table structure and column layout in scanned documents, which significantly improves chunk quality over raw OCR output.

Chunking Documents: Different Rules than Product Records

Product records chunk naturally — one product, one or a few chunks. Documents require a different approach because they're long, hierarchically structured, and cover multiple topics in sequence.

Section-Based Chunking

Technical documents have a built-in structure: section headings, subsections, and numbered procedures. Use this structure to create semantically coherent chunks rather than slicing at a fixed token count.

interface DocumentSection {
  heading: string
  level: number        // 1 = H1, 2 = H2, etc.
  content: string
  pageRange: [number, number]
}
 
function chunkBySection(sections: DocumentSection[], doc: Document): Chunk[] {
  return sections.map(section => ({
    content: `${section.heading}\n\n${section.content}`,
    metadata: {
      documentId: doc.id,
      documentTitle: doc.title,
      documentType: doc.type,       // "datasheet", "installation_guide", etc.
      relatedSkus: doc.relatedSkus,  // Link back to product catalog
      section: section.heading,
      pages: section.pageRange,
      revision: doc.revision,
      revisionDate: doc.revisionDate,
    }
  }))
}

The heading becomes part of the chunk content, which helps the embedding model understand the topic context. Without the heading, a chunk containing "Torque the M8 bolts to 18 Nm in a star pattern" could be retrieved for almost any installation question. With the heading "Step 3: Mechanical Installation — Mounting the Control Unit", the chunk retrieves precisely for questions about mounting that specific control unit.

Handling Long Sections

Some document sections are very long — a detailed installation procedure might run 3,000 words. Split these into overlapping sub-chunks (e.g., 400-token chunks with 100-token overlap) while preserving the section heading in each sub-chunk's content:

function splitLongSection(section: DocumentSection, maxTokens = 400): DocumentSection[] {
  if (estimateTokens(section.content) <= maxTokens) {
    return [section]
  }
  
  const paragraphs = section.content.split(/\n{2,}/)
  const subchunks: DocumentSection[] = []
  let buffer: string[] = []
  let bufferTokens = 0
  
  for (const paragraph of paragraphs) {
    const pTokens = estimateTokens(paragraph)
    
    if (bufferTokens + pTokens > maxTokens && buffer.length > 0) {
      subchunks.push({
        ...section,
        content: `${section.heading} (continued)\n\n${buffer.join('\n\n')}`,
      })
      // Overlap: keep last paragraph for context continuity
      buffer = [buffer[buffer.length - 1], paragraph]
      bufferTokens = estimateTokens(buffer.join('\n\n'))
    } else {
      buffer.push(paragraph)
      bufferTokens += pTokens
    }
  }
  
  if (buffer.length > 0) {
    subchunks.push({
      ...section,
      content: `${section.heading} (continued)\n\n${buffer.join('\n\n')}`,
    })
  }
  
  return subchunks
}

For more on chunking strategy trade-offs, see our deep dive on document chunking for RAG.

The Document–Product Link: The Most Important Relationship in Your Knowledge Base

Here's where most document ingestion implementations miss a critical detail: the relationship between a document and the products it covers.

A datasheet for the PV2200 inverter covers that product — and probably also the PV2200-T (three-phase variant) and the PV2200-FAN (fan-cooled variant). An installation guide for a conduit fitting family might cover 40 SKUs. A safety data sheet for a product line might cover a hundred variants sharing the same chemical composition.

If you don't capture these relationships, two problems arise:

False negatives in retrieval: A buyer asks about SKU PV2200-T and the system doesn't surface the installation guide because the guide is only linked to PV2200.
Incorrect answer attribution: The AI uses a document chunk to answer a question but the document applies to a different product variant with different specifications.

The solution is a document–product mapping that you maintain as part of your knowledge base:

interface DocumentProductMap {
  documentId: string
  documentType: DocumentType
  revision: string
  coveredSkus: string[]        // All SKUs this document applies to
  primarySku?: string          // The "canonical" product if applicable
  applicabilityNote?: string   // e.g., "Applies to all variants with firmware 2.x+"
}

Store coveredSkus as metadata on every chunk derived from that document. At query time, you can now:

Filter document chunks to those covering a specific SKU: "find installation instructions for PV2200-FAN"
When returning a product record and retrieving related documents, pull all docs where coveredSkus includes that SKU

This bidirectional linking is what enables a buyer to ask "how do I wire up the PV2200-T?" and get the correct installation guide section, even if the guide is titled simply "PV2200 Series Installation Manual."

Versioning and Supersession

Technical documents have revisions. A datasheet updated to reflect a PCB revision that changed the pinout is not interchangeable with the previous version — and serving the wrong revision can cause real problems.

Your ingestion pipeline should:

Track document revisions explicitly in metadata (revision: "C", revisionDate: "2025-11-01")
Replace old chunks when a new revision is ingested — don't let revision A and revision C chunks coexist in the index
Flag superseded documents — if a product has been replaced by a new model, mark the old product's documents as superseded and note the replacement

async function ingestDocumentRevision(doc: ParsedDocument): Promise<void> {
  // Delete all chunks from previous revisions of this document
  await vectorStore.deleteWhere({
    documentId: doc.id,
    revision: { $ne: doc.revision }  // Delete all except current revision
  })
  
  // Ingest new chunks
  const chunks = generateChunks(doc)
  await vectorStore.upsert(chunks)
  
  // Update document registry
  await documentRegistry.upsert({
    id: doc.id,
    revision: doc.revision,
    revisionDate: doc.revisionDate,
    ingestedAt: new Date(),
  })
}

This is analogous to the catalog freshness challenge covered in our article on product catalog sync and RAG freshness — the same principles apply, with the added complication that document revisions have explicit version numbers that catalog records usually don't.

Querying Across Catalog and Documents Together

Once both your product catalog and technical documents are in the same vector store (with appropriate metadata), you need query-time logic that decides when to pull from which source.

For most B2B product AI deployments, the right approach is unified retrieval with source diversity:

Run a single similarity search across all chunk types
Ensure the top-K results include chunks from at least two different source types when available (catalog record, datasheet, installation guide, etc.)
In the LLM prompt, cite the source type for each piece of information used

async function searchKnowledgeBase(query: string, sku?: string): Promise<RetrievedContext> {
  const filter = sku ? { coveredSkus: { $contains: sku } } : undefined
  
  // Retrieve candidates
  const candidates = await vectorStore.similaritySearch(query, {
    topK: 30,
    filter,
  })
  
  // Ensure source diversity — don't return 10 chunks all from the same datasheet
  const diversified = diversifyBySource(candidates, {
    maxPerDocument: 3,
    desiredSourceTypes: ['product_record', 'datasheet', 'installation_guide'],
    totalK: 8,
  })
  
  return buildContext(diversified)
}

In the system prompt, instruct the LLM to attribute answers to their source:

When the answer comes from a specific document (datasheet, installation manual, etc.), mention the document name and revision. For example: "According to the PV2200 Installation Manual (Rev C), the recommended torque is 18 Nm."

This attribution serves two purposes: it helps buyers verify the information, and it builds trust in the AI's answers by making the reasoning traceable.

A Note on Safety-Critical Information

If your product range includes chemicals, electrical equipment, pressure vessels, medical devices, or anything else where incorrect information poses a safety risk — your document ingestion pipeline has an extra responsibility.

For SDS/MSDS content in particular:

Always retrieve from the authoritative, current revision — don't serve chunks from an outdated SDS that may have different exposure limits
Never let the LLM paraphrase safety-critical values — if the SDS says the TWA is 25 ppm, the AI should quote that number exactly, not interpret it
Add a disclaimer for safety-related responses that directs the buyer to the full SDS document for authoritative guidance
Suppress low-confidence answers — if the retrieval score for an SDS query is below your confidence threshold, fail to a human rather than guessing (see building trust in AI responses)

The regulatory exposure from a product AI that confidently gives incorrect safety information is significant. Conservative handling of SDS content is not optional.

Measuring Document Knowledge Quality

Once your document layer is live, track these signals to understand whether it's actually helping:

Document coverage rate: What percentage of active SKUs have at least one associated technical document ingested? A coverage rate below 60% means a large class of product questions will be unanswerable.

Document-retrieval rate in live sessions: Of all chat sessions, what fraction retrieve at least one document chunk (vs. relying only on catalog data)? This tells you whether the document layer is being reached.

Query types now resolved without escalation: Track the before/after on question categories like installation, safety, and compatibility. If your escalation rate on these drops after document ingestion, the layer is working.

SDS / safety query accuracy: For any regulated products, run a periodic evaluation set against authoritative SDS values. This is a non-negotiable quality check.

Getting Started: A Practical Rollout Plan

If you're starting from a product catalog and adding documents, here's a pragmatic sequence:

Week 1–2: Inventory and prioritize. Audit what documents exist for your top 20% of SKUs (by sales volume or support ticket frequency). These are the products where document knowledge has the highest ROI.

Week 2–3: Ingest datasheets first. Datasheets are the highest-value, most consistently available document type. Get them ingested with proper table-to-prose conversion and SKU linkage before tackling anything else.

Week 3–4: Add SDS/MSDS for regulated products. If you have any products with compliance requirements, this is non-negotiable and should happen early.

Month 2: Expand to installation guides and application notes. Once the ingestion pipeline is working reliably for datasheets, extending it to other document types is mostly a configuration change.

Ongoing: Automate revision tracking. As products evolve, set up a process to detect when manufacturer documents have been updated and re-ingest. Manual processes decay; automation sustains.

The Compound Effect

The value of combining product catalog data with technical documents isn't additive — it's multiplicative. Buyers don't ask questions that fit neatly into one category. They ask questions like:

"I need to install the XR-440 in a Zone 2 explosive atmosphere — is that possible, and if so, what's the procedure?"

Answering this correctly requires: the product's ATEX certification (from the catalog or datasheet), the Zone 2 applicability confirmation (from an application note), and the relevant installation procedure (from the installation manual). No single source contains all of it. Only a knowledge base with multiple document types can synthesize the complete answer.

That's the competitive moat: an AI that can reason across the full depth of your product knowledge, not just the top layer of catalog attributes. Buyers who get answers like that don't go looking elsewhere.

Start Building the Complete Knowledge Layer

Axoverna's document library supports ingestion of PDFs, datasheets, installation guides, SDS files, and application notes alongside your product catalog — with automatic SKU linkage, revision management, and table-aware extraction.

Start a free trial and upload your first documents in minutes, or book a demo to see how a unified catalog + document knowledge base answers the questions your catalog AI can't.

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.

Start free — no credit card required →Read the docs

Guide

Role-Aware Product AI: Why Engineers, Buyers, and Sales Reps Should Not Get the Same Answer

A B2B product knowledge assistant should not answer every user the same way. Engineers, procurement teams, and sales reps need different evidence, different workflows, and different levels of detail. Here is how to design role-aware product AI without fragmenting your knowledge stack.

May 25, 202612 min read

Guide

Catalog Drift Detection for B2B Product AI: Find Knowledge Gaps Before Buyers Do

Product catalogs change faster than most AI assistants can safely keep up. This guide explains how B2B teams can detect catalog drift early by combining query logs, answer failures, and coverage signals before trust erodes.

May 21, 202611 min read

Guide

Schema Mapping for Product AI: Turning Supplier Data Chaos Into Reliable Answers

Messy supplier feeds are one of the biggest reasons B2B product AI fails in production. This guide explains how schema mapping turns inconsistent catalog data into retrieval-ready product knowledge that actually supports accurate answers.

May 18, 202612 min read

What Lives in Technical Documents That Doesn't Live in Your Catalog

The Document Taxonomy: What to Prioritize

Tier 1: High Retrieval Value, Ingest First

Tier 2: High Value for Specific Product Categories

Tier 3: Ingest When Available

Extracting Content from PDFs: Where Most Implementations Go Wrong

Problem 1: Multi-column Layout Destroys Reading Order

Problem 2: Tables Lose Structure When Converted to Text

Problem 3: Figures and Diagrams Are Invisible

Problem 4: Scanned PDFs with No Text Layer

Chunking Documents: Different Rules than Product Records

Section-Based Chunking

Handling Long Sections

The Document–Product Link: The Most Important Relationship in Your Knowledge Base

Versioning and Supersession

Querying Across Catalog and Documents Together

A Note on Safety-Critical Information

Measuring Document Knowledge Quality

Getting Started: A Practical Rollout Plan

The Compound Effect

Start Building the Complete Knowledge Layer

Turn your product catalog into an AI knowledge base

Related articles

Role-Aware Product AI: Why Engineers, Buyers, and Sales Reps Should Not Get the Same Answer

Catalog Drift Detection for B2B Product AI: Find Knowledge Gaps Before Buyers Do

Schema Mapping for Product AI: Turning Supplier Data Chaos Into Reliable Answers