Beyond the Product Catalog: Building a Complete AI Knowledge Base with Technical Documents
Product catalog data alone leaves your AI unable to answer a huge class of buyer questions. Here's how to bring datasheets, installation manuals, SDS files, and application notes into your RAG pipeline — and why it changes everything.
Most B2B companies, when they first build a conversational product AI, make the same scoping decision: we'll use the product catalog as the knowledge source. SKUs, attributes, specs, descriptions — all the structured product data living in the ERP or PIM gets ingested, chunked, and embedded. The demo looks great. Buyers can ask about dimensions, materials, and compatibility. Leadership is excited.
Then the first wave of real questions arrives:
"How do I install the PV2200 inverter in a three-phase system?"
"What PPE is required when handling product XR-440?"
"The datasheet says max ambient temp is 50°C — does that apply to the fan-cooled version too?"
"Is the new 2024 revision of this cable compatible with the connectors I ordered last year?"
Your catalog AI answers zero of these correctly. It might hallucinate an answer, or it might say I don't have that information — which is technically honest but commercially frustrating. Either way, a buyer with a complex technical question ends up in a support queue.
The problem is architectural: a product catalog describes what a product is. Technical documents describe how it works, how to use it safely, and how it fits into a larger system. B2B buyers need both.
This article walks through how to build the second layer — incorporating datasheets, manuals, safety data sheets, application notes, and related documents into your RAG knowledge base — and the engineering decisions that determine whether it actually works.
What Lives in Technical Documents That Doesn't Live in Your Catalog
Before diving into implementation, it's worth mapping the knowledge gap precisely. Here's what product catalogs typically contain well, and where they fall short:
| Query Type | Product Catalog | Technical Documents |
|---|---|---|
| What is it? (material, dimensions, weight) | ✅ Excellent | ✅ Also covered |
| How does it compare to alternatives? | ⚠️ Partial | ✅ Application notes, comparison guides |
| How do I install / set up / configure it? | ❌ Missing | ✅ Installation manuals |
| What are the safety requirements? | ❌ Missing | ✅ Safety data sheets (SDS/MSDS) |
| What does the wiring diagram show? | ❌ Missing | ✅ Technical drawings in datasheets |
| What firmware version does this require? | ❌ Rarely tracked | ✅ Release notes, tech bulletins |
| What changed between revisions A and B? | ❌ Missing | ✅ Revision history in datasheets |
| What certifications does it carry and to what standard? | ⚠️ Often listed without detail | ✅ Detailed in compliance documents |
| What happens if I operate it outside rated range? | ❌ Missing | ✅ Covered in technical notes |
The gap isn't a data quality issue — it's a structural one. Catalog attributes are designed for filtering and purchasing. Technical documents are designed for engineering and operational decision-making. B2B buyers need both, often in the same conversation.
The Document Taxonomy: What to Prioritize
Not all documents are equally valuable for a product AI. Here's how to prioritize ingestion effort:
Tier 1: High Retrieval Value, Ingest First
Product datasheets are the single most important document type for B2B product AI. A well-written datasheet covers the complete performance specification, operating envelope, application guidance, and certification details in one place. For electrical, mechanical, and industrial components, datasheets answer the majority of pre-sales technical questions.
Safety Data Sheets (SDS / MSDS) are mandatory for products involving chemicals, hazardous materials, or regulated substances. When a buyer's EHS team asks about exposure limits, storage requirements, or disposal procedures, the answer comes from the SDS — not the product listing. Failure to surface this information accurately is a compliance risk.
Installation and commissioning guides answer the how do I deploy this? question that every technical buyer eventually asks. These documents contain procedure steps, torque specifications, clearance requirements, tool lists, and commissioning checklists that have no home in a product catalog.
Tier 2: High Value for Specific Product Categories
Application notes are short engineering documents that describe how to use a product or product family in a specific context — "Using the XR-440 in ATEX Zone 2 environments," for example. They're incredibly valuable for retrieval because they map product capabilities to real buyer scenarios. A buyer asking about a specific application will find an application note far more useful than a spec sheet.
Technical bulletins and product change notices document revisions, corrections, firmware changes, and product supersessions. These are often the difference between a buyer correctly understanding that product revision B has a different pinout than revision A — versus discovering it the hard way on the production line.
Integration and compatibility guides matter most for products that form part of a larger system — software libraries, protocol adapters, industrial communication modules, anything with an API or a bus interface. These documents answer the compatibility and interoperability questions that catalog attributes can't capture.
Tier 3: Ingest When Available
FAQs, service bulletins, and technical Q&A documents compiled from support cases. These are gold if they exist because they represent exactly the questions buyers ask, with verified answers. See our article on building a knowledge base that actually gets used for why this content type often outperforms polished documentation for retrieval quality.
Training materials and product brochures can supplement the knowledge base, but tend to be marketing-heavy and less precise than datasheets. Ingest them selectively.
Extracting Content from PDFs: Where Most Implementations Go Wrong
The majority of technical documents exist as PDFs. This seems straightforward — extract text, chunk it, embed it — but there are several common failure modes that degrade retrieval quality significantly.
Problem 1: Multi-column Layout Destroys Reading Order
Most datasheets use a two-column or three-column layout. Naive PDF extraction that reads left-to-right across the full page width will interleave content from the left column and right column, producing incoherent chunks. For example, a spec table on the left might be interspersed with a cautions section on the right:
Input voltage range: 85–264 VAC WARNING: Do not connect to DC supply
Operating temp: -40 to +70°C Failure to observe this warning may
Max input current: 10A result in equipment damage or personal
Use a PDF extraction library that understands column layout — PyMuPDF's block detection, pdfplumber's table-aware extraction, or a dedicated document parsing service. The extra effort pays off immediately in chunk coherence.
Problem 2: Tables Lose Structure When Converted to Text
Specification tables are a core content type in technical documents. A raw text extraction of a specifications table typically looks like:
Parameter Min Typ Max Unit
Input Voltage 85 — 264 VAC
Operating Temperature -40 — +70 °C
Efficiency — 91 — %
The column headers have been separated from the values, and it's not clear which row belongs to which parameter. A language model trying to answer "what is the maximum operating temperature?" may fail to match the "Max" column value to the "Operating Temperature" row.
The fix is to render tables as structured prose during ingestion:
function tableToProseChunk(table: ParsedTable, context: string): string {
const rows = table.rows.map(row => {
const param = row[0]
const min = row[1] !== '—' ? `min ${row[1]}` : null
const typ = row[2] !== '—' ? `typical ${row[2]}` : null
const max = row[3] !== '—' ? `max ${row[3]}` : null
const unit = row[4]
const values = [min, typ, max]
.filter(Boolean)
.join(', ')
return `${param}: ${values} ${unit}`.trim()
})
return `${context}\n\n${rows.join('\n')}`
}Output: "Operating Temperature: min -40°C, max +70°C" — now clearly answerable by a language model.
Problem 3: Figures and Diagrams Are Invisible
Datasheets are full of wiring diagrams, dimensional drawings, block diagrams, and performance curves that communicate critical information. Naive text extraction ignores all of this.
The pragmatic approach: extract figure captions and any surrounding descriptive text, and log which figures exist. For the most important figures (wiring diagrams, dimensional drawings), either:
- Use a multimodal vision model to generate a text description of the figure that gets included in the relevant chunk
- Tag the chunk with a reference to the figure so the chat interface can present the image when the chunk is retrieved
Full multimodal extraction is increasingly viable — see our guide on multimodal RAG and image search for implementation details.
Problem 4: Scanned PDFs with No Text Layer
Older technical documents, especially for legacy equipment, are often scanned images with no searchable text layer. Standard PDF text extraction returns nothing.
The solution is OCR at ingestion time. Tesseract (open source) and commercial APIs (Google Document AI, AWS Textract, Azure Document Intelligence) all handle this reliably. Cloud document intelligence APIs have an additional advantage: they detect table structure and column layout in scanned documents, which significantly improves chunk quality over raw OCR output.
Chunking Documents: Different Rules than Product Records
Product records chunk naturally — one product, one or a few chunks. Documents require a different approach because they're long, hierarchically structured, and cover multiple topics in sequence.
Section-Based Chunking
Technical documents have a built-in structure: section headings, subsections, and numbered procedures. Use this structure to create semantically coherent chunks rather than slicing at a fixed token count.
interface DocumentSection {
heading: string
level: number // 1 = H1, 2 = H2, etc.
content: string
pageRange: [number, number]
}
function chunkBySection(sections: DocumentSection[], doc: Document): Chunk[] {
return sections.map(section => ({
content: `${section.heading}\n\n${section.content}`,
metadata: {
documentId: doc.id,
documentTitle: doc.title,
documentType: doc.type, // "datasheet", "installation_guide", etc.
relatedSkus: doc.relatedSkus, // Link back to product catalog
section: section.heading,
pages: section.pageRange,
revision: doc.revision,
revisionDate: doc.revisionDate,
}
}))
}The heading becomes part of the chunk content, which helps the embedding model understand the topic context. Without the heading, a chunk containing "Torque the M8 bolts to 18 Nm in a star pattern" could be retrieved for almost any installation question. With the heading "Step 3: Mechanical Installation — Mounting the Control Unit", the chunk retrieves precisely for questions about mounting that specific control unit.
Handling Long Sections
Some document sections are very long — a detailed installation procedure might run 3,000 words. Split these into overlapping sub-chunks (e.g., 400-token chunks with 100-token overlap) while preserving the section heading in each sub-chunk's content:
function splitLongSection(section: DocumentSection, maxTokens = 400): DocumentSection[] {
if (estimateTokens(section.content) <= maxTokens) {
return [section]
}
const paragraphs = section.content.split(/\n{2,}/)
const subchunks: DocumentSection[] = []
let buffer: string[] = []
let bufferTokens = 0
for (const paragraph of paragraphs) {
const pTokens = estimateTokens(paragraph)
if (bufferTokens + pTokens > maxTokens && buffer.length > 0) {
subchunks.push({
...section,
content: `${section.heading} (continued)\n\n${buffer.join('\n\n')}`,
})
// Overlap: keep last paragraph for context continuity
buffer = [buffer[buffer.length - 1], paragraph]
bufferTokens = estimateTokens(buffer.join('\n\n'))
} else {
buffer.push(paragraph)
bufferTokens += pTokens
}
}
if (buffer.length > 0) {
subchunks.push({
...section,
content: `${section.heading} (continued)\n\n${buffer.join('\n\n')}`,
})
}
return subchunks
}For more on chunking strategy trade-offs, see our deep dive on document chunking for RAG.
The Document–Product Link: The Most Important Relationship in Your Knowledge Base
Here's where most document ingestion implementations miss a critical detail: the relationship between a document and the products it covers.
A datasheet for the PV2200 inverter covers that product — and probably also the PV2200-T (three-phase variant) and the PV2200-FAN (fan-cooled variant). An installation guide for a conduit fitting family might cover 40 SKUs. A safety data sheet for a product line might cover a hundred variants sharing the same chemical composition.
If you don't capture these relationships, two problems arise:
- False negatives in retrieval: A buyer asks about SKU PV2200-T and the system doesn't surface the installation guide because the guide is only linked to PV2200.
- Incorrect answer attribution: The AI uses a document chunk to answer a question but the document applies to a different product variant with different specifications.
The solution is a document–product mapping that you maintain as part of your knowledge base:
interface DocumentProductMap {
documentId: string
documentType: DocumentType
revision: string
coveredSkus: string[] // All SKUs this document applies to
primarySku?: string // The "canonical" product if applicable
applicabilityNote?: string // e.g., "Applies to all variants with firmware 2.x+"
}Store coveredSkus as metadata on every chunk derived from that document. At query time, you can now:
- Filter document chunks to those covering a specific SKU: "find installation instructions for PV2200-FAN"
- When returning a product record and retrieving related documents, pull all docs where
coveredSkusincludes that SKU
This bidirectional linking is what enables a buyer to ask "how do I wire up the PV2200-T?" and get the correct installation guide section, even if the guide is titled simply "PV2200 Series Installation Manual."
Versioning and Supersession
Technical documents have revisions. A datasheet updated to reflect a PCB revision that changed the pinout is not interchangeable with the previous version — and serving the wrong revision can cause real problems.
Your ingestion pipeline should:
- Track document revisions explicitly in metadata (
revision: "C",revisionDate: "2025-11-01") - Replace old chunks when a new revision is ingested — don't let revision A and revision C chunks coexist in the index
- Flag superseded documents — if a product has been replaced by a new model, mark the old product's documents as superseded and note the replacement
async function ingestDocumentRevision(doc: ParsedDocument): Promise<void> {
// Delete all chunks from previous revisions of this document
await vectorStore.deleteWhere({
documentId: doc.id,
revision: { $ne: doc.revision } // Delete all except current revision
})
// Ingest new chunks
const chunks = generateChunks(doc)
await vectorStore.upsert(chunks)
// Update document registry
await documentRegistry.upsert({
id: doc.id,
revision: doc.revision,
revisionDate: doc.revisionDate,
ingestedAt: new Date(),
})
}This is analogous to the catalog freshness challenge covered in our article on product catalog sync and RAG freshness — the same principles apply, with the added complication that document revisions have explicit version numbers that catalog records usually don't.
Querying Across Catalog and Documents Together
Once both your product catalog and technical documents are in the same vector store (with appropriate metadata), you need query-time logic that decides when to pull from which source.
For most B2B product AI deployments, the right approach is unified retrieval with source diversity:
- Run a single similarity search across all chunk types
- Ensure the top-K results include chunks from at least two different source types when available (catalog record, datasheet, installation guide, etc.)
- In the LLM prompt, cite the source type for each piece of information used
async function searchKnowledgeBase(query: string, sku?: string): Promise<RetrievedContext> {
const filter = sku ? { coveredSkus: { $contains: sku } } : undefined
// Retrieve candidates
const candidates = await vectorStore.similaritySearch(query, {
topK: 30,
filter,
})
// Ensure source diversity — don't return 10 chunks all from the same datasheet
const diversified = diversifyBySource(candidates, {
maxPerDocument: 3,
desiredSourceTypes: ['product_record', 'datasheet', 'installation_guide'],
totalK: 8,
})
return buildContext(diversified)
}In the system prompt, instruct the LLM to attribute answers to their source:
When the answer comes from a specific document (datasheet, installation manual, etc.), mention the document name and revision. For example: "According to the PV2200 Installation Manual (Rev C), the recommended torque is 18 Nm."
This attribution serves two purposes: it helps buyers verify the information, and it builds trust in the AI's answers by making the reasoning traceable.
A Note on Safety-Critical Information
If your product range includes chemicals, electrical equipment, pressure vessels, medical devices, or anything else where incorrect information poses a safety risk — your document ingestion pipeline has an extra responsibility.
For SDS/MSDS content in particular:
- Always retrieve from the authoritative, current revision — don't serve chunks from an outdated SDS that may have different exposure limits
- Never let the LLM paraphrase safety-critical values — if the SDS says the TWA is 25 ppm, the AI should quote that number exactly, not interpret it
- Add a disclaimer for safety-related responses that directs the buyer to the full SDS document for authoritative guidance
- Suppress low-confidence answers — if the retrieval score for an SDS query is below your confidence threshold, fail to a human rather than guessing (see building trust in AI responses)
The regulatory exposure from a product AI that confidently gives incorrect safety information is significant. Conservative handling of SDS content is not optional.
Measuring Document Knowledge Quality
Once your document layer is live, track these signals to understand whether it's actually helping:
Document coverage rate: What percentage of active SKUs have at least one associated technical document ingested? A coverage rate below 60% means a large class of product questions will be unanswerable.
Document-retrieval rate in live sessions: Of all chat sessions, what fraction retrieve at least one document chunk (vs. relying only on catalog data)? This tells you whether the document layer is being reached.
Query types now resolved without escalation: Track the before/after on question categories like installation, safety, and compatibility. If your escalation rate on these drops after document ingestion, the layer is working.
SDS / safety query accuracy: For any regulated products, run a periodic evaluation set against authoritative SDS values. This is a non-negotiable quality check.
Getting Started: A Practical Rollout Plan
If you're starting from a product catalog and adding documents, here's a pragmatic sequence:
Week 1–2: Inventory and prioritize. Audit what documents exist for your top 20% of SKUs (by sales volume or support ticket frequency). These are the products where document knowledge has the highest ROI.
Week 2–3: Ingest datasheets first. Datasheets are the highest-value, most consistently available document type. Get them ingested with proper table-to-prose conversion and SKU linkage before tackling anything else.
Week 3–4: Add SDS/MSDS for regulated products. If you have any products with compliance requirements, this is non-negotiable and should happen early.
Month 2: Expand to installation guides and application notes. Once the ingestion pipeline is working reliably for datasheets, extending it to other document types is mostly a configuration change.
Ongoing: Automate revision tracking. As products evolve, set up a process to detect when manufacturer documents have been updated and re-ingest. Manual processes decay; automation sustains.
The Compound Effect
The value of combining product catalog data with technical documents isn't additive — it's multiplicative. Buyers don't ask questions that fit neatly into one category. They ask questions like:
"I need to install the XR-440 in a Zone 2 explosive atmosphere — is that possible, and if so, what's the procedure?"
Answering this correctly requires: the product's ATEX certification (from the catalog or datasheet), the Zone 2 applicability confirmation (from an application note), and the relevant installation procedure (from the installation manual). No single source contains all of it. Only a knowledge base with multiple document types can synthesize the complete answer.
That's the competitive moat: an AI that can reason across the full depth of your product knowledge, not just the top layer of catalog attributes. Buyers who get answers like that don't go looking elsewhere.
Start Building the Complete Knowledge Layer
Axoverna's document library supports ingestion of PDFs, datasheets, installation guides, SDS files, and application notes alongside your product catalog — with automatic SKU linkage, revision management, and table-aware extraction.
Start a free trial and upload your first documents in minutes, or book a demo to see how a unified catalog + document knowledge base answers the questions your catalog AI can't.
Turn your product catalog into an AI knowledge base
Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.
Related articles
Clarifying Questions in B2B Product AI: How to Reduce Zero-Context Queries Without Adding Friction
Many high-intent B2B buyers ask vague product questions like 'Do you have this in stainless?' or 'What's the replacement for the old one?'. The best product AI does not guess. It asks the minimum useful clarifying question, grounded in catalog data, to guide buyers to the right answer faster.
When Product AI Should Hand Off to a Human: Designing Escalation That Actually Helps B2B Buyers
A strong product AI should not try to answer everything. In B2B commerce, the best systems know when to keep helping, when to ask clarifying questions, and when to route the conversation to a human with the right context.
Catalog Coverage Analysis for Product AI: How to Find the Blind Spots Before Your Users Do
Most product AI failures are not hallucinations, but coverage failures. Before launch, B2B teams should measure which products, attributes, documents, and query types their knowledge layer can actually answer well, and where it cannot.