Embedding Your Product Knowledge: PDF, CSV, and API Ingestion Patterns
How to get product data into an AI system. Practical patterns for ingesting from PDFs, spreadsheets, APIs, and web pages — with error handling and data quality checks.
The hardest part of building an AI product knowledge system isn't the embeddings or the vector search. It's getting your product data into the system in a form that can be searched effectively. Your product information lives in multiple places: CSVs, PDFs, your ERP system, product pages, internal wikis. Each has different structure, different quality, different update cadence.
This guide covers the practical patterns for ingesting each, with production-grade error handling.
Data Sources and Their Characteristics
CSVs and Spreadsheets
Advantages: Structured, machine-readable, easy to export from any system. Challenges: Limited context (a spreadsheet row is often incomplete without header info), data quality issues, no embedded media.
Typical structure:
SKU,Name,Category,Description,Price,Stock,Specs_JSON
SKF-6205-2RS,Deep groove ball bearing,Bearings,"High-precision bearing...",€12.50,1200,"{ ""bore"": ""25mm"" }"
PDFs
Advantages: Official documents (datasheets, manuals), rich formatting, structured metadata. Challenges: Text extraction is fragile, complex layouts, unstructured content, images lose data.
Typical PDFs: Product datasheets, installation manuals, compliance documentation, technical bulletins.
APIs
Advantages: Real-time, structured, programmatically queryable. Challenges: Rate limits, authentication, downstream dependencies, versioning.
Typical APIs: Your PIM (Product Information Management), ERP inventory, distributor catalog feeds.
Web Pages
Advantages: Latest content, already optimized for reading. Challenges: Noisy extraction (sidebars, ads, irrelevant content), JavaScript rendering, link structure.
Internal Wikis / Confluence
Advantages: Structured knowledge, ownership clarity, version history. Challenges: Often incomplete or inconsistent, different teams format things differently.
Ingestion Pattern 1: CSV Products
CSVs are the easiest starting point. The data is structured, and you can ingest it directly.
import csv
import json
from pathlib import Path
class CSVProductIngester:
def __init__(self, csv_path: str):
self.csv_path = Path(csv_path)
def ingest(self) -> list[dict]:
"""
Read CSV and convert to product documents.
Expected columns: sku, name, category, description, specs_json, tags
"""
products = []
with open(self.csv_path) as f:
reader = csv.DictReader(f)
for row_num, row in enumerate(reader, start=2): # Start at 2 (skip header)
try:
product = self._parse_row(row)
products.append(product)
except Exception as e:
print(f"Warning: Row {row_num} failed to parse: {e}")
continue # Skip malformed rows, don't crash
return products
def _parse_row(self, row: dict) -> dict:
"""Parse a CSV row into a product document."""
# Validate required fields
required = ["sku", "name", "category"]
for field in required:
if not row.get(field, "").strip():
raise ValueError(f"Missing required field: {field}")
# Parse specs (often JSON in a CSV column)
specs = {}
if row.get("specs_json"):
try:
specs = json.loads(row["specs_json"])
except json.JSONDecodeError:
print(f"Warning: Invalid JSON in specs for {row['sku']}")
# Build product document
return {
"external_id": row["sku"].strip(),
"type": "product",
"title": row["name"].strip(),
"content": self._build_content(row),
"metadata": {
"category": row["category"].strip(),
"price": row.get("price", "").strip(),
"stock_level": row.get("stock", "0"),
"tags": [t.strip() for t in row.get("tags", "").split(",") if t.strip()],
"specifications": specs,
"ingestion_source": "csv",
"ingestion_timestamp": str(datetime.now()),
}
}
def _build_content(self, row: dict) -> str:
"""Combine multiple fields into searchable content."""
parts = []
# Product identity
parts.append(f"{row['name']} ({row['sku']})")
# Category context
if row.get("category"):
parts.append(f"Category: {row['category']}")
# Description
if row.get("description"):
parts.append(f"Description: {row['description']}")
# Specifications as text
if row.get("specs_json"):
try:
specs = json.loads(row["specs_json"])
spec_text = "; ".join([f"{k}: {v}" for k, v in specs.items()])
parts.append(f"Specifications: {spec_text}")
except:
pass
# Tags
if row.get("tags"):
parts.append(f"Tags: {row['tags']}")
return "\n".join(parts)
# Usage
ingester = CSVProductIngester("products.csv")
products = ingester.ingest()Key points:
- Handle missing/malformed rows gracefully (skip, don't crash)
- Parse nested data (JSON specs) explicitly
- Combine multiple fields into a single "content" field for embedding
- Preserve structured metadata separately so you can filter on it
Ingestion Pattern 2: PDFs
PDF extraction is notoriously fragile. Different layouts, different fonts, different page structures.
import pypdf
from pathlib import Path
class PDFProductIngester:
def __init__(self, pdf_path: str, product_id: str = None):
"""
Ingest a single PDF as product documentation.
Args:
pdf_path: Path to PDF file
product_id: Optional product ID (inferred from filename if not provided)
"""
self.pdf_path = Path(pdf_path)
self.product_id = product_id or self.pdf_path.stem.upper()
def ingest(self) -> list[dict]:
"""Extract PDF and split into chunks."""
chunks = []
try:
with open(self.pdf_path, "rb") as f:
reader = pypdf.PdfReader(f)
for page_num, page in enumerate(reader.pages, start=1):
text = page.extract_text()
if not text.strip():
print(f"Warning: Page {page_num} of {self.pdf_path.name} has no extractable text")
continue
# Split page into logical chunks (paragraphs)
paragraphs = text.split("\n\n")
for para_idx, para in enumerate(paragraphs):
if len(para.strip()) < 50: # Skip tiny fragments
continue
chunk = {
"external_id": f"{self.product_id}-page{page_num}-chunk{para_idx}",
"type": "manual",
"title": f"{self.product_id} - Page {page_num}",
"content": para.strip(),
"metadata": {
"source_file": self.pdf_path.name,
"source_product": self.product_id,
"source_page": page_num,
"source_type": "pdf",
"total_pages": len(reader.pages),
}
}
chunks.append(chunk)
except Exception as e:
print(f"Error processing {self.pdf_path}: {e}")
return []
return chunks
# Usage
ingester = PDFProductIngester("Model_3200_Datasheet.pdf", product_id="3200-DS-v2")
chunks = ingester.ingest()Limitations and workarounds:
- Complex layouts: pypdf struggles with multi-column layouts, forms, and tables. Consider commercial services (AWS Textract, Google Document AI) for complex PDFs.
- Scanned PDFs: If the PDF is an image, use OCR (Tesseract, AWS Textract).
- Table extraction: Dedicated libraries like
camelotortabulacan extract tables if structure is regular.
For production systems, the practical approach:
- Try pypdf for simple PDFs (text-based, linear content)
- Use AWS Textract or Google Document AI for complex/scanned PDFs
- Manual review the first 10–20 PDFs ingested to catch structural issues
Ingestion Pattern 3: API Feeds
If your product data lives in a live system (PIM, ERP, distributor feed), ingest via API. This enables continuous updates.
import requests
from datetime import datetime
import hashlib
class APIProductIngester:
def __init__(self, api_base: str, api_key: str, batch_size: int = 100):
self.api_base = api_base
self.api_key = api_key
self.batch_size = batch_size
def ingest(self, last_sync: datetime = None) -> list[dict]:
"""
Fetch products from API.
Optionally filter to products modified since last_sync.
"""
products = []
page = 1
while True:
try:
response = requests.get(
f"{self.api_base}/products",
headers={"Authorization": f"Bearer {self.api_key}"},
params={
"page": page,
"limit": self.batch_size,
"modified_since": last_sync.isoformat() if last_sync else None,
},
timeout=30
)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"API error on page {page}: {e}")
break # Stop gracefully, don't crash entire ingestion
data = response.json()
page_products = data.get("products", [])
if not page_products:
break # No more pages
for api_product in page_products:
product = self._map_api_product(api_product)
products.append(product)
# Track product hash for change detection
product["_content_hash"] = self._hash_content(product["content"])
page += 1
return products
def _map_api_product(self, api_product: dict) -> dict:
"""Map API response to standard product format."""
return {
"external_id": api_product["id"],
"type": "product",
"title": api_product["name"],
"content": f"{api_product['name']}\n{api_product.get('description', '')}",
"metadata": {
"category": api_product.get("category"),
"price": api_product.get("price"),
"stock": api_product.get("stock_level", 0),
"sku": api_product.get("sku"),
"url": api_product.get("url"),
"tags": api_product.get("tags", []),
"specifications": api_product.get("specs", {}),
"last_modified": api_product.get("updated_at"),
"source": "api",
}
}
def _hash_content(self, content: str) -> str:
"""Hash content for change detection."""
return hashlib.md5(content.encode()).hexdigest()
# Usage: Incremental sync
from datetime import datetime, timedelta
ingester = APIProductIngester(
api_base="https://your-pim.example.com/api",
api_key="your_api_key"
)
# Sync products modified in the last hour
last_sync = datetime.now() - timedelta(hours=1)
products = ingester.ingest(last_sync=last_sync)Key patterns:
- Pagination handling (loop until empty result)
- Graceful error handling (skip failed requests, continue ingestion)
- Content hashing for change detection (only re-embed changed products)
- Incremental sync (fetch only recently modified products)
Ingestion Pattern 4: Web Scraping
Less desirable than APIs, but sometimes necessary.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
class WebScraperIngester:
def __init__(self, base_url: str, delay: float = 1.0):
self.base_url = base_url
self.delay = delay # Respectful delay between requests
self.session = requests.Session()
self.session.headers.update({"User-Agent": "Product Knowledge Bot"})
def ingest_product_pages(self, product_urls: list[str]) -> list[dict]:
"""Scrape a list of product URLs."""
products = []
for url in product_urls:
time.sleep(self.delay) # Rate limiting
try:
product = self._scrape_product(url)
if product:
products.append(product)
except Exception as e:
print(f"Error scraping {url}: {e}")
continue
return products
def _scrape_product(self, url: str) -> dict:
"""Scrape a single product page."""
response = self.session.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "html.parser")
# Extract data (CSS selectors depend on your site structure)
title_elem = soup.select_one("h1.product-title")
desc_elem = soup.select_one("div.product-description")
specs_elem = soup.select_one("div.specifications")
if not title_elem:
return None # Not a valid product page
return {
"external_id": url,
"type": "product",
"title": title_elem.get_text(strip=True),
"content": self._extract_text(soup),
"metadata": {
"source_url": url,
"source_type": "web",
"scraped_at": datetime.now().isoformat(),
}
}
def _extract_text(self, soup: BeautifulSoup) -> str:
"""Extract all meaningful text from page."""
# Remove script/style tags
for tag in soup(["script", "style"]):
tag.decompose()
text = soup.get_text(separator="\n")
# Clean up whitespace
lines = [line.strip() for line in text.split("\n") if line.strip()]
return "\n".join(lines[:500]) # Truncate to first 500 linesImportant: Check the website's robots.txt and terms of service before scraping. Respect rate limits, use reasonable delays, and identify your bot in the User-Agent header.
Data Quality and Deduplication
Before ingestion, validate and deduplicate:
class IngestionValidator:
@staticmethod
def deduplicate(products: list[dict]) -> list[dict]:
"""Remove duplicate products by external_id."""
seen = {}
for product in products:
product_id = product.get("external_id")
if product_id not in seen:
seen[product_id] = product
return list(seen.values())
@staticmethod
def validate(product: dict) -> bool:
"""Check if product is complete enough to ingest."""
required_fields = ["external_id", "type", "title", "content"]
for field in required_fields:
if not product.get(field, "").strip():
return False
# Content should have minimum length
if len(product["content"]) < 50:
return False
return True
@staticmethod
def clean_content(content: str) -> str:
"""Normalize whitespace and remove invalid characters."""
# Remove multiple spaces
content = " ".join(content.split())
# Remove non-printable characters
content = "".join(c for c in content if c.isprintable() or c in "\n\t")
return content.strip()
# Usage
validator = IngestionValidator()
products = [p for p in products if validator.validate(p)]
products = validator.deduplicate(products)
products = [
{**p, "content": validator.clean_content(p["content"])}
for p in products
]Putting It Together: Orchestrated Ingestion
In production, you'll likely ingest from multiple sources and want to combine them:
class MultiSourceIngester:
def __init__(self):
self.csv_ingester = CSVProductIngester("products.csv")
self.api_ingester = APIProductIngester(
api_base="https://pim.example.com/api",
api_key="key"
)
self.pdf_ingester = PDFProductIngester
self.validator = IngestionValidator()
def ingest_all(self) -> list[dict]:
"""Ingest from all sources and combine."""
all_products = []
# CSV products (base data)
print("Ingesting CSV...")
csv_products = self.csv_ingester.ingest()
all_products.extend(csv_products)
# API products (latest data, overwrites CSV)
print("Ingesting from API...")
api_products = self.api_ingester.ingest()
all_products.extend(api_products)
# PDFs (supplementary documentation)
print("Ingesting PDFs...")
pdf_dir = Path("datasheets/")
for pdf_path in pdf_dir.glob("*.pdf"):
ingester = self.pdf_ingester(str(pdf_path))
all_products.extend(ingester.ingest())
# Validate and clean
print(f"Validating {len(all_products)} items...")
all_products = [
p for p in all_products
if self.validator.validate(p)
]
# Deduplicate (API overwrites CSV)
print("Deduplicating...")
all_products = self.validator.deduplicate(all_products)
print(f"Ingestion complete: {len(all_products)} products")
return all_products
# Usage
ingester = MultiSourceIngester()
products = ingester.ingest_all()The ingestion phase is where data quality is set. Invest in careful handling, validation, and deduplication here, and your entire RAG system will be more reliable downstream.
Axoverna handles all ingestion patterns automatically → Upload your data and search immediately
Turn your product catalog into an AI knowledge base
Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.
Related articles
Clarifying Questions in B2B Product AI: How to Reduce Zero-Context Queries Without Adding Friction
Many high-intent B2B buyers ask vague product questions like 'Do you have this in stainless?' or 'What's the replacement for the old one?'. The best product AI does not guess. It asks the minimum useful clarifying question, grounded in catalog data, to guide buyers to the right answer faster.
When Product AI Should Hand Off to a Human: Designing Escalation That Actually Helps B2B Buyers
A strong product AI should not try to answer everything. In B2B commerce, the best systems know when to keep helping, when to ask clarifying questions, and when to route the conversation to a human with the right context.
Catalog Coverage Analysis for Product AI: How to Find the Blind Spots Before Your Users Do
Most product AI failures are not hallucinations, but coverage failures. Before launch, B2B teams should measure which products, attributes, documents, and query types their knowledge layer can actually answer well, and where it cannot.