Embedding Your Product Knowledge: PDF, CSV, and API Ingestion Patterns

How to get product data into an AI system. Practical patterns for ingesting from PDFs, spreadsheets, APIs, and web pages — with error handling and data quality checks.

Axoverna Team
9 min read

The hardest part of building an AI product knowledge system isn't the embeddings or the vector search. It's getting your product data into the system in a form that can be searched effectively. Your product information lives in multiple places: CSVs, PDFs, your ERP system, product pages, internal wikis. Each has different structure, different quality, different update cadence.

This guide covers the practical patterns for ingesting each, with production-grade error handling.

Data Sources and Their Characteristics

CSVs and Spreadsheets

Advantages: Structured, machine-readable, easy to export from any system. Challenges: Limited context (a spreadsheet row is often incomplete without header info), data quality issues, no embedded media.

Typical structure:

SKU,Name,Category,Description,Price,Stock,Specs_JSON
SKF-6205-2RS,Deep groove ball bearing,Bearings,"High-precision bearing...",€12.50,1200,"{ ""bore"": ""25mm"" }"

PDFs

Advantages: Official documents (datasheets, manuals), rich formatting, structured metadata. Challenges: Text extraction is fragile, complex layouts, unstructured content, images lose data.

Typical PDFs: Product datasheets, installation manuals, compliance documentation, technical bulletins.

APIs

Advantages: Real-time, structured, programmatically queryable. Challenges: Rate limits, authentication, downstream dependencies, versioning.

Typical APIs: Your PIM (Product Information Management), ERP inventory, distributor catalog feeds.

Web Pages

Advantages: Latest content, already optimized for reading. Challenges: Noisy extraction (sidebars, ads, irrelevant content), JavaScript rendering, link structure.

Internal Wikis / Confluence

Advantages: Structured knowledge, ownership clarity, version history. Challenges: Often incomplete or inconsistent, different teams format things differently.

Ingestion Pattern 1: CSV Products

CSVs are the easiest starting point. The data is structured, and you can ingest it directly.

import csv
import json
from pathlib import Path
 
class CSVProductIngester:
    def __init__(self, csv_path: str):
        self.csv_path = Path(csv_path)
    
    def ingest(self) -> list[dict]:
        """
        Read CSV and convert to product documents.
        
        Expected columns: sku, name, category, description, specs_json, tags
        """
        products = []
        
        with open(self.csv_path) as f:
            reader = csv.DictReader(f)
            for row_num, row in enumerate(reader, start=2):  # Start at 2 (skip header)
                try:
                    product = self._parse_row(row)
                    products.append(product)
                except Exception as e:
                    print(f"Warning: Row {row_num} failed to parse: {e}")
                    continue  # Skip malformed rows, don't crash
        
        return products
    
    def _parse_row(self, row: dict) -> dict:
        """Parse a CSV row into a product document."""
        # Validate required fields
        required = ["sku", "name", "category"]
        for field in required:
            if not row.get(field, "").strip():
                raise ValueError(f"Missing required field: {field}")
        
        # Parse specs (often JSON in a CSV column)
        specs = {}
        if row.get("specs_json"):
            try:
                specs = json.loads(row["specs_json"])
            except json.JSONDecodeError:
                print(f"Warning: Invalid JSON in specs for {row['sku']}")
        
        # Build product document
        return {
            "external_id": row["sku"].strip(),
            "type": "product",
            "title": row["name"].strip(),
            "content": self._build_content(row),
            "metadata": {
                "category": row["category"].strip(),
                "price": row.get("price", "").strip(),
                "stock_level": row.get("stock", "0"),
                "tags": [t.strip() for t in row.get("tags", "").split(",") if t.strip()],
                "specifications": specs,
                "ingestion_source": "csv",
                "ingestion_timestamp": str(datetime.now()),
            }
        }
    
    def _build_content(self, row: dict) -> str:
        """Combine multiple fields into searchable content."""
        parts = []
        
        # Product identity
        parts.append(f"{row['name']} ({row['sku']})")
        
        # Category context
        if row.get("category"):
            parts.append(f"Category: {row['category']}")
        
        # Description
        if row.get("description"):
            parts.append(f"Description: {row['description']}")
        
        # Specifications as text
        if row.get("specs_json"):
            try:
                specs = json.loads(row["specs_json"])
                spec_text = "; ".join([f"{k}: {v}" for k, v in specs.items()])
                parts.append(f"Specifications: {spec_text}")
            except:
                pass
        
        # Tags
        if row.get("tags"):
            parts.append(f"Tags: {row['tags']}")
        
        return "\n".join(parts)
 
# Usage
ingester = CSVProductIngester("products.csv")
products = ingester.ingest()

Key points:

  • Handle missing/malformed rows gracefully (skip, don't crash)
  • Parse nested data (JSON specs) explicitly
  • Combine multiple fields into a single "content" field for embedding
  • Preserve structured metadata separately so you can filter on it

Ingestion Pattern 2: PDFs

PDF extraction is notoriously fragile. Different layouts, different fonts, different page structures.

import pypdf
from pathlib import Path
 
class PDFProductIngester:
    def __init__(self, pdf_path: str, product_id: str = None):
        """
        Ingest a single PDF as product documentation.
        
        Args:
            pdf_path: Path to PDF file
            product_id: Optional product ID (inferred from filename if not provided)
        """
        self.pdf_path = Path(pdf_path)
        self.product_id = product_id or self.pdf_path.stem.upper()
    
    def ingest(self) -> list[dict]:
        """Extract PDF and split into chunks."""
        chunks = []
        
        try:
            with open(self.pdf_path, "rb") as f:
                reader = pypdf.PdfReader(f)
                
                for page_num, page in enumerate(reader.pages, start=1):
                    text = page.extract_text()
                    
                    if not text.strip():
                        print(f"Warning: Page {page_num} of {self.pdf_path.name} has no extractable text")
                        continue
                    
                    # Split page into logical chunks (paragraphs)
                    paragraphs = text.split("\n\n")
                    
                    for para_idx, para in enumerate(paragraphs):
                        if len(para.strip()) < 50:  # Skip tiny fragments
                            continue
                        
                        chunk = {
                            "external_id": f"{self.product_id}-page{page_num}-chunk{para_idx}",
                            "type": "manual",
                            "title": f"{self.product_id} - Page {page_num}",
                            "content": para.strip(),
                            "metadata": {
                                "source_file": self.pdf_path.name,
                                "source_product": self.product_id,
                                "source_page": page_num,
                                "source_type": "pdf",
                                "total_pages": len(reader.pages),
                            }
                        }
                        chunks.append(chunk)
        
        except Exception as e:
            print(f"Error processing {self.pdf_path}: {e}")
            return []
        
        return chunks
 
# Usage
ingester = PDFProductIngester("Model_3200_Datasheet.pdf", product_id="3200-DS-v2")
chunks = ingester.ingest()

Limitations and workarounds:

  • Complex layouts: pypdf struggles with multi-column layouts, forms, and tables. Consider commercial services (AWS Textract, Google Document AI) for complex PDFs.
  • Scanned PDFs: If the PDF is an image, use OCR (Tesseract, AWS Textract).
  • Table extraction: Dedicated libraries like camelot or tabula can extract tables if structure is regular.

For production systems, the practical approach:

  1. Try pypdf for simple PDFs (text-based, linear content)
  2. Use AWS Textract or Google Document AI for complex/scanned PDFs
  3. Manual review the first 10–20 PDFs ingested to catch structural issues

Ingestion Pattern 3: API Feeds

If your product data lives in a live system (PIM, ERP, distributor feed), ingest via API. This enables continuous updates.

import requests
from datetime import datetime
import hashlib
 
class APIProductIngester:
    def __init__(self, api_base: str, api_key: str, batch_size: int = 100):
        self.api_base = api_base
        self.api_key = api_key
        self.batch_size = batch_size
    
    def ingest(self, last_sync: datetime = None) -> list[dict]:
        """
        Fetch products from API.
        
        Optionally filter to products modified since last_sync.
        """
        products = []
        page = 1
        
        while True:
            try:
                response = requests.get(
                    f"{self.api_base}/products",
                    headers={"Authorization": f"Bearer {self.api_key}"},
                    params={
                        "page": page,
                        "limit": self.batch_size,
                        "modified_since": last_sync.isoformat() if last_sync else None,
                    },
                    timeout=30
                )
                response.raise_for_status()
            except requests.exceptions.RequestException as e:
                print(f"API error on page {page}: {e}")
                break  # Stop gracefully, don't crash entire ingestion
            
            data = response.json()
            page_products = data.get("products", [])
            
            if not page_products:
                break  # No more pages
            
            for api_product in page_products:
                product = self._map_api_product(api_product)
                products.append(product)
                
                # Track product hash for change detection
                product["_content_hash"] = self._hash_content(product["content"])
            
            page += 1
        
        return products
    
    def _map_api_product(self, api_product: dict) -> dict:
        """Map API response to standard product format."""
        return {
            "external_id": api_product["id"],
            "type": "product",
            "title": api_product["name"],
            "content": f"{api_product['name']}\n{api_product.get('description', '')}",
            "metadata": {
                "category": api_product.get("category"),
                "price": api_product.get("price"),
                "stock": api_product.get("stock_level", 0),
                "sku": api_product.get("sku"),
                "url": api_product.get("url"),
                "tags": api_product.get("tags", []),
                "specifications": api_product.get("specs", {}),
                "last_modified": api_product.get("updated_at"),
                "source": "api",
            }
        }
    
    def _hash_content(self, content: str) -> str:
        """Hash content for change detection."""
        return hashlib.md5(content.encode()).hexdigest()
 
# Usage: Incremental sync
from datetime import datetime, timedelta
 
ingester = APIProductIngester(
    api_base="https://your-pim.example.com/api",
    api_key="your_api_key"
)
 
# Sync products modified in the last hour
last_sync = datetime.now() - timedelta(hours=1)
products = ingester.ingest(last_sync=last_sync)

Key patterns:

  • Pagination handling (loop until empty result)
  • Graceful error handling (skip failed requests, continue ingestion)
  • Content hashing for change detection (only re-embed changed products)
  • Incremental sync (fetch only recently modified products)

Ingestion Pattern 4: Web Scraping

Less desirable than APIs, but sometimes necessary.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
 
class WebScraperIngester:
    def __init__(self, base_url: str, delay: float = 1.0):
        self.base_url = base_url
        self.delay = delay  # Respectful delay between requests
        self.session = requests.Session()
        self.session.headers.update({"User-Agent": "Product Knowledge Bot"})
    
    def ingest_product_pages(self, product_urls: list[str]) -> list[dict]:
        """Scrape a list of product URLs."""
        products = []
        
        for url in product_urls:
            time.sleep(self.delay)  # Rate limiting
            
            try:
                product = self._scrape_product(url)
                if product:
                    products.append(product)
            except Exception as e:
                print(f"Error scraping {url}: {e}")
                continue
        
        return products
    
    def _scrape_product(self, url: str) -> dict:
        """Scrape a single product page."""
        response = self.session.get(url, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, "html.parser")
        
        # Extract data (CSS selectors depend on your site structure)
        title_elem = soup.select_one("h1.product-title")
        desc_elem = soup.select_one("div.product-description")
        specs_elem = soup.select_one("div.specifications")
        
        if not title_elem:
            return None  # Not a valid product page
        
        return {
            "external_id": url,
            "type": "product",
            "title": title_elem.get_text(strip=True),
            "content": self._extract_text(soup),
            "metadata": {
                "source_url": url,
                "source_type": "web",
                "scraped_at": datetime.now().isoformat(),
            }
        }
    
    def _extract_text(self, soup: BeautifulSoup) -> str:
        """Extract all meaningful text from page."""
        # Remove script/style tags
        for tag in soup(["script", "style"]):
            tag.decompose()
        
        text = soup.get_text(separator="\n")
        # Clean up whitespace
        lines = [line.strip() for line in text.split("\n") if line.strip()]
        return "\n".join(lines[:500])  # Truncate to first 500 lines

Important: Check the website's robots.txt and terms of service before scraping. Respect rate limits, use reasonable delays, and identify your bot in the User-Agent header.

Data Quality and Deduplication

Before ingestion, validate and deduplicate:

class IngestionValidator:
    @staticmethod
    def deduplicate(products: list[dict]) -> list[dict]:
        """Remove duplicate products by external_id."""
        seen = {}
        for product in products:
            product_id = product.get("external_id")
            if product_id not in seen:
                seen[product_id] = product
        return list(seen.values())
    
    @staticmethod
    def validate(product: dict) -> bool:
        """Check if product is complete enough to ingest."""
        required_fields = ["external_id", "type", "title", "content"]
        for field in required_fields:
            if not product.get(field, "").strip():
                return False
        
        # Content should have minimum length
        if len(product["content"]) < 50:
            return False
        
        return True
    
    @staticmethod
    def clean_content(content: str) -> str:
        """Normalize whitespace and remove invalid characters."""
        # Remove multiple spaces
        content = " ".join(content.split())
        # Remove non-printable characters
        content = "".join(c for c in content if c.isprintable() or c in "\n\t")
        return content.strip()
 
# Usage
validator = IngestionValidator()
products = [p for p in products if validator.validate(p)]
products = validator.deduplicate(products)
products = [
    {**p, "content": validator.clean_content(p["content"])}
    for p in products
]

Putting It Together: Orchestrated Ingestion

In production, you'll likely ingest from multiple sources and want to combine them:

class MultiSourceIngester:
    def __init__(self):
        self.csv_ingester = CSVProductIngester("products.csv")
        self.api_ingester = APIProductIngester(
            api_base="https://pim.example.com/api",
            api_key="key"
        )
        self.pdf_ingester = PDFProductIngester
        self.validator = IngestionValidator()
    
    def ingest_all(self) -> list[dict]:
        """Ingest from all sources and combine."""
        all_products = []
        
        # CSV products (base data)
        print("Ingesting CSV...")
        csv_products = self.csv_ingester.ingest()
        all_products.extend(csv_products)
        
        # API products (latest data, overwrites CSV)
        print("Ingesting from API...")
        api_products = self.api_ingester.ingest()
        all_products.extend(api_products)
        
        # PDFs (supplementary documentation)
        print("Ingesting PDFs...")
        pdf_dir = Path("datasheets/")
        for pdf_path in pdf_dir.glob("*.pdf"):
            ingester = self.pdf_ingester(str(pdf_path))
            all_products.extend(ingester.ingest())
        
        # Validate and clean
        print(f"Validating {len(all_products)} items...")
        all_products = [
            p for p in all_products 
            if self.validator.validate(p)
        ]
        
        # Deduplicate (API overwrites CSV)
        print("Deduplicating...")
        all_products = self.validator.deduplicate(all_products)
        
        print(f"Ingestion complete: {len(all_products)} products")
        return all_products
 
# Usage
ingester = MultiSourceIngester()
products = ingester.ingest_all()

The ingestion phase is where data quality is set. Invest in careful handling, validation, and deduplication here, and your entire RAG system will be more reliable downstream.

Axoverna handles all ingestion patterns automatically → Upload your data and search immediately

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.