Your Website Is Already a Knowledge Base: How Web Crawling Powers Live Product AI

Most B2B companies have product knowledge scattered across websites, docs portals, and support pages. Web crawling turns that existing content into a continuously synced RAG knowledge source — no manual export required.

Axoverna Team
15 min read

There's a question that comes up on nearly every first call with a B2B company exploring product AI: "How do we get our content into the system?"

The assumption buried in that question is usually that it requires an export: pull a CSV from the PIM, dump a PDF from the ERP, compile a spreadsheet from the team. And for structured catalog data, that's often the right path — we've covered how PIM integration works in practice and how to keep that data fresh.

But there's a second category of product knowledge that almost every B2B company already has, fully written, publicly accessible, and continuously updated: their website.

Product pages. Technical documentation portals. Application notes. FAQ sections. Support knowledge bases. Blog articles with installation guides. Specification PDFs linked from product pages. This content exists. It's usually well-maintained because it's customer-facing. And until recently, it was invisible to product AI systems that only knew how to ingest structured catalog exports.

Web crawling changes that. This article explains how it works, what it's actually good at, what it isn't, and how to build a crawling pipeline that stays in sync as your site evolves.


What Web Crawling Actually Does for RAG

When people hear "web crawler," they think of Googlebot — a massive infrastructure project that indexes the entire public internet. That's not what we're doing here. Product knowledge crawling is a targeted crawl of a known domain: your own site, your documentation portal, your support knowledge base.

The pipeline has four stages:

  1. Discovery: Starting from a seed URL (e.g. https://products.yourcompany.com), follow internal links to build a map of pages within the allowed domain(s).
  2. Extraction: For each discovered URL, fetch the HTML and extract clean text — stripping navigation, headers, footers, cookie banners, and other chrome that isn't product knowledge.
  3. Chunking and embedding: Split extracted text into retrieval-sized chunks, embed them, and store them in the vector index alongside chunks from other sources.
  4. Sync: Periodically re-crawl pages to detect changes. When content updates, re-embed and replace the affected chunks. When pages disappear, remove their chunks.

The result is that your website content participates in retrieval alongside your structured catalog data. A buyer asking about a product application gets context from both the structured specifications and the application note your engineering team published last quarter.


The Anatomy of a Production Crawl

Let's get into how each stage actually works in a B2B product context, because the implementation decisions matter a lot.

A naive crawler fetches every URL it finds. A product knowledge crawler needs to be smarter about scope.

Domain restriction is the most important constraint. If you start crawling products.yourcompany.com and follow external links, you'll end up ingesting competitor sites, Wikipedia, ISO standards bodies, and the rest of the internet. Domain restriction ensures you stay within the intended knowledge boundaries.

For most B2B setups, this means crawling one or a few subdomains:

  • products.yourcompany.com — the product catalog web front-end
  • docs.yourcompany.com — technical documentation
  • support.yourcompany.com — FAQ and knowledge base articles
  • yourcompany.com/resources — application notes, whitepapers, guides

Different subdomains often have different content priorities. A documentation portal might be dense with reference content worth crawling deeply. A marketing subdomain might have mostly brand content that adds noise. Being selective about which domains to include — and which to exclude — dramatically improves the signal-to-noise ratio of your knowledge base.

Depth limits and politeness are practical concerns. Crawling tens of thousands of pages synchronously will hammer your web server and take hours. A well-behaved crawler:

  • Respects robots.txt exclusions
  • Adds configurable delays between requests
  • Crawls in breadth-first order, so the highest-value pages (those closest to the seed) are processed first
  • Sets a sensible depth limit — most product catalogs don't need to go more than 4–5 links deep from the seed

URL normalization prevents duplicate processing. ?utm_source=email query parameters, #section anchors, trailing slashes, and HTTP/HTTPS variants all produce URL strings that point to the same content. A production crawler normalizes URLs before queuing them to avoid crawling the same page multiple times.

Extraction: Clean Text from Messy HTML

HTML is not clean text. Between the navigation menus, JavaScript bundles, cookie consent banners, "related products" carousels, and footer links, a typical product page might have 200 words of genuine product knowledge embedded in 50KB of markup.

Extraction is the step where you isolate the signal. The goal is to produce text that reads like a document, not like HTML. That means:

Semantic element selection: Prefer content inside <main>, <article>, or [role="main"] elements. Discard <nav>, <header>, <footer>, <aside>, and <script> blocks. These structural hints exist precisely to identify the primary content region.

Structured content preservation: Product pages often contain specification tables, ordered steps in installation guides, and bullet-point feature lists. These structures carry meaning. A reductive "strip all tags" approach loses that structure — a 5×3 HTML table becomes an undifferentiated blob of numbers with no column headers. Better to convert tables to markdown-style representations and preserve list structure as line-broken items.

PDF handling for linked documents: Product pages frequently link to PDF datasheets and specification documents. A complete crawl pipeline follows those links too, extracting text from PDFs and indexing them as part of the same knowledge graph as the product page. This matters enormously in B2B — often the authoritative technical specification lives in a PDF that the product page links to, not on the product page itself.

Metadata extraction: The page title, meta description, canonical URL, breadcrumb trail, and structured data markup (Schema.org Product, TechArticle, etc.) all provide metadata that enriches the chunk store. When retrieval surfaces a chunk from a crawled page, that metadata tells the LLM what page it came from, what category it's in, and how authoritative it is. This is metadata filtering in practice — applied at the source level, not just the chunk level.

Chunking Crawled Web Content

Web content presents chunking challenges that differ from catalog exports. A product page is not a technical datasheet. It might have a 60-word product summary followed by 400 words of marketing copy followed by a full specification table followed by 200 words of application notes. Treating that as a single chunk is wrong. Treating each paragraph as its own chunk loses the structural context.

The chunking strategy that works best for web content is hierarchical: split at HTML section boundaries first (h2, h3 headings), then apply a secondary token-count limit within sections.

This approach:

  • Keeps specification tables together with their section heading
  • Keeps step-by-step instructions intact within their procedure section
  • Respects the author's own structural choices (which headings separate topics is deliberate)
  • Avoids the mid-sentence splits that plague naive character-count chunking

Each chunk also inherits the page-level metadata: URL, page title, section heading, crawl timestamp, and content hash. The content hash is key for incremental sync.


The Live Sync Problem

Static ingestion is easy. Point the crawler at a domain, extract everything, embed it, and you're done. The hard part is keeping the knowledge base in sync as the website changes.

Product catalogs update constantly. Specs get revised. New products launch. Old products get discontinued. Support articles get updated when a known issue is resolved. Documentation is corrected when a user reports an error. If your knowledge base is a snapshot from six months ago, your AI is giving buyers stale information — exactly the problem we analyzed in our article on RAG freshness.

Change Detection Without Crawling Everything

Recrawling an entire domain every day is expensive and slow. The efficient approach is to crawl everything initially, then use incremental change detection on subsequent syncs.

Two complementary techniques:

HTTP conditional requests: The If-Modified-Since and If-None-Match headers let your crawler ask the web server "has this page changed since I last fetched it?" If the server supports these headers (most modern web servers do), it responds with 304 Not Modified for unchanged pages — no content transfer, near-instant check. You can scan thousands of pages in minutes with conditional requests, then only fully re-fetch and re-embed the ones that actually changed.

Content hashing: For servers that don't support conditional requests, store a hash of the extracted text content for each page. On re-crawl, fetch the page, extract text, hash it, and compare to the stored hash. If they match, skip re-embedding. Only new or changed content goes through the expensive embedding pipeline.

Sitemap integration: Most well-maintained websites publish a sitemap.xml that lists all pages with their last-modification timestamps. A sync process can parse the sitemap, compare modification timestamps against the last-crawl database, and build a targeted re-crawl list of pages that have changed. This avoids even fetching pages that haven't changed — useful for large catalogs where a full conditional-request scan would still take significant time.

Handling Page Deletion

What happens when a product is discontinued and its page disappears? Without deletion handling, your knowledge base accumulates ghost chunks — content from pages that no longer exist. Buyers then get answers citing discontinued products that aren't available anymore.

Deletion detection requires maintaining a page inventory. Every crawl, track which URLs were discovered. Pages present in the previous crawl's inventory but absent from the current crawl's discovery pass should be scheduled for removal: their chunks deleted from the vector store, their embeddings dropped.

This is trickier than it sounds because pages can temporarily disappear (server errors, deployment windows, CDN issues) and come back. A robust implementation distinguishes between a 404 (likely permanent) and a 5xx (likely transient), and applies a short grace period before purging chunks from a 404 URL.


Domain Restriction Is a Feature, Not a Limitation

It's worth spending a moment on domain restriction because it's often treated as a technical detail when it's actually a core product design decision.

When a buyer asks your product AI a question, you want the answer to come from your knowledge: your specifications, your application guidance, your support articles. You don't want the AI synthesizing answers from competitor product pages that your site happens to link to, or from generic Wikipedia articles about the underlying technology.

Domain restriction enforces this. It's the crawling equivalent of guardrails and hallucination prevention at the data layer — you're constraining what the system can know, not just what it can say.

Practical applications:

Multi-domain allow-listing: You might want to include your main product site, your documentation subdomain, and a third-party distributor's portal that carries your products. Allow-listing specific domains gives you precise control.

Path-level restrictions: Even within an allowed domain, some paths don't contain product knowledge. /blog/press-releases/, /careers/, /investor-relations/ — these sections might be on the same domain as your product content but add noise when indexed. Path-level exclusions keep the knowledge base focused.

Respecting private areas: If your product site requires login to access dealer pricing or distributor-only documentation, the crawler should operate with appropriate credentials — or, if those areas contain confidential information not intended for general AI consumption, explicitly exclude them.


Web Content Alongside Structured Catalog Data

One of the most powerful patterns in product knowledge AI is combining web-crawled content with structured catalog ingestion. Neither source alone is complete.

Structured catalog data (from PIM, ERP, or spreadsheet exports) gives you precise, structured attributes: part numbers, dimensions, material specifications, weights, certifications. This data is authoritative and machine-readable, making it excellent for exact-match retrieval and metadata filtering.

Web-crawled content gives you the narrative layer: the application guidance ("suitable for outdoor installation in environments rated to IP67"), the installation instructions, the compatibility notes ("not compatible with copper fittings — use with brass or stainless only"), the troubleshooting context. This is the tacit knowledge that exists in your content but doesn't fit into a PIM attribute field.

When both sources are indexed together, hybrid search can draw from both in a single retrieval pass. A query like "can I use your 3/4 inch ball valve outdoors in marine environments?" pulls the rated IP or NEMA classification from structured data and the application guidance about corrosion resistance from the product page. The answer is richer than either source alone would produce.

In the embedding and ingestion layer, this means tagging chunks with their source type (catalog, web_page, pdf_document) so the LLM can weight them appropriately and cite sources accurately. Structured catalog attributes are authoritative for technical specifications. Web content is authoritative for application guidance and compatibility notes. Both are relevant; neither should dominate.


Practical Considerations for B2B Web Crawling

A few things that matter in production that often get underestimated:

JavaScript-rendered content: An increasing number of product catalog sites render content client-side via React, Vue, or Angular. A basic HTTP fetcher that reads raw HTML will return empty <div id="app"> shells instead of product content. Handling this requires a headless browser (Playwright, Puppeteer) that executes JavaScript and captures the rendered DOM. This is significantly more expensive than basic HTTP fetching — plan for 3–10x the compute cost per page.

Rate limiting and respectful crawling: Crawling your own site doesn't mean you can ignore its capacity. A crawler hitting 50 concurrent requests against a product site with normal traffic can cause real degradation for real users. Configure concurrency limits and per-domain request delays. Most production crawlers operate at 1–5 requests per second per domain unless you've explicitly confirmed the infrastructure can handle more.

Crawl scope drift: A product site links to a distributor. The distributor links to another distributor. Without careful domain restriction and depth limits, a crawler can find itself hundreds of hops from the starting URL and deep into unrelated content. Regular auditing of what's actually being indexed catches scope drift before it degrades retrieval quality.

Authentication for gated content: Technical documentation portals and dealer-facing knowledge bases often require login. The crawler needs to authenticate — typically via session cookies, API tokens, or HTTP Basic Auth — and refresh credentials when they expire. This is worth building properly upfront; retrofitting auth into a crawler mid-project is painful.


How Axoverna Handles Web Source Ingestion

Axoverna's URL source ingestion is designed specifically for the B2B product knowledge use case. When you add a URL source:

  • We crawl the domain with configurable depth and domain restriction
  • We handle both static HTML and JavaScript-rendered content
  • We follow and extract linked PDFs automatically
  • We preserve structural context (headings, tables, lists) in chunk metadata
  • We run incremental sync on a configurable schedule, detecting and propagating changes without full recrawls
  • We combine crawled content with your structured catalog data in the same retrieval index, so hybrid search works across both

The result is that your product website — maintained by your team for your customers — becomes a live, continuously updated component of your product AI's knowledge. No export workflow. No manual file uploads. Just point us at your domain and the knowledge stays current.


When Web Crawling Is the Right Starting Point

Not every team is ready to tackle PIM integration on day one. Connecting an ERP, transforming catalog data formats, and negotiating API access takes time and organizational coordination.

Web crawling is often the fastest path to a working, useful product AI — because the content already exists and is already maintained. Your product pages are already written. Your documentation portal is already up to date. Your support articles are already there.

For many B2B companies, a crawl-based knowledge source deployed in a week produces a product AI that handles 60–70% of inbound product questions with genuine accuracy. That's a real result you can show stakeholders while the structured catalog integration work continues in parallel.

It's also worth noting that web content often covers different questions than catalog data does. A spec sheet tells you dimensions and ratings. A product page tells you what problems the product solves and what it's used for. Both matter. Starting with web content means starting with the buyer-facing knowledge your team already curates.


The Knowledge Source You Already Have

The question isn't whether your company has product knowledge worth indexing for AI. It does — in a PIM, in a docs portal, in support articles, in technical PDFs, across product pages. The question is how efficiently you can get it into the retrieval layer.

Web crawling is the answer for content that lives on your website. It's not a workaround or a compromise — it's often the richest source of buyer-facing product knowledge you have, presented in the natural language and context that buyers actually need.

Combined with structured catalog ingestion and kept fresh with incremental sync, a crawl-based knowledge source gives your product AI both the precision of structured data and the richness of narrative content. That combination is what produces answers that actually help buyers decide, configure, and buy.


Ready to Connect Your Product Website?

Axoverna's URL source ingestion makes it straightforward to turn your existing product web content into a live AI knowledge source. Point us at your product domain, set your crawl scope and sync schedule, and your website becomes part of your product AI — automatically staying current as your content evolves.

Start a free trial to connect your first URL source, or book a demo to see how web-crawled content combines with your catalog data for richer, more accurate product answers.

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.