Multimodal RAG: Adding Visual Search to Your Product Knowledge AI

Text embeddings alone can't answer 'do you have this part?' when a buyer holds up a photo. Learn how multimodal RAG pipelines handle image queries in B2B product catalogs — and when visual search delivers the biggest ROI.

Axoverna Team

March 12, 202613 min read

A technician on a factory floor photographs a worn bearing. A procurement manager snaps a picture of a competitor's product. A field engineer holds their phone up to an unlabeled connector and asks: "Do you carry this?"

Text-based product search has no answer for any of these. Neither does a traditional keyword search bar, a PIM system, or a static FAQ page. These are visual queries — and they're far more common in B2B purchasing than most product teams expect.

Multimodal RAG changes that. By combining visual embeddings with the retrieval-augmented generation pipelines that already power text-based product knowledge systems, you can answer image-based questions with the same accuracy and confidence as text ones. This article explains how the pipeline works, where it delivers the most value, and what the implementation looks like in practice.

Why Visual Queries Matter in B2B Procurement

Before getting into the mechanics, it's worth understanding why visual search is particularly valuable in industrial and wholesale contexts — because the business case drives the architectural decisions.

The Unknown Part Problem

In maintenance, repair, and operations (MRO) purchasing, buyers frequently encounter parts with no visible label, no barcode, or no documentation. The original equipment manufacturer may no longer exist. The part was installed decades ago. The product number has worn off.

In these situations, the buyer's only option is visual identification. They have a physical object and need to find its specification, its compatible replacements, or its current equivalent. Without visual search, this process involves phone calls, email threads with engineers, and sometimes hours of manual catalog browsing.

A multimodal AI that can take a photograph and return "This appears to be an SKF 6203-2RS1 deep groove ball bearing, 17mm bore × 40mm OD × 12mm width. Compatible replacements available include..." collapses that workflow into seconds.

New Product Verification

Procurement teams often need to verify that an incoming delivery matches the specified product. Visual AI can cross-reference a photo of a received item against catalog images and flag discrepancies — wrong product, wrong variant, counterfeit detection.

Competitor Product Matching

Distributors and wholesalers frequently encounter customers holding competitors' products and asking "do you have something equivalent?" Historically this required a sales rep to know the catalog deeply. A visual-first RAG system can match a photo against your catalog and surface compatible alternatives automatically.

How Multimodal Embedding Works

The core of a multimodal RAG pipeline is a joint embedding space — a model that can embed both images and text into the same vector space, so that a photo of a hex bolt and the text "M12 hex head bolt stainless steel" land near each other as vectors.

The two dominant model families for this are:

CLIP (Contrastive Language-Image Pretraining) and its successors: OpenAI CLIP, OpenCLIP, SigLIP (Google), ALIGN (Google). These models were trained on hundreds of millions of image-text pairs. They learn that the concept "stainless steel hex bolt" and a photograph of one are the same thing in different modalities.

Vision-Language Models (VLMs) like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro: These are full LLMs that can both describe an image and reason about it. For RAG specifically, you can use a VLM to generate a text description of an incoming image and then fall back to a standard text retrieval pipeline.

The choice between these approaches involves real trade-offs:

Approach	Latency	Accuracy	Cost	Complexity
CLIP-style joint embeddings	Low (5–20ms)	High for visual similarity	Low (self-hostable)	Medium
VLM image-to-text + text RAG	Medium (500–2000ms)	High for product knowledge	High (API costs)	Low
Hybrid: CLIP retrieval + VLM reranking	Medium	Highest	Medium	High

For production B2B product search, the hybrid approach generally wins: use CLIP-family embeddings for fast first-pass retrieval, then a VLM to generate a rich description and rerank results.

The Two Indexing Strategies

When building a multimodal product knowledge index, you have two approaches. They're not mutually exclusive.

Strategy 1: Image Embeddings in the Vector Index

Index product images directly as embedding vectors alongside your text chunks. At query time, embed the incoming image and find the nearest neighbors — regardless of whether the query is visual or textual, because they share the same embedding space.

import { pipeline } from '@xenova/transformers'
 
// Load a CLIP-compatible model
const extractor = await pipeline('feature-extraction', 'Xenova/clip-vit-base-patch32')
 
// Embed a product image
async function embedImage(imagePath: string): Promise<number[]> {
  const output = await extractor(imagePath, { pooling: 'mean', normalize: true })
  return Array.from(output.data)
}
 
// Embed a text query (same model, same space)
async function embedText(text: string): Promise<number[]> {
  const output = await extractor(text, { pooling: 'mean', normalize: true })
  return Array.from(output.data)
}
 
// At index time: store image embeddings with product metadata
await vectorStore.insert({
  id: `product-${sku}-image-${imageIndex}`,
  vector: await embedImage(imageUrl),
  metadata: {
    type: 'product_image',
    sku,
    productTitle,
    imageUrl,
    altText,
  },
})

The advantage: extremely fast retrieval, no per-query VLM call required. The limitation: CLIP embeddings capture visual similarity but not fine-grained product attributes. Two similar-looking bolts might be in the same embedding neighborhood even if their thread pitch differs.

Strategy 2: VLM-Generated Descriptions as Text

When ingesting product images, call a VLM to generate a rich text description of each image — capturing visual attributes, shape, color, material, form factor — and store that description as a text chunk.

async function describeProductImage(imageUrl: string, productContext: string): Promise<string> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'user',
        content: [
          {
            type: 'image_url',
            image_url: { url: imageUrl },
          },
          {
            type: 'text',
            text: `You are indexing product images for a B2B catalog search system.
Product context: ${productContext}
 
Describe this product image in precise technical detail. Include:
- Visual form factor and shape
- Visible materials and finishes
- Any visible markings, text, or specifications
- Color and surface texture
- Dimensional cues (relative sizes, proportions)
- Any distinguishing visual features
 
Be specific and technical. This description will be searched by engineers and procurement managers.`,
          },
        ],
      },
    ],
    max_tokens: 300,
  })
 
  return response.choices[0].message.content ?? ''
}

This description gets stored as a regular text chunk, embedded with your standard text embedding model, and retrieved by standard text queries. The VLM cost happens once at index time, not at query time.

Handling Image Queries at Runtime

At query time, you receive an image (or image + text) from the buyer. Here's a practical pipeline:

Step 1: Generate a Query Description

If using Strategy 2 (or the hybrid), call a VLM to describe the query image and extract key product attributes.

async function processImageQuery(
  imageUrl: string,
  userText?: string
): Promise<{ description: string; extractedAttributes: Record<string, string> }> {
  const prompt = userText
    ? `A user is asking: "${userText}"\n\nDescribe the product in this image in detail.`
    : `Describe the product in this image. Extract any visible specifications, model numbers, or identifiers.`
 
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'user',
        content: [
          { type: 'image_url', image_url: { url: imageUrl } },
          { type: 'text', text: prompt },
        ],
      },
    ],
    max_tokens: 400,
  })
 
  const description = response.choices[0].message.content ?? ''
 
  // Also extract structured attributes for metadata filtering
  const attributes = await extractProductAttributes(description)
 
  return { description, extractedAttributes: attributes }
}

Step 2: Retrieve with Both Text and Image Embeddings

Run retrieval in parallel using both the text description and the raw image embedding, then merge and deduplicate candidates.

async function multimodalRetrieve(
  imageUrl: string,
  queryDescription: string,
  topK: number = 30
): Promise<Chunk[]> {
  const [imageEmbedding, textEmbedding] = await Promise.all([
    embedImage(imageUrl),
    embedText(queryDescription),
  ])
 
  const [imageResults, textResults] = await Promise.all([
    vectorStore.similaritySearch(imageEmbedding, topK),
    vectorStore.similaritySearch(textEmbedding, topK),
  ])
 
  // Merge and deduplicate by product SKU, keeping highest score
  const merged = new Map<string, Chunk>()
  for (const chunk of [...imageResults, ...textResults]) {
    const existing = merged.get(chunk.metadata.sku)
    if (!existing || chunk.score > existing.score) {
      merged.set(chunk.metadata.sku, chunk)
    }
  }
 
  return Array.from(merged.values())
    .sort((a, b) => b.score - a.score)
    .slice(0, topK)
}

This fusion approach — sometimes called late fusion — is more robust than relying on either modality alone. Text retrieval catches products with matching specifications; image retrieval catches visually similar form factors that text might miss.

Step 3: Rerank and Generate

Pass the merged candidates to a reranker (see our deep-dive on two-stage retrieval and reranking), then generate a structured product identification response.

For image queries, the LLM generation prompt needs slight adaptation — you want it to lead with product identification, not just an answer to a question.

const systemPrompt = `You are a B2B product identification assistant.
The user has provided an image of a product. Based on the catalog context retrieved,
identify the product, provide the SKU, and list compatible alternatives if exact match is uncertain.
If you cannot make a confident identification, say so clearly — do not guess.
Always cite the specific catalog entries you're drawing from.`

When Visual Search Delivers the Highest ROI

Not every B2B context benefits equally from multimodal capabilities. The investment is highest where:

High SKU density with visual differentiation

When you carry thousands of SKUs that look similar but have critical specification differences — fasteners, connectors, seals, bearings, filters — visual search combined with specification extraction from the image is enormously valuable. A buyer photographing a bearing doesn't need to know the exact model number; the system can work from visual attributes plus dimensional cues.

Field service and maintenance scenarios

Technicians in the field are your most time-pressured buyers. They can't stop to look up part numbers. A visual-first mobile experience changes how they interact with your catalog — a photo and a question, rather than a form with SKU fields.

High inbound sample / equivalent request volume

If your sales team spends significant time on "do you carry something like this?" emails with attached photos, multimodal search directly deflects that workload to a self-service channel. Measurable ROI shows up in sales team bandwidth almost immediately.

International catalogs without consistent part number standards

Cross-border equivalence matching is hard to do with part numbers — standards differ by country. Visual similarity often works where text-based cross-referencing fails.

What to Index: Product Images vs. Application Images

One underappreciated nuance: for B2B products, you often have two types of images — product images (white background studio shots, isolation shots) and application images (the product installed in context, usage scenarios).

Both are worth indexing for different query types:

Image type	Good for
Studio / isolated product shot	Visual identification ("what is this?")
Dimensioned drawing or diagram	Specification extraction, dimensional matching
Application / in-context shot	Usage matching ("I need something for this application")
Exploded diagram	Assembly context, replacement part identification

If your PIM exports all image types, index them separately with imageType as a metadata field. Queries asking "what is this part?" should weight product shots; queries asking "what works in this assembly?" should weight application images.

The Freshness Problem: Keeping Visual Indexes Current

Image indexes have the same staleness problem as text indexes — when you update a product's images (rebranding, new variant photos, updated diagrams), your index needs to reflect that. This is an instance of the broader product catalog sync and RAG freshness problem.

The additional complexity for image indexes: re-embedding images is more expensive than re-embedding text, especially if you're using a VLM to generate descriptions at index time. Build your update pipeline with this in mind:

Incremental updates: Only re-embed images that have changed (compare checksums or use PIM webhooks)
Staggered re-embedding: For large catalogs, queue image re-embedding and process async to avoid blocking updates
Separate text and image update cadences: Text product data changes more frequently than product images; you can often refresh text chunks on a daily cadence and images weekly

Practical Rollout Approach

Multimodal RAG doesn't have to replace your existing text-based pipeline. The cleanest rollout is additive:

Phase 1 — Image descriptions as text: Generate VLM descriptions for your product images and add them as text chunks to your existing index. This immediately improves text search for visual characteristics ("blue anodized," "flanged hex head") without any change to query-time infrastructure. Cost: indexing time only.

Phase 2 — Image query input: Add an image upload option to your chat widget. When an image is received, call a VLM to describe it and feed the description into your existing text retrieval pipeline. No change to the retrieval or generation layers.

Phase 3 — Joint embedding retrieval: Add CLIP-family image embeddings for direct image-to-image retrieval, enabling the late-fusion approach described above. This is where you see the biggest gains for visually distinctive product queries.

The phased approach lets you validate ROI at each step before investing in the next layer of infrastructure.

Limitations to Know Before You Build

CLIP models struggle with fine-grained differences: Two bolts that differ only in thread pitch look nearly identical to CLIP. For specification-critical retrieval, you still need text extraction — CLIP gets you to the right product family; text gets you to the right variant.

VLM description quality varies with image quality: Blurry photos, poor lighting, and partial views produce vague descriptions. Set client-side image quality guidance and handle low-confidence identifications explicitly rather than forcing a confident-sounding answer.

Catalog images often lack context: Studio product shots on white backgrounds are great for isolation, but VLMs struggle with scale and material without context cues. Supplement with detailed alt text and structured metadata in your product descriptions. This ties back to the fundamentals of effective product knowledge ingestion.

Inference costs at scale: If you're running VLM descriptions at query time for every image query, costs add up. Profile your query volume and consider caching common image queries or building a fast visual-similarity tier that avoids VLM calls for high-confidence matches.

The Bigger Picture: Moving Toward True Product Understanding

Multimodal RAG is a step toward what the best B2B product AI ultimately needs: a genuine understanding of products that matches how buyers actually experience them — visually, tactilely, contextually, not just textually.

Text-only product knowledge systems are powerful, but they hit a ceiling when buyers arrive with visual context. A worn bearing in someone's hand. A photo from a field service call. An engineer's sketch. These are normal inputs in B2B purchasing, and they've been invisible to digital product experiences until now.

The systems that handle these inputs smoothly — combining visual retrieval with specification knowledge and conversational AI — create a qualitatively different buyer experience. Not just faster search, but a system that meets buyers where they are, with the context they actually have.

That's the trajectory the best B2B product knowledge platforms are on. The underlying technology is mature enough today to build it in production.

Ready to Add Visual Search to Your Product Catalog?

Axoverna's product knowledge platform supports image-based queries, combining visual retrieval with deep B2B catalog context. Whether you're handling unknown part identification, competitor product matching, or field service requests, our multimodal pipeline is built for the specifics of industrial and wholesale catalogs.

Book a demo to see visual product search on your catalog, or start a free trial and explore what multimodal RAG can do for your buyers.

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.

Start free — no credit card required →Read the docs

Technical

BOM-Aware Product AI: How to Turn Part-Level Questions Into Procurement-Ready Answers

Most product AI systems answer one SKU at a time. B2B buyers work from assemblies, spare parts lists, and bills of materials. BOM-aware retrieval helps AI reason across sets of parts, dependencies, alternates, and order constraints so conversations lead to real purchasing decisions.

May 24, 202611 min read

Technical

Revenue-Weighted Evaluation for B2B Product AI: Why All Retrieval Errors Are Not Equal

Most B2B teams evaluate product AI with flat accuracy metrics. The better approach is to weight failures by commercial risk, so mistakes on high-value, high-complexity workflows get fixed before low-stakes browsing errors.

May 23, 202611 min read

Technical

How Conversation Mining Turns Product AI Into a Product Data Improvement Engine

Most B2B teams treat AI chat logs as support exhaust. The smarter move is to mine them for missing attributes, broken mappings, unclear terminology, and catalog blind spots, then feed those insights back into product data operations.

May 22, 202612 min read

Why Visual Queries Matter in B2B Procurement

The Unknown Part Problem

New Product Verification

Competitor Product Matching

How Multimodal Embedding Works

The Two Indexing Strategies

Strategy 1: Image Embeddings in the Vector Index

Strategy 2: VLM-Generated Descriptions as Text

Handling Image Queries at Runtime

Step 1: Generate a Query Description

Step 2: Retrieve with Both Text and Image Embeddings

Step 3: Rerank and Generate

When Visual Search Delivers the Highest ROI

High SKU density with visual differentiation

Field service and maintenance scenarios

High inbound sample / equivalent request volume

International catalogs without consistent part number standards

What to Index: Product Images vs. Application Images

The Freshness Problem: Keeping Visual Indexes Current

Practical Rollout Approach

Limitations to Know Before You Build

The Bigger Picture: Moving Toward True Product Understanding

Ready to Add Visual Search to Your Product Catalog?

Turn your product catalog into an AI knowledge base

Related articles

BOM-Aware Product AI: How to Turn Part-Level Questions Into Procurement-Ready Answers

Revenue-Weighted Evaluation for B2B Product AI: Why All Retrieval Errors Are Not Equal

How Conversation Mining Turns Product AI Into a Product Data Improvement Engine