Multimodal RAG: Adding Visual Search to Your Product Knowledge AI
Text embeddings alone can't answer 'do you have this part?' when a buyer holds up a photo. Learn how multimodal RAG pipelines handle image queries in B2B product catalogs — and when visual search delivers the biggest ROI.
A technician on a factory floor photographs a worn bearing. A procurement manager snaps a picture of a competitor's product. A field engineer holds their phone up to an unlabeled connector and asks: "Do you carry this?"
Text-based product search has no answer for any of these. Neither does a traditional keyword search bar, a PIM system, or a static FAQ page. These are visual queries — and they're far more common in B2B purchasing than most product teams expect.
Multimodal RAG changes that. By combining visual embeddings with the retrieval-augmented generation pipelines that already power text-based product knowledge systems, you can answer image-based questions with the same accuracy and confidence as text ones. This article explains how the pipeline works, where it delivers the most value, and what the implementation looks like in practice.
Why Visual Queries Matter in B2B Procurement
Before getting into the mechanics, it's worth understanding why visual search is particularly valuable in industrial and wholesale contexts — because the business case drives the architectural decisions.
The Unknown Part Problem
In maintenance, repair, and operations (MRO) purchasing, buyers frequently encounter parts with no visible label, no barcode, or no documentation. The original equipment manufacturer may no longer exist. The part was installed decades ago. The product number has worn off.
In these situations, the buyer's only option is visual identification. They have a physical object and need to find its specification, its compatible replacements, or its current equivalent. Without visual search, this process involves phone calls, email threads with engineers, and sometimes hours of manual catalog browsing.
A multimodal AI that can take a photograph and return "This appears to be an SKF 6203-2RS1 deep groove ball bearing, 17mm bore × 40mm OD × 12mm width. Compatible replacements available include..." collapses that workflow into seconds.
New Product Verification
Procurement teams often need to verify that an incoming delivery matches the specified product. Visual AI can cross-reference a photo of a received item against catalog images and flag discrepancies — wrong product, wrong variant, counterfeit detection.
Competitor Product Matching
Distributors and wholesalers frequently encounter customers holding competitors' products and asking "do you have something equivalent?" Historically this required a sales rep to know the catalog deeply. A visual-first RAG system can match a photo against your catalog and surface compatible alternatives automatically.
How Multimodal Embedding Works
The core of a multimodal RAG pipeline is a joint embedding space — a model that can embed both images and text into the same vector space, so that a photo of a hex bolt and the text "M12 hex head bolt stainless steel" land near each other as vectors.
The two dominant model families for this are:
CLIP (Contrastive Language-Image Pretraining) and its successors: OpenAI CLIP, OpenCLIP, SigLIP (Google), ALIGN (Google). These models were trained on hundreds of millions of image-text pairs. They learn that the concept "stainless steel hex bolt" and a photograph of one are the same thing in different modalities.
Vision-Language Models (VLMs) like GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro: These are full LLMs that can both describe an image and reason about it. For RAG specifically, you can use a VLM to generate a text description of an incoming image and then fall back to a standard text retrieval pipeline.
The choice between these approaches involves real trade-offs:
| Approach | Latency | Accuracy | Cost | Complexity |
|---|---|---|---|---|
| CLIP-style joint embeddings | Low (5–20ms) | High for visual similarity | Low (self-hostable) | Medium |
| VLM image-to-text + text RAG | Medium (500–2000ms) | High for product knowledge | High (API costs) | Low |
| Hybrid: CLIP retrieval + VLM reranking | Medium | Highest | Medium | High |
For production B2B product search, the hybrid approach generally wins: use CLIP-family embeddings for fast first-pass retrieval, then a VLM to generate a rich description and rerank results.
The Two Indexing Strategies
When building a multimodal product knowledge index, you have two approaches. They're not mutually exclusive.
Strategy 1: Image Embeddings in the Vector Index
Index product images directly as embedding vectors alongside your text chunks. At query time, embed the incoming image and find the nearest neighbors — regardless of whether the query is visual or textual, because they share the same embedding space.
import { pipeline } from '@xenova/transformers'
// Load a CLIP-compatible model
const extractor = await pipeline('feature-extraction', 'Xenova/clip-vit-base-patch32')
// Embed a product image
async function embedImage(imagePath: string): Promise<number[]> {
const output = await extractor(imagePath, { pooling: 'mean', normalize: true })
return Array.from(output.data)
}
// Embed a text query (same model, same space)
async function embedText(text: string): Promise<number[]> {
const output = await extractor(text, { pooling: 'mean', normalize: true })
return Array.from(output.data)
}
// At index time: store image embeddings with product metadata
await vectorStore.insert({
id: `product-${sku}-image-${imageIndex}`,
vector: await embedImage(imageUrl),
metadata: {
type: 'product_image',
sku,
productTitle,
imageUrl,
altText,
},
})The advantage: extremely fast retrieval, no per-query VLM call required. The limitation: CLIP embeddings capture visual similarity but not fine-grained product attributes. Two similar-looking bolts might be in the same embedding neighborhood even if their thread pitch differs.
Strategy 2: VLM-Generated Descriptions as Text
When ingesting product images, call a VLM to generate a rich text description of each image — capturing visual attributes, shape, color, material, form factor — and store that description as a text chunk.
async function describeProductImage(imageUrl: string, productContext: string): Promise<string> {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{
type: 'image_url',
image_url: { url: imageUrl },
},
{
type: 'text',
text: `You are indexing product images for a B2B catalog search system.
Product context: ${productContext}
Describe this product image in precise technical detail. Include:
- Visual form factor and shape
- Visible materials and finishes
- Any visible markings, text, or specifications
- Color and surface texture
- Dimensional cues (relative sizes, proportions)
- Any distinguishing visual features
Be specific and technical. This description will be searched by engineers and procurement managers.`,
},
],
},
],
max_tokens: 300,
})
return response.choices[0].message.content ?? ''
}This description gets stored as a regular text chunk, embedded with your standard text embedding model, and retrieved by standard text queries. The VLM cost happens once at index time, not at query time.
Handling Image Queries at Runtime
At query time, you receive an image (or image + text) from the buyer. Here's a practical pipeline:
Step 1: Generate a Query Description
If using Strategy 2 (or the hybrid), call a VLM to describe the query image and extract key product attributes.
async function processImageQuery(
imageUrl: string,
userText?: string
): Promise<{ description: string; extractedAttributes: Record<string, string> }> {
const prompt = userText
? `A user is asking: "${userText}"\n\nDescribe the product in this image in detail.`
: `Describe the product in this image. Extract any visible specifications, model numbers, or identifiers.`
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{
role: 'user',
content: [
{ type: 'image_url', image_url: { url: imageUrl } },
{ type: 'text', text: prompt },
],
},
],
max_tokens: 400,
})
const description = response.choices[0].message.content ?? ''
// Also extract structured attributes for metadata filtering
const attributes = await extractProductAttributes(description)
return { description, extractedAttributes: attributes }
}Step 2: Retrieve with Both Text and Image Embeddings
Run retrieval in parallel using both the text description and the raw image embedding, then merge and deduplicate candidates.
async function multimodalRetrieve(
imageUrl: string,
queryDescription: string,
topK: number = 30
): Promise<Chunk[]> {
const [imageEmbedding, textEmbedding] = await Promise.all([
embedImage(imageUrl),
embedText(queryDescription),
])
const [imageResults, textResults] = await Promise.all([
vectorStore.similaritySearch(imageEmbedding, topK),
vectorStore.similaritySearch(textEmbedding, topK),
])
// Merge and deduplicate by product SKU, keeping highest score
const merged = new Map<string, Chunk>()
for (const chunk of [...imageResults, ...textResults]) {
const existing = merged.get(chunk.metadata.sku)
if (!existing || chunk.score > existing.score) {
merged.set(chunk.metadata.sku, chunk)
}
}
return Array.from(merged.values())
.sort((a, b) => b.score - a.score)
.slice(0, topK)
}This fusion approach — sometimes called late fusion — is more robust than relying on either modality alone. Text retrieval catches products with matching specifications; image retrieval catches visually similar form factors that text might miss.
Step 3: Rerank and Generate
Pass the merged candidates to a reranker (see our deep-dive on two-stage retrieval and reranking), then generate a structured product identification response.
For image queries, the LLM generation prompt needs slight adaptation — you want it to lead with product identification, not just an answer to a question.
const systemPrompt = `You are a B2B product identification assistant.
The user has provided an image of a product. Based on the catalog context retrieved,
identify the product, provide the SKU, and list compatible alternatives if exact match is uncertain.
If you cannot make a confident identification, say so clearly — do not guess.
Always cite the specific catalog entries you're drawing from.`When Visual Search Delivers the Highest ROI
Not every B2B context benefits equally from multimodal capabilities. The investment is highest where:
High SKU density with visual differentiation
When you carry thousands of SKUs that look similar but have critical specification differences — fasteners, connectors, seals, bearings, filters — visual search combined with specification extraction from the image is enormously valuable. A buyer photographing a bearing doesn't need to know the exact model number; the system can work from visual attributes plus dimensional cues.
Field service and maintenance scenarios
Technicians in the field are your most time-pressured buyers. They can't stop to look up part numbers. A visual-first mobile experience changes how they interact with your catalog — a photo and a question, rather than a form with SKU fields.
High inbound sample / equivalent request volume
If your sales team spends significant time on "do you carry something like this?" emails with attached photos, multimodal search directly deflects that workload to a self-service channel. Measurable ROI shows up in sales team bandwidth almost immediately.
International catalogs without consistent part number standards
Cross-border equivalence matching is hard to do with part numbers — standards differ by country. Visual similarity often works where text-based cross-referencing fails.
What to Index: Product Images vs. Application Images
One underappreciated nuance: for B2B products, you often have two types of images — product images (white background studio shots, isolation shots) and application images (the product installed in context, usage scenarios).
Both are worth indexing for different query types:
| Image type | Good for |
|---|---|
| Studio / isolated product shot | Visual identification ("what is this?") |
| Dimensioned drawing or diagram | Specification extraction, dimensional matching |
| Application / in-context shot | Usage matching ("I need something for this application") |
| Exploded diagram | Assembly context, replacement part identification |
If your PIM exports all image types, index them separately with imageType as a metadata field. Queries asking "what is this part?" should weight product shots; queries asking "what works in this assembly?" should weight application images.
The Freshness Problem: Keeping Visual Indexes Current
Image indexes have the same staleness problem as text indexes — when you update a product's images (rebranding, new variant photos, updated diagrams), your index needs to reflect that. This is an instance of the broader product catalog sync and RAG freshness problem.
The additional complexity for image indexes: re-embedding images is more expensive than re-embedding text, especially if you're using a VLM to generate descriptions at index time. Build your update pipeline with this in mind:
- Incremental updates: Only re-embed images that have changed (compare checksums or use PIM webhooks)
- Staggered re-embedding: For large catalogs, queue image re-embedding and process async to avoid blocking updates
- Separate text and image update cadences: Text product data changes more frequently than product images; you can often refresh text chunks on a daily cadence and images weekly
Practical Rollout Approach
Multimodal RAG doesn't have to replace your existing text-based pipeline. The cleanest rollout is additive:
Phase 1 — Image descriptions as text: Generate VLM descriptions for your product images and add them as text chunks to your existing index. This immediately improves text search for visual characteristics ("blue anodized," "flanged hex head") without any change to query-time infrastructure. Cost: indexing time only.
Phase 2 — Image query input: Add an image upload option to your chat widget. When an image is received, call a VLM to describe it and feed the description into your existing text retrieval pipeline. No change to the retrieval or generation layers.
Phase 3 — Joint embedding retrieval: Add CLIP-family image embeddings for direct image-to-image retrieval, enabling the late-fusion approach described above. This is where you see the biggest gains for visually distinctive product queries.
The phased approach lets you validate ROI at each step before investing in the next layer of infrastructure.
Limitations to Know Before You Build
CLIP models struggle with fine-grained differences: Two bolts that differ only in thread pitch look nearly identical to CLIP. For specification-critical retrieval, you still need text extraction — CLIP gets you to the right product family; text gets you to the right variant.
VLM description quality varies with image quality: Blurry photos, poor lighting, and partial views produce vague descriptions. Set client-side image quality guidance and handle low-confidence identifications explicitly rather than forcing a confident-sounding answer.
Catalog images often lack context: Studio product shots on white backgrounds are great for isolation, but VLMs struggle with scale and material without context cues. Supplement with detailed alt text and structured metadata in your product descriptions. This ties back to the fundamentals of effective product knowledge ingestion.
Inference costs at scale: If you're running VLM descriptions at query time for every image query, costs add up. Profile your query volume and consider caching common image queries or building a fast visual-similarity tier that avoids VLM calls for high-confidence matches.
The Bigger Picture: Moving Toward True Product Understanding
Multimodal RAG is a step toward what the best B2B product AI ultimately needs: a genuine understanding of products that matches how buyers actually experience them — visually, tactilely, contextually, not just textually.
Text-only product knowledge systems are powerful, but they hit a ceiling when buyers arrive with visual context. A worn bearing in someone's hand. A photo from a field service call. An engineer's sketch. These are normal inputs in B2B purchasing, and they've been invisible to digital product experiences until now.
The systems that handle these inputs smoothly — combining visual retrieval with specification knowledge and conversational AI — create a qualitatively different buyer experience. Not just faster search, but a system that meets buyers where they are, with the context they actually have.
That's the trajectory the best B2B product knowledge platforms are on. The underlying technology is mature enough today to build it in production.
Ready to Add Visual Search to Your Product Catalog?
Axoverna's product knowledge platform supports image-based queries, combining visual retrieval with deep B2B catalog context. Whether you're handling unknown part identification, competitor product matching, or field service requests, our multimodal pipeline is built for the specifics of industrial and wholesale catalogs.
Book a demo to see visual product search on your catalog, or start a free trial and explore what multimodal RAG can do for your buyers.
Turn your product catalog into an AI knowledge base
Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.
Related articles
Why Session Memory Matters for Repeat B2B Buyers, and How to Design It Without Breaking Trust
The strongest B2B product AI systems do not treat every conversation like a cold start. They use session memory to preserve buyer context, speed up repeat interactions, and improve recommendation quality, while staying grounded in live product data and clear trust boundaries.
Unit Normalization in B2B Product AI: Why 1/2 Inch, DN15, and 15 mm Should Mean the Same Thing
B2B product AI breaks fast when dimensions, thread sizes, pack quantities, and engineering units are stored in inconsistent formats. Here is how to design unit normalization that improves retrieval, filtering, substitutions, and answer accuracy.
Source-Aware RAG: How to Combine PIM, PDFs, ERP, and Policy Content Without Conflicting Answers
Most product AI failures are not caused by weak models, but by mixing sources with different authority levels. Here is how B2B teams design source-aware RAG that keeps specs, availability, pricing rules, and policy answers aligned.