Entity Resolution for B2B Product AI: Matching Duplicates, Supplier Codes, and Product Synonyms

A product AI assistant is only as reliable as its ability to recognize when different records describe the same thing. Here's how B2B teams can solve entity resolution across supplier feeds, ERP data, PDFs, and product synonyms.

Axoverna Team

April 14, 202612 min read

In B2B product catalogs, the same product rarely appears exactly once.

It appears as a manufacturer part number in one feed, an internal SKU in the ERP, a shortened alias in the webshop, an outdated code in a PDF datasheet, and a colloquial nickname used by sales reps who have been selling it for ten years.

Humans learn to work around this mess. They know that “AB-4100”, “4100 series actuator”, and “old pneumatic drive” all point to the same product family. A retrieval system does not know that unless you teach it.

That teaching problem is entity resolution.

If your product AI cannot reliably determine when two records refer to the same real-world item, every downstream capability gets weaker. Search fragments. RAG retrieves conflicting chunks. Substitution logic becomes unreliable. Analytics undercount demand because the same product is split across multiple identities.

Entity resolution is not glamorous, but in B2B product knowledge it is one of the highest-leverage problems you can solve.

What Entity Resolution Actually Means in Product AI

Entity resolution is the process of deciding whether multiple records represent the same entity.

In consumer AI examples, that often means matching customer profiles. In B2B product AI, it usually means matching product identities across systems and sources:

supplier feed records
ERP items
PIM entries
webshop products
technical PDF references
accessory and compatibility mappings
historical or superseded codes

The key detail is that “same entity” does not always mean “exact same SKU string.” It often means one of three things:

Exact identity match, where two systems refer to the same sellable item.
Variant-family match, where two records belong to the same product family but not the same exact configuration.
Relationship match, where one record supersedes, replaces, complements, or substitutes another.

A good product AI system needs all three. If you collapse them together, the assistant starts answering with the wrong level of precision.

For example, if a buyer asks for a 24V actuator and your system merges it with the 230V variant because the names look similar, retrieval may surface plausible but incorrect specifications. That is not a minor quality issue. It is a trust-breaking one.

Why This Problem Shows Up Everywhere in B2B Catalogs

Entity resolution becomes painful in B2B because product data comes from operational systems that were never designed to speak a common language.

A typical distributor might have:

manufacturer feeds with official part numbers
ERP records with internal item codes
PIM-enriched product names for e-commerce
legacy aliases from previous suppliers
PDFs that mention obsolete references still used by customers
manually maintained compatibility spreadsheets

Each source is locally reasonable. Together, they create identity fragmentation.

This is also why many teams underestimate the issue. Search still works some of the time. Sales reps can still answer many questions manually. The damage shows up in second-order effects:

buyers get duplicate or near-duplicate results
RAG retrieves contradictory descriptions for “the same” product
analytics treat one product family as multiple demand signals
substitution suggestions miss obvious alternatives
zero-result queries rise because users search with non-canonical terms

If you have already invested in product data governance, entity resolution becomes much easier because source authority is clearer. If you have not, the matching layer ends up compensating for governance gaps it should not have to own.

The Four Identity Signals That Matter Most

In practice, strong entity resolution does not come from one clever model. It comes from combining several imperfect signals.

1. Deterministic identifiers

These are the easiest wins:

manufacturer part number
GTIN/EAN/UPC
exact supplier article code
normalized internal SKU mappings

If these fields are complete and trustworthy, use them first. Exact identifier matches should outrank every fuzzy heuristic.

The catch is that these identifiers are often dirty in production data. Hyphens disappear. Prefixes change. Leading zeros get dropped. One system stores “AB-04100”, another stores “AB04100”, and a PDF says “Type 4100”.

That means exact-match logic needs a normalization layer before comparison.

2. Attribute similarity

When deterministic IDs fail, product attributes become the next signal:

brand or manufacturer
dimensions
voltage or pressure rating
material grade
thread or connector type
standards and certifications
pack size or unit of measure

This is where structured data matters. Matching “same product” from unstructured marketing copy is hard. Matching records that share the same manufacturer, diameter, thread pitch, material, and compliance standard is much easier.

This is one reason structured ingestion from CSV, XML, and JSON product feeds is so valuable. It gives you fields you can compare directly instead of forcing the model to infer everything from prose.

3. Textual similarity

Names, descriptions, and application notes still matter, especially when structured attributes are incomplete.

But textual similarity works best as a supporting signal, not the sole decision maker. B2B product names are messy:

word order changes
abbreviations vary by supplier
one source includes dimensions, another does not
marketing labels obscure the base product identity

A semantic model can recognize that “stainless hygienic butterfly valve” and “butterfly valve, SS hygienic series” are closely related, but you still need attribute and identifier checks to determine whether they are actually the same sellable product.

4. Graph relationships and historical aliases

This is the underused signal.

Over time, catalogs accumulate identity breadcrumbs:

“replaces part X”
“formerly sold as Y”
“compatible with series Z”
“same core unit as item Q, different housing”

These relationships are exactly the kind of knowledge that disappears when teams only think in terms of flat rows. If you have read our piece on GraphRAG for product relationship queries, this is where graph structure starts paying off. Identity is often relational, not just lexical.

A Practical Matching Pipeline

The most reliable approach is staged, not monolithic.

Stage 1, normalize everything

Before you compare records, normalize the fields that commonly drift:

uppercase/lowercase
punctuation and separators
whitespace
unit formatting
decimal notation
common abbreviations
manufacturer name variants

For example:

SS → stainless steel
Ø → diameter
mm. → mm
Siemens AG, SIEMENS, Siemens → Siemens

This sounds basic because it is basic. It is also where a surprising amount of matching quality comes from. Teams often jump straight to embeddings when they still have cheap deterministic cleanup available.

Stage 2, generate candidate pairs

Do not compare every product against every other product. That does not scale.

Instead, use blocking rules to create plausible candidate sets:

same manufacturer
same normalized family token
same base part-number stem
same category plus similar core dimensions

This is similar in spirit to metadata filtering: constrain the search space before you do expensive semantic work.

Stage 3, score candidates across multiple signals

For each candidate pair, compute a composite score:

identifier match score
attribute overlap score
semantic name similarity
exact numeric spec agreement
packaging or unit mismatch penalty
source-authority weighting

The source-authority term matters more than many teams expect. If an ERP record and a reviewed manufacturer feed disagree, the system should not treat them as equally trustworthy. This is another place where governance and matching need to reinforce each other.

Stage 4, classify the relationship

Do not stop at “match” or “no match.” Classify the relationship type:

exact same SKU
same family, different variant
superseded by
accessory of
substitute candidate
likely duplicate, needs review

That classification is what makes the result useful to downstream AI. A search assistant, a quote assistant, and a support bot do not all need the same identity abstraction.

Stage 5, keep a human review lane

Some ambiguous pairs should go to review.

This is not a failure. In technical catalogs, the cost of confidently merging the wrong items is higher than the cost of escalating uncertain matches. A good system learns where automation should stop.

Where Teams Usually Go Wrong

There are a few recurring mistakes.

Mistake 1, using embeddings as the whole solution

Embeddings are useful for recall. They are not enough for product identity.

Two products can be semantically similar but commercially distinct. A 12V and 24V version of the same device will often embed very closely because their descriptions overlap. If your match decision depends mostly on semantic similarity, you will create dangerous false positives.

Use embeddings to find candidates, not to finalize identity on their own.

Mistake 2, ignoring units and packaging

One record says “box of 10”, another says “single unit”. One says “3/4 inch”, another says “DN20”. One stores pressure in bar, another in psi.

If units and packaging are not normalized, your match layer will confuse equivalence with approximation.

This is especially risky in industrial distribution, where the difference between “same specification” and “same orderable item” is often hidden in packaging, certification, or market-specific compliance.

Mistake 3, collapsing variants too aggressively

Catalog teams often want deduplication, but what they really need is identity structure.

Merging all near-duplicate records into one canonical node may simplify the database while making the buyer experience worse. A better pattern is:

canonical product family
linked sellable variants
linked aliases and historical references
explicit substitution and supersession edges

That preserves precision while still giving the AI one coherent identity graph to work with.

Mistake 4, treating identity as a one-time cleanup project

Catalog identity is not a migration chore you finish once.

New suppliers arrive. Manufacturers rename product lines. Legacy part numbers stay alive in customer emails for years. Feed quality drifts. If matching logic is not part of your ongoing ingestion and monitoring process, entropy comes back fast.

The right mental model is not “dedupe the catalog.” It is “maintain catalog identity continuously.”

How Better Entity Resolution Improves RAG

This is where the payoff becomes obvious.

Cleaner retrieval

When duplicate records are linked correctly, retrieval can aggregate evidence from the same product identity instead of scattering relevance across near-duplicate chunks. That improves both recall and precision.

Better answer grounding

If a product has one canonical identity with linked aliases, the assistant can answer a question asked with an old part number while grounding the response in the current approved record.

That is much more useful than returning “no results” or answering from a superseded PDF.

Fewer contradictions in context

A lot of RAG “hallucinations” are really context collisions. The model receives two records that look related but disagree on critical details. Better entity resolution reduces those collisions before they ever reach the answer stage.

Stronger zero-result recovery

Many zero-result searches are not true misses. They are identity misses.

The buyer searched with a supplier nickname, a historical code, or a shorthand product family label that is absent from the canonical catalog record. If your identity layer stores aliases and synonym edges, the system can recover gracefully. That pairs naturally with the techniques we covered in zero-result search and query expansion with HyDE.

A Minimal Data Model That Works

You do not need a huge master data program to get started. A practical model can be quite small.

For each canonical product identity, store:

canonical product ID
source records linked to it
exact identifiers, normalized and raw
alias list and historical codes
manufacturer and brand normalization
family membership
critical attribute fingerprint
relationship edges, such as supersedes, substitute_for, accessory_of
confidence score and review status

That gives both your retrieval system and your operators something usable.

One helpful pattern is to compute an attribute fingerprint from the specs that matter most in your category, for example:

manufacturer + product_family + voltage + diameter + material + certification

It will not uniquely identify every product, but it creates a compact comparison surface for candidate generation and anomaly detection.

What Good Looks Like in Production

A mature entity-resolution layer should make three things happen consistently:

A user can search with old or informal terms and still reach the current product.
The AI can distinguish between same-family similarity and exact-item equivalence.
Retrieval can combine all relevant knowledge around one product identity without blending incompatible variants.

That is the difference between a product chatbot that feels clever in demos and a product knowledge system that sales teams, buyers, and support staff actually trust.

Entity resolution rarely gets top billing in AI roadmaps because it sits below the shiny layer. But if you are building conversational product discovery, technical Q&A, substitution workflows, or guided selling, it is foundational.

The system cannot answer confidently about “the same product” until it knows what “the same product” means.

CTA

If your catalog has duplicate records, supplier aliases, and conflicting product identities across systems, Axoverna can help you turn that chaos into one searchable product knowledge layer.

We help B2B teams unify product data, improve retrieval quality, and power trustworthy AI experiences across search, chat, and sales workflows. Talk to Axoverna to see what a cleaner identity layer could do for your catalog.

Ready to get started?

Turn your product catalog into an AI knowledge base

Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.

Start free — no credit card required →Read the docs

Technical

BOM-Aware Product AI: How to Turn Part-Level Questions Into Procurement-Ready Answers

Most product AI systems answer one SKU at a time. B2B buyers work from assemblies, spare parts lists, and bills of materials. BOM-aware retrieval helps AI reason across sets of parts, dependencies, alternates, and order constraints so conversations lead to real purchasing decisions.

May 24, 202611 min read

Technical

Revenue-Weighted Evaluation for B2B Product AI: Why All Retrieval Errors Are Not Equal

Most B2B teams evaluate product AI with flat accuracy metrics. The better approach is to weight failures by commercial risk, so mistakes on high-value, high-complexity workflows get fixed before low-stakes browsing errors.

May 23, 202611 min read

Technical

How Conversation Mining Turns Product AI Into a Product Data Improvement Engine

Most B2B teams treat AI chat logs as support exhaust. The smarter move is to mine them for missing attributes, broken mappings, unclear terminology, and catalog blind spots, then feed those insights back into product data operations.

May 22, 202612 min read

What Entity Resolution Actually Means in Product AI

Why This Problem Shows Up Everywhere in B2B Catalogs

The Four Identity Signals That Matter Most

1. Deterministic identifiers

2. Attribute similarity

3. Textual similarity

4. Graph relationships and historical aliases

A Practical Matching Pipeline

Stage 1, normalize everything

Stage 2, generate candidate pairs

Stage 3, score candidates across multiple signals

Stage 4, classify the relationship

Stage 5, keep a human review lane

Where Teams Usually Go Wrong

Mistake 1, using embeddings as the whole solution

Mistake 2, ignoring units and packaging

Mistake 3, collapsing variants too aggressively

Mistake 4, treating identity as a one-time cleanup project

How Better Entity Resolution Improves RAG

Cleaner retrieval

Better answer grounding

Fewer contradictions in context

Stronger zero-result recovery

A Minimal Data Model That Works

What Good Looks Like in Production

CTA

Turn your product catalog into an AI knowledge base

Related articles

BOM-Aware Product AI: How to Turn Part-Level Questions Into Procurement-Ready Answers

Revenue-Weighted Evaluation for B2B Product AI: Why All Retrieval Errors Are Not Equal

How Conversation Mining Turns Product AI Into a Product Data Improvement Engine