Entity Resolution for B2B Product AI: Matching Duplicates, Supplier Codes, and Product Synonyms
A product AI assistant is only as reliable as its ability to recognize when different records describe the same thing. Here's how B2B teams can solve entity resolution across supplier feeds, ERP data, PDFs, and product synonyms.
In B2B product catalogs, the same product rarely appears exactly once.
It appears as a manufacturer part number in one feed, an internal SKU in the ERP, a shortened alias in the webshop, an outdated code in a PDF datasheet, and a colloquial nickname used by sales reps who have been selling it for ten years.
Humans learn to work around this mess. They know that “AB-4100”, “4100 series actuator”, and “old pneumatic drive” all point to the same product family. A retrieval system does not know that unless you teach it.
That teaching problem is entity resolution.
If your product AI cannot reliably determine when two records refer to the same real-world item, every downstream capability gets weaker. Search fragments. RAG retrieves conflicting chunks. Substitution logic becomes unreliable. Analytics undercount demand because the same product is split across multiple identities.
Entity resolution is not glamorous, but in B2B product knowledge it is one of the highest-leverage problems you can solve.
What Entity Resolution Actually Means in Product AI
Entity resolution is the process of deciding whether multiple records represent the same entity.
In consumer AI examples, that often means matching customer profiles. In B2B product AI, it usually means matching product identities across systems and sources:
- supplier feed records
- ERP items
- PIM entries
- webshop products
- technical PDF references
- accessory and compatibility mappings
- historical or superseded codes
The key detail is that “same entity” does not always mean “exact same SKU string.” It often means one of three things:
- Exact identity match, where two systems refer to the same sellable item.
- Variant-family match, where two records belong to the same product family but not the same exact configuration.
- Relationship match, where one record supersedes, replaces, complements, or substitutes another.
A good product AI system needs all three. If you collapse them together, the assistant starts answering with the wrong level of precision.
For example, if a buyer asks for a 24V actuator and your system merges it with the 230V variant because the names look similar, retrieval may surface plausible but incorrect specifications. That is not a minor quality issue. It is a trust-breaking one.
Why This Problem Shows Up Everywhere in B2B Catalogs
Entity resolution becomes painful in B2B because product data comes from operational systems that were never designed to speak a common language.
A typical distributor might have:
- manufacturer feeds with official part numbers
- ERP records with internal item codes
- PIM-enriched product names for e-commerce
- legacy aliases from previous suppliers
- PDFs that mention obsolete references still used by customers
- manually maintained compatibility spreadsheets
Each source is locally reasonable. Together, they create identity fragmentation.
This is also why many teams underestimate the issue. Search still works some of the time. Sales reps can still answer many questions manually. The damage shows up in second-order effects:
- buyers get duplicate or near-duplicate results
- RAG retrieves contradictory descriptions for “the same” product
- analytics treat one product family as multiple demand signals
- substitution suggestions miss obvious alternatives
- zero-result queries rise because users search with non-canonical terms
If you have already invested in product data governance, entity resolution becomes much easier because source authority is clearer. If you have not, the matching layer ends up compensating for governance gaps it should not have to own.
The Four Identity Signals That Matter Most
In practice, strong entity resolution does not come from one clever model. It comes from combining several imperfect signals.
1. Deterministic identifiers
These are the easiest wins:
- manufacturer part number
- GTIN/EAN/UPC
- exact supplier article code
- normalized internal SKU mappings
If these fields are complete and trustworthy, use them first. Exact identifier matches should outrank every fuzzy heuristic.
The catch is that these identifiers are often dirty in production data. Hyphens disappear. Prefixes change. Leading zeros get dropped. One system stores “AB-04100”, another stores “AB04100”, and a PDF says “Type 4100”.
That means exact-match logic needs a normalization layer before comparison.
2. Attribute similarity
When deterministic IDs fail, product attributes become the next signal:
- brand or manufacturer
- dimensions
- voltage or pressure rating
- material grade
- thread or connector type
- standards and certifications
- pack size or unit of measure
This is where structured data matters. Matching “same product” from unstructured marketing copy is hard. Matching records that share the same manufacturer, diameter, thread pitch, material, and compliance standard is much easier.
This is one reason structured ingestion from CSV, XML, and JSON product feeds is so valuable. It gives you fields you can compare directly instead of forcing the model to infer everything from prose.
3. Textual similarity
Names, descriptions, and application notes still matter, especially when structured attributes are incomplete.
But textual similarity works best as a supporting signal, not the sole decision maker. B2B product names are messy:
- word order changes
- abbreviations vary by supplier
- one source includes dimensions, another does not
- marketing labels obscure the base product identity
A semantic model can recognize that “stainless hygienic butterfly valve” and “butterfly valve, SS hygienic series” are closely related, but you still need attribute and identifier checks to determine whether they are actually the same sellable product.
4. Graph relationships and historical aliases
This is the underused signal.
Over time, catalogs accumulate identity breadcrumbs:
- “replaces part X”
- “formerly sold as Y”
- “compatible with series Z”
- “same core unit as item Q, different housing”
These relationships are exactly the kind of knowledge that disappears when teams only think in terms of flat rows. If you have read our piece on GraphRAG for product relationship queries, this is where graph structure starts paying off. Identity is often relational, not just lexical.
A Practical Matching Pipeline
The most reliable approach is staged, not monolithic.
Stage 1, normalize everything
Before you compare records, normalize the fields that commonly drift:
- uppercase/lowercase
- punctuation and separators
- whitespace
- unit formatting
- decimal notation
- common abbreviations
- manufacturer name variants
For example:
SS→stainless steelØ→diametermm.→mmSiemens AG,SIEMENS,Siemens→Siemens
This sounds basic because it is basic. It is also where a surprising amount of matching quality comes from. Teams often jump straight to embeddings when they still have cheap deterministic cleanup available.
Stage 2, generate candidate pairs
Do not compare every product against every other product. That does not scale.
Instead, use blocking rules to create plausible candidate sets:
- same manufacturer
- same normalized family token
- same base part-number stem
- same category plus similar core dimensions
This is similar in spirit to metadata filtering: constrain the search space before you do expensive semantic work.
Stage 3, score candidates across multiple signals
For each candidate pair, compute a composite score:
- identifier match score
- attribute overlap score
- semantic name similarity
- exact numeric spec agreement
- packaging or unit mismatch penalty
- source-authority weighting
The source-authority term matters more than many teams expect. If an ERP record and a reviewed manufacturer feed disagree, the system should not treat them as equally trustworthy. This is another place where governance and matching need to reinforce each other.
Stage 4, classify the relationship
Do not stop at “match” or “no match.” Classify the relationship type:
- exact same SKU
- same family, different variant
- superseded by
- accessory of
- substitute candidate
- likely duplicate, needs review
That classification is what makes the result useful to downstream AI. A search assistant, a quote assistant, and a support bot do not all need the same identity abstraction.
Stage 5, keep a human review lane
Some ambiguous pairs should go to review.
This is not a failure. In technical catalogs, the cost of confidently merging the wrong items is higher than the cost of escalating uncertain matches. A good system learns where automation should stop.
Where Teams Usually Go Wrong
There are a few recurring mistakes.
Mistake 1, using embeddings as the whole solution
Embeddings are useful for recall. They are not enough for product identity.
Two products can be semantically similar but commercially distinct. A 12V and 24V version of the same device will often embed very closely because their descriptions overlap. If your match decision depends mostly on semantic similarity, you will create dangerous false positives.
Use embeddings to find candidates, not to finalize identity on their own.
Mistake 2, ignoring units and packaging
One record says “box of 10”, another says “single unit”. One says “3/4 inch”, another says “DN20”. One stores pressure in bar, another in psi.
If units and packaging are not normalized, your match layer will confuse equivalence with approximation.
This is especially risky in industrial distribution, where the difference between “same specification” and “same orderable item” is often hidden in packaging, certification, or market-specific compliance.
Mistake 3, collapsing variants too aggressively
Catalog teams often want deduplication, but what they really need is identity structure.
Merging all near-duplicate records into one canonical node may simplify the database while making the buyer experience worse. A better pattern is:
- canonical product family
- linked sellable variants
- linked aliases and historical references
- explicit substitution and supersession edges
That preserves precision while still giving the AI one coherent identity graph to work with.
Mistake 4, treating identity as a one-time cleanup project
Catalog identity is not a migration chore you finish once.
New suppliers arrive. Manufacturers rename product lines. Legacy part numbers stay alive in customer emails for years. Feed quality drifts. If matching logic is not part of your ongoing ingestion and monitoring process, entropy comes back fast.
The right mental model is not “dedupe the catalog.” It is “maintain catalog identity continuously.”
How Better Entity Resolution Improves RAG
This is where the payoff becomes obvious.
Cleaner retrieval
When duplicate records are linked correctly, retrieval can aggregate evidence from the same product identity instead of scattering relevance across near-duplicate chunks. That improves both recall and precision.
Better answer grounding
If a product has one canonical identity with linked aliases, the assistant can answer a question asked with an old part number while grounding the response in the current approved record.
That is much more useful than returning “no results” or answering from a superseded PDF.
Fewer contradictions in context
A lot of RAG “hallucinations” are really context collisions. The model receives two records that look related but disagree on critical details. Better entity resolution reduces those collisions before they ever reach the answer stage.
Stronger zero-result recovery
Many zero-result searches are not true misses. They are identity misses.
The buyer searched with a supplier nickname, a historical code, or a shorthand product family label that is absent from the canonical catalog record. If your identity layer stores aliases and synonym edges, the system can recover gracefully. That pairs naturally with the techniques we covered in zero-result search and query expansion with HyDE.
A Minimal Data Model That Works
You do not need a huge master data program to get started. A practical model can be quite small.
For each canonical product identity, store:
- canonical product ID
- source records linked to it
- exact identifiers, normalized and raw
- alias list and historical codes
- manufacturer and brand normalization
- family membership
- critical attribute fingerprint
- relationship edges, such as supersedes, substitute_for, accessory_of
- confidence score and review status
That gives both your retrieval system and your operators something usable.
One helpful pattern is to compute an attribute fingerprint from the specs that matter most in your category, for example:
manufacturer + product_family + voltage + diameter + material + certification
It will not uniquely identify every product, but it creates a compact comparison surface for candidate generation and anomaly detection.
What Good Looks Like in Production
A mature entity-resolution layer should make three things happen consistently:
- A user can search with old or informal terms and still reach the current product.
- The AI can distinguish between same-family similarity and exact-item equivalence.
- Retrieval can combine all relevant knowledge around one product identity without blending incompatible variants.
That is the difference between a product chatbot that feels clever in demos and a product knowledge system that sales teams, buyers, and support staff actually trust.
Entity resolution rarely gets top billing in AI roadmaps because it sits below the shiny layer. But if you are building conversational product discovery, technical Q&A, substitution workflows, or guided selling, it is foundational.
The system cannot answer confidently about “the same product” until it knows what “the same product” means.
CTA
If your catalog has duplicate records, supplier aliases, and conflicting product identities across systems, Axoverna can help you turn that chaos into one searchable product knowledge layer.
We help B2B teams unify product data, improve retrieval quality, and power trustworthy AI experiences across search, chat, and sales workflows. Talk to Axoverna to see what a cleaner identity layer could do for your catalog.
Turn your product catalog into an AI knowledge base
Axoverna ingests your product data, builds a semantic search index, and gives you an embeddable chat widget — in minutes, not months.
Related articles
Why Session Memory Matters for Repeat B2B Buyers, and How to Design It Without Breaking Trust
The strongest B2B product AI systems do not treat every conversation like a cold start. They use session memory to preserve buyer context, speed up repeat interactions, and improve recommendation quality, while staying grounded in live product data and clear trust boundaries.
Unit Normalization in B2B Product AI: Why 1/2 Inch, DN15, and 15 mm Should Mean the Same Thing
B2B product AI breaks fast when dimensions, thread sizes, pack quantities, and engineering units are stored in inconsistent formats. Here is how to design unit normalization that improves retrieval, filtering, substitutions, and answer accuracy.
Source-Aware RAG: How to Combine PIM, PDFs, ERP, and Policy Content Without Conflicting Answers
Most product AI failures are not caused by weak models, but by mixing sources with different authority levels. Here is how B2B teams design source-aware RAG that keeps specs, availability, pricing rules, and policy answers aligned.