The Sherlock Holmes for your master data: Detecting duplicates with a data science solution

Duplicates are a business risk – we solve it with data science: Duplicate records in SAP master data are more than just a cosmetic issue. They cause confusion in sales, flawed analyses, inefficient processes, and high costs. Our self-developed data science-powered duplicate detection reliably identifies duplicate and even erroneous entries – even in large datasets.


Who hasn’t experienced it: a small typo – big confusion. When entering customer data, an “a” quickly becomes an “s,” and suddenly a seemingly new customer is created in the system – even though they already exist. The result: duplicates.

Why duplicates threaten your data quality

Duplicate records in SAP master data – whether for customers, suppliers, or materials – lead to inconsistencies, flawed analyses, and unnecessary extra effort. It becomes especially critical when different departments work with different versions of the same entity. This costs time, money, and in the worst case, leads to wrong decisions and loss of trust.

Our solution: Data science-powered duplicate detection for SAP systems

In the Business Unit Data Analytics & AI, we have developed an intelligent solution that automates the checking of SAP master data for duplicates. The key: the use of data science makes it possible to reliably detect even erroneous or slightly deviating entries.


How it works – learn more now

The challenge in duplicate detection lies in the sheer number of possible comparison pairs: For N entries, there are approximately N²/2 combinations. That means: Ten times the number of master data entries results in one hundred times the number of comparison pairs. To reduce this complexity, we use a two-step approach:

  1. Vectorization of text, categories, and numerical values: All information in the form of text (name, description), categories (customer group, material type), and numerical values (dimensions, weight) is mapped to a consistent vector. This vectorized information allows us to use ultra-efficient linear algebra routines to compare each vector with every other vector – that is, each master data entry with every other.
  2. Preclustering: Where possible and necessary, we form clusters in advance within which we search for duplicates. This significantly reduces complexity and computational effort.

Optionally, we can use Locality-Sensitive Hashing (LSH): This is a mathematical method that allows the dimensionality of the vectorized information – which can quickly reach tens of thousands – to be reduced. Similarity relationships between any two vectors are preserved. This allows computational effort to be reduced by orders of magnitude without compromising the accuracy of duplicate detection.


The result: clean data, clear decisions

Our solution provides a list of potential duplicates that can users review and process.

Your benefits at a glance:

  • Higher data quality: Consistent and reliable master data
  • More efficient processes: Less manual rework
  • Better decisions: Based on clean data
  • Scalability: High performance even with large data volumes

Conclusion: Data science-powered duplicate detection – implement now and benefit

Duplicates are more than just a cosmetic issue – they are a real business risk. With our data science-powered solution, we provide relief and ensure clean, reliable master data. Because only those who have their data under control can make well-informed decisions.

Author of the article

Dr. Kris Holtgrewe
Consultant SAP Business Intelligence

You need advice
or have questions?

That’s our pleasure. We are happy to be of service to you. The best way is to write us how you would prefer to be contacted.

Kontakt Formular

*Required field
Data protection(Required)