Blogs

Duplicate Data Resolution Techniques

"Attack the problem at the source by preventing duplication at data entry. Manual data entry is perhaps the entry point most culpable for data duplication. Guidelines, standardized templates and a strong review system as well as an alert informing data operators of possible duplicates existing in the system when a new entry is added will ameliorate some of the pain. Naturally, the next step is to provide a tool to check existing records for duplicate data. In the rest of this article, I will analyze the problems with solutions currently in use and propose my own improvements on these."

Imagine, you call a prospect about a proposal and you hear back that they already have been called by your colleague. You end up with egg on your face as your company looks disorganized. Investigating the problem reveals that the prospect was named differently, thereby wasting your time, effort and affecting brand image. The flip-side of this can be even more severe where we conflate two separate entities based on one piece of information, disregarding other crucial facts. One such tragic story happened to a woman saddled with someone else’s debt only based on her first name and had to fight for years to get the debt expunged. This happened due to the poor duplicate data resolution techniques in the system.

A piece of data that gets created more than once or appears to be similar or identical is duplicate data. For e.g.:

  • Ninaad Joshi, Ninad Joshi, Joshi Ninaad, N Joshi, Joshi N

  • XYZ International LLC, XYZ International, XYZ Private Limited

  • 150 Shell Boulevard, 150 Shell Blvd

  • Sr. Vice President, Senior Vice President, SVP

We have been plagued with duplicate data ever since we first started keeping records. Sadly, this has only worsened with the advent of digital data. The crucial difference is that digital data is trivial to create, copy, and merge (often unintentionally) in comparison to older mediums, riddling the data typos, inconsistency of representations and formatting, and inadequate meta-data which create ambiguities that hinder automatic resolution. Unfortunately, the advent of automation and a renewed emphasis on ease of use has exacerbated the problem. Shockingly, a data science report from 2016 highlights that “60% [of data scientists] said they spent most of their time cleaning and organizing data.” Indeed, duplicate data wastes users’ time and energy by introducing inconsistent updates and a constant need for manual resolution. Although data duplication is inevitable, it can be grossly reduced with a combination of processes and practices, software solutions and expert advice.

Attack the problem at the source by preventing duplication at data entry. Manual data entry is perhaps the entry point most culpable for data duplication. Guidelines, standardized templates and a strong review system as well as an alert informing data operators of possible duplicates existing in the system when a new entry is added will ameliorate some of the pain. Naturally, the next step is to provide a tool to check existing records for duplicate data. In the rest of this article, I will analyze the problems with solutions currently in use and propose my own improvements on these.

We must deal with errors in names. Generally, humans refer to things with names and not identifiers. Despite issuing an unique identifier, names are pivotal in organizing, searching and managing data. Matching similar names is complicated by variations in aliases, nicknames, abbreviations, spellings and syntactic representations. Every failed match creates a duplicate that robs us of a complete picture of that entity. Likewise, false duplicates introduce the opposite problem where the data of multiple entities is lost or intermingled. Hence, when comparing person names, additional meta-data such as email, social profile identifier, educational history, professional history, cities/countries lived etc. can be used for multi-field verification. Therefore, it is imperative to fill in as much information as possible which will aid in resolving similar entities.

While a multitude of software techniques exist for resolving duplicates in names, a hybrid approach is most apt when taking into account the complexity of names. Each entity needs to be split into individual components comprising of one or more tokens with their own semantic and syntactic constructs. The importance of order and the presence of components varies with each case.

  • Person names will be made up of optional Salutation, First Name, Middle Name and Last name components, each of which may be shortened to a nickname or initials. We must also consider patronymic and other naming practices, which differ by region.

  • A designation will have optional Seniority prefix, title and a job function. Any component or all of the designation may be abbreviated. Comparing Job functions will be more complex as every company labels their job functions differently. For e.g. A Hiring function is typically the same as Recruitment. Typically, addresses have syntactic structure which is country specific. A US address will have street number, optional street direction, street name & type, optional unit information, city, state and zip code. Some components such as Street Direction, Street Type, Unit Information State and City names can be abbreviated according to regional guidelines.

  • Companies, especially multi-regional ones, may have subsidiaries in different countries, with each subsidiary complying with the local incorporation standards and abbreviations. For instance, XYZ International LLC and XYZ Private Limited may be same but operating in different countries. Moreover, XYZ Automotive and XYZ Motors may be equivalent.

Natural Language Processing (NLP) techniques need to be used to semantically tokenize a name in to different component constructs. An entity specific ontology needs to be developed to address these variations. An ontology represents formal naming conventions, definitions, properties, relationships between the components and categories. Individual component classifiers will use different features to match these extracted components and derive a component match threshold. Depending on the component, various matching features can be used: distance metrics, phonetic similarity, semantic similarity, hyphens, commas, nicknames, abbreviations, aliases and syntactic rules. An overall entity match threshold can be calculated by a weighted average of each component match threshold scores.

Some of the techniques mentioned above are resource intensive, time consuming and need to be used offline. An online process will analyse data inline to detect duplicates when they first enter the system. An offline process executes at periodic intervals to analyse existing data for duplicates. Offline process doesn’t affect data entry performance but requires larger storage pace and duplicate data will persist until the next execution of the offline process. The interval between subsequent executions of offline processing will determine the impact of this duplicate data. Choosing between online and offline processes is a trade-off between performance and impact of intermittent duplicate data.

With application of these techniques you will be able to eliminate duplicate data to a great extent thereby increasing consumer confidence in your product and increasing trust in your brand. Furthermore, you will able to avoid blunders that cause you and the consumer much pain and embarrassment.