Data matching refers to the process of identifying when two entities are the same entity, such as whether or not two tuples (David Smith, JHU, 35) and (Dave Smith, John Hopkins, 37) refer to the same real-world entity. This problem is also often referred to as record linkage, entity matching, entity resolution, reference reconciliation, deduplication, etc.
Data merging refers to the process of combining data from different sources into a single format. For example, we may want to wish to merge two tuples (D. M. Smith, JHU, 35) and (Dave Smith, John Hopkins, 37) into a single tuple. Here, we face problems such as how to reconcile the conflicting age values of 35 and 37. This problem is also sometimes referred to as entity merging or data fusion and may leverage data cleaning tools to resolve inconsistency. This component may also leverage schema mapping tools to generate a transformation script for migrating data from different sources into a single format before data cleaning is applied.
For data merging, we are not aware of publicly available tools for data merging tools. However, as mentioned, data merging may be implemented as a schema mapping or data transformation followed by data cleaning.
For data matching, we describe some tools below (list is non-exhaustive and in no particular order):
- Magellan is a publicly available entity matching tool in Python (py_entitymatching package) developed by University of Wisconsin. It enables matching two tables (or one table against itself) using supervised learning techniques. The website provides further documentation on the py_entitymatching package.
Both schema and data matching uses string matching as a fundamental building block. Magellan also includes a py_stringmatching package for string matching. See their website for a discussion on available open-source string matching packages.
- dedupe, developed in Python by datamade, uses machine learning techniques to match, deduplicate and match entities over structured data.
- febrl (Freely Extensible Biomedical Record Linkage) developed in Python by Australian National University, matches entities by standardizing and cleaning data before “fuzzily matching” the records.
- pydeduple is a deduplication tool developed in Python, originally developed as an internal tool for linking a directory database. It first identifies groups of records based on some measures, and then for each group, compare each pair of records within the group before classifying whether each pair is a match or not a match. There is also a data matching package in R called recordlinkage.
*There are more cool tools to add to the list? Tell us about it.