A Hands on Tutorial

BigGorilla in Action!

The main goal of this tutorial is to highlight the tools that BigGorilla provides for “entity matching” problems. The workflow presented here integrates two movie datasets that are acquired from different sources. The entity-matching step is discussed in the last part of the tutorial (i.e., Part 4), but we recommend readers to read Parts 1-3 where we showcase how existing python packages can be deployed to prepare the data for the entity-matching task.

Part 1: Data Acquistion

We will start by using urllib, a popular python package for fetching data across the web, to download the datasets that we need for this tutorial.

Step 1: Downloading the “Kaggle 5000 Movie Dataset”

The desired dataset is a .csv file with a url that is specified in the code snippet below.

In [1]:
# Importing urlib
import urllib
import os

# Creating the data folder
if not os.path.exists('./data'):
    os.makedirs('./data')

# Obtaining the dataset using the url that hosts it
kaggle_url = 'https://github.com/sundeepblue/movie_rating_prediction/raw/master/movie_metadata.csv'
if not os.path.exists('./data/kaggle_dataset.csv'):     # avoid downloading if the file exists
    response = urllib.urlretrieve(kaggle_url, './data/kaggle_dataset.csv')

Step 2: Downloading the “IMDB Plain Text Data”

The IMDB Plain Text Data (see here) is a collection of files where each files describe one or a few attributes of a movie. We are going to focus on a subset of movie attribues which subsequently means that we are only interested in a few of these files which are listed below:

  • genres.list.gz
  • ratings.list.gz

** Note: The total size of files mentioned above is roughly 30M. Running the following code may take a few minutes.

In [2]:
import gzip

# Obtaining IMDB's text files
imdb_url_prefix = 'ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/'
imdb_files_list = ['genres.list.gz', 'ratings.list.gz']
for name in imdb_files_list:
    if not os.path.exists('./data/' + name):
        response = urllib.urlretrieve(imdb_url_prefix + name, './data/' + name)
        urllib.urlcleanup()   # urllib fails to download two files from a ftp source. This fixes the bug!
        with gzip.open('./data/' + name) as comp_file, open('./data/' + name[:-3], 'w') as reg_file:
            file_content = comp_file.read()
            reg_file.write(file_content)

Step 3: Downloading the “IMDB Prepared Data”

During this tutorial, we discuss how the contents of genres.list.gz and ratings.list.gz files can be integrated. However, to make the tutorial more concise, we avoid including the same process for all the files in the “IMDB Plain Text Data”. The “IMDB Prepared Data” is the dataset that we obtained by integrating a number of files from the “IMDB Plain Text Data” which we will use during later stages of this tutorial. The following code snippet downloads this dataset.

In [3]:
imdb_url = 'https://anaconda.org/BigGorilla/datasets/1/download/imdb_dataset.csv'
if not os.path.exists('./data/imdb_dataset.csv'):     # avoid downloading if the file exists
    response = urllib.urlretrieve(kaggle_url, './data/imdb_dataset.csv')

Part 2: Data Extraction

The “Kaggle 5000 Movie Dataset” is stored in a .csv file which is alreday structured and ready to use. On the other hand, the “IMDB Plain Text Data” is a collection of semi-structured text files that need to be processed to extract the data. A quick look at the first few lines of each files shows that each file has a different format and has to be handled separately.

Content of “ratings.list” data file

In [4]:
with open("./data/ratings.list") as myfile:
    head = [next(myfile) for x in range(38)]
print (''.join(head[28:38]))   # skipping the first 28 lines as they are descriptive headers
      0000000125  1728818   9.2  The Shawshank Redemption (1994)
      0000000125  1181412   9.2  The Godfather (1972)
      0000000124  810055   9.0  The Godfather: Part II (1974)
      0000000124  1714042   8.9  The Dark Knight (2008)
      0000000133  461310   8.9  12 Angry Men (1957)
      0000000133  885509   8.9  Schindler's List (1993)
      0000000123  1354135   8.9  Pulp Fiction (1994)
      0000000124  1241908   8.9  The Lord of the Rings: The Return of the King (2003)
      0000000123  514540   8.9  Il buono, il brutto, il cattivo (1966)
      0000000133  1380148   8.8  Fight Club (1999)

Content of the “genres.list” data file

In [5]:
with open("./data/genres.list") as myfile:
    head = [next(myfile) for x in range(392)]
print (''.join(head[382:392]))   # skipping the first 382 lines as they are descriptive header
"!Next?" (1994)						Documentary
"#1 Single" (2006)					Reality-TV
"#15SecondScare" (2015)					Horror
"#15SecondScare" (2015)					Short
"#15SecondScare" (2015)					Thriller
"#15SecondScare" (2015) {Who Wants to Play with the Rabbit? (#1.2)}	Drama
"#15SecondScare" (2015) {Who Wants to Play with the Rabbit? (#1.2)}	Horror
"#15SecondScare" (2015) {Who Wants to Play with the Rabbit? (#1.2)}	Short
"#15SecondScare" (2015) {Who Wants to Play with the Rabbit? (#1.2)}	Thriller
"#1MinuteNightmare" (2014)				Horror

Step 1: Extracting the information from “genres.list”¶

The goal of this step is to extract the movie titles and their production year from “movies.list”, and store the extracted data into a dataframe. Dataframe (from the python package pandas) is one of the key tools that is commonly used for data profiling and cleaning. To extract the desired information from the text, we rely on regular expressions which are implemented in the python package “re”.

In [6]:
import re
import pandas as pd

with open("./data/genres.list") as genres_file:
    raw_content = genres_file.readlines()
    genres_list = []
    content = raw_content[382:]
    for line in content:
        m = re.match(r'"?(.*[^"])"? \(((?:\d|\?){4})(?:/\w*)?\).*\s((?:\w|-)+)', line.strip())
        genres_list.append([m.group(1), m.group(2), m.group(3)])
    genres_data = pd.DataFrame(genres_list, columns=['movie', 'year', 'genre'])

Step 2: Extracting the information from “ratings.list”

In [7]:
with open("./data/ratings.list") as ratings_file:
    raw_content = ratings_file.readlines()
    ratings_list = []
    content = raw_content[28:]
    for line in content:
        m = re.match(r'(?:\d|\.|\*){10}\s+\d+\s+(1?\d\.\d)\s"?(.*[^"])"? \(((?:\d|\?){4})(?:/\w*)?\)', line.strip())
        if m is None: continue
        ratings_list.append([m.group(2), m.group(3), m.group(1)])
    ratings_data = pd.DataFrame(ratings_list, columns=['movie', 'year', 'rating'])

Note that one has to repeat the information extraction procedure for other data files as well if he is interested in their content. For now (and to keep the tutorial simple), we assume that we are only interested in genres and ratings of movies. The above code snippets store the extracted data on these two attributes into two dataframes (namely, genres_list and ratings_list).

Part 3: Data Profiling & Cleaning

The high-level goal in this stage of data prepration is to look into the data that we have acquired and extracted so far. This helps us to get familiar with data, understand in what ways the data needs cleaning or transformation, and finally enables us to prepare the data for the following steps of the data integration task.

Step 1: Loading the “Kaggle 5000 Movie Dataset”

For this step, we rely on dataframes (from the python package pandas) as they are designed to assist users in data exploration and data profiling tasks. In Part 2 of the tutorial, we stored the extracted data from “IMDB Plain Text Data” into dataframes. It would be appropriate to load the “Kaggle 5000 Movies Dataset” into a dataframe as well and follow the same data profiling procedure for all datasets.

In [8]:
import pandas as pd

# Loading the Kaggle dataset from the .csv file (kaggle_dataset.csv)
kaggle_data = pd.read_csv('./data/kaggle_dataset.csv')

Step 2: Calculating Some Basic Statistics (Profiling)

Let’s start by finding out how many movies are listed in each dataframe.

In [9]:
print ('Number of movies in kaggle_data: {}'.format(kaggle_data.shape[0]))
print ('Number of movies in genres_data: {}'.format(genres_data.shape[0]))
print ('Number of movies in ratings_data: {}'.format(ratings_data.shape[0]))
Number of movies in kaggle_data: 5043
Number of movies in genres_data: 2384400
Number of movies in ratings_data: 691621

We can also check to see if we have duplicates (i.e., a movie appearing more than once) in the data. We consider an entry duplicate if we can find another entry with the same movie title and production year.

In [10]:
print ('Number of duplicates in kaggle_data: {}'.format(
    sum(kaggle_data.duplicated(subset=['movie_title', 'title_year'], keep=False))))
print ('Number of duplicates in genres_data: {}'.format(
    sum(genres_data.duplicated(subset=['movie', 'year'], keep=False))))
print ('Number of duplicates in ratings_data: {}'.format(
    sum(ratings_data.duplicated(subset=['movie', 'year'], keep=False))))
Number of duplicates in kaggle_data: 241
Number of duplicates in genres_data: 1807712
Number of duplicates in ratings_data: 286515

Step 3: Dealing with duplicates (cleaning)

There are many strategies to deal with duplicates. Here, we are going to use a simple method for dealing with duplicates and that is to only keep the first occurrence of a duplicated entry and remove the rest.

In [11]:
kaggle_data = kaggle_data.drop_duplicates(subset=['movie_title', 'title_year'], keep='first').copy()
genres_data = genres_data.drop_duplicates(subset=['movie', 'year'], keep='first').copy()
ratings_data = ratings_data.drop_duplicates(subset=['movie', 'year'], keep='first').copy()

Step 4: Normalizing the text (cleaning)

The key attribute that we will use to integrate our movie datasets is the movie titles. So it is important to normalize these titles. The following code snippet makes all movie titles lower case, and then removes certain characters such as “‘” and “?”, and replaces some other special characters (e.g., “&” is replaced with “and”).

In [12]:
def preprocess_title(title):
    title = title.lower()
    title = title.replace(',', ' ')
    title = title.replace("'", '')    
    title = title.replace('&', 'and')
    title = title.replace('?', '')
    title = title.decode('utf-8', 'ignore')
    return title.strip()

kaggle_data['norm_movie_title'] = kaggle_data['movie_title'].map(preprocess_title)
genres_data['norm_movie'] = genres_data['movie'].map(preprocess_title)
ratings_data['norm_movie'] = ratings_data['movie'].map(preprocess_title)

Step 5: Looking at a few samples

The goal here is to a look at a few sample entries from each dataset for a quick sanity check. To keep the tutorial consice, we just present this step for the “Kaggle 5000 Movies Dataset” which is stored in the kaggle_data dataframe.

In [13]:
kaggle_data.sample(3, random_state=0)
Out[13]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes norm_movie_title
4422 Color Simeon Rice 6.0 93.0 6.0 56.0 Lisa Brave 393.0 NaN Action|Horror|Thriller English USA R 1500000.0 2014.0 191.0 5.5 2.35 307 unsullied
1022 Color Doug Liman 214.0 108.0 218.0 405.0 Ty Burrell 6000.0 9528092.0 Biography|Drama|Thriller English USA PG-13 22000000.0 2010.0 3000.0 6.8 2.35 9000 fair game
3631 Color Jonathan Levine 147.0 99.0 129.0 362.0 Aaron Yoo 976.0 2077046.0 Comedy|Drama|Romance English USA R 6000000.0 2008.0 617.0 7.0 2.35 0 the wackness

3 rows × 29 columns

Looking at the data guides us to decide in what ways we might want to clean the data. For instance, the small sample data shown above, reveals that the title_year attribute is stored as floats (i.e., rational numbers). We can add another cleaning step to transform the title_year into strings and replace the missing title years with symbol “?”.

In [14]:
def preprocess_year(year):
    if pd.isnull(year):
        return '?'
    else:
        return str(int(year))

kaggle_data['norm_title_year'] = kaggle_data['title_year'].map(preprocess_year)
kaggle_data.head()
Out[14]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes norm_movie_title norm_title_year
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000 avatar 2009
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0 pirates of the caribbean: at worlds end 2007
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000 spectre 2015
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000 the dark knight rises 2012
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary NaN NaN NaN NaN 12.0 7.1 NaN 0 star wars: episode vii – the force awakens ?

5 rows × 30 columns

Part 4: Data Matching & Merging

The main goal in this part is go match the data that we have acquired from different sources to create a single rich dataset. Recall that in Part 3, we transformed all datasets into a dataframe which we used to clean the data. In this part, we continue using the same dataframes for the data that we have prepared so far.

Step 1: Integrating the “IMDB Plain Text Data” files

Note that both ratings_data and genres_data dataframes contain data that come from the same source (i.e., “the IMDB Plain Text data”). Thus, we assume that there are no inconsistencies between the data stored in these dataframe and to combine them, all we need to do is to match the entries that share the same title and production year. This simple “exact match” can be done simply using dataframes.

In [15]:
brief_imdb_data = pd.merge(ratings_data, genres_data, how='inner', on=['norm_movie', 'year'])
brief_imdb_data.head()
Out[15]:
movie_x year rating norm_movie movie_y genre
0 The Shawshank Redemption 1994 9.2 the shawshank redemption The Shawshank Redemption Crime
1 The Godfather 1972 9.2 the godfather The Godfather Crime
2 The Godfather: Part II 1974 9.0 the godfather: part ii The Godfather: Part II Crime
3 The Dark Knight 2008 8.9 the dark knight The Dark Knight Action
4 12 Angry Men 1957 8.9 12 angry men 12 Angry Men Crime

We refer to the dataset created above as the brief_imdb_data since it only contains two attributes (namely, genre and rating). Henceforth, we are going to use a richer version of the IMDB dataset which we created by integrating a number of files from the “IMDB Plain Text Data”. If you have completed the first part of this tutorial, then this dataset is already downloaded and stored in “imdb_dataset.csv” under the “data” folder. The following code snippet loads this dataset, does preprocessing on the title and production year of movies, removes the duplicates as before, and prints the size of the dataset.

In [16]:
# reading the new IMDB dataset
imdb_data = pd.read_csv('./data/imdb_dataset.csv')
# let's normlize the title as we did in Part 3 of the tutorial
imdb_data['norm_title'] = imdb_data['title'].map(preprocess_title)
imdb_data['norm_year'] = imdb_data['year'].map(preprocess_year)
imdb_data = imdb_data.drop_duplicates(subset=['norm_title', 'norm_year'], keep='first').copy()
imdb_data.shape
Out[16]:
(869178, 27)

Step 2: Integrating the Kaggle and IMDB datasets

A simple approach to integrate the two datasets is to simply join entries that share the same movie title and year of production. The following code reveals that 4,248 matches are found using this simple approach.

In [17]:
data_attempt1 = pd.merge(imdb_data, kaggle_data, how='inner', left_on=['norm_title', 'norm_year'],
                         right_on=['norm_movie_title', 'norm_title_year'])
data_attempt1.shape
Out[17]:
(4248, 57)

But given that IMDB and Kaggle datasets are collected from different sources, chances are that the name of a movie would be slightly different in these datasets (e.g. “Wall.E” vs “WallE”). To be able to find such matches, one can look at the similarity of movie titles and consider title with high similarity to be the same entity. BigGorilla provides a python pacakge named py_stringsimjoin for doing similarity join across two datasets. The following code snippet uses the py_stringsimjoin to match all the titles that have an edit distance of one or less (i.e., there is at most one character that needs to be changed/added/removed to make both titles identical). Once the similarity join is complete, it only selects the title pairs that are produced in the same year.

In [18]:
import py_stringsimjoin as ssj
import py_stringmatching as sm

imdb_data['id'] = range(imdb_data.shape[0])
kaggle_data['id'] = range(kaggle_data.shape[0])
similar_titles = ssj.edit_distance_join(imdb_data, kaggle_data, 'id', 'id', 'norm_title',
                                        'norm_movie_title', l_out_attrs=['norm_title', 'norm_year'],
                                         r_out_attrs=['norm_movie_title', 'norm_title_year'], threshold=1)
# selecting the entries that have the same production year
data_attempt2 = similar_titles[similar_titles.r_norm_title_year == similar_titles.l_norm_year]
data_attempt2.shape
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:02:46
Out[18]:
(4689, 8)

We can see that using the similarity join 4,689 titles were matched. Let’s look at some of the titles that are matched by the similarity join but are not identical.

In [19]:
data_attempt2[data_attempt2.l_norm_title != data_attempt2.r_norm_movie_title].head()
Out[19]:
_id l_id r_id l_norm_title l_norm_year r_norm_movie_title r_norm_title_year _sim_score
144 144 852736 46 world war v 2013 world war z 2013 1.0
162 162 281649 56 grave 2012 brave 2012 1.0
180 180 831490 58 walle 2008 wall·e 2008 1.0
236 236 816188 67 upe 2009 up 2009 1.0
243 243 817366 67 ut 2009 up 2009 1.0

Step 3: Using Magellan for Data Matching

Substep A: Finding a candidate set (Blocking)

The goal of this step is to limit the number of pairs that we consider as potential matches using a simple heuristic. For this task, we can create a new column in each dataset that combines the values of important attributes into a single string (which we call the mixture). Then, we can use the string similarity join as before to find a set of entities that have some overlap in the values of the important columns. Before doing that, we need to transform the columns that are part of the mixture to strings. The py_stringsimjoin package allows us to do so easily.

In [20]:
# transforming the "budget" column into string and creating a new **mixture** column
ssj.utils.converter.dataframe_column_to_str(imdb_data, 'budget', inplace=True)
imdb_data['mixture'] = imdb_data['norm_title'] + ' ' + imdb_data['norm_year'] + ' ' + imdb_data['budget']

# repeating the same thing for the Kaggle dataset
ssj.utils.converter.dataframe_column_to_str(kaggle_data, 'budget', inplace=True)
kaggle_data['mixture'] = kaggle_data['norm_movie_title'] + ' ' + kaggle_data['norm_title_year'] + \
                         ' ' + kaggle_data['budget']

Now, we can use the mixture columns to create a desired candidate set which we call C.

In [21]:
C = ssj.overlap_coefficient_join(kaggle_data, imdb_data, 'id', 'id', 'mixture', 'mixture', sm.WhitespaceTokenizer(), 
                                 l_out_attrs=['norm_movie_title', 'norm_title_year', 'duration',
                                              'budget', 'content_rating'],
                                 r_out_attrs=['norm_title', 'norm_year', 'length', 'budget', 'mpaa'],
                                 threshold=0.65)
C.shape
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:01:08
Out[21]:
(18317, 14)

We can see that by doing a similarity join, we already reduced the candidate set to 18,317 pairs.

Substep B: Specifying the keys

The next step is to specify to the py_entitymatching package which columns correspond to the keys in each dataframe. Also, we need to specify which columns correspond to the foreign keys of the the two dataframes in the candidate set.

In [22]:
import py_entitymatching as em
em.set_key(kaggle_data, 'id')   # specifying the key column in the kaggle dataset
em.set_key(imdb_data, 'id')     # specifying the key column in the imdb dataset
em.set_key(C, '_id')            # specifying the key in the candidate set
em.set_ltable(C, kaggle_data)   # specifying the left table 
em.set_rtable(C, imdb_data)     # specifying the right table
em.set_fk_rtable(C, 'r_id')     # specifying the column that matches the key in the right table 
em.set_fk_ltable(C, 'l_id')     # specifying the column that matches the key in the left table 
Out[22]:
True

Substep C: Debugging the blocker

Now, we need to make sure that the candidate set is loose enough to include pairs of movies that are not very close. If this is not the case, there is a chance that we have eliminated pair that could be potentially matched together. By looking at a few pairs from the candidate set, we can judge whether the blocking step has been too harsh or not.

Note: The py_entitymatching package provides some tools for debugging the blocker as well.

In [23]:
C[['l_norm_movie_title', 'r_norm_title', 'l_norm_title_year', 'r_norm_year',
   'l_budget', 'r_budget', 'l_content_rating', 'r_mpaa']].head()
Out[23]:
l_norm_movie_title r_norm_title l_norm_title_year r_norm_year l_budget r_budget l_content_rating r_mpaa
0 dude wheres my dog! #hacked 2014 2014 20000 20000 PG NaN
1 road hard #horror 2015 2015 1500000 1500000 NaN NaN
2 me you and five bucks #horror 2015 2015 1500000 1500000 NaN NaN
3 checkmate #horror 2015 2015 1500000 1500000 NaN NaN
4 #horror #horror 2015 2015 1500000 1500000 Not Rated NaN

Based on the above sample we can see that the blocking seems to be reasonable.

Substep D: Sampling from the candidate set

The goal of this step is to obtain a sample from the candidate set and manually label the sampled candidates; that is, to specify if the candidate pair is a correct match or not.

In [24]:
# Sampling 500 pairs and writing this sample into a .csv file
sampled = C.sample(500, random_state=0)
sampled.to_csv('./data/sampled.csv', encoding='utf-8')

In order to label the sampled data, we can create a new column in the .csv file (which we call label) and put value 1 under that column if the pair is a correct match and 0 otherwise. To avoid overriding the files, let’s rename the new file as labeled.csv.

In [25]:
# If you would like to avoid labeling the pairs for now, you can download the labled.csv file from
# BigGorilla using the following command (if you prefer to do it yourself, command the next line)
response = urllib.urlretrieve('https://anaconda.org/BigGorilla/datasets/1/download/labeled.csv',
                              './data/labeled.csv')
labeled = em.read_csv_metadata('data/labeled.csv', ltable=kaggle_data, rtable=imdb_data,
                               fk_ltable='l_id', fk_rtable='r_id', key='_id')
labeled.head()
No handlers could be found for logger "py_entitymatching.io.parsers"
Out[25]:
Unnamed: 0 _id l_id r_id l_norm_movie_title l_norm_title_year l_duration l_budget l_content_rating r_norm_title r_norm_year r_length r_budget r_mpaa _sim_score label
0 4771 4771 2639 235925 eye of the beholder 1999 109.0 15000000 R eye of the beholder 1999 109.0 35000000 R 0.833333 1
1 11478 11478 2001 600301 rocky balboa 2006 139.0 24000000 PG rocky balboa 2006 139.0 24000000 PG 1.000000 1
2 13630 13630 4160 691766 from russia with love 1963 115.0 2000000 Approved the aeolians: from russia with love 2012 NaN 20000 NaN 0.666667 0
3 1972 1972 1248 101029 sex tape 2014 94.0 40000000 R blended 2014 117.0 40000000 PG-13 0.666667 0
4 15903 15903 722 758133 the scorch trials 2015 132.0 61000000 PG-13 the scorch trials 2015 132.0 61000000 PG-13 1.000000 1

Substep E: Traning machine learning algorithms

Now we can use the sampled dataset to train various machine learning algorithms for our prediction task. To do so, we need to split our dataset into a training and a test set, and then select the desired machine learning techniques for our prediction task.

In [26]:
split = em.split_train_test(labeled, train_proportion=0.5, random_state=0)
train_data = split['train']
test_data = split['test']

dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher(name='NaiveBayes')

Before we can apply any machine learning technique, we need to extract a set of features. Fortunately, the py_entitymatching package can automatically extract a set of features once we specify which columns in the two datasets correspond to each other. The following code snippet starts by specifying the correspondence between the column of the two datasets. Then, it uses the py_entitymatching package to determine the type of each column. By considering the types of columns in each dataset (stored in variables l_attr_types and r_attr_types), and using the tokenizers and similarity functions suggested by the package, we can extract a set of instructions for extracting features. Note that variable F is not the set of extracted features, rather it encodes the instructions for computing the features.

In [27]:
attr_corres = em.get_attr_corres(kaggle_data, imdb_data)
attr_corres['corres'] = [('norm_movie_title', 'norm_title'), 
                         ('norm_title_year', 'norm_year'),
                        ('content_rating', 'mpaa'),
                         ('budget', 'budget'),
]

l_attr_types = em.get_attr_types(kaggle_data)
r_attr_types = em.get_attr_types(imdb_data)

tok = em.get_tokenizers_for_matching()
sim = em.get_sim_funs_for_matching()

F = em.get_features(kaggle_data, imdb_data, l_attr_types, r_attr_types, attr_corres, tok, sim)

Given the set of desired features F, we can now calculate the feature values for our training data and also impute the missing values in our data. In this case, we choose to replace the missing values with the mean of the column.

In [28]:
train_features = em.extract_feature_vecs(train_data, feature_table=F, attrs_after='label', show_progress=False) 
train_features = em.impute_table(train_features,  exclude_attrs=['_id', 'l_id', 'r_id', 'label'], strategy='mean')

Using the calculated features, we can evaluate the performance of different machine learning algorithms and select the best one for our matching task.

In [29]:
result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=train_features, 
                           exclude_attrs=['_id', 'l_id', 'r_id', 'label'], k=5,
                           target_attr='label', metric='f1', random_state=0)
result['cv_stats']
Out[29]:
Name Matcher Num folds Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Mean score
0 DecisionTree <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x15d828090> 5 1.000000 0.967742 1.0 1.000000 1.000 0.993548
1 RF <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x15d828550>gt; 5 1.000000 0.967742 1.0 1.000000 1.000 0.993548
2 SVM <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x15d8284d0> 5 0.956522 0.967742 1.0 1.000000 0.875 0.959853
3 LinReg <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x15d8560d0> 5 1.000000 0.967742 1.0 1.000000 1.000 0.993548
4 LogReg <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x15d8281d0> 5 1.000000 0.967742 1.0 0.956522 1.000 0.984853
5 NaiveBayes <py_entitymatching.matcher.nbmatcher.NBMatcher object at 0x111b2c290> 5 1.000000 0.967742 1.0 1.000000 1.000 0.993548

We can observe based on the reported accuracy of different techniques that the “random forest (RF)” algorithm achieves the best performance. Thus, it is best to use this technique for the matching.

Substep F: Evaluating the quality of our matching

It is important to evaluate the quality of our matching. We can now, use the traning set for this purpose and measure how well the random forest predicts the matches. We can see that we are obtaining a high accuracy and recall on the test set as well.

In [30]:
best_model = result['selected_matcher']
best_model.fit(table=train_features, exclude_attrs=['_id', 'l_id', 'r_id', 'label'], target_attr='label')

test_features = em.extract_feature_vecs(test_data, feature_table=F, attrs_after='label', show_progress=False)
test_features = em.impute_table(test_features, exclude_attrs=['_id', 'l_id', 'r_id', 'label'], strategy='mean')

# Predict on the test data
predictions = best_model.predict(table=test_features, exclude_attrs=['_id', 'l_id', 'r_id', 'label'], 
                                 append=True, target_attr='predicted', inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'label', 'predicted')
em.print_eval_summary(eval_result)
Precision : 94.44% (51/54)
Recall : 100.0% (51/51)
F1 : 97.14%
False positives : 3 (out of 54 positive predictions)
False negatives : 0 (out of 196 negative predictions)

Substep G: Using the trained model to match the datasets

Now, we can use the trained model to match the two tables as follows:

In [31]:
candset_features = em.extract_feature_vecs(C, feature_table=F, show_progress=True)
candset_features = em.impute_table(candset_features, exclude_attrs=['_id', 'l_id', 'r_id'], strategy='mean')
predictions = best_model.predict(table=candset_features, exclude_attrs=['_id', 'l_id', 'r_id'],
                                 append=True, target_attr='predicted', inplace=False)
matches = predictions[predictions.predicted == 1]
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:01:05

Note that the matches dataframe contains many columns storing the extracted features for both datasets. The following code snippet removes all the unnecessary columns and creates a nice formatted dataframe that has the resulting integrated dataset.

In [32]:
from py_entitymatching.catalog import catalog_manager as cm
matches = matches[['_id', 'l_id', 'r_id', 'predicted']]
matches.reset_index(drop=True, inplace=True)
cm.set_candset_properties(matches, '_id', 'l_id', 'r_id', kaggle_data, imdb_data)
matches = em.add_output_attributes(matches, l_output_attrs=['norm_movie_title', 'norm_title_year', 'budget', 'content_rating'],
                                   r_output_attrs=['norm_title', 'norm_year', 'budget', 'mpaa'],
                                   l_output_prefix='l_', r_output_prefix='r_',
                                   delete_from_catalog=False)
matches.drop('predicted', axis=1, inplace=True)
matches.head()
Out[32]:
_id l_id r_id l_norm_movie_title l_norm_title_year l_budget l_content_rating r_norm_title r_norm_year r_budget r_mpaa
0 4 4352 106 #horror 2015 1500000 Not Rated #horror 2015 1500000 NaN
1 8 2726 450 crocodile dundee ii 1988 15800000 PG crocodile dundee ii 1988 14000000 NaN
2 11 3406 838 500 days of summer 2009 7500000 PG-13 (500) days of summer 2009 7500000 PG-13
3 24 3631 1872 10 cloverfield lane 2016 15000000 PG-13 10 cloverfield lane 2016 15000000 PG-13
4 26 2965 1881 10 days in a madhouse 2015 12000000 R 10 days in delaware 2015 0 NaN