Data Integration and Data Preparation in Python

BigGorilla

“Data scientists, according to interviews and expert estimates, spend from 50-80% of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.”

— Steve Lohr, Aug 17, 2014, New York Times (For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights)

BigGorilla is an open-source data integration and data preparation ecosystem (powered by Python) to enable data scientists to perform integration and analysis of data. BigGorilla consolidates and documents the different steps that are typically taken by data scientists to bring data from different sources into a single database to perform data analysis.  For each of these steps, we document existing technologies and also point to desired technologies that could be developed.

The different components of BigGorilla are freely available for download and use. Data scientists are encouraged to contribute code, datasets, or examples to BigGorilla.  Our goal is also to promote education and training for aspiring data scientists with the development, documentation, and tools provided through BigGorilla. Jumpstart with our BigGorilla introduction to data integration and data preparation.

  • OPEN-SOURCE COMPONENTS FOR DATA INTEGRATION AND DATA PREPARATION

  • CAN BE COMBINED AND REUSED IN DIFFERENT CONTEXT AND DIFFERENT WAYS

  • TUTORIALS, DATASETS, AND EXAMPLES

OVERVIEW

Here is a hypothetical scenario to illustrate how different components (see bottom of the page) of BigGorilla may be used. Suppose a company is trying to understand the sentiments of its customers and potential customers regarding products of its own and products of its competitors. Their goal is to obtain and prepare data from relevant tweets, blogs, and news articles before running their favorite sentiment analysis algorithm on the data. The following is a depiction of the possible steps taken to prepare the data prior to the running of their sentiment analysis algorithm.

The first step is to acquire data from different sources (in this case, from relevant tweets, blogs, and news feeds). In our scenario, one source will produce a CSV file while the other produces a JSON file. The data scientist then applies two extractors to extract the following information respectively: (1) tweet(company, item, {sentence, tweetid, date}) in JSON format, where sentence, tweetid, date are grouped by company and item and (2) about(company name, product, sentiment utterance, blogurl, date) in CSV format. The next step is a data transformation step, where we convert the JSON file into a CSV file by unnesting sentence, tweetid, and date with company and item. Next, we match the schemas of the two CSV files with a goal schema final(company, product, utterance, provenance, date) that is designed by a user. The resulting matchings are used by the schema mapping component to generate a script that would translate and combine both CSV files into data that conforms to the goal schema. The final step is data matching, where the goal is to identify all utterances that belong to the same company and product pair.  The different steps can be coordinated manually or through a workflow management tool.

DATA ACQUISITION

Use this component when you wish to acquire data from other sources (such as Web pages, tables from an RDBMS database, Excel and pdf files from a local email system, and so on) or generate synthetic data from sensitive ones.

DATA EXTRACTION

Use this component when you wish to extract structured data from semi-structured or natural text, such as extracting attribute-value pairs that describe a product from text, or extracting person and organization names.

SCHEMA MATCHING & MERGING

Use this component when you wish to match attributes across two schemas, such as “address” in one schema with “location” in another, or merge two schemas into one.

DATA MATCHING & MERGING

Use this component when you wish to identify when two entities are the same, such as whether or not two tuples (David Smith, JHU, 35) and (Dave Smith, John Hopkins, 37) refer to the same real-world entity, or when you wish to combine data from different sources together, which may involve reconciling conflicting values.

DATA TRANSFORMATION

Use this component when you wish to convert data from one format into another, such as from CSV to JSON, or reshape or reorganize data into a desired form.

SCHEMA MAPPING

Use this component when you wish to generate scripts (from schema matchings) that can be executed to combine data from different sources into data that conforms to a schema.

WORKFLOW MANAGEMENT

Use this component when you wish to build a pipeline of your data integration and data preparation tasks, automate the pipeline, handle input/output data, failure etc.