BigGorilla - Data Integration & Preparation

Use this component when you wish to acquire data from other sources or extract structured data from text. Most tools in this component include data cleaning components to, for example, detect and/or correct inconsistent data.

Featured Packages

FrameIt Semantic Role Labeling

https://github.com/biggorilla-gh/frameit

FrameIt is a system for creating custom frames for text corpora. FrameIt uses Python3 + Spacy2

Features:
– Intent detection for individual sentences using a CNN model
– Entity extraction paired with intents using either CNN or heuristic models
– SRL system allows for loading multiple Frames for intent detection simultaneously, allowing for the differentiation of similar domains
– Easy to train and customize using jupyter notebooks
Evaluation scripts for convenient experimental design and iteration
Functions for all languages supported by Spacy2 models

Developed by Megagon Labs
Scrapy

https://scrapy.org/

Scrapy is a framework for extracting data from websites. Scrapy can be used to build a crawler or spider to crawl multiple websites and retrieve selected data.
Usagi

https://github.com/biggorilla-gh/usagi

Usagi is an open source platform to build data discovery systems. Usagi crawls and extracts metadata about datasets and builds catalogs and indices to make datasets discoverable by search and browsing.
pandas

http://pandas.pydata.org/

pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for Python.
JSON

https://docs.python.org/3/library/json.html

The json library parses JSON strings into dictionaries and lists and vice versa.
CSV

https://docs.python.org/3/library/csv.html

The csv module implements classes to read and write tabular data in CSV format. It allows programmers to say, “write this data in the format preferred by Excel,” or “read data from this file which was generated by Excel,” without knowing the precise details of the CSV format used by Excel. Programmers can also describe the CSV formats understood by other applications or define their own special-purpose CSV formats.
xlrd

https://pypi.python.org/pypi/xlrd

xlrd is a Python package that parses Excel data. It has accompanying packages for writing and formatting information in Excel format.
PDFtables

https://pypi.python.org/pypi/pdftables

PDFtables parses PDF files and extracts what it believes to be tables.
Slate

https://pypi.python.org/pypi/slate

Slate is a Python package that simplifies the process of extracting text from PDF files. It depends on the PDFMiner package.
PDFminer

https://pypi.python.org/pypi/pdfminer/

PDFminer is Python package for extracting information from PDF files into text.
PDFminer includes a tool that can convert PDF files into HTML in addition to text.
Stanford Open IE and the general NLP suite for named entity recognition, relation extraction etc.

http://nlp.stanford.edu/software/openie.html

Stanford CoreNLP provides a set of human language technology tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get the quotes people said, etc.
KOKO

https://github.com/biggorilla-gh/koko

Koko is an information extraction tool (developed in Python 3) that allows users to query a text corpus and extract those entities that is of interest to them.
SpaCy

https://spacy.io/

SpaCy is a library for advanced Natural Language Processing in Python and Cython.
Google Cloud Natural Language API

https://cloud.google.com/natural-language/

Google Cloud Natural Language API provides developers with access to Google-powered, machine learning-based text analysis components such as sentiment analysis, entity recognition, and syntax analysis.
NLTK

http://www.nltk.org/

NLTK is an open-source platform for building Python programs to process human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. NLTK also provides wrappers for industrial-strength NLP libraries.
lxml

http://lxml.de/

Library for processing XML and HTML in the Python language.
beautiful soup

http://lxml.de/

Helps easily read and parse Web pages. Great for initial parsing and scraping.
Apache Nutch

http://nutch.apache.org/

Apache Nutch is an extensible and scalable open source web crawler written in Java.
Data Synthesizer

https://github.com/DataResponsibly/DataSynthesizer

Data Synthesizer can generate a synthetic dataset from a sensitive one for release to public
Tweepy

http://www.tweepy.org/

Tweepy is a Python library for accessing the Twitter API to extract tweets.
urllib2

https://docs.python.org/2/library/urllib2.html

urllib and urllib2 are part of the Python standard library for making simple HTTP requests to visit web pages and get their content.
urllib

https://docs.python.org/2/library/urllib.html

urllib and urllib2 are part of the Python standard library for making simple HTTP requests to visit web pages and get their content.
Requests

http://docs.python-requests.org/

Requests is a HTTP library for Python that provides the necessary apis to scrap websites. Requests can make complex requests to visit a page and get content, such as those requiring additional headers, complex POST data, or authentication credentials.

Registered Packages

Python Client for Google Maps Services

https://github.com/googlemaps/google-maps-services-python

This library brings the Google Maps API Web Services to your Python application.

The Python Client for Google Maps Services is a Python Client library for the following Google Maps APIs: Directions API Distance Matrix API Elevation API Geocoding API Geolocation API Time Zone API Roads API Places API

Data Acquisition, Extraction, and Cleaning

Featured Packages

FrameIt Semantic Role Labeling

Scrapy

Usagi

pandas

JSON

CSV

xlrd

PDFtables

Slate

PDFminer

Stanford Open IE and the general NLP suite for named entity recognition, relation extraction etc.

KOKO

SpaCy

Google Cloud Natural Language API

NLTK

lxml

beautiful soup

Apache Nutch

Data Synthesizer

Tweepy

urllib2

urllib

Requests

Registered Packages

Python Client for Google Maps Services