Use this component when you wish to acquire data from other sources or extract structured data from text. Most tools in this component include data cleaning components to, for example, detect and/or correct inconsistent data.

Featured Packages

  • FrameIt Semantic Role Labeling

    https://github.com/biggorilla-gh/frameit

    FrameIt is a system for creating custom frames for text corpora. FrameIt uses Python3 + Spacy2

    Features:
    – Intent detection for individual sentences using a CNN model
    – Entity extraction paired with intents using either CNN or heuristic models
    – SRL system allows for loading multiple Frames for intent detection simultaneously, allowing for the differentiation of similar domains
    – Easy to train and customize using jupyter notebooks
    Evaluation scripts for convenient experimental design and iteration
    Functions for all languages supported by Spacy2 models

    Developed by Megagon Labs


  • Scrapy

    https://scrapy.org/

    Scrapy is a framework for extracting data from websites. Scrapy can be used to build a crawler or spider to crawl multiple websites and retrieve selected data.


  • Usagi

    https://github.com/biggorilla-gh/usagi

    Usagi is an open source platform to build data discovery systems. Usagi crawls and extracts metadata about datasets and builds catalogs and indices to make datasets discoverable by search and browsing.


  • pandas

    http://pandas.pydata.org/

    pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for Python.


  • JSON

    https://docs.python.org/3/library/json.html

    The json library parses JSON strings into dictionaries and lists and vice versa.


  • CSV

    https://docs.python.org/3/library/csv.html

    The csv module implements classes to read and write tabular data in CSV format. It allows programmers to say, “write this data in the format preferred by Excel,” or “read data from this file which was generated by Excel,” without knowing the precise details of the CSV format used by Excel. Programmers can also describe the CSV formats understood by other applications or define their own special-purpose CSV formats.


  • xlrd

    https://pypi.python.org/pypi/xlrd

    xlrd is a Python package that parses Excel data. It has accompanying packages for writing and formatting information in Excel format.


  • PDFtables

    https://pypi.python.org/pypi/pdftables

    PDFtables parses PDF files and extracts what it believes to be tables.


  • Slate

    https://pypi.python.org/pypi/slate

    Slate is a Python package that simplifies the process of extracting text from PDF files. It depends on the PDFMiner package.


  • PDFminer

    https://pypi.python.org/pypi/pdfminer/

    PDFminer is Python package for extracting information from PDF files into text.
    PDFminer includes a tool that can convert PDF files into HTML in addition to text.


  • Stanford Open IE and the general NLP suite for named entity recognition, relation extraction etc.

    http://nlp.stanford.edu/software/openie.html

    Stanford CoreNLP provides a set of human language technology tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get the quotes people said, etc.


  • KOKO

    https://github.com/biggorilla-gh/koko

    Koko is an information extraction tool (developed in Python 3) that allows users to query a text corpus and extract those entities that is of interest to them.


  • SpaCy

    https://spacy.io/

    SpaCy is a library for advanced Natural Language Processing in Python and Cython.


  • Google Cloud Natural Language API

    https://cloud.google.com/natural-language/

    Google Cloud Natural Language API provides developers with access to Google-powered, machine learning-based text analysis components such as sentiment analysis, entity recognition, and syntax analysis.


  • NLTK

    http://www.nltk.org/

    NLTK is an open-source platform for building Python programs to process human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. NLTK also provides wrappers for industrial-strength NLP libraries.


  • lxml

    http://lxml.de/

    Library for processing XML and HTML in the Python language.


  • beautiful soup

    http://lxml.de/

    Helps easily read and parse Web pages. Great for initial parsing and scraping.


  • Apache Nutch

    http://nutch.apache.org/

    Apache Nutch is an extensible and scalable open source web crawler written in Java.


  • Data Synthesizer

    https://github.com/DataResponsibly/DataSynthesizer

    Data Synthesizer can generate a synthetic dataset from a sensitive one for release to public


  • Tweepy

    http://www.tweepy.org/

    Tweepy is a Python library for accessing the Twitter API to extract tweets.


  • urllib2

    https://docs.python.org/2/library/urllib2.html

    urllib and urllib2 are part of the Python standard library for making simple HTTP requests to visit web pages and get their content.


  • urllib

    https://docs.python.org/2/library/urllib.html

    urllib and urllib2 are part of the Python standard library for making simple HTTP requests to visit web pages and get their content.


  • Requests

    http://docs.python-requests.org/

    Requests is a HTTP library for Python that provides the necessary apis to scrap websites. Requests can make complex requests to visit a page and get content, such as those requiring additional headers, complex POST data, or authentication credentials.


Registered Packages