Use this component when you wish to acquire data from other sources (such as Web pages, tables from an RDBMS database, Excel and pdf files from a local email system, and so on) or generate synthetic data from sensitive ones. Note that many content providers provide links to download their data. So look for those first before using these tools to crawl and scrape the sites.
Here is a non-exhaustive list of tools, in no particular order, that can be used to acquire and generate data.
- Scrapy is a framework for extracting data from websites. Scrapy can be used to build a crawler or spider to crawl multiple websites and retrieve selected data.
- Requests is a HTTP library for Python that provides the necessary apis to scrap websites. Requests can make complex requests to visit a page and get content, such as those requiring additional headers, complex POST data, or authentication credentials.
- urllib and urllib2 are part of the Python standard library for making simple HTTP requests to visit web pages and get their content.
- Tweepy is a Python library for accessing the Twitter API to extract tweets.
- Data Synthesizer can generate a synthetic dataset from a sensitive one for release to public
- There are other tools such as Apache Nutch and Norconex HTTP Collector (in Java) and more…
*There are more cool tools to add to the list? Tell us about it.