Use this component when you wish to extract structured data from semi-structured or natural text, such as extracting attribute-value pairs that describe a product from text, or extracting person and organization names. There are generally two types of extraction techniques.
Wrapper-based extraction (aka template-based extraction) are extractors that operate over HTML pages that conform to a specified template (such as Amazon product pages). Such pages are typically generated by an automatic script in response to a user query. Wrapper-based extractors examine the HTML pages to discover the template, then use the template to extract the data embedded in the page. The extraction program is referred to as a wrapper.
Information extraction from text are extractors that obtain structured data, such as Company(Apple Inc.), CEO(Tim Cook, Apple Inc.), from plain text, such as news articles, blogs, emails etc.
A key difference between information extraction and wrapper-based extraction is that for information extraction, there is no mention for how entity mentions or relations are placed in text. Often, if data is embedded in HTML pages, then wrapper-based extraction tools need to be applied before information extraction can begin.
Here is a non-exhaustive list of tools, in no particular order:
Even with all these tools available, it is still not easy to detect and extract tables from Excel, plain text, pdf, html, etc. A tool that can semi-automatically extract tables (through interacting with users) from files of different formats will be highly desirable.
There are more cool tools to add to the list? Tell us about it.