Machine learning of information extraction procedures
Authors
More about the book
Automatic fact retrieval from text documents is becoming one of the key technologies for the Information Age. One category of Intelligent Information Systems aims at supporting the user in search and retrieval of precious information from data resources like intranets or the World Wide Web containing billions of web pages and linked documents. Until now, most of the existing systems are restricted to document retrieval tasks and only a few hand tailored systems exist allowing the user to query and retrieve facts from the vast amount of online information available. In the last decade several approaches have been developed in the Information Extraction (IE) research area that are able to automatically construct (learn) extraction procedures, so called wrappers. Wrappers allow documents to be interpreted and accessed like relational databases. They form one of the core components in future Intelligent Information Systems, since they allow the user to query, compare and combine information from various textual information sources. This thesis presents an Logic Programming and Inductive Logic Programming (ILP) framework for supervised learning of wrappers from positive examples only. In contrast to existing systems that adapt some methods from the Artificial Intelligence subfield of Inductive Logic Programming the here presented machine learning approach follows a pure logical bottom-up learning approach under a new IE-ILP semantics. The presented learning approach for multi-slot extraction programs is independent of the chosen wrapper model and document view. Three classes of Inductive Logic Programming algorithms are presented, two one step learning algorithms, a set of iterative learning algorithms, and one algorithm combining clustering techniques with an iterative ILP algorithm. Several extraction tasks are investigated and a formal definition of wrapper classes is given. Based on these wrapper classes three wrapper models are presented using two different document representations, a sequential token and a DOM related representation. The introduced learning algorithms and wrapper models are evaluated on standard test cases and they are compared with related methods and machine learning based information extraction systems. For some of the single-slot extraction tasks the implemented methods yield better results than the best state-of-the-art systems. Learned wrappers for multi-slot extraction tasks show promising competitive quality scores in comparison to the leading extraction systems.