Department of Computer Science

Machine Learning Research Group

University of Texas at Austin Artificial Intelligence Lab

Publications: Information Extraction

Information Extraction (IE) is a shallow form of text understanding that extracts substrings about prespecified types of entities or relationships from documents and web pages. Our work has focused on machine learning methods that induce information extractors from manually labeled training examples. Our recent work has focussed on IE for bioinformatics.

The RISE web site is a useful general information resource on IE.

  1. University of Texas at Austin KBP 2013 Slot Filling System: Bayesian Logic Programs for Textual Inference
    [Details] [PDF]
    Yinon Bentor and Amelia Harrison and Shruti Bhosale and Raymond Mooney
    In Proceedings of the Sixth Text Analysis Conference (TAC 2013), 2013.
    This document describes the University of Texas at Austin 2013 system for the Knowledge Base Population (KBP) English Slot Filling (SF) task. The UT Austin system builds upon the output of an existing relation extractor by augmenting relations that are explicitly stated in the text with ones that are inferred from the stated relations using probabilistic rules that encode commonsense world knowledge. Such rules are learned from linked open data and are encoded in the form of Bayesian Logic Programs (BLPs), a statistical relational learning framework based on directed graphical models. In this document, we describe our methods for learning these rules, estimating their associated weights, and performing probabilistic and logical inference to infer unseen relations. In the KBP SF task, our system was able to infer several unextracted relations, but its performance was limited by the base level extractor.
    ML ID: 299
  2. Online Inference-Rule Learning from Natural-Language Extractions
    [Details] [PDF] [Poster]
    Sindhu Raghavan and Raymond J. Mooney
    In Proceedings of the 3rd Statistical Relational AI (StaRAI-13) workshop at AAAI '13, July 2013.
    In this paper, we consider the problem of learning commonsense knowledge in the form of first-order rules from incomplete and noisy natural-language extractions produced by an off-the-shelf information extraction (IE) system. Much of the information conveyed in text must be inferred from what is explicitly stated since easily inferable facts are rarely mentioned. The proposed rule learner accounts for this phenomenon by learning rules in which the body of the rule contains relations that are usually explicitly stated, while the head employs a less-frequently mentioned relation that is easily inferred. The rule learner processes training examples in an online manner to allow it to scale to large text corpora. Furthermore, we propose a novel approach to weighting rules using a curated lexical ontology like WordNet. The learned rules along with their parameters are then used to infer implicit information using a Bayesian Logic Program. Experimental evaluation on a machine reading testbed demonstrates the efficacy of the proposed methods.
    ML ID: 287
  3. Bayesian Logic Programs for Plan Recognition and Machine Reading
    [Details] [PDF] [Slides]
    Sindhu Raghavan
    PhD Thesis, Department of Computer Science, University of Texas at Austin, December 2012. 170.
    Several real world tasks involve data that is uncertain and relational in nature. Traditional approaches like first-order logic and probabilistic models either deal with structured data or uncertainty, but not both. To address these limitations, statistical relational learning (SRL), a new area in machine learning integrating both first-order logic and probabilistic graphical models, has emerged in the recent past. The advantage of SRL models is that they can handle both uncertainty and structured/relational data. As a result, they are widely used in domains like social network analysis, biological data analysis, and natural language processing. Bayesian Logic Programs (BLPs), which integrate both first-order logic and Bayesian networks are a powerful SRL formalism developed in the recent past. In this dissertation, we develop approaches using BLPs to solve two real world tasks -- plan recognition and machine reading.

    Plan recognition is the task of predicting an agent's top-level plans based on its observed actions. It is an abductive reasoning task that involves inferring cause from effect. In the first part of the dissertation, we develop an approach to abductive plan recognition using BLPs. Since BLPs employ logical deduction to construct the networks, they cannot be used effectively for abductive plan recognition as is. Therefore, we extend BLPs to use logical abduction to construct Bayesian networks and call the resulting model Bayesian Abductive Logic Programs (BALPs).

    In the second part of the dissertation, we apply BLPs to the task of machine reading, which involves automatic extraction of knowledge from natural language text. Most information extraction (IE) systems identify facts that are explicitly stated in text. However, much of the information conveyed in text must be inferred from what is explicitly stated since easily inferable facts are rarely mentioned. Human readers naturally use common sense knowledge and "read between the lines" to infer such implicit information from the explicitly stated facts. Since IE systems do not have access to common sense knowledge, they cannot perform deeper reasoning to infer implicitly stated facts. Here, we first develop an approach using BLPs to infer implicitly stated facts from natural language text. It involves learning uncertain common sense knowledge in the form of probabilistic first-order rules by mining a large corpus of automatically extracted facts using an existing rule learner. These rules are then used to derive additional facts from extracted information using BLP inference. We then develop an online rule learner that handles the concise, incomplete nature of natural-language text and learns first-order rules from noisy IE extractions. Finally, we develop a novel approach to calculate the weights of the rules using a curated lexical ontology like WordNet.

    Both tasks described above involve inference and learning from partially observed or incomplete data. In plan recognition, the underlying cause or the top-level plan that resulted in the observed actions is not known or observed. Further, only a subset of the executed actions can be observed by the plan recognition system resulting in partially observed data. Similarly, in machine reading, since some information is implicitly stated, they are rarely observed in the data. In this dissertation, we demonstrate the efficacy of BLPs for inference and learning from incomplete data. Experimental comparison on various benchmark data sets on both tasks demonstrate the superior performance of BLPs over state-of-the-art methods.

    ML ID: 280
  4. Learning to "Read Between the Lines" using Bayesian Logic Programs
    [Details] [PDF] [Slides]
    Sindhu Raghavan and Raymond J. Mooney and Hyeonseo Ku
    In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL-2012), 349--358, July 2012.
    Most information extraction (IE) systems identify facts that are explicitly stated in text. However, in natural language, some facts are implicit, and identifying them requires "reading between the lines". Human readers naturally use common sense knowledge to infer such implicit information from the explicitly stated facts. We propose an approach that uses Bayesian Logic Programs (BLPs), a statistical relational model combining first-order logic and Bayesian networks, to infer additional implicit information from extracted facts. It involves learning uncertain commonsense knowledge (in the form of probabilistic first-order rules) from natural language text by mining a large corpus of automatically extracted facts. These rules are then used to derive additional facts from extracted information using BLP inference. Experimental evaluation on a benchmark data set for machine reading demonstrates the efficacy of our approach.
    ML ID: 270
  5. Extending Bayesian Logic Programs for Plan Recognition and Machine Reading
    [Details] [PDF] [Slides]
    Sindhu V. Raghavan
    Technical Report, PhD proposal, Department of Computer Science, The University of Texas at Austin, May 2011.

    Statistical relational learning (SRL) is the area of machine learning that integrates both first-order logic and probabilistic graphical models. The advantage of these formalisms is that they can handle both uncertainty and structured/relational data. As a result, they are widely used in domains like social network analysis, biological data analysis, and natural language processing. Bayesian Logic Programs (BLPs), which integrate both first-order logic and Bayesian networks are a powerful SRL formalism developed in the recent past. In this proposal, we focus on applying BLPs to two real worlds tasks -- plan recognition and machine reading.

    Plan recognition is the task of predicting an agent's top-level plans based on its observed actions. It is an abductive reasoning task that involves inferring cause from effect. In the first part of the proposal, we develop an approach to abductive plan recognition using BLPs. Since BLPs employ logical deduction to construct the networks, they cannot be used effectively for plan recognition as is. Therefore, we extend BLPs to use logical abduction to construct Bayesian networks and call the resulting model Bayesian Abductive Logic Programs (BALPs). Experimental evaluation on three benchmark data sets demonstrate that BALPs outperform the existing state-of-art methods like Markov Logic Networks (MLNs) for plan recognition.

    For future work, we propose to apply BLPs to the task of machine reading, which involves automatic extraction of knowledge from natural language text. Present day information extraction (IE) systems that are trained for machine reading are limited by their ability to extract only factual information that is stated explicitly in the text. We propose to improve the performance of an off-the-shelf IE system by inducing general knowledge rules about the domain using the facts already extracted by the IE system. We then use these rules to infer additional facts using BLPs, thereby improving the recall of the underlying IE system. Here again, the standard inference used in BLPs cannot be used to construct the networks. So, we extend BLPs to perform forward inference on all facts extracted by the IE system and then construct the ground Bayesian networks. We initially use an existing inductive logic programming (ILP) based rule learner to learn the rules. In the longer term, we would like to develop a rule/structure learner that is capable of learning an even better set of first-order rules for BLPs.

    ML ID: 258
  6. Joint Entity and Relation Extraction using Card-Pyramid Parsing
    [Details] [PDF] [Slides]
    Rohit J. Kate and Raymond J. Mooney
    In Proceedings of the Fourteenth Conference on Computational Natural Language Learning (CoNLL-2010), 203--212, Uppsala, Sweden, July 2010.
    Both entity and relation extraction can benefit from being performed jointly, allowing each task to correct the errors of the other. We present a new method for joint entity and relation extraction using a graph we call a "card-pyramid". This graph compactly encodes all possible entities and relations in a sentence, reducing the task of their joint extraction to jointly labeling its nodes. We give an efficient labeling algorithm that is analogous to parsing using dynamic programming. Experimental results show improved results for our joint extraction method compared to a pipelined approach.
    ML ID: 247
  7. Learning for Information Extraction: From Named Entity Recognition and Disambiguation To Relation Extraction
    [Details] [PDF]
    Razvan Constantin Bunescu
    PhD Thesis, Department of Computer Sciences, University of Texas at Austin, Austin, TX, August 2007. 150 pages. Also as Technical Report AI07-345, Artificial Intelligence Lab, University of Texas at Austin, August 2007.
    Information Extraction, the task of locating textual mentions of specific types of entities and their relationships, aims at representing the information contained in text documents in a structured format that is more amenable to applications in data mining, question answering, or the semantic web. The goal of our research is to design information extraction models that obtain improved performance by exploiting types of evidence that have not been explored in previous approaches. Since designing an extraction system through introspection by a domain expert is a laborious and time consuming process, the focus of this thesis will be on methods that automatically induce an extraction model by training on a dataset of manually labeled examples.

    Named Entity Recognition is an information extraction task that is concerned with finding textual mentions of entities that belong to a predefined set of categories. We approach this task as a phrase classification problem, in which candidate phrases from the same document are collectively classified. Global correlations between candidate entities are captured in a model built using the expressive framework of Relational Markov Networks. Additionally, we propose a novel tractable approach to phrase classification for named entity recognition based on a special Junction Tree representation.

    Classifying entity mentions into a predefined set of categories achieves only a partial disambiguation of the names. This is further refined in the task of Named Entity Disambiguation, where names need to be linked to their actual denotations. In our research, we use Wikipedia as a repository of named entities and propose a ranking approach to disambiguation that exploits learned correlations between words from the name context and categories from the Wikipedia taxonomy.

    Relation Extraction refers to finding relevant relationships between entities mentioned in text documents. Our approaches to this information extraction task differ in the type and the amount of supervision required. We first propose two relation extraction methods that are trained on documents in which sentences are manually annotated for the required relationships. In the first method, the extraction patterns correspond to sequences of words and word classes anchored at two entity names occurring in the same sentence. These are used as implicit features in a generalized subsequence kernel, with weights computed through training of Support Vector Machines. In the second approach, the implicit extraction features are focused on the shortest path between the two entities in the word-word dependency graph of the sentence. Finally, in a significant departure from previous learning approaches to relation extraction, we propose reducing the amount of required supervision to only a handful of pairs of entities known to exhibit or not exhibit the desired relationship. Each pair is associated with a bag of sentences extracted automatically from a very large corpus. We extend the subsequence kernel to handle this weaker form of supervision, and describe a method for weighting features in order to focus on those correlated with the target relation rather than with the individual entities. The resulting Multiple Instance Learning approach offers a competitive alternative to previous relation extraction methods, at a significantly reduced cost in human supervision.

    ML ID: 213
  8. Learning to Extract Relations from the Web using Minimal Supervision
    [Details] [PDF]
    Razvan C. Bunescu and Raymond J. Mooney
    In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), Prague, Czech Republic, June 2007.
    We present a new approach to relation extraction that requires only a handful of training examples. Given a few pairs of named entities known to exhibit or not exhibit a particular relation, bags of sentences containing the pairs are extracted from the web. We extend an existing relation extraction method to handle this weaker form of supervision, and present experimental results demonstrating that our approach can reliably extract relations from web documents.
    ML ID: 204
  9. Extracting Relations from Text: From Word Sequences to Dependency Paths
    [Details] [PDF]
    Razvan C. Bunescu and Raymond J. Mooney
    In A. Kao and S. Poteet, editors, Natural Language Processing and Text Mining, 29-44, Berlin, 2007. Springer Verlag.
    ML ID: 186
  10. Statistical Relational Learning for Natural Language Information Extraction
    [Details] [PDF]
    Razvan Bunescu and Raymond J. Mooney
    In L. Getoor and B. Taskar, editors, Introduction to Statistical Relational Learning, 535-552, Cambridge, MA, 2007. MIT Press.
    Understanding natural language presents many challenging problems that lend themselves to statistical relational learning (SRL). Historically, both logical and probabilistic methods have found wide application in natural language processing (NLP). NLP inevitably involves reasoning about an arbitrary number of entities (people, places, and things) that have an unbounded set of complex relationships between them. Representing and reasoning about unbounded sets of entities and relations has generally been considered a strength of predicate logic. However, NLP also requires integrating uncertain evidence from a variety of sources in order to resolve numerous syntactic and semantic ambiguities. Effectively integrating multiple sources of uncertain evidence has generally been considered a strength of Bayesian probabilistic methods and graphical models. Consequently, NLP problems are particularly suited for SRL methods that combine the strengths of first-order predicate logic and probabilistic graphical models. In this article, we review our recent work on using Relational Markov Networks (RMNs) for information extraction, the problem of identifying phrases in natural language text that refer to specific types of entities. We use the expressive power of RMNs to represent and reason about several specific relationships between candidate entities and thereby collectively identify the appropriate set of phrases to extract. We present experiments on learning to extract protein names from biomedical text, which demonstrate the advantage of this approach over existing IE methods.
    ML ID: 165
  11. Learnable Similarity Functions and Their Application to Record Linkage and Clustering
    [Details] [PDF]
    Mikhail Bilenko
    PhD Thesis, Department of Computer Sciences, University of Texas at Austin, Austin, TX, August 2006. 136 pages.
    Many machine learning and data mining tasks depend on functions that estimate similarity between instances. Similarity computations are particularly important in clustering and information integration applications, where pairwise distances play a central role in many algorithms. Typically, algorithms for these tasks rely on pre-defined similarity measures, such as edit distance or cosine similarity for strings, or Euclidean distance for vector-space data. However, standard distance functions are frequently suboptimal as they do not capture the appropriate notion of similarity for a particular domain, dataset, or application.

    In this thesis, we present several approaches for addressing this problem by employing learnable similarity functions. Given supervision in the form of similar or dissimilar pairs of instances, learnable similarity functions can be trained to provide accurate estimates for the domain and task at hand. We study the problem of adapting similarity functions in the context of several tasks: record linkage, clustering, and blocking. For each of these tasks, we present learnable similarity functions and training algorithms that lead to improved performance.

    In record linkage, also known as duplicate detection and entity matching, the goal is to identify database records referring to the same underlying entity. This requires estimating similarity between corresponding field values of records, as well as overall similarity between records. For computing field-level similarity between strings, we describe two learnable variants of edit distance that lead to improvements in linkage accuracy. For learning record-level similarity functions, we employ Support Vector Machines to combine similarities of individual record fields in proportion to their relative importance, yielding a high-accuracy linkage system. We also investigate strategies for efficient collection of training data which can be scarce due to the pairwise nature of the record linkage task.

    In clustering, similarity functions are essential as they determine the grouping of instances that is the goal of clustering. We describe a framework for integrating learnable similarity functions within a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs). The framework accommodates learning various distance measures, including those based on Bregman divergences (e.g., parameterized Mahalanobis distance and parameterized KL-divergence), as well as directional measures (e.g., cosine similarity). Thus, it is applicable to a wide range of domains and data representations. Similarity functions are learned within the HMRF-KMeans algorithm derived from the framework, leading to significant improvements in clustering accuracy.

    The third application we consider, blocking, is critical in making record linkage and clustering algorithms scalable to large datasets, as it facilitates efficient selection of approximately similar instance pairs without explicitly considering all possible pairs. Previously proposed blocking methods require manually constructing a similarity function or a set of similarity predicates, followed by hand-tuning of parameters. We propose learning blocking functions automatically from linkage and semi-supervised clustering supervision, which allows automatic construction of blocking methods that are efficient and accurate. This approach yields computationally cheap learnable similarity functions that can be used for scaling up in a variety of tasks that rely on pairwise distance computations, including record linkage and clustering.

    ML ID: 193
  12. Integrating Co-occurrence Statistics with Information Extraction for Robust Retrieval of Protein Interactions from Medline
    [Details] [PDF]
    Razvan Bunescu, Raymond Mooney, Arun Ramani and Edward Marcotte
    In Proceedings of the HLT-NAACL Workshop on Linking Natural Language Processing and Biology (BioNLP'06), 49-56, New York, NY, June 2006.
    The task of mining relations from collections of documents is usually approached in two different ways. One type of systems do relation extraction from individual sentences, followed by an aggregation of the results over the entire collection. Other systems follow an entirely different approach, in which co-occurrence counts are used to determine whether the mentioning together of two entities is due to more than simple chance. We show that increased extraction performance can be obtained by combining the two approaches into an integrated relation extraction model.
    ML ID: 188
  13. Using Encyclopedic Knowledge for Named Entity Disambiguation
    [Details] [PDF]
    Razvan Bunescu and Marius Pasca
    In Proceesings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), 9-16, Trento, Italy, 2006.
    We present a new method for detecting and disambiguating named entities in open domain text. A disambiguation SVM kernel is trained to exploit the high coverage and rich structure of the knowledge encoded in an online encyclopedia. The resulting model significantly outperforms a less informed baseline.
    ML ID: 185
  14. Subsequence Kernels for Relation Extraction
    [Details] [PDF]
    Razvan Bunescu and Raymond J. Mooney
    In Submitted to the Ninth Conference on Natural Language Learning (CoNLL-2005), Ann Arbor, MI, July 2006. Available at url{http://www.cs.utexas.edu/users/ml/publication/ie.html}.
    We present a new kernel method for extracting semantic relations between entities in natural language text, based on a generalization of subsequence kernels. This kernel uses three types of subsequence patterns that are typically employed in natural language to assert relationships between two entities. Experiments on extracting protein interactions from biomedical corpora and top-level relations from newspaper corpora demonstrate the advantages of this approach.
    ML ID: 169
  15. A Shortest Path Dependency Kernel for Relation Extraction
    [Details] [PDF]
    R. C. Bunescu, and Raymond J. Mooney
    In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP-05), 724-731, Vancouver, BC, October 2005.
    We present a novel approach to relation extraction, based on the observation that the information required to assert a relationship between two named entities in the same sentence is typically captured by the shortest path between the two entities in the dependency graph. Experiments on extracting top-level relations from the ACE (Automated Content Extraction) newspaper corpus show that the new shortest path dependency kernel outperforms a recent approach based on dependency tree kernels.
    ML ID: 175
  16. Consolidating the Set of Known Human Protein-Protein Interactions in Preparation for Large-Scale Mapping of the Human Interactome
    [Details] [PDF]
    A.K. Ramani, R.C. Bunescu, Raymond J. Mooney and E.M. Marcotte
    Genome Biology, 6(5):r40, 2005.
    Background

    Extensive protein interaction maps are being constructed for yeast, worm, and fly to ask how the proteins organize into pathways and systems, but no such genome-wide interaction map yet exists for the set of human proteins. To prepare for studies in humans, we wished to establish tests for the accuracy of future interaction assays and to consolidate the known interactions among human proteins.

    Results

    We established two tests of the accuracy of human protein interaction datasets and measured the relative accuracy of the available data. We then developed and applied natural language processing and literature-mining algorithms to recover from Medline abstracts 6,580 interactions among 3,737 human proteins. A three-part algorithm was used: first, human protein names were identified in Medline abstracts using a discriminator based on conditional random fields, then interactions were identified by the co-occurrence of protein names across the set of Medline abstracts, filtering the interactions with a Bayesian classifier to enrich for legitimate physical interactions. These mined interactions were combined with existing interaction data to obtain a network of 31,609 interactions among 7,748 human proteins, accurate to the same degree as the existing datasets.

    Conclusion

    These interactions and the accuracy benchmarks will aid interpretation of current functional genomics data and provide a basis for determining the quality of future large-scale human protein interaction assays. Projecting from the approximately 15 interactions per protein in the best-sampled interaction set to the estimated 25,000 human genes implies more than 375,000 interactions in the complete human protein interaction network. This set therefore represents no more than 10% of the complete network.
    ML ID: 172
  17. Mining Knowledge from Text Using Information Extraction
    [Details] [PDF]
    Raymond J. Mooney and R. Bunescu
    SIGKDD Explorations (special issue on Text Mining and Natural Language Processing), 7(1):3-10, 2005.
    An important approach to text mining involves the use of natural-language information extraction. Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities as well as stated relationships between such entities. IE systems can be used to directly extricate abstract knowledge from a text corpus, or to extract concrete data from a set of documents which can then be further analyzed with traditional data-mining techniques to discover more general patterns. We discuss methods and implemented systems for both of these approaches and summarize results on mining real text corpora of biomedical abstracts, job announcements, and product descriptions. We also discuss challenges that arise when employing current information extraction technology to discover knowledge in text.
    ML ID: 170
  18. Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions
    [Details] [PDF]
    A. Ramani, E. Marcotte, R. Bunescu and Raymond J. Mooney
    In Proceedings of the ISMB/ACL-05 Workshop of the BioLINK SIG: Linking Literature, Information and Knowledge for Biology, Detroit, MI, June 2005.
    This paper presents the results of a large-scale effort to construct a comprehensive database of known human protein interactions by combining and linking known interactions from existing databases and then adding to them by automatically mining additional interactions from 750,000 Medline abstracts. The end result is a network of 31,609 interactions amongst 7,748 proteins. The text mining system first identifies protein names in the text using a trained Conditional Random Field (CRF) and then identifies interactions through a filtered co-citation analysis. We also report two new strategies for mining interactions, either by finding explicit statements of interactions in the text using learned pattern-based rules or a Support-Vector Machine using a string kernel. Using information in existing ontologies, the automatically extracted data is shown to be of equivalent accuracy to manually curated data sets.
    ML ID: 164
  19. Learning for Collective Information Extraction
    [Details] [PDF]
    Razvan C. Bunescu
    Technical Report TR-05-02, Department of Computer Sciences, University of Texas at Austin, October 2005. Ph.D. proposal.
    An Information Extraction (IE) system analyses a set of documents with the aim of identifying certain types of entities and relations between them. Most IE systems treat separate potential extractions as independent. However, in many cases, considering influences between different candidate extractions could improve overall accuracy. For example, phrase repetitions inside a document are usually associated with the same entity type, the same being true for acronyms and their corresponding long form. One of our goals in this thesis is to show how these and potentially other types of correlations can be captured by a particular type of undirected probabilistic graphical model. Inference and learning using this graphical model allows for collective information extraction in a way that exploits the mutual influence between possible extractions. Preliminary experiments on learning to extract named entities from biomedical and newspaper text demonstrate the advantages of our approach.
    The benefit of doing collective classification comes however at a cost: in the general case, exact inference in the resulting graphical model has an exponential time complexity. The standard solution, which is also the one that we used in our initial work, is to resort to approximate inference. In this proposal we show that by considering only a selected subset of mutual influences between candidate extractions, exact inference can be done in linear time. Consequently, a short term goal is to run comparative experiments that would help us choose between the two approaches: exact inference with a restricted subset of mutual influences or approximate inference with the full set of influences.
    The set of issues that we intend to investigate in future work is two fold. One direction refers to applying the already developed framework to other natural language tasks that may benefit from the same types of influences, such as word sense disambiguation and part-of-speech tagging. Another direction concerns the design of a sufficiently general framework that would allow a seamless integration of cues from a variety of knowledge sources. We contemplate using generic sources such as external dictionaries, or web statistics on discriminative textual patterns. We also intend to alleviate the modeling problems due to the intrinsic local nature of entity features by exploiting syntactic information. All these generic features will be input to a feature selection algorithm, so that in the end we obtain a model which is both compact and accurate.
    ML ID: 155
  20. Comparative Experiments on Learning Information Extractors for Proteins and their Interactions
    [Details] [PDF]
    Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Edward M. Marcotte, Raymond J. Mooney, Arun Kumar Ramani, and Yuk Wah Wong
    Artificial Intelligence in Medicine (special issue on Summarization and Information Extraction from Medical Documents)(2):139-155, 2005.
    Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer-accessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in Medline. However, extraction efforts have been frustrated by the lack of conventions for describing human genes and proteins. We have developed and evaluated a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting information on interactions between the proteins. We demonstrate that machine learning approaches using support vector machines and hidden Markov models are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions more accurately than manually-developed rules.
    ML ID: 137
  21. Text Mining with Information Extraction
    [Details] [PDF]
    Un Yong Nahm
    PhD Thesis, Department of Computer Sciences, University of Texas at Austin, Austin, TX, August 2004. 217 pages. Also appears as Technical Report UT-AI-TR-04-311.
    The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this dissertation, we develop a text-mining system by integrating methods from Information Extraction (IE) and Data Mining (Knowledge Discovery from Databases or KDD). By utilizing existing IE and KDD techniques, text-mining systems can be developed relatively rapidly and evaluated on existing text corpora for testing IE systems.
    We present a general text-mining framework called DiscoTEX which employs an IE module for transforming natural-language documents into structured data and a KDD module for discovering prediction rules from the extracted data. When discovering patterns in extracted text, strict matching of strings is inadequate because textual database entries generally exhibit variations due to typographical errors, misspellings, abbreviations, and other sources. We introduce the notion of discovering ``soft-matching'' rules from text and present two new learning algorithms. TextRISE is an inductive method for learning soft-matching prediction rules that integrates rule-based and instance-based learning methods. Simple, interpretable rules are discovered using rule induction, while a nearest-neighbor algorithm provides soft matching. SoftApriori is a text-mining algorithm for discovering association rules from texts that uses a similarity measure to allow flexible matching to variable database items. We present experimental results on inducing prediction and association rules from natural-language texts demonstrating that TextRISE and SoftApriori learn more accurate rules than previous methods for these tasks. We also present an approach to using rules mined from extracted data to improve the accuracy of information extraction. Experimental results demonstate that such discovered patterns can be used to effectively improve the underlying IE method.
    ML ID: 153
  22. Collective Information Extraction with Relational Markov Networks
    [Details] [PDF]
    Razvan Bunescu and Raymond J. Mooney
    In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), 439-446, Barcelona, Spain, July 2004.
    Most information extraction (IE) systems treat separate potential extractions as independent. However, in many cases, considering influences between different potential extractions could improve overall accuracy. Statistical methods based on undirected graphical models, such as conditional random fields (CRFs), have been shown to be an effective approach to learning accurate IE systems. We present a new IE method that employs Relational Markov Networks (a generalization of CRFs), which can represent arbitrary dependencies between extractions. This allows for ``collective information extraction'' that exploits the mutual influence between possible extractions. Experiments on learning to extract protein names from biomedical text demonstrate the advantages of this approach.
    ML ID: 152
  23. Using Soft-Matching Mined Rules to Improve Information Extraction
    [Details] [PDF]
    Un Yong Nahm and Raymond J. Mooney
    In Proceedings of the AAAI-2004 Workshop on Adaptive Text Extraction and Mining (ATEM-2004), 27-32, San Jose, CA, July 2004.
    By discovering predictive relationships between different pieces of extracted data, data-mining algorithms can be used to improve the accuracy of information extraction. However, textual variation due to typos, abbreviations, and other sources can prevent the productive discovery and utilization of hard-matching rules. Recent methods for inducing soft-matching rules from extracted data can more effectively find and exploit predictive relationships in textual data. This paper presents techniques for using mined soft-matching association rules to increase the accuracy of information extraction. Experimental results on a corpus of computer-science job postings demonstrate that soft-matching rules improve information extraction more effectively than hard-matching rules.
    ML ID: 150
  24. Relational Markov Networks for Collective Information Extraction
    [Details] [PDF]
    Razvan Bunescu and Raymond J. Mooney
    In Proceedings of the ICML-04 Workshop on Statistical Relational Learning and its Connections to Other Fields, Banff, Alberta, July 2004.
    Most information extraction (IE) systems treat separate potential extractions as independent. However, in many cases, considering influences between different potential extractions could improve overall accuracy. Statistical methods based on undirected graphical models, such as conditional random fields (CRFs), have been shown to be an effective approach to learning accurate IE systems. We present a new IE method that employs Relational Markov Networks, which can represent arbitrary dependencies between extractions. This allows for ``collective information extraction'' that exploits the mutual influence between possible extractions. Experiments on learning to extract protein names from biomedical text demonstrate the advantages of this approach.
    ML ID: 145
  25. Learning to Extract Proteins and their Interactions from Medline Abstracts
    [Details] [PDF]
    Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Raymond J. Mooney, Yuk Wah Wong, Edward M. Marcotte, and Arun Kumar Ramani
    In Proceedings of the ICML-03 Workshop on Machine Learning in Bioinformatics, 46-53, Washington, DC, August 2003.
    We present results from a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting interactions between the proteins. We demonstrate that machine learning approaches using support vector machines and hidden Markov models are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions with higher precision than manually-developed rules.
    ML ID: 126
  26. Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction
    [Details] [PDF]
    Mary Elaine Califf and Raymond J. Mooney
    Journal of Machine Learning Research:177-210, 2003.
    Information Extraction is a form of shallow text processing that locates a specified set of relevant items in a natural-language document. Systems for this task require significant domain-specific knowledge and are time-consuming and difficult to build by hand, making them a good application for machine learning. We present a aystem, RAPIER, that uses pairs of sample documents and filled templates to induce pattern-match rules that directly extract fillers for the slots in the template. RAPIER employs a bottom-up learning algorithm which incorporates techniques from several inductive logic programming systems and acquires unbounded patterns that include constraints on the words, part-of-speech tags, and semantic classes present in the filler and the surrounding text. We present encouraging experimental results on two domains.
    ML ID: 124
  27. Property-Based Feature Engineering and Selection
    [Details] [PDF]
    Noppadon Kamolvilassatian
    Masters Thesis, Department of Computer Sciences, University of Texas at Austin, Austin, TX, December 2002. 85 pages.
    Example representation is a fundamental problem in machine learning. In particular, the decision on what features are extracted and selected to be included in the learning process significantly affects learning performance.
    This work proposes a novel framework for feature representation based on feature properties and applies it to the domain of textual information extraction. Our framework enables knowledge on feature engineering and selection to be explicitly learned and applied. The application of this knowledge can improve learning performance within the domain from which it is learned and in other domains with similar representational bias.
    We conducted several experiments comparing the performance of feature engineering and selection methods based on our framework with other approaches in the Information Extraction task. Results suggested that our approach performs either competitively or better than the best heuristic-based feature selection approach used. Moreover, our general framework can potentially be combined with other feature selection approaches to yield even better results.
    ML ID: 118
  28. Extracting Gene and Protein Names from Biomedical Abstracts
    [Details] [PDF]
    Razvan Bunescu, Ruifang Ge, Raymond J. Mooney, Edward Marcotte, and Arun Kumar Ramani
    March 2002. Unpublished Technical Note.
    Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer accessible form. We are investigating the use of information extraction techniques for processing biomedical text. Currently, we have focused on the initial stage of identifying information on interacting proteins, specifically the problem of recognizin protein and gene names with high precision. We present preliminary results on extracting protein names from Medline abstracts.
    ML ID: 111
  29. ELIXIR: A Library for Writing Wrappers in Java
    [Details] [PDF]
    Edward Wild
    December 2001. Undergraduate Honor Thesis, Department of Computer Sciences, University of Texas at Austin.
    ELIXIR is a library for writing wrappers in Java. ELIXIR provides a way to combine text extraction and spidering in wrappers. Since wrappers using ELIXIR are Java programs, they are eays to integrate with other Java program. The user can also extend the functionality of ELIXIR by implement new ItemExtractors. In an experiment, a wrapper written using ELIXIR showed an 89% reduction in non-comment source statements from a wrapper written using a prototype of ELIXIR. In another experiemnt, a wrapper written using ELIXIR showed a 90% reduction in non-comment source statements from a wrapper written using SPHINX, a Java toolkit for writing spiders.
    ML ID: 109
  30. A Mutually Beneficial Integration of Data Mining and Information Extraction
    [Details] [PDF]
    Un Yong Nahm and Raymond J. Mooney
    In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-00), 627-632, Austin, TX, July 2000.
    Text mining concerns applying data mining techniques to unstructured text. Information extraction (IE) is a form of shallow text understanding that locates specific pieces of data in natural language documents, transforming unstructured text into a structured database. This paper describes a system called DiscoTEX, that combines IE and data mining methodologies to perform text mining as well as improve the performance of the underlying extraction system. Rules mined from a database extracted from a corpus of texts are used to predict additional information to extract from future documents, thereby improving the recall of IE. Encouraging results are presented on applying these techniques to a corpus of computer job postings from an Internet newsgroup.
    ML ID: 100
  31. Relational Learning of Pattern-Match Rules for Information Extraction
    [Details] [PDF]
    Mary Elaine Califf and Raymond J. Mooney
    In Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), 328-334, Orlando, FL, July 1999.
    Information extraction is a form of shallow text processing that locates a specified set of relevant items in a natural-language document. Systems for this task require significant domain-specific knowledge and are time-consuming and difficult to build by hand, making them a good application for machine learning. This paper presents a system, Rapier, that takes pairs of sample documents and filled templates and induces pattern-match rules that directly extract fillers for the slots in the template. Rapier employs a bottom-up learning algorithm which incorporates techniques from several inductive logic programming systems and acquires unbounded patterns that include constraints on the words, part-of-speech tags, and semantic classes present in the filler and the surrounding text. We present encouraging experimental results on two domains.
    ML ID: 94
  32. Active Learning for Natural Language Parsing and Information Extraction
    [Details] [PDF]
    Cynthia A. Thompson, Mary Elaine Califf and Raymond J. Mooney
    In Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99), 406-414, Bled, Slovenia, June 1999.
    In natural language acquisition, it is difficult to gather the annotated data needed for supervised learning; however, unannotated data is fairly plentiful. Active learning methods attempt to select for annotation and training only the most informative examples, and therefore are potentially very useful in natural language applications. However, existing results for active learning have only considered standard classification tasks. To reduce annotation effort while maintaining accuracy, we apply active learning to two non-classification tasks in natural language processing: semantic parsing and information extraction. We show that active learning can significantly reduce the number of annotated examples required to achieve a given level of performance for these complex tasks.
    ML ID: 92
  33. Relational Learning Techniques for Natural Language Information Extraction
    [Details] [PDF]
    Mary Elaine Califf
    PhD Thesis, Department of Computer Sciences, University of Texas, Austin, TX, August 1998. 142 pages. Also appears as Artificial Intelligence Laboratory Technical Report AI 98-276.
    The recent growth of online information available in the form of natural language documents creates a greater need for computing systems with the ability to process those documents to simplify access to the information. One type of processing appropriate for many tasks is information extraction, a type of text skimming that retrieves specific types of information from text. Although information extraction systems have existed for two decades, these systems have generally been built by hand and contain domain specific information, making them difficult to port to other domains. A few researchers have begun to apply machine learning to information extraction tasks, but most of this work has involved applying learning to pieces of a much larger system. This dissertation presents a novel rule representation specific to natural language and a relational learning system, Rapier, which learns information extraction rules. Rapier takes pairs of documents and filled templates indicating the information to be extracted and learns pattern-matching rules to extract fillers for the slots in the template. The system is tested on several domains, showing its ability to learn rules for different tasks. Rapier's performance is compared to a propositional learning system for information extraction, demonstrating the superiority of relational learning for some information extraction tasks. Because one difficulty in using machine learning to develop natural language processing systems is the necessity of providing annotated examples to supervised learning systems, this dissertation also describes an attempt to reduce the number of examples Rapier requires by employing a form of active learning. Experimental results show that the number of examples required to achieve a given level of performance can be significantly reduced by this method.
    ML ID: 88
  34. Relational Learning of Pattern-Match Rules for Information Extraction
    [Details] [PDF]
    Mary Elaine Califf and Raymond J. Mooney
    In Proceedings of AAAI Spring Symposium on Applying Machine Learning to Discourse Processing, 6-11, Standford, CA, March 1998.
    Information extraction is a form of shallow text processing which locates a specified set of relevant items in natural language documents. Such systems can be useful, but require domain-specific knowledge and rules, and are time-consuming and difficult to build by hand, making infomation extraction a good testbed for the application of machine learning techniques to natural language processing. This paper presents a system, RAPIER, that takes pairs of documents and filled templates and induces pattern-match rules that directly extract fillers for the slots in the template. The learning algorithm incorporates techniques from several inductive logic programming systems and learns unbounded patterns that include constraints on the words and part-of-speech tags surrounding the filler. Encouraging results are presented on learning to extract information from computer job postings from the newsgroup misc.jobs.offered.
    ML ID: 80
  35. Relational Learning Techniques for Natural Language Information Extraction
    [Details] [PDF]
    Mary Elaine Califf
    1997. Ph.D. proposal, Department of Computer Sciences, University of Texas at Austin.
    The recent growth of online information available in the form of natural language documents creates a greater need for computing systems with the ability to process those documents to simplify access to the information. One type of processing appropriate for many tasks is information extraction, a type of text skimming that retrieves specific types of information from text. Although information extraction systems have existed for two decades, these systems have generally been built by hand and contain domain specific information, making them difficult to port to other domains. A few researchers have begun to apply machine learning to information extraction tasks, but most of this work has involved applying learning to pieces of a much larger system. This paper presents a novel rule representation specific to natural language and a learning system, RAPIER, which learns information extraction rules. RAPIER takes pairs of documents and filled templates indicating the information to be extracted and learns patterns to extract fillers for the slots in the template. This proposal presents initial results on a small corpus of computer-related job postings with a preliminary version of RAPIER. Future research will involve several enhancements to RAPIER as well as more thorough testing on several domains and extension to additional natural language processing tasks. We intend to extend the rule representation and algorithm to allow for more types of constraints than are currently supported. We also plan to incorporate active learning, or sample selection, methods, specifically query by committee, into RAPIER. These methods have the potential to substantially reduce the amount of annotation required. We will explore the issue of distinguishing relevant and irrelevant messages, since currently RAPIER only extracts from the any messages given to it, assuming that all are relevant. We also intend to run much larger tests with RAPIER on multiple domains including the terrorism domain from the third and fourth Message Uncderstanding Conferences, which will allow comparison against other systems. Finally, we plan to demonstrate the generality of RAPIER`s representation and algorithm by applying it to other natural language processing tasks such as word sense disambiguation.
    ML ID: 78
  36. Applying ILP-based Techniques to Natural Language Information Extraction: An Experiment in Relational Learning
    [Details] [PDF]
    Mary Elaine Califf and Raymond J. Mooney
    In Workshop Notes of the IJCAI-97 Workshop on Frontiers of Inductive Logic Programming, 7--11, Nagoya, Japan, August 1997.
    Information extraction systems process natural language documents and locate a specific set of relevant items. Given the recent success of empirical or corpus-based approaches in other areas of natural language processing, machine learning has the potential to significantly aid the development of these knowledge-intensive systems. This paper presents a system, RAPIER, that takes pairs of documents and filled templates and induces pattern-match rules that directly extract fillers for the slots in the template. The learning algorithm incorporates techniques from several inductive logic programming systems and learns unbounded patterns that include constraints on the words and part-of-speech tags surrounding the filler. Encouraging results are presented on learning to extract information from computer job postings from the newsgroup misc.jobs.offered.
    ML ID: 76
  37. Relational Learning of Pattern-Match Rules for Information Extraction
    [Details] [PDF]
    Mary Elaine Califf and Raymond J. Mooney
    In Proceedings of the ACL Workshop on Natural Language Learning, 9-15, Madrid, Spain, July 1997.
    Information extraction systems process natural language documents and locate a specific set of relevant items. Given the recent success of empirical or corpus-based approaches in other areas of natural language processing, machine learning has the potential to significantly aid the development of these knowledge-intensive systems. This paper presents a system, RAPIER, that takes pairs of documents and filled templates and induces pattern-match rules that directly extract fillers for the slots in the template. The learning algorithm incorporates techniques from several inductive logic programming systems and learns unbounded patterns that include constraints on the words and part-of-speech tags surrounding the filler. Encouraging results are presented on learning to extract information from computer job postings from the newsgroup misc.jobs.offered.
    ML ID: 74