Machine Learning Research Group | University of Texas

Publications: Bioinformatics

Bioinformatics concerns the development of computer databases and algorithms for learning, managing and processing biological information. Currently we are focusing on extracting structured information such as protein names and relationships from biological documents using natural language learning for information extraction.

By mining over 750,000 Medline abstracts for human protein interactions and integrating the results with existing databases, we have developed a fairly comprehensive database of 31,609 known human protein interactions. The resulting database is accessible though a web interface at Human Gene ID-SERVE

Hide abstracts

Sparse Meets Dense: A Hybrid Approach to Enhance Scientific Document Retrieval
[Details] [PDF] [Slides (PPT)]
Priyanka Mandikal, Raymond Mooney
In The 4th Workshop on Scientific Document Understanding, AAAI, February 2024.
Traditional information retrieval is based on sparse bag-of-words vector representations of documents and queries. More recent deep-learning approaches have used dense embeddings learned using a transformer-based large language model. We show that on a classic benchmark on scientific document retrieval in the medical domain of cystic fibrosis, that both of these models perform roughly equivalently. Notably, dense vectors from the state-of-the-art SPECTER2 model do not significantly enhance performance. However, a hybrid model that we propose combining these methods yields significantly better results, underscoring the merits of integrating classical and contemporary deep learning techniques in information retrieval in the domain of specialized scientific documents.
ML ID: 425
Discriminative Structure and Parameter Learning for Markov Logic Networks
[Details] [PDF] [Slides (PPT)]
Tuyen N. Huynh and Raymond J. Mooney
In Proceedings of the 25th International Conference on Machine Learning (ICML), Helsinki, Finland, July 2008.
Markov logic networks (MLNs) are an expressive representation for statistical relational learning that generalizes both first-order logic and graphical models. Existing methods for learning the logical structure of an MLN are not discriminative; however, many relational learning problems involve specific target predicates that must be inferred from given background information. We found that existing MLN methods perform very poorly on several such ILP benchmark problems, and we present improved discriminative methods for learning MLN clauses and weights that outperform existing MLN and traditional ILP methods.
ML ID: 220
Integrating Co-occurrence Statistics with Information Extraction for Robust Retrieval of Protein Interactions from Medline
[Details] [PDF]
Razvan Bunescu, Raymond Mooney, Arun Ramani and Edward Marcotte
In Proceedings of the HLT-NAACL Workshop on Linking Natural Language Processing and Biology (BioNLP'06), 49-56, New York, NY, June 2006.
The task of mining relations from collections of documents is usually approached in two different ways. One type of systems do relation extraction from individual sentences, followed by an aggregation of the results over the entire collection. Other systems follow an entirely different approach, in which co-occurrence counts are used to determine whether the mentioning together of two entities is due to more than simple chance. We show that increased extraction performance can be obtained by combining the two approaches into an integrated relation extraction model.
ML ID: 188
Consolidating the Set of Known Human Protein-Protein Interactions in Preparation for Large-Scale Mapping of the Human Interactome
[Details] [PDF]
A.K. Ramani, R.C. Bunescu, Raymond J. Mooney and E.M. Marcotte
Genome Biology, 6(5):r40, 2005.
Background

Extensive protein interaction maps are being constructed for yeast, worm, and fly to ask how the proteins organize into pathways and systems, but no such genome-wide interaction map yet exists for the set of human proteins. To prepare for studies in humans, we wished to establish tests for the accuracy of future interaction assays and to consolidate the known interactions among human proteins.

Results

We established two tests of the accuracy of human protein interaction datasets and measured the relative accuracy of the available data. We then developed and applied natural language processing and literature-mining algorithms to recover from Medline abstracts 6,580 interactions among 3,737 human proteins. A three-part algorithm was used: first, human protein names were identified in Medline abstracts using a discriminator based on conditional random fields, then interactions were identified by the co-occurrence of protein names across the set of Medline abstracts, filtering the interactions with a Bayesian classifier to enrich for legitimate physical interactions. These mined interactions were combined with existing interaction data to obtain a network of 31,609 interactions among 7,748 human proteins, accurate to the same degree as the existing datasets.

Conclusion

These interactions and the accuracy benchmarks will aid interpretation of current functional genomics data and provide a basis for determining the quality of future large-scale human protein interaction assays. Projecting from the approximately 15 interactions per protein in the best-sampled interaction set to the estimated 25,000 human genes implies more than 375,000 interactions in the complete human protein interaction network. This set therefore represents no more than 10% of the complete network.
ML ID: 172
Using Biomedical Literature Mining to Consolidate the Set of Known Human Protein-Protein Interactions
[Details] [PDF]
A. Ramani, E. Marcotte, R. Bunescu and Raymond J. Mooney
In Proceedings of the ISMB/ACL-05 Workshop of the BioLINK SIG: Linking Literature, Information and Knowledge for Biology, Detroit, MI, June 2005.
This paper presents the results of a large-scale effort to construct a comprehensive database of known human protein interactions by combining and linking known interactions from existing databases and then adding to them by automatically mining additional interactions from 750,000 Medline abstracts. The end result is a network of 31,609 interactions amongst 7,748 proteins. The text mining system first identifies protein names in the text using a trained Conditional Random Field (CRF) and then identifies interactions through a filtered co-citation analysis. We also report two new strategies for mining interactions, either by finding explicit statements of interactions in the text using learned pattern-based rules or a Support-Vector Machine using a string kernel. Using information in existing ontologies, the automatically extracted data is shown to be of equivalent accuracy to manually curated data sets.
ML ID: 164
Comparative Experiments on Learning Information Extractors for Proteins and their Interactions
[Details] [PDF]
Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Edward M. Marcotte, Raymond J. Mooney, Arun Kumar Ramani, and Yuk Wah Wong
Artificial Intelligence in Medicine (special issue on Summarization and Information Extraction from Medical Documents)(2):139-155, 2005.
Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer-accessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in Medline. However, extraction efforts have been frustrated by the lack of conventions for describing human genes and proteins. We have developed and evaluated a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting information on interactions between the proteins. We demonstrate that machine learning approaches using support vector machines and hidden Markov models are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions more accurately than manually-developed rules.
ML ID: 137
Collective Information Extraction with Relational Markov Networks
[Details] [PDF]
Razvan Bunescu and Raymond J. Mooney
In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), 439-446, Barcelona, Spain, July 2004.
Most information extraction (IE) systems treat separate potential extractions as independent. However, in many cases, considering influences between different potential extractions could improve overall accuracy. Statistical methods based on undirected graphical models, such as conditional random fields (CRFs), have been shown to be an effective approach to learning accurate IE systems. We present a new IE method that employs Relational Markov Networks (a generalization of CRFs), which can represent arbitrary dependencies between extractions. This allows for ``collective information extraction'' that exploits the mutual influence between possible extractions. Experiments on learning to extract protein names from biomedical text demonstrate the advantages of this approach.
ML ID: 152
Relational Markov Networks for Collective Information Extraction
[Details] [PDF]
Razvan Bunescu and Raymond J. Mooney
In Proceedings of the ICML-04 Workshop on Statistical Relational Learning and its Connections to Other Fields, Banff, Alberta, July 2004.
Most information extraction (IE) systems treat separate potential extractions as independent. However, in many cases, considering influences between different potential extractions could improve overall accuracy. Statistical methods based on undirected graphical models, such as conditional random fields (CRFs), have been shown to be an effective approach to learning accurate IE systems. We present a new IE method that employs Relational Markov Networks, which can represent arbitrary dependencies between extractions. This allows for ``collective information extraction'' that exploits the mutual influence between possible extractions. Experiments on learning to extract protein names from biomedical text demonstrate the advantages of this approach.
ML ID: 145
Learning to Extract Proteins and their Interactions from Medline Abstracts
[Details] [PDF]
Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Raymond J. Mooney, Yuk Wah Wong, Edward M. Marcotte, and Arun Kumar Ramani
In Proceedings of the ICML-03 Workshop on Machine Learning in Bioinformatics, 46-53, Washington, DC, August 2003.
We present results from a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting interactions between the proteins. We demonstrate that machine learning approaches using support vector machines and hidden Markov models are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions with higher precision than manually-developed rules.
ML ID: 126
Extracting Gene and Protein Names from Biomedical Abstracts
[Details] [PDF]
Razvan Bunescu, Ruifang Ge, Raymond J. Mooney, Edward Marcotte, and Arun Kumar Ramani
March 2002. Unpublished Technical Note.
Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer accessible form. We are investigating the use of information extraction techniques for processing biomedical text. Currently, we have focused on the initial stage of identifying information on interacting proteins, specifically the problem of recognizin protein and gene names with high precision. We present preliminary results on extracting protein names from Medline abstracts.
ML ID: 111