A. Ramani, E. Marcotte, R. Bunescu and Raymond J. Mooney
This paper presents the results of a large-scale effort to construct a comprehensive database of known human protein interactions by combining and linking known interactions from existing databases and then adding to them by automatically mining additional interactions from 750,000 Medline abstracts. The end result is a network of 31,609 interactions amongst 7,748 proteins. The text mining system first identifies protein names in the text using a trained Conditional Random Field (CRF) and then identifies interactions through a filtered co-citation analysis. We also report two new strategies for mining interactions, either by finding explicit statements of interactions in the text using learned pattern-based rules or a Support-Vector Machine using a string kernel. Using information in existing ontologies, the automatically extracted data is shown to be of equivalent accuracy to manually curated data sets.
In Proceedings of the ISMB/ACL-05 Workshop of the BioLINK SIG: Linking Literature, Information and Knowledge for Biology, Detroit, MI, June 2005.

Razvan Bunescu Ph.D. Alumni bunescu [at] ohio edu
Raymond J. Mooney Faculty mooney [at] cs utexas edu