Machine Learning Research Group | University of Texas

Publications: Text Data Mining

Text data mining concerns the application of data mining (knowledge discovery in databases, KDD) to unstructured textual data. Our work focuses on using information extraction to first extract a structured database from a corpus of natural language texts and then discovering patterns in the resulting database using traditional KDD tools. It also concerns record linkage, a form of data-cleaning that identifies equivalent but textually distinct items in the extracted data prior to mining. It is also related to our research on natural language learning. Our recent work has focused on text mining for bioinformatics.

This research was formerly supported by the National Science Foundation through grant IIS-0117308 from the "Information and Data Management" Program.

Hide abstracts

Review Quality Aware Collaborative Filtering
[Details] [PDF]
Sindhu Raghavan and Suriya Ganasekar and Joydeep Ghosh
In Sixth ACM Conference on Recommender Systems (RecSys 2012), 123--130, September 2012.
Probabilistic matrix factorization (PMF) and other popular approaches to collaborative filtering assume that the ratings given by users for products are genuine, and hence they give equal importance to all available ratings. However, this is not always true due to several reasons including the presence of opinion spam in product reviews. In this paper, the possibility of performing collaborative filtering while attaching weights or quality scores to the ratings is explored. The quality scores, which are determined from the corresponding review data are used to ``up--weight'' or ``down--weight'' the importance given to the individual rating while performing collaborative filtering, thereby improving the accuracy of the predictions. First, the measure used to capture the quality of the ratings is described. Different approaches for estimating the quality score based on the available review information are examined. Subsequently, a mathematical formulation to incorporate quality scores as weights for the ratings in the basic PMF framework is derived. Experimental evaluation on two product categories of a benchmark data set from Amazon.com demonstrates the efficacy of our approach.
ML ID: 281
Improving Video Activity Recognition using Object Recognition and Text Mining
[Details] [PDF] [Slides (PPT)]
Tanvi S. Motwani and Raymond J. Mooney
In Proceedings of the 20th European Conference on Artificial Intelligence (ECAI-2012), 600--605, August 2012.
Recognizing activities in real-world videos is a challenging AI problem. We present a novel combination of standard activity classification, object recognition, and text mining to learn effective activity recognizers without ever explicitly labeling training videos. We cluster verbs used to describe videos to automatically discover classes of activities and produce a labeled training set. This labeled data is then used to train an activity classifier based on spatio-temporal features. Next, text mining is employed to learn the correlations between these verbs and related objects. This knowledge is then used together with the outputs of an off-the-shelf object recognizer and the trained activity classifier to produce an improved activity recognizer. Experiments on a corpus of YouTube videos demonstrate the effectiveness of the overall approach.
ML ID: 274
Extending Bayesian Logic Programs for Plan Recognition and Machine Reading
[Details] [PDF] [Slides (PPT)]
Sindhu V. Raghavan
Technical Report, PhD proposal, Department of Computer Science, The University of Texas at Austin, May 2011.
Statistical relational learning (SRL) is the area of machine learning that integrates both first-order logic and probabilistic graphical models. The advantage of these formalisms is that they can handle both uncertainty and structured/relational data. As a result, they are widely used in domains like social network analysis, biological data analysis, and natural language processing. Bayesian Logic Programs (BLPs), which integrate both first-order logic and Bayesian networks are a powerful SRL formalism developed in the recent past. In this proposal, we focus on applying BLPs to two real worlds tasks -- plan recognition and machine reading.
Plan recognition is the task of predicting an agent's top-level plans based on its observed actions. It is an abductive reasoning task that involves inferring cause from effect. In the first part of the proposal, we develop an approach to abductive plan recognition using BLPs. Since BLPs employ logical deduction to construct the networks, they cannot be used effectively for plan recognition as is. Therefore, we extend BLPs to use logical abduction to construct Bayesian networks and call the resulting model Bayesian Abductive Logic Programs (BALPs). Experimental evaluation on three benchmark data sets demonstrate that BALPs outperform the existing state-of-art methods like Markov Logic Networks (MLNs) for plan recognition.
For future work, we propose to apply BLPs to the task of machine reading, which involves automatic extraction of knowledge from natural language text. Present day information extraction (IE) systems that are trained for machine reading are limited by their ability to extract only factual information that is stated explicitly in the text. We propose to improve the performance of an off-the-shelf IE system by inducing general knowledge rules about the domain using the facts already extracted by the IE system. We then use these rules to infer additional facts using BLPs, thereby improving the recall of the underlying IE system. Here again, the standard inference used in BLPs cannot be used to construct the networks. So, we extend BLPs to perform forward inference on all facts extracted by the IE system and then construct the ground Bayesian networks. We initially use an existing inductive logic programming (ILP) based rule learner to learn the rules. In the longer term, we would like to develop a rule/structure learner that is capable of learning an even better set of first-order rules for BLPs.
ML ID: 258
Spherical Topic Models
[Details] [PDF] [Slides (PDF)]
Joseph Reisinger, Austin Waters, Bryan Silverthorn, and Raymond J. Mooney
In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), 2010.
We introduce the Spherical Admixture Model (SAM), a Bayesian topic model for arbitrary L2 normalized data. SAM maintains the same hierarchical structure as Latent Dirichlet Allocation (LDA), but models documents as points on a high-dimensional spherical manifold, allowing a natural likelihood parameterization in terms of cosine distance. Furthermore, SAM can model word absence/presence at the document level, and unlike previous models can assign explicit negative weight to topic terms. Performance is evaluated empirically, both through human ratings of topic quality and through diverse classification tasks from natural language processing and computer vision. In these experiments, SAM consistently outperforms existing models.
ML ID: 248
Spherical Topic Models
[Details] [PDF]
Joseph Reisinger, Austin Waters, Bryan Silverthorn, and Raymond Mooney
In NIPS'09 workshop: Applications for Topic Models: Text and Beyond, 2009.
We introduce the Spherical Admixture Model (SAM), a Bayesian topic model over arbitrary L2 normalized data. SAM models documents as points on a high- dimensional spherical manifold, and is capable of representing negative word- topic correlations and word presence/absence, unlike models with multinomial document likelihood, such as LDA. In this paper, we evaluate SAM as a topic browser, focusing on its ability to model “negative” topic features, and also as a dimensionality reduction method, using topic proportions as features for difficult classification tasks in natural language processing and computer vision.
ML ID: 237
Learning for Information Extraction: From Named Entity Recognition and Disambiguation To Relation Extraction
[Details] [PDF]
Razvan Constantin Bunescu
PhD Thesis, Department of Computer Sciences, University of Texas at Austin, Austin, TX, August 2007. 150 pages. Also as Technical Report AI07-345, Artificial Intelligence Lab, University of Texas at Austin, August 2007.
Information Extraction, the task of locating textual mentions of specific types of entities and their relationships, aims at representing the information contained in text documents in a structured format that is more amenable to applications in data mining, question answering, or the semantic web. The goal of our research is to design information extraction models that obtain improved performance by exploiting types of evidence that have not been explored in previous approaches. Since designing an extraction system through introspection by a domain expert is a laborious and time consuming process, the focus of this thesis will be on methods that automatically induce an extraction model by training on a dataset of manually labeled examples.
Named Entity Recognition is an information extraction task that is concerned with finding textual mentions of entities that belong to a predefined set of categories. We approach this task as a phrase classification problem, in which candidate phrases from the same document are collectively classified. Global correlations between candidate entities are captured in a model built using the expressive framework of Relational Markov Networks. Additionally, we propose a novel tractable approach to phrase classification for named entity recognition based on a special Junction Tree representation.
Classifying entity mentions into a predefined set of categories achieves only a partial disambiguation of the names. This is further refined in the task of Named Entity Disambiguation, where names need to be linked to their actual denotations. In our research, we use Wikipedia as a repository of named entities and propose a ranking approach to disambiguation that exploits learned correlations between words from the name context and categories from the Wikipedia taxonomy.
Relation Extraction refers to finding relevant relationships between entities mentioned in text documents. Our approaches to this information extraction task differ in the type and the amount of supervision required. We first propose two relation extraction methods that are trained on documents in which sentences are manually annotated for the required relationships. In the first method, the extraction patterns correspond to sequences of words and word classes anchored at two entity names occurring in the same sentence. These are used as implicit features in a generalized subsequence kernel, with weights computed through training of Support Vector Machines. In the second approach, the implicit extraction features are focused on the shortest path between the two entities in the word-word dependency graph of the sentence. Finally, in a significant departure from previous learning approaches to relation extraction, we propose reducing the amount of required supervision to only a handful of pairs of entities known to exhibit or not exhibit the desired relationship. Each pair is associated with a bag of sentences extracted automatically from a very large corpus. We extend the subsequence kernel to handle this weaker form of supervision, and describe a method for weighting features in order to focus on those correlated with the target relation rather than with the individual entities. The resulting Multiple Instance Learning approach offers a competitive alternative to previous relation extraction methods, at a significantly reduced cost in human supervision.
ML ID: 213
Learning to Extract Relations from the Web using Minimal Supervision
[Details] [PDF]
Razvan C. Bunescu and Raymond J. Mooney
In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL'07), Prague, Czech Republic, June 2007.
We present a new approach to relation extraction that requires only a handful of training examples. Given a few pairs of named entities known to exhibit or not exhibit a particular relation, bags of sentences containing the pairs are extracted from the web. We extend an existing relation extraction method to handle this weaker form of supervision, and present experimental results demonstrating that our approach can reliably extract relations from web documents.
ML ID: 204
Extracting Relations from Text: From Word Sequences to Dependency Paths
[Details] [PDF]
Razvan C. Bunescu and Raymond J. Mooney
In A. Kao and S. Poteet, editors, Natural Language Processing and Text Mining, 29-44, Berlin, 2007. Springer Verlag.
ML ID: 186
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping
[Details] [PDF]
Mikhail Bilenko, Sugato Basu, and Mehran Sahami
In Proceedings of the 5th International Conference on Data Mining (ICDM-2005), 58--65, Houston, TX, November 2005.
The problem of record linkage focuses on determining whether two object descriptions refer to the same underlying entity. Addressing this problem effectively has many practical applications, e.g., elimination of duplicate records in databases and citation matching for scholarly articles. In this paper, we consider a new domain where the record linkage problem is manifested: Internet comparison shopping. We address the resulting linkage setting that requires learning a similarity function between record pairs from streaming data. The learned similarity function is subsequently used in clustering to determine which records are co-referent and should be linked. We present an online machine learning method for addressing this problem, where a composite similarity function based on a linear combination of basis functions is learned incrementally. We illustrate the efficacy of this approach on several real-world datasets from an Internet comparison shopping site, and show that our method is able to effectively learn various distance functions for product data with differing characteristics. We also provide experimental results that show the importance of considering multiple performance measures in record linkage evaluation.
ML ID: 178
Alignments and String Similarity in Information Integration: A Random Field Approach
[Details] [PDF]
Mikhail Bilenko and Raymond J. Mooney
In Proceedings of the 2005 Dagstuhl Seminar on Machine Learning for the Semantic Web, Dagstuhl, Germany, February 2005.
Several problems central to information integration, such as ontology mapping and object matching, can be viewed as alignment tasks where the goal is to find an optimal correspondence between two structured objects and to compute the associated similarity score. The diversity of data sources and domains in the Semantic Web requires solutions to these problems to be highly adaptive, which can be achieved by employing probabilistic machine learning approaches. We present one such approach, Alignment Conditional Random Fields (ACRFs), a new framework for constructing and scoring sequence alignments using undirected graphical models. ACRFs allow incorporating arbitrary features into string edit distance computation, yielding a learnable string similarity function for use in tasks where approximate string matching is needed. We outline possible applications of ACRFs in information integration tasks and describe directions for future work.
ML ID: 177
Mining Knowledge from Text Using Information Extraction
[Details] [PDF]
Raymond J. Mooney and R. Bunescu
SIGKDD Explorations (special issue on Text Mining and Natural Language Processing), 7(1):3-10, 2005.
An important approach to text mining involves the use of natural-language information extraction. Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities as well as stated relationships between such entities. IE systems can be used to directly extricate abstract knowledge from a text corpus, or to extract concrete data from a set of documents which can then be further analyzed with traditional data-mining techniques to discover more general patterns. We discuss methods and implemented systems for both of these approaches and summarize results on mining real text corpora of biomedical abstracts, job announcements, and product descriptions. We also discuss challenges that arise when employing current information extraction technology to discover knowledge in text.
ML ID: 170
Learning for Collective Information Extraction
[Details] [PDF]
Razvan C. Bunescu
Technical Report TR-05-02, Department of Computer Sciences, University of Texas at Austin, October 2005. Ph.D. proposal.
An Information Extraction (IE) system analyses a set of documents with the aim of identifying certain types of entities and relations between them. Most IE systems treat separate potential extractions as independent. However, in many cases, considering influences between different candidate extractions could improve overall accuracy. For example, phrase repetitions inside a document are usually associated with the same entity type, the same being true for acronyms and their corresponding long form. One of our goals in this thesis is to show how these and potentially other types of correlations can be captured by a particular type of undirected probabilistic graphical model. Inference and learning using this graphical model allows for collective information extraction in a way that exploits the mutual influence between possible extractions. Preliminary experiments on learning to extract named entities from biomedical and newspaper text demonstrate the advantages of our approach.
The benefit of doing collective classification comes however at a cost: in the general case, exact inference in the resulting graphical model has an exponential time complexity. The standard solution, which is also the one that we used in our initial work, is to resort to approximate inference. In this proposal we show that by considering only a selected subset of mutual influences between candidate extractions, exact inference can be done in linear time. Consequently, a short term goal is to run comparative experiments that would help us choose between the two approaches: exact inference with a restricted subset of mutual influences or approximate inference with the full set of influences.
The set of issues that we intend to investigate in future work is two fold. One direction refers to applying the already developed framework to other natural language tasks that may benefit from the same types of influences, such as word sense disambiguation and part-of-speech tagging. Another direction concerns the design of a sufficiently general framework that would allow a seamless integration of cues from a variety of knowledge sources. We contemplate using generic sources such as external dictionaries, or web statistics on discriminative textual patterns. We also intend to alleviate the modeling problems due to the intrinsic local nature of entity features by exploiting syntactic information. All these generic features will be input to a feature selection algorithm, so that in the end we obtain a model which is both compact and accurate.
ML ID: 155
Text Mining with Information Extraction
[Details] [PDF]
Un Yong Nahm
PhD Thesis, Department of Computer Sciences, University of Texas at Austin, Austin, TX, August 2004. 217 pages. Also appears as Technical Report UT-AI-TR-04-311.
The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this dissertation, we develop a text-mining system by integrating methods from Information Extraction (IE) and Data Mining (Knowledge Discovery from Databases or KDD). By utilizing existing IE and KDD techniques, text-mining systems can be developed relatively rapidly and evaluated on existing text corpora for testing IE systems.
We present a general text-mining framework called DiscoTEX which employs an IE module for transforming natural-language documents into structured data and a KDD module for discovering prediction rules from the extracted data. When discovering patterns in extracted text, strict matching of strings is inadequate because textual database entries generally exhibit variations due to typographical errors, misspellings, abbreviations, and other sources. We introduce the notion of discovering ``soft-matching'' rules from text and present two new learning algorithms. TextRISE is an inductive method for learning soft-matching prediction rules that integrates rule-based and instance-based learning methods. Simple, interpretable rules are discovered using rule induction, while a nearest-neighbor algorithm provides soft matching. SoftApriori is a text-mining algorithm for discovering association rules from texts that uses a similarity measure to allow flexible matching to variable database items. We present experimental results on inducing prediction and association rules from natural-language texts demonstrating that TextRISE and SoftApriori learn more accurate rules than previous methods for these tasks. We also present an approach to using rules mined from extracted data to improve the accuracy of information extraction. Experimental results demonstate that such discovered patterns can be used to effectively improve the underlying IE method.
ML ID: 153
Using Soft-Matching Mined Rules to Improve Information Extraction
[Details] [PDF]
Un Yong Nahm and Raymond J. Mooney
In Proceedings of the AAAI-2004 Workshop on Adaptive Text Extraction and Mining (ATEM-2004), 27-32, San Jose, CA, July 2004.
By discovering predictive relationships between different pieces of extracted data, data-mining algorithms can be used to improve the accuracy of information extraction. However, textual variation due to typos, abbreviations, and other sources can prevent the productive discovery and utilization of hard-matching rules. Recent methods for inducing soft-matching rules from extracted data can more effectively find and exploit predictive relationships in textual data. This paper presents techniques for using mined soft-matching association rules to increase the accuracy of information extraction. Experimental results on a corpus of computer-science job postings demonstrate that soft-matching rules improve information extraction more effectively than hard-matching rules.
ML ID: 150
Learnable Similarity Functions and Their Applications to Clustering and Record Linkage
[Details] [PDF]
Mikhail Bilenko
In Proceedings of the Ninth AAAI/SIGART Doctoral Consortium, 981--982, San Jose, CA, July 2004.

ML ID: 148
Text Mining with Information Extraction
[Details] [PDF]
Raymond J. Mooney and Un Yong Nahm
In W. Daelemans and T. du Plessis and C. Snyman and L. Teck, editors, Multilingualism and Electronic Language Management: Proceedings of the 4th International MIDP Colloquium, 141-160, Bloemfontein, South Africa, September 2003. Van Schaik: South Africa.
Text mining concerns looking for patterns in unstructured text. The related task of Information Extraction (IE) is about locating specific items in natural-language documents. This paper presents a framework for text mining, called DiscoTEX (Discovery from Text EXtraction), using a learned information extraction system to transform text into more structured data which is then mined for interesting relationships. The initial version of DiscoTEX integrates an IE module acquired by an IE learning system, and a standard rule induction module. In addition, rules mined from a database extracted from a corpus of texts are used to predict additional information to extract from future documents, thereby improving the recall of the underlying extraction system. Encouraging results are presented on applying these techniques to a corpus of computer job announcement postings from an Internet newsgroup.
ML ID: 159
Learnable Similarity Functions and Their Applications to Record Linkage and Clustering
[Details] [PDF]
Mikhail Bilenko
2003. Doctoral Dissertation Proposal, University of Texas at Austin.
Many machine learning tasks require similarity functions that estimate likeness between observations. Similarity computations are particularly important for clustering and record linkage algorithms that depend on accurate estimates of the distance between datapoints. However, standard measures such as string edit distance and Euclidean distance often fail to capture an appropriate notion of similarity for a particular domain or dataset. This problem can be alleviated by employing learnable similarity functions that adapt using training data. In this proposal, we introduce two adaptive string similarity measures: (1) Learnable Edit Distance with Affine Gaps, and (2) Learnable Vector-Space Similarity Based on Pairwise Classification. These similarity functions can be trained using a corpus of labeled pairs of equivalent and non-equivalent strings. We illustrate the accuracy improvements obtained with these measures using MARLIN, our system for record linkage in databases that learns to combine adaptive and static string similarity functions in a two-level learning framework.
Obtaining useful training examples for learnable similarity functions can be problematic due to scarcity of informative similar and dissimilar object pairs. We propose two strategies, Static-Active Selection and Weakly-Labeled Negatives, that facilitate efficient training data collection for record linkage. These strategies significantly outperform random selection on real datasets without the computational cost of traditional active learning methods. Additionally, we describe a method for combining seeding with Euclidean distance learning for semi-supervised k-means clustering. Experimental evaluation demonstrates that our method outperforms unsupervised clustering and semi-supervised clustering that employs seeding or metric learning separately.
In future research, we intend to pursue several directions in developing accurate learnable similarity functions and applying them to record linkage and clustering problems. This work will involve improving the proposed string similarity functions as well as introducing several novel approaches to adaptive string distance computation. We also plan to extend our initial work on learnable similarity functions for clustering, particularly for high-dimensional data. Finally, we will investigate the utility of various active learning strategies for learning similarity functions, as well as extend the preliminary work on static-active selection of training pairs.
ML ID: 133
Adaptive Name-Matching in Information Integration
[Details] [PDF]
Mikhail Bilenko, William W. Cohen, Stephen Fienberg, Raymond J. Mooney, and Pradeep Ravikumar
IEEE Intelligent Systems, 18(5):16-23, 2003.
Identifying approximately duplicate database records that refer to the same entity is essential for information integration. The authors compare and describe methods for combining and learning textual similarity measures for name matching.
ML ID: 131
On Evaluation and Training-Set Construction for Duplicate Detection
[Details] [PDF]
Mikhail Bilenko and Raymond J. Mooney
In Proceedings of the KDD-03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 7-12, Washington, DC, August 2003.
A variety of experimental methodologies have been used to evaluate the accuracy of duplicate-detection systems. We advocate presenting precision-recall curves as the most informative evaluation methodology. We also discuss a number of issues that arise when evaluating and assembling training data for adaptive systems that use machine learning to tune themselves to specific applications. We consider several different application scenarios and experimentally examine the effectiveness of alternative methods of collecting training data under each scenario. We propose two new approaches to collecting training data called static-active learning and weakly-labeled non-duplicates, and present experimental results on their effectiveness.
ML ID: 129
Adaptive Duplicate Detection Using Learnable String Similarity Measures
[Details] [PDF]
Mikhail Bilenko and Raymond J. Mooney
In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), 39-48, Washington, DC, August 2003.
The problem of identifying approximately duplicate records in databases is an essential step for data cleaning and data integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each database field, and show that such measures are capable of adapting to the specific notion of similarity that is appropriate for the field's domain. We present two learnable text similarity measures suitable for this task: an extended variant of learnable string edit distance, and a novel vector-space based measure that employs a Support Vector Machine (SVM) for training. Experimental results on a range of datasets show that our framework can improve duplicate detection accuracy over traditional techniques.
ML ID: 127
Employing Trainable String Similarity Metrics for Information Integration
[Details] [PDF]
Mikhail Bilenko and Raymond J. Mooney
In Proceedings of the IJCAI-03 Workshop on Information Integration on the Web, 67-72, Acapulco, Mexico, August 2003.
The problem of identifying approximately duplicate records in databases is an essential step for the information integration processes. Most existing approaches have relied on generic or manually tuned distance metrics for estimating the similarity of potential duplicates. In this paper, we present a framework for improving duplicate detection using trainable measures of textual similarity. We propose to employ learnable text distance functions for each data field, and introduce an extended variant of learnable string edit distance based on an Expectation-Maximization(EM) training algorithm. Experimental results on a range of datasets show that this similarity metric is capable of adapting to the specific notions of similarity that are appropriate for different domains. Our overall system, MARLIN utilizes support vector machines to combine multiple similarity metrics, which are shown to perform better than ensembles of decision trees, which were employed for this task previously.
ML ID: 123
Mining Soft-Matching Association Rules
[Details] [PDF]
Un Yong Nahm and Raymond J. Mooney
In Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM-2002), 681-683, McLean, VA, November 2002.
Variation and noise in database entries can prevent data mining algorithms, such as association rule mining, from discovering important regularities. In particular, textual fields can exhibit variation due to typographical errors, mispellings, abbreviations, etc.. By allowing partial or "soft matching" of items based on a similarity metric such as edit-distance or cosine similarity, additional important patterns can be detected. This paper introduces an algorithm, SoftApriori that discovers soft-matching association rules given a user-supplied similarity metric for each field. Experimental results on several "noisy" datasets extracted from text demonstrate that SoftApriori discovers additional relationships that more accurately reflect regularities in the data.
ML ID: 117
Two Approaches to Handling Noisy Variation in Text Mining
[Details] [PDF]
Un Yong Nahm, Mikhail Bilenko, and Raymond J. Mooney
In Papers from the Nineteenth International Conference on Machine Learning (ICML-2002) Workshop on Text Learning, 18-27, Sydney, Australia, July 2002.
Variation and noise in textual database entries can prevent text mining algorithms from discovering important regularities. We present two novel methods to cope with this problem: (1) an adaptive approach to ``hardening'' noisy databases by identifying duplicate records, and (2) mining ``soft'' association rules. For identifying approximately duplicate records, we present a domain-independent two-level method for improving duplicate detection accuracy based on machine learning. For mining soft matching rules, we introduce an algorithm that discovers association rules by allowing partial matching of items based on a textual similarity metric such as edit distance or cosine similarity. Experimental results on real and synthetic datasets show that our methods outperform traditional techniques for noisy textual databases.
ML ID: 115
Text Mining with Information Extraction
[Details] [PDF]
Un Yong Nahm and Raymond J. Mooney
In Proceedings of the AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases, 60-67, Stanford, CA, March 2002.
Text mining concerns looking for patterns in unstructured text. The related task of Information Extraction (IE) is about locating specific items in natural-language documents. This paper presents a framework for text mining, called DiscoTEX (Discovery from Text EXtraction), using a learned information extraction system to transform text into more structured data which is then mined for interesting relationships. The initial version of DiscoTEX integrates an IE module acquired by an IE learning system, and a standard rule induction module. However, this approach has problems when the same extracted entity or feature is represented by similar but not identical strings in different documents. Consequently, we also develop an alternate rule induction system called TextRISE, that allows for partial matching of textual items. Encouraging preliminary results are presented on applying these techniques to a corpus of Internet documents.
ML ID: 112
Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases
[Details] [PDF]
Mikhail Bilenko and Raymond J. Mooney
Technical Report AI 02-296, Artificial Intelligence Laboratory, University of Texas at Austin, Austin, TX, February 2002.
The problem of identifying approximately duplicate records in databases has previously been studied as record linkage, the merge/purge problem, hardening soft databases, and field matching. Most existing approaches have focused on efficient algorithms for locating potential duplicates rather than precise similarity metrics for comparing records. In this paper, we present a domain-independent method for improving duplicate detection accuracy using machine learning. First, trainable distance metrics are learned for each field, adapting to the specific notion of similarity that is appropriate for the field's domain. Second, a classifier is employed that uses several diverse metrics for each field as distance features and classifies pairs of records as duplicates or non-duplicates. We also propose an extended model of learnable string distance which improves over an existing approach. Experimental results on real and synthetic datasets show that our method outperforms traditional techniques.
ML ID: 110
Evaluating the Novelty of Text-Mined Rules using Lexical Knowledge
[Details] [PDF]
Sugato Basu, Raymond J. Mooney, Krupakar V. Pasupuleti, and Joydeep Ghosh
In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2001), 233-239, San Francisco, CA, 2001.
In this paper, we present a new method of estimating the novelty of rules discovered by data-mining methods using WordNet, a lexical knowledge-base of English words. We assess the novelty of a rule by the average semantic distance in a knowledge hierarchy between the words in the antecedent and the consequent of the rule -- the more the average distance, more is the novelty of the rule. The novelty of rules extracted by the DiscoTEX text-mining system on Amazon.com book descriptions were evaluated by both human subjects and by our algorithm. By computing correlation coefficients between pairs of human ratings and between human and automatic ratings, we found that the automatic scoring of rules based on our novelty measure correlates with human judgments about as well as human judgments correlate with one another.
ML ID: 106
Mining Soft-Matching Rules from Textual Data
[Details] [PDF]
Un Yong Nahm and Raymond J. Mooney
In Proceedings of the 18th International Joint Conference on Artificial Intelligence, 2001.
Text mining concerns the discovery of knowledge from unstructured textual data. One important task is the discovery of rules that relate specific words and phrases. Although existing methods for this task learn traditional logical rules, soft-matching methods that utilize word-frequency information generally work better for textual data. This paper presents a rule induction system, TextRISE, that allows for partial matching of text-valued features by combining rule-based and instance-based learning. We present initial experiments applying TextRISE to corpora of book descriptions and patent documents retrieved from the web and compare its results to those of traditional rule and instance based methods.
ML ID: 105
Using Lexical Knowlege to Evaluate the Novelty of Rules Mined from Text
[Details] [PDF]
Sugato Basu, Raymond J. Mooney, Krupakar V. Pasupuleti, and Joydeep Ghosh
In Proceedings of NAACL 2001 Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, 144--149, Pittsburg, PA, June 2001.
We present a novel application of WordNet to estimating the interestingness of rules discovered by data-mining methods. We estimate the novelty of text-mined rules using semantic distance measures based on WordNet. In our experiments, we found that the automatic scoring of rules based on our novelty measure correlates with human judgments about as well as human judgments correlate with each other.
ML ID: 104
Text Mining with Information Extraction
[Details] [PDF]
Un Yong Nahm
February 2001. Ph.D. proposal, Department of Computer Sciences, University of Texas at Austin.
Text mining is a relatively new research area at the intersection of data mining, natural-language processing, machine learning, and information retrieval. The goal of text mining is to discover knowledge in unstructured text. The related task of Information Extraction (IE) concerns locating specific items of data in natural-language documents, thereby transforming unstructured text into a structured database. Although handmade IE systems have existed for a while, automatic construction of information extraction systems using machine learning is more recent. This proposal presents a new framework for text mining, called DiscoTEX (Discovery from Text EXtraction), which uses a learned information extraction system to transform text into more structured data which is then mined for interesting relationships.
DiscoTEX combines IE and standard data mining methods to perform text mining as well as improve the performance of the underlying IE system. It discovers prediction rules from natural-language corpora, and these rules are used to predict additional information to extract from future documents, thereby improving the recall of IE. The initial version of DiscoTex integrates an IE module acquired by the Rapier learning system, and a standard rule induction module such as C4.5rules or Ripper. Encouraging initial results are presented on applying these techniques to a corpus of computer job announcements posted on an Internet newsgroup. However, this approach has problems when the same extracted entity or feature is represented by similar but not identical strings in different documents. Consequently, we are also developing an alternate rule induction system for DiscoTex called, TextRISE, that allows for partial matching of string-valued features. We also present initial results applying the TextRISE rule learner to corpora of book descriptions and patent documents retrieved from the World Wide Web (WWW). Future research will involve thorough testing on several domains, further development of this approach, and extensions of the proposed framework (currently limited to prediction rule discovery) to additional text mining tasks.
ML ID: 103
Using Information Extraction to Aid the Discovery of Prediction Rules from Text
[Details] [PDF]
Un Yong Nahm and Raymond J. Mooney
In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000) Workshop on Text Mining, 51--58, Boston, MA, August 2000.
Text mining and Information Extraction(IE) are both topics of significant recent interest. Text mining concerns applying data mining, a.k.a. knowledge discovery from databases (KDD) techniques to unstructured text. Information extraction (IE) is a form of shallow text understanding that locates specific pieces of data in natural language documents, transforming unstructured text into a structured database. This paper describes a system called DiscoTEX, that combines IE and KDD methods to perform a text mining task, discovering prediction rules from natural-language corpora. An initial version of DiscoTEX is constructed by integrating an IE module based on Rapier and a rule-learning module, Ripper. We present encouraging results on applying these techniques to a corpus of computer job postings from an Internet newsgroup.
ML ID: 101
A Mutually Beneficial Integration of Data Mining and Information Extraction
[Details] [PDF]
Un Yong Nahm and Raymond J. Mooney
In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-00), 627-632, Austin, TX, July 2000.
Text mining concerns applying data mining techniques to unstructured text. Information extraction (IE) is a form of shallow text understanding that locates specific pieces of data in natural language documents, transforming unstructured text into a structured database. This paper describes a system called DiscoTEX, that combines IE and data mining methodologies to perform text mining as well as improve the performance of the underlying extraction system. Rules mined from a database extracted from a corpus of texts are used to predict additional information to extract from future documents, thereby improving the recall of IE. Encouraging results are presented on applying these techniques to a corpus of computer job postings from an Internet newsgroup.
ML ID: 100