ELIXIR is a library for writing wrappers in Java. ELIXIR provides a way to combine text extraction and spidering in wrappers. Since wrappers using ELIXIR are Java programs, they are eays to integrate with other Java program. The user can also extend the functionality of ELIXIR by implement new ItemExtractors. In an experiment, a wrapper written using ELIXIR showed an 89% reduction in non-comment source statements from a wrapper written using a prototype of ELIXIR. In another experiemnt, a wrapper written using ELIXIR showed a 90% reduction in non-comment source statements from a wrapper written using SPHINX, a Java toolkit for writing spiders.
ML ID: 109
Most recommender systems use Collaborative Filtering or Content-based methods to predict new items of interest for a user. While both methods have their own advantages, individually they fail to provide good recommendattions in many situations. Incorporating components from both methods, a hybrid recommender system can overcome these shortcomings. In this paper, we present an elegant and effective framework for combining content and collaboration. Our approach uses a content-based predictor to enhance existing user data, and then provides personalized suggestions through collaborative filtering. We present experimental results that show how this approach, Content-Boosted Collaborative Filtering, performs better than a pure content-based predictor, pure collaborative filter, and a naive hybrid approach. We also discuss methods to improve the performance of our hybrid system.
ML ID: 108
In this paper, we explored a learning approach which combines different learning methods in inductive logic programming (ILP) to allow a learner to produce more expressive hypothese than that of each individual learner. Such a learning approach may be useful when the performance of the task depends on solving a large amount of classification problems and each has its own characteristics which may or may not fit a particular learning method. The task of sematnic parser acquisition in two different domains was attempted and preliminary results demonstrated that such an approach is promising.
ML ID: 107
In this paper, we present a new method of estimating the novelty of rules discovered by data-mining methods using WordNet, a lexical knowledge-base of English words. We assess the novelty of a rule by the average semantic distance in a knowledge hierarchy between the words in the antecedent and the consequent of the rule -- the more the average distance, more is the novelty of the rule. The novelty of rules extracted by the DiscoTEX text-mining system on Amazon.com book descriptions were evaluated by both human subjects and by our algorithm. By computing correlation coefficients between pairs of human ratings and between human and automatic ratings, we found that the automatic scoring of rules based on our novelty measure correlates with human judgments about as well as human judgments correlate with one another.
ML ID: 106
Text mining concerns the discovery of knowledge from unstructured textual data. One important task is the discovery of rules that relate specific words and phrases. Although existing methods for this task learn traditional logical rules, soft-matching methods that utilize word-frequency information generally work better for textual data. This paper presents a rule induction system, TextRISE, that allows for partial matching of text-valued features by combining rule-based and instance-based learning. We present initial experiments applying TextRISE to corpora of book descriptions and patent documents retrieved from the web and compare its results to those of traditional rule and instance based methods.
ML ID: 105
We present a novel application of WordNet to estimating the interestingness of rules discovered by data-mining methods. We estimate the novelty of text-mined rules using semantic distance measures based on WordNet. In our experiments, we found that the automatic scoring of rules based on our novelty measure correlates with human judgments about as well as human judgments correlate with each other.
ML ID: 104
Text mining is a relatively new research area at the intersection of data mining, natural-language processing, machine learning, and information retrieval. The goal of text mining is to discover knowledge in unstructured text. The related task of Information Extraction (IE) concerns locating specific items of data in natural-language documents, thereby transforming unstructured text into a structured database. Although handmade IE systems have existed for a while, automatic construction of information extraction systems using machine learning is more recent. This proposal presents a new framework for text mining, called DiscoTEX (Discovery from Text EXtraction), which uses a learned information extraction system to transform text into more structured data which is then mined for interesting relationships.
DiscoTEX combines IE and standard data mining methods to perform text mining as well as improve the performance of the underlying IE system. It discovers prediction rules from natural-language corpora, and these rules are used to predict additional information to extract from future documents, thereby improving the recall of IE. The initial version of DiscoTex integrates an IE module acquired by the Rapier learning system, and a standard rule induction module such as C4.5rules or Ripper. Encouraging initial results are presented on applying these techniques to a corpus of computer job announcements posted on an Internet newsgroup. However, this approach has problems when the same extracted entity or feature is represented by similar but not identical strings in different documents. Consequently, we are also developing an alternate rule induction system for DiscoTex called, TextRISE, that allows for partial matching of string-valued features. We also present initial results applying the TextRISE rule learner to corpora of book descriptions and patent documents retrieved from the World Wide Web (WWW). Future research will involve thorough testing on several domains, further development of this approach, and extensions of the proposed framework (currently limited to prediction rule discovery) to additional text mining tasks.
ML ID: 103