In order to respond to increasing demand for natural language interfaces—and provide meaningful insight into user query intent—fast, scalable lexical semantic models with flexible representations are needed. Human concept organization is a rich phenomenon that has yet to be accounted for by a single coherent psychological framework: Concept generalization is captured by a mixture of prototype and exemplar models, and local taxonomic information is available through multiple overlapping organizational systems. Previous work in computational linguistics on extracting lexical semantic information from unannotated corpora does not provide adequate representational flexibility and hence fails to capture the full extent of human conceptual knowledge. In this thesis I outline a family of probabilistic models capable of capturing important aspects of the rich organizational structure found in human language that can predict contextual variation, selectional preference and feature-saliency norms to a much higher degree of accuracy than previous approaches. These models account for cross-cutting structure of concept organization—i.e. selective attention, or the notion that humans make use of different categorization systems for different kinds of generalization tasks—and can be applied to Web-scale corpora. Using these models, natural language systems will be able to infer a more comprehensive semantic relations, which in turn may yield improved systems for question answering, text classification, machine translation, and information retrieval.
ML ID: 309
Probabilistic matrix factorization (PMF) and other popular approaches to collaborative filtering assume that the ratings given by users for products are genuine, and hence they give equal importance to all available ratings. However, this is not always true due to several reasons including the presence of opinion spam in product reviews. In this paper, the possibility of performing collaborative filtering while attaching weights or quality scores to the ratings is explored. The quality scores, which are determined from the corresponding review data are used to ``up--weight'' or ``down--weight'' the importance given to the individual rating while performing collaborative filtering, thereby improving the accuracy of the predictions. First, the measure used to capture the quality of the ratings is described. Different approaches for estimating the quality score based on the available review information are examined. Subsequently, a mathematical formulation to incorporate quality scores as weights for the ratings in the basic PMF framework is derived. Experimental evaluation on two product categories of a benchmark data set from Amazon.com demonstrates the efficacy of our approach.
ML ID: 281
Several real world tasks involve data that is uncertain and relational in nature. Traditional approaches like first-order logic and probabilistic models either deal with structured data or uncertainty, but not both. To address these limitations, statistical relational learning (SRL), a new area in machine learning integrating both first-order logic and probabilistic graphical models, has emerged in the recent past. The advantage of SRL models is that they can handle both uncertainty and structured/relational data. As a result, they are widely used in domains like social network analysis, biological data analysis, and natural language processing. Bayesian Logic Programs (BLPs), which integrate both first-order logic and Bayesian networks are a powerful SRL formalism developed in the recent past. In this dissertation, we develop approaches using BLPs to solve two real world tasks -- plan recognition and machine reading. Plan recognition is the task of predicting an agent's top-level plans based on its observed actions. It is an abductive reasoning task that involves inferring cause from effect. In the first part of the dissertation, we develop an approach to abductive plan recognition using BLPs. Since BLPs employ logical deduction to construct the networks, they cannot be used effectively for abductive plan recognition as is. Therefore, we extend BLPs to use logical abduction to construct Bayesian networks and call the resulting model Bayesian Abductive Logic Programs (BALPs). In the second part of the dissertation, we apply BLPs to the task of machine reading, which involves automatic extraction of knowledge from natural language text. Most information extraction (IE) systems identify facts that are explicitly stated in text. However, much of the information conveyed in text must be inferred from what is explicitly stated since easily inferable facts are rarely mentioned. Human readers naturally use common sense knowledge and "read between the lines" to infer such implicit information from the explicitly stated facts. Since IE systems do not have access to common sense knowledge, they cannot perform deeper reasoning to infer implicitly stated facts. Here, we first develop an approach using BLPs to infer implicitly stated facts from natural language text. It involves learning uncertain common sense knowledge in the form of probabilistic first-order rules by mining a large corpus of automatically extracted facts using an existing rule learner. These rules are then used to derive additional facts from extracted information using BLP inference. We then develop an online rule learner that handles the concise, incomplete nature of natural-language text and learns first-order rules from noisy IE extractions. Finally, we develop a novel approach to calculate the weights of the rules using a curated lexical ontology like WordNet. Both tasks described above involve inference and learning from partially observed or incomplete data. In plan recognition, the underlying cause or the top-level plan that resulted in the observed actions is not known or observed. Further, only a subset of the executed actions can be observed by the plan recognition system resulting in partially observed data. Similarly, in machine reading, since some information is implicitly stated, they are rarely observed in the data. In this dissertation, we demonstrate the efficacy of BLPs for inference and learning from incomplete data. Experimental comparison on various benchmark data sets on both tasks demonstrate the superior performance of BLPs over state-of-the-art methods.
ML ID: 280
Past work on learning part-of-speech taggers from tag dictionaries and raw data has reported good results, but the assumptions made about those dictionaries are often unrealistic: due to historical precedents, they assume access to information about labels in the raw and test sets. Here, we demonstrate ways to learn hidden Markov model taggers from incomplete tag dictionaries. Taking the MIN-GREEDY algorithm (Ravi et al., 2010) as a starting point, we improve it with several intuitive heuristics. We also define a simple HMM emission initialization that takes advantage of the tag dictionary and raw data to capture both the openness of a given tag and its estimated prevalence in the raw data. Altogether, our augmentations produce improvements to performance over the original MIN-GREEDY algorithm for both English and Italian data.
ML ID: 279
Recognizing activities in real-world videos is a challenging AI problem. We present a novel combination of standard activity classification, object recognition, and text mining to learn effective activity recognizers without ever explicitly labeling training videos. We cluster verbs used to describe videos to automatically discover classes of activities and produce a labeled training set. This labeled data is then used to train an activity classifier based on spatio-temporal features. Next, text mining is employed to learn the correlations between these verbs and related objects. This knowledge is then used together with the outputs of an off-the-shelf object recognizer and the trained activity classifier to produce an improved activity recognizer. Experiments on a corpus of YouTube videos demonstrate the effectiveness of the overall approach.
ML ID: 274
"Grounded" language learning is the process of learning the semantics of natural language with respect to relevant perceptual inputs. Toward this goal, computational systems are trained with data in the form of natural language sentences paired with relevant but ambiguous perceptual contexts. With such ambiguous supervision, it is required to resolve the ambiguity between a natural language (NL) sentence and a corresponding set of possible logical meaning representations (MR). My research focuses on devising effective models for simultaneously disambiguating such supervision and learning the underlying semantics of language to map NL sentences into proper logical forms. Specifically, I will present two probabilistic generative models for learning such correspondences. The models are applied to two publicly available datasets in two different domains, sportscasting and navigation, and compared with previous work on the same data. I will first present a probabilistic generative model that learns the mappings from NL sentences into logical forms where the true meaning of each NL sentence is one of a handful of candidate logical MRs. It simultaneously disambiguates the meaning of each sentence in the training data and learns to probabilistically map a NL sentence to its MR form depicted in a single tree structure. Evaluations are performed on the RoboCup sportscasting corpous, which show that it outperforms previous methods. Next, I present a PCFG induction model for grounded language learning that extends the model of Borschinger, Jones, and Johnson (2011) by utilizing a semantic lexicon. Borschinger et al.'s approach works well when there is limited ambiguity such as in the sportscasting task, but it does not scale well to highly ambiguous situations when there are large sets of potential meaning possibilities for each sentence, such as in the navigation instruction following task studied by Chen and Mooney (2011). Our model overcomes such limitations by employing a semantic lexicon as the basic building block for PCFG rule generation. Our model also allows for novel combination of MR outputs when parsing novel test sentences. For future work, I propose to extend our PCFG induction model in several ways: improving the lexicon learning algorithm, discriminative re-ranking of top-k parses, and integrating the meaning representation language (MRL) grammar for extra structural information. The longer-term agenda includes applying our approach to summarized machine translation, using real perception data such as robot sensorimeter and images/videos, and joint learning with other natural language processing tasks.
ML ID: 273
"Grounded" language learning employs training data in the form of sentences paired with relevant but ambiguous perceptual contexts. Borschinger et al. (2011) introduced an approach to grounded language learning based on unsupervised PCFG induction. Their approach works well when each sentence potentially refers to one of a small set of possible meanings, such as in the sportscasting task. However, it does not scale to problems with a large set of potential meanings for each sentence, such as the navigation instruction following task studied by Chen and Mooney (2011). This paper presents an enhancement of the PCFG approach that scales to such problems with highly-ambiguous supervision. Experimental results on the navigation task demonstrates the effectiveness of our approach.
ML ID: 272
Learning a semantic lexicon is often an important first step in building a system that learns to interpret the meaning of natural language. It is especially important in language grounding where the training data usually consist of language paired with an ambiguous perceptual context. Recent work by Chen and Mooney (2011) introduced a lexicon learning method that deals with ambiguous relational data by taking intersections of graphs. While the algorithm produced good lexicons for the task of learning to interpret navigation instructions, it only works in batch settings and does not scale well to large datasets. In this paper we introduce a new online algorithm that is an order of magnitude faster and surpasses the state-of-the-art results. We show that by changing the grammar of the formal meaning representation language and training on additional data collected from Amazon's Mechanical Turk we can further improve the results. We also include experimental results on a Chinese translation of the training data to demonstrate the generality of our approach.
ML ID: 271
Most information extraction (IE) systems identify facts that are explicitly stated in text. However, in natural language, some facts are implicit, and identifying them requires "reading between the lines". Human readers naturally use common sense knowledge to infer such implicit information from the explicitly stated facts. We propose an approach that uses Bayesian Logic Programs (BLPs), a statistical relational model combining first-order logic and Bayesian networks, to infer additional implicit information from extracted facts. It involves learning uncertain commonsense knowledge (in the form of probabilistic first-order rules) from natural language text by mining a large corpus of automatically extracted facts. These rules are then used to derive additional facts from extracted information using BLP inference. Experimental evaluation on a benchmark data set for machine reading demonstrates the efficacy of our approach.
ML ID: 270
Building a computer system that can understand human languages has been one of the long-standing goals of artificial intelligence. Currently, most state-of-the-art natural language processing (NLP) systems use statistical machine learning methods to extract linguistic knowledge from large, annotated corpora. However, constructing such corpora can be expensive and time-consuming due to the expertise it requires to annotate such data. In this thesis, we explore alternative ways of learning which do not rely on direct human supervision. In particular, we draw our inspirations from the fact that humans are able to learn language through exposure to linguistic inputs in the context of a rich, relevant, perceptual environment.
We first present a system that learned to sportscast for RoboCup simulation games by observing how humans commentate a game. Using the simple assumption that people generally talk about events that have just occurred, we pair each textual comment with a set of events that it could be referring to. By applying an EM-like algorithm, the system simultaneously learns a grounded language model and aligns each description to the corresponding event. The system does not use any prior language knowledge and was able to learn to sportscast in both English and Korean. Human evaluations of the generated commentaries indicate they are of reasonable quality and in some cases even on par with those produced by humans.
For the sportscasting task, while each comment could be aligned to one of several events, the level of ambiguity was low enough that we could enumerate all the possible alignments. However, it is not always possible to restrict the set of possible alignments to such limited numbers. Thus, we present another system that allows each sentence to be aligned to one of exponentially many connected subgraphs without explicitly enumerating them. The system first learns a lexicon and uses it to prune the nodes in the graph that are unrelated to the words in the sentence. By only observing how humans follow navigation instructions, the system was able to infer the corresponding hidden navigation plans and parse previously unseen instructions in new environments for both English and Chinese data. With the rise in popularity of crowdsourcing, we also present results on collecting additional training data using Amazon’s Mechanical Turk. Since our system only needs supervision in the form of language being used in relevant contexts, it is easy for virtually anyone to contribute to the training data.
ML ID: 269