Department of Computer Science

Machine Learning Research Group

University of Texas at Austin Artificial Intelligence Lab

Publications: Connecting Language and Perception

To truly understand language, an intelligent system must be able to connect words, phrases, and sentences to its perception of objects and events in the world. Ideally, an AI system would be able to learn language like a human child, by being exposed to utterances in a rich perceptual environment. The perceptual context would provide the necessary supervisory information, and learning the connection between language and perception would ground the system's semantic representations in its perception of the world. As a step in this direction, our research is developing systems that learn semantic parsers and language generators from sentences paired only with their perceptual context. It is part of our research on natural language learning. Our research on this topic is supported by the National Science Foundation through grants IIS-0712097 and IIS-1016312.
  • Grounded Language Learning [Video Lecture]
  • Raymond J. Mooney, Invited Talk, AAAI, 2013.
  • Learning Language from its Perceptual Context [Video Lecture]
  • Raymond J. Mooney, Invited Talk, ECML-PKDD, 2008.
  1. Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild
    [Details] [PDF] [Poster]
    Jesse Thomason and Subhashini Venugopalan and Sergio Guadarrama and Kate Saenko and Raymond Mooney
    In Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014), 1218--1227, Dublin, Ireland, August 2014.
    This paper integrates techniques in natural language processing and computer vision to improve recognition and description of entities and activities in real-world videos. We propose a strategy for generating textual descriptions of videos by using a factor graph to combine visual detections with language statistics. We use state-of-the-art visual recognition systems to obtain confidences on entities, activities, and scenes present in the video. Our factor graph model combines these detection confidences with probabilistic knowledge mined from text corpora to estimate the most likely subject, verb, object, and place. Results on YouTube videos show that our approach improves both the joint detection of these latent, diverse sentence components and the detection of some individual components when compared to using the vision system alone, as well as over a previous n-gram language-modeling approach. The joint detection allows us to automatically generate more accurate, richer sentential descriptions of videos with a wide array of possible content.
    ML ID: 304
  2. Integrating Visual and Linguistic Information to Describe Properties of Objects
    [Details] [PDF]
    Calvin MacKenzie
    2014. Undergraduate Honors Thesis, Computer Science Department, University of Texas at Austin.
    Generating sentences from images has historically been performed with standalone Computer Vision systems. The idea of combining visual and linguistic information has been gaining traction in the Computer Vision and Natural Language Processing communities over the past several years. The motivation for a combined system is to generate richer linguistic descriptions of images. Standalone vision systems are typically unable to generate linguistically rich descriptions. This approach combines abundant available language data to clean up noisy results from standalone vision systems.

    This thesis investigates the performance of several models which integrate information from language and vision systems in order to describe certain attributes of objects. The attributes used were split into two categories: color attributes and other attributes. Our proposed model was found to be statistically significantly more accurate than the vision system alone for both sets of attributes.

    ML ID: 302
  3. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-shot Recognition
    [Details] [PDF] [Poster]
    Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko
    In Proceedings of the 14th International Conference on Computer Vision (ICCV-2013), 2712--2719, Sydney, Australia, December 2013.
    Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities "in-the-wild". We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action, and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to "fill in" novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.
    ML ID: 295
  4. A Multimodal LDA Model Integrating Textual, Cognitive and Visual Modalities
    [Details] [PDF]
    Stephen Roller and Sabine Schulte im Walde
    In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), 1146--1157, Seattle, WA, October 2013.
    Recent investigations into grounded models of language have shown that holistic views of language and perception can provide higher performance than independent views. In this work, we improve a two-dimensional multimodal version of Latent Dirichlet Allocation (Andrews et al., 2009) in various ways. (1) We outperform text-only models in two different evaluations, and demonstrate that low-level visual features are directly compatible with the existing model. (2) We present a novel way to integrate visual features into the LDA model using unsupervised clusters of images. The clusters are directly interpretable and improve on our evaluation tasks. (3) We provide two novel ways to extend the bimodal models to support three or more modalities. We find that the three-, four-, and five-dimensional models significantly outperform models using only one or two modalities, and that nontextual modalities each provide separate, disjoint knowledge that cannot be forced into a shared, latent structure.
    ML ID: 294
  5. Grounded Language Learning Models for Ambiguous Supervision
    [Details] [PDF] [Slides]
    Joo Hyun Kim
    PhD Thesis, Department of Computer Science, University of Texas at Austin, December 2013.
    Communicating with natural language interfaces is a long-standing, ultimate goal for artificial intelligence (AI) agents to pursue, eventually. One core issue toward this goal is "grounded" language learning, a process of learning the semantics of natural language with respect to relevant perceptual inputs. In order to ground the meanings of language in a real world situation, computational systems are trained with data in the form of natural language sentences paired with relevant but ambiguous perceptual contexts. With such ambiguous supervision, it is required to resolve the ambiguity between a natural language (NL) sentence and a corresponding set of possible logical meaning representations (MR).

    In this thesis, we focus on devising effective models for simultaneously disambiguating such supervision and learning the underlying semantics of language to map NL sentences into proper logical MRs. We present probabilistic generative models for learning such correspondences along with a reranking model to improve the performance further.

    First, we present a probabilistic generative model that learns the mappings from NL sentences into logical forms where the true meaning of each NL sentence is one of a handful of candidate logical MRs. It simultaneously disambiguates the meaning of each sentence in the training data and learns to probabilistically map an NL sentence to its corresponding MR form depicted in a single tree structure. We perform evaluations on the RoboCup sportscasting corpus, proving that our model is more effective than those proposed by previous researchers.

    Next, we describe two PCFG induction models for grounded language learning that extend the previous grounded language learning model of Borschinger, Jones, and Johnson (2011). Borschinger et al.'s approach works well in situations of limited ambiguity, such as in the sportscasting task. However, it does not scale well to highly ambiguous situations when there are large sets of potential meaning possibilities for each sentence, such as in the navigation instruction following task first studied by Chen and Mooney (2011). The two models we present overcome such limitations by employing a learned semantic lexicon as a basic correspondence unit between NL and MR for PCFG rule generation.

    Finally, we present a method of adapting discriminative reranking to grounded language learning in order to improve the performance of our proposed generative models. Although such generative models are easy to implement and are intuitive, it is not always the case that generative models perform best, since they are maximizing the joint probability of data and model, rather than directly maximizing conditional probability. Because we do not have gold-standard references for training a secondary conditional reranker, we incorporate weak supervision of evaluations against the perceptual world during the process of improving model performance.

    All these approaches are evaluated on the two publicly available domains that have been actively used in many other grounded language learning studies. Our methods demonstrate consistently improved performance over those of previous studies in the domains with different languages; this proves that our methods are language-independent and can be generally applied to other grounded learning problems as well. Further possible applications of the presented approaches include summarized machine translation tasks and learning from real perception data assisted by computer vision and robotics.

    ML ID: 291
  6. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge
    [Details] [PDF] [Slides]
    Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, Sergio Guadarrama
    In Proceedings of the NAACL HLT Workshop on Vision and Language (WVL '13), 10--19, Atlanta, Georgia, July 2013.
    We present a holistic data-driven technique that generates natural-language descriptions for videos. We combine the output of state-of-the-art object and activity detectors with ``real-world'' knowledge to select the most probable subject-verb-object triplet for describing a video. We show that this knowledge, automatically mined from web-scale text corpora, enhances the triplet selection algorithm by providing it contextual information and leads to a four-fold increase in activity identification. Unlike previous methods, our approach can annotate arbitrary videos without requiring the expensive collection and annotation of a similar training video corpus. We evaluate our technique against a baseline that does not use text-mined knowledge and show that humans prefer our descriptions 61% of the time.
    ML ID: 290
  7. Adapting Discriminative Reranking to Grounded Language Learning
    [Details] [PDF] [Slides]
    Joohyun Kim and Raymond J. Mooney
    In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL-2013), 218--227, Sofia, Bulgaria, August 2013.
    We adapt discriminative reranking to improve the performance of grounded language acquisition, specifically the task of learning to follow navigation instructions from observation. Unlike conventional reranking used in syntactic and semantic parsing, gold-standard reference trees are not naturally available in a grounded setting. Instead, we show how the weak supervision of response feedback (e.g. successful task completion) can be used as an alternative, experimentally demonstrating that its performance is comparable to training on gold-standard parse trees.
    ML ID: 286
  8. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge
    [Details] [PDF] [Slides]
    Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond J. Mooney, Kate Saenko, Sergio Guadarrama
    In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI-2013), 541--547, July 2013.
    We present a holistic data-driven technique that generates natural-language descriptions for videos. We combine the output of state-of-the-art object and activity detectors with "real-world" knowledge to select the most probable subject-verb-object triplet for describing a video. We show that this knowledge, automatically mined from web-scale text corpora, enhances the triplet selection algorithm by providing it contextual information and leads to a four-fold increase in activity identification. Unlike previous methods, our approach can annotate arbitrary videos without requiring the expensive collection and annotation of a similar training video corpus. We evaluate our technique against a baseline that does not use text-mined knowledge and show that humans prefer our descriptions 61 percent of the time.
    ML ID: 282
  9. Improving Video Activity Recognition using Object Recognition and Text Mining
    [Details] [PDF] [Slides]
    Tanvi S. Motwani and Raymond J. Mooney
    In Proceedings of the 20th European Conference on Artificial Intelligence (ECAI-2012), 600--605, August 2012.
    Recognizing activities in real-world videos is a challenging AI problem. We present a novel combination of standard activity classification, object recognition, and text mining to learn effective activity recognizers without ever explicitly labeling training videos. We cluster verbs used to describe videos to automatically discover classes of activities and produce a labeled training set. This labeled data is then used to train an activity classifier based on spatio-temporal features. Next, text mining is employed to learn the correlations between these verbs and related objects. This knowledge is then used together with the outputs of an off-the-shelf object recognizer and the trained activity classifier to produce an improved activity recognizer. Experiments on a corpus of YouTube videos demonstrate the effectiveness of the overall approach.
    ML ID: 274
  10. Generative Models of Grounded Language Learning with Ambiguous Supervision
    [Details] [PDF] [Slides]
    Joohyun Kim
    Technical Report, PhD proposal, Department of Computer Science, The University of Texas at Austin, June 2012.

    "Grounded" language learning is the process of learning the semantics of natural language with respect to relevant perceptual inputs. Toward this goal, computational systems are trained with data in the form of natural language sentences paired with relevant but ambiguous perceptual contexts. With such ambiguous supervision, it is required to resolve the ambiguity between a natural language (NL) sentence and a corresponding set of possible logical meaning representations (MR). My research focuses on devising effective models for simultaneously disambiguating such supervision and learning the underlying semantics of language to map NL sentences into proper logical forms. Specifically, I will present two probabilistic generative models for learning such correspondences. The models are applied to two publicly available datasets in two different domains, sportscasting and navigation, and compared with previous work on the same data.

    I will first present a probabilistic generative model that learns the mappings from NL sentences into logical forms where the true meaning of each NL sentence is one of a handful of candidate logical MRs. It simultaneously disambiguates the meaning of each sentence in the training data and learns to probabilistically map a NL sentence to its MR form depicted in a single tree structure. Evaluations are performed on the RoboCup sportscasting corpous, which show that it outperforms previous methods.

    Next, I present a PCFG induction model for grounded language learning that extends the model of Borschinger, Jones, and Johnson (2011) by utilizing a semantic lexicon. Borschinger et al.'s approach works well when there is limited ambiguity such as in the sportscasting task, but it does not scale well to highly ambiguous situations when there are large sets of potential meaning possibilities for each sentence, such as in the navigation instruction following task studied by Chen and Mooney (2011). Our model overcomes such limitations by employing a semantic lexicon as the basic building block for PCFG rule generation. Our model also allows for novel combination of MR outputs when parsing novel test sentences.

    For future work, I propose to extend our PCFG induction model in several ways: improving the lexicon learning algorithm, discriminative re-ranking of top-k parses, and integrating the meaning representation language (MRL) grammar for extra structural information. The longer-term agenda includes applying our approach to summarized machine translation, using real perception data such as robot sensorimeter and images/videos, and joint learning with other natural language processing tasks.

    ML ID: 273
  11. Unsupervised PCFG Induction for Grounded Language Learning with Highly Ambiguous Supervision
    [Details] [PDF]
    Joohyun Kim and Raymond J. Mooney
    In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Natural Language Learning (EMNLP-CoNLL '12), 433--444, Jeju Island, Korea, July 2012.
    "Grounded" language learning employs training data in the form of sentences paired with relevant but ambiguous perceptual contexts. Borschinger et al. (2011) introduced an approach to grounded language learning based on unsupervised PCFG induction. Their approach works well when each sentence potentially refers to one of a small set of possible meanings, such as in the sportscasting task. However, it does not scale to problems with a large set of potential meanings for each sentence, such as the navigation instruction following task studied by Chen and Mooney (2011). This paper presents an enhancement of the PCFG approach that scales to such problems with highly-ambiguous supervision. Experimental results on the navigation task demonstrates the effectiveness of our approach.
    ML ID: 272
  12. Fast Online Lexicon Learning for Grounded Language Acquisition
    [Details] [PDF] [Slides]
    David L. Chen
    In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL-2012), 430--439, July 2012.
    Learning a semantic lexicon is often an important first step in building a system that learns to interpret the meaning of natural language. It is especially important in language grounding where the training data usually consist of language paired with an ambiguous perceptual context. Recent work by Chen and Mooney (2011) introduced a lexicon learning method that deals with ambiguous relational data by taking intersections of graphs. While the algorithm produced good lexicons for the task of learning to interpret navigation instructions, it only works in batch settings and does not scale well to large datasets. In this paper we introduce a new online algorithm that is an order of magnitude faster and surpasses the state-of-the-art results. We show that by changing the grammar of the formal meaning representation language and training on additional data collected from Amazon's Mechanical Turk we can further improve the results. We also include experimental results on a Chinese translation of the training data to demonstrate the generality of our approach.
    ML ID: 271
  13. Learning Language from Ambiguous Perceptual Context
    [Details] [PDF] [Slides]
    David L. Chen
    PhD Thesis, Department of Computer Science, University of Texas at Austin, May 2012. 196.

    Building a computer system that can understand human languages has been one of the long-standing goals of artificial intelligence. Currently, most state-of-the-art natural language processing (NLP) systems use statistical machine learning methods to extract linguistic knowledge from large, annotated corpora. However, constructing such corpora can be expensive and time-consuming due to the expertise it requires to annotate such data. In this thesis, we explore alternative ways of learning which do not rely on direct human supervision. In particular, we draw our inspirations from the fact that humans are able to learn language through exposure to linguistic inputs in the context of a rich, relevant, perceptual environment.

    We first present a system that learned to sportscast for RoboCup simulation games by observing how humans commentate a game. Using the simple assumption that people generally talk about events that have just occurred, we pair each textual comment with a set of events that it could be referring to. By applying an EM-like algorithm, the system simultaneously learns a grounded language model and aligns each description to the corresponding event. The system does not use any prior language knowledge and was able to learn to sportscast in both English and Korean. Human evaluations of the generated commentaries indicate they are of reasonable quality and in some cases even on par with those produced by humans.

    For the sportscasting task, while each comment could be aligned to one of several events, the level of ambiguity was low enough that we could enumerate all the possible alignments. However, it is not always possible to restrict the set of possible alignments to such limited numbers. Thus, we present another system that allows each sentence to be aligned to one of exponentially many connected subgraphs without explicitly enumerating them. The system first learns a lexicon and uses it to prune the nodes in the graph that are unrelated to the words in the sentence. By only observing how humans follow navigation instructions, the system was able to infer the corresponding hidden navigation plans and parse previously unseen instructions in new environments for both English and Chinese data. With the rise in popularity of crowdsourcing, we also present results on collecting additional training data using Amazon’s Mechanical Turk. Since our system only needs supervision in the form of language being used in relevant contexts, it is easy for virtually anyone to contribute to the training data.

    ML ID: 269
  14. Learning to Interpret Natural Language Navigation Instructions from Observations
    [Details] [PDF] [Slides]
    David L. Chen and Raymond J. Mooney
    In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI-2011), 859-865, August 2011.
    The ability to understand natural-language instructions is critical to building intelligent agents that interact with humans. We present a system that learns to transform natural-language navigation instructions into executable formal plans. Given no prior linguistic knowledge, the system learns by simply observing how humans follow navigation instructions. The system is evaluated in three complex virtual indoor environments with numerous objects and landmarks. A previously collected realistic corpus of complex English navigation instructions for these environments is used for training and testing data. By using a learned lexicon to refine inferred plans and a supervised learner to induce a semantic parser, the system is able to automatically learn to correctly interpret a reasonable fraction of the complex instructions in this corpus.
    ML ID: 264
  15. Panning for Gold: Finding Relevant Semantic Content for Grounded Language Learning
    [Details] [PDF] [Slides]
    David L. Chen and Raymond J. Mooney
    In Proceedings of Symposium on Machine Learning in Speech and Language Processing (MLSLP 2011), June 2011.
    One of the key challenges in grounded language acquisition is resolving the intentions of the expressions. Typically the task involves identifying a subset of records from a list of candidates as the correct meaning of a sentence. While most current work assume complete or partial independence be- tween the records, we examine a scenario in which they are strongly related. By representing the set of potential meanings as a graph, we explicitly encode the relationships between the candidate meanings. We introduce a refinement algorithm that first learns a lexicon which is then used to remove parts of the graphs that are irrelevant. Experiments in a navigation domain shows that the algorithm successfully recovered over three quarters of the correct semantic content.
    ML ID: 261
  16. Generative Alignment and Semantic Parsing for Learning from Ambiguous Supervision
    [Details] [PDF]
    Joohyun Kim and Raymond J. Mooney
    In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), 543--551, Beijing, China, August 2010.
    We present a probabilistic generative model for learning semantic parsers from ambiguous supervision. Our approach learns from natural language sentences paired with world states consisting of multiple potential logical meaning representations. It disambiguates the meaning of each sentence while simultaneously learning a semantic parser that maps sentences into logical form. Compared to a previous generative model for semantic alignment, it also supports full semantic parsing. Experimental results on the Robocup sportscasting corpora in both English and Korean indicate that our approach produces more accurate semantic alignments than existing methods and also produces competitive semantic parsers and improved language generators.
    ML ID: 251
  17. Using Closed Captions as Supervision for Video Activity Recognition
    [Details] [PDF]
    Sonal Gupta, Raymond J. Mooney
    In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-2010), 1083--1088, Atlanta, GA, July 2010.
    Recognizing activities in real-world videos is a difficult problem exacerbated by background clutter, changes in camera angle and zoom, and rapid camera movements. Large corpora of labeled videos can be used to train automated activity recognition systems, but this requires expensive human labor and time. This paper explores how closed captions that naturally accompany many videos can act as weak supervision that allows automatically collecting ‘labeled’ data for activity recognition. We show that such an approach can improve activity retrieval in soccer videos. Our system requires no manual labeling of video clips and needs minimal human supervision. We also present a novel caption classifier that uses additional linguistic information to determine whether a specific comment refers to an ongoing activity. We demonstrate that combining linguistic analysis and automatically trained activity recognizers can significantly improve the precision of video retrieval.
    ML ID: 242
  18. Training a Multilingual Sportscaster: Using Perceptual Context to Learn Language
    [Details] [PDF]
    David L. Chen, Joohyun Kim, Raymond J. Mooney
    Journal of Artificial Intelligence Research, 37:397--435, 2010.
    We present a novel framework for learning to interpret and generate language using only perceptual context as supervision. We demonstrate its capabilities by developing a system that learns to sportscast simulated robot soccer games in both English and Korean without any language-specific prior knowledge. Training employs only ambiguous supervision consisting of a stream of descriptive textual comments and a sequence of events extracted from the simulation trace. The system simultaneously establishes correspondences between individual comments and the events that they describe while building a translation model that supports both parsing and generation. We also present a novel algorithm for learning which events are worth describing. Human evaluations of the generated commentaries indicate they are of reasonable quality and in some cases even on par with those produced by humans for our limited domain.
    ML ID: 240
  19. Learning Language from Perceptual Context
    [Details] [PDF] [Slides]
    David L. Chen
    December 2009. Ph.D. proposal, Department of Computer Sciences, University of Texas at Austin.
    Most current natural language processing (NLP) systems are built using statistical learning algorithms trained on large annotated corpora which can be expensive and time-consuming to collect. In contrast, humans can learn language through exposure to linguistic input in the context of a rich, relevant, perceptual environment. If a machine learning system can acquire language in a similar manner without explicit human supervision, then it can leverage the large amount of available text that refers to observed world states (e.g. sportscasts, instruction manuals, weather forecasts, etc.) Thus, my research focuses on how to build systems that use both text and the perceptual context in which it is used in order to learn a language. I will first present a system we completed that can describe events in RoboCup 2D simulation games by learning only from sample language commentaries paired with traces of simulated activities without any language-specific prior knowledge. By applying an EM-like algorithm, the system was able to simultaneously learn a grounded language model as well as align the ambiguous training data. Human evaluations of the generated commentaries indicate they are of reasonable quality and in some cases even on par with those produced by humans. For future work, I am proposing to solve the more complex task of learning how to give and receive navigation instructions in a virtual environment. In this setting, each instruction corresponds to a navigation plan that is not directly observable. Since an exponential number of plans can all lead to the same observed actions, we have to learn from compact representations of all valid plans rather than enumerating all possible meanings as we did in the sportscasting task. Initially, the system will passively observe a human giving instruction to another human, and try to learn the correspondences between the instructions and the intended plan. After the system has a decent understanding of the language, it can then participate in the interactions to learn more directly by playing either the role of the instructor or the follower.
    ML ID: 239
  20. Activity Retrieval in Closed Captioned Videos
    [Details] [PDF]
    Sonal Gupta
    Masters Thesis, Department of Computer Sciences, University of Texas at Austin, August 2009. 64 pages.
    Recognizing activities in real-world videos is a difficult problem exacerbated by background clutter, changes in camera angle and zoom, occlusion and rapid camera movements. Large corpora of labeled videos can be used to train automated activity recognition systems, but this requires expensive human labor and time. This thesis explores how closed captions that naturally accompany many videos can act as weak supervision that allows automatically collecting “labeled” data for activity recognition. We show that such an approach can improve activity retrieval in soccer videos. Our system requires no manual labeling of video clips and needs minimal human supervision. We also present a novel caption classifier that uses additional linguistic information to determine whether a specific comment refers to an ongoing activity. We demonstrate that combining linguistic analysis and automatically trained activity recognizers can significantly improve the precision of video retrieval.
    ML ID: 236
  21. Using Closed Captions to Train Activity Recognizers that Improve Video Retrieval
    [Details] [PDF]
    Sonal Gupta and Raymond Mooney
    In Proceedings of the CVPR-09 Workshop on Visual and Contextual Learning from Annotated Images and Videos (VCL), Miami, FL, June 2009.
    Recognizing activities in real-world videos is a difficult problem exacerbated by background clutter, changes in camera angle & zoom, rapid camera movements etc. Large corpora of labeled videos can be used to train automated activity recognition systems, but this requires expensive human labor and time. This paper explores how closed captions that naturally accompany many videos can act as weak supervision that allows automatically collecting labeled data for activity recognition. We show that such an approach can improve activity retrieval in soccer videos. Our system requires no manual labeling of video clips and needs minimal human supervision. We also present a novel caption classifier that uses additional linguistic information to determine whether a specific comment refers to an ongoing activity. We demonstrate that combining linguistic analysis and automatically trained activity recognizers can significantly improve the precision of video retrieval.
    ML ID: 226
  22. Watch, Listen & Learn: Co-training on Captioned Images and Videos
    [Details] [PDF]
    Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney
    In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 457--472, Antwerp Belgium, September 2008.
    Recognizing visual scenes and activities is challenging: often visual cues alone are ambiguous, and it is expensive to obtain manually labeled examples from which to learn. To cope with these constraints, we propose to leverage the text that often accompanies visual data to learn robust models of scenes and actions from partially labeled collections. Our approach uses co-training, a semi-supervised learning method that accommodates multi-modal views of data. To classify images, our method learns from captioned images of natural scenes; and to recognize human actions, it learns from videos of athletic events with commentary. We show that by exploiting both multi-modal representations and unlabeled data our approach learns more accurate image and video classifiers than standard baseline algorithms.
    ML ID: 221
  23. Learning to Sportscast: A Test of Grounded Language Acquisition
    [Details] [PDF] [Slides] [Video]
    David L. Chen and Raymond J. Mooney
    In Proceedings of the 25th International Conference on Machine Learning (ICML), Helsinki, Finland, July 2008.
    We present a novel commentator system that learns language from sportscasts of simulated soccer games. The system learns to parse and generate commentaries without any engineered knowledge about the English language. Training is done using only ambiguous supervision in the form of textual human commentaries and simulation states of the soccer games. The system simultaneously tries to establish correspondences between the commentaries and the simulation states as well as build a translation model. We also present a novel algorithm, Iterative Generation Strategy Learning (IGSL), for deciding which events to comment on. Human evaluations of the generated commentaries indicate they are of reasonable quality compared to human commentaries.
    ML ID: 219
  24. Learning to Connect Language and Perception
    [Details] [PDF]
    Raymond J. Mooney
    In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI), 1598--1601, Chicago, IL, July 2008. Senior Member Paper.
    To truly understand language, an intelligent system must be able to connect words, phrases, and sentences to its perception of objects and events in the world. Current natural language processing and computer vision systems make extensive use of machine learning to acquire the probabilistic knowledge needed to comprehend linguistic and visual input. However, to date, there has been relatively little work on learning the relationships between the two modalities. In this talk, I will review some of the existing work on learning to connect language and perception, discuss important directions for future research in this area, and argue that the time is now ripe to make a concerted effort to address this important, integrative AI problem.
    ML ID: 216
  25. Learning Language Semantics from Ambiguous Supervision
    [Details] [PDF]
    Rohit J. Kate and Raymond J. Mooney
    In Proceedings of the 22nd Conference on Artificial Intelligence (AAAI-07), 895-900, Vancouver, Canada, July 2007.
    This paper presents a method for learning a semantic parser from ambiguous supervision. Training data consists of natural language sentences annotated with multiple potential meaning representations, only one of which is correct. Such ambiguous supervision models the type of supervision that can be more naturally available to language-learning systems. Given such weak supervision, our approach produces a semantic parser that maps sentences into meaning representations. An existing semantic parsing learning system that can only learn from unambiguous supervision is augmented to handle ambiguous supervision. Experimental results show that the resulting system is able to cope up with ambiguities and learn accurate semantic parsers.
    ML ID: 200
  26. Learning Language from Perceptual Context: A Challenge Problem for AI
    [Details] [PDF]
    Raymond J. Mooney
    In Proceedings of the 2006 AAAI Fellows Symposium, Boston, MA, July 2006.
    We present the problem of learning to understand natural language from examples of utterances paired only with their relevant real-world context as an important challenge problem for AI. Machine learning has been adopted as the most effective way of developing natural-language processing systems; however, currently, complex annotated corpora are required for training. By learning language from perceptual context, the need for laborious annotation is removed and the system's resulting understanding is grounded in its perceptual experience.
    ML ID: 192