I am actively looking for prospective graduate students interested in machine learning applied to speech, audio, and natural language, especially within a multimodal context (e.g. in conjunction with vision).
If you are already a student at UT, please contact me directly.
If you are not currently a UT student, please apply to UTCS and mention your interest in working with my group in your statement of purpose. I am not able to respond to direct inquiries regarding admissions. Information about applying to the graduate program can be found here.
My research interests are in the area of machine learning for speech and language processing. The ultimate goal of my work is to discover the algorithmic mechanisms that would enable computers to learn and use spoken language the way that humans do. My approach emphasizes the multimodal and grounded nature of human language, and thus has a strong connection to other machine learning disciplines such as computer vision.
While modern machine learning techniques such as deep learning have made impressive progress across a variety of domains, it is doubtful that existing methods can fully capture the phenomenon of language. State-of-the-art deep learning models for tasks such as speech recognition are extremely data hungry, requiring many thousands of hours of speech recordings that have been painstakingly transcribed by humans. Even then, they are highly brittle when used outside of their training domain, breaking down when confronted with new vocabulary, accents, or environmental noise. Because of its reliance on massive training datasets, the technology we do have is completely out of reach for all but several dozen of the 7,000 human languages spoken worldwide.
In contrast, human toddlers are able to grasp the meaning of new word forms from only a few spoken examples, and learn to carry a meaningful conversation long before they are able to read and write. There are critical aspects of language that are currently missing from our machine learning models. Human language is inherently multimodal; it is grounded in embodied experience; it holistically integrates information from all of our sensory organs into our rational capacity; and it is acquired via immersion and interaction, without the kind of heavy-handed supervision relied upon by most machine learning models. My research agenda revolves around finding ways to bring these aspects into the fold.
Prior to joining UT, I worked as a research scientist at MIT CSAIL from 2018 to 2020. I recieved my PhD in 2018 from the Spoken Language Systems Group at MIT CSAIL, under the supervision of Jim Glass.