Yi-Jen(Ian) Shih

UT Austin CS Ph.D.

Blog
CV

Email: yjshih [AT] utexas.edu

I'm a third year Ph.D. student at UTCS supervised by Prof. David Harwath. Before joining UT, I was supervised by Prof. Hung-yi Lee from NTU and Prof. Yi-Hsuan Yang from Academia Sinica.
I'm interested in Speech Foundation Models, Self-supervised Representation Learning and Multimodal Representation learning.

Recent Publications ^* indicates equal contribution

Can Speech LLMs Think while Listening?
Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, Mike Seltzer
arXiv preprint 2025
arXiv

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
Chien-yu Huang, ..., Yi-Jen Shih, et. al.
ICLR 2025
arXiv

Unifying Model and Layer Fusion for Speech Foundation Models
Yi-Jen Shih, David Harwath
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2025
arXiv

Self-supervised Speech Models for Word-Level Stuttered Speech Detection
Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G. Dimakis, David Harwath
IEEE Spoken Language Technology Workshop (SLT) 2024
arXiv

Measuring Sound Symbolism in Audio-Visual Models
Wei-Cheng Tseng^*, Yi-Jen Shih^*, David Harwath, Raymond Mooney
IEEE Spoken Language Technology Workshop (SLT) 2024
arXiv

Interface Design for Self-Supervised Speech Models
Yi-Jen Shih, David Harwath
Interspeech 2024
arXiv code

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath
ICASSP 2024 workshop on Self-supervision in Audio, Speech, and Beyond (SASB)
arXiv

Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model
Hung-Chieh Fang^*, Nai-Xuan Ye^*, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-yi Lee, David Harwath
ICASSP 2024 workshop on Self-supervision in Audio, Speech, and Beyond (SASB)
arXiv

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Yuan Tseng, Layne Berry^*, Yi-Ting Chen^*, I-Hsiang Chiu^*, Hsuan-Hao Lin^*, Max Liu^*, Puyuan Peng^*, Yi-Jen Shih^*, Hung-Yu Wang^*, Haibin Wu^*, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee
International Conference on Acoustics, Speech, & Signal Processing (ICASSP) 2024
arXiv

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrievall
Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-yi Lee, David Harwath
International Conference on Acoustics, Speech, & Signal Processing (ICASSP) 2023
arXiv

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Modell
Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath
IEEE Spoken Language Technology Workshop (SLT) 2022
arXiv blog code present@JSALT22 poster

Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformerl
Yi-Jen Shih, Shih-Lun Wu, Frank Zalkow, Meinard Müller, Yi-Hsuan Yang
IEEE Transactions on Multimedia (TMM) 2022
arXiv blog code demo slides@MILA talk@mMILA

Recent Publications * indicates equal contribution

Recent Publications ^* indicates equal contribution