Yi-Jen(Ian) Shih

UT Austin CS Ph.D.

Email: yjshih [AT] utexas.edu

I'm a third year Ph.D. student at UTCS supervised by Prof. David Harwath. Before joining UT, I was supervised by Prof. Hung-yi Lee from NTU and Prof. Yi-Hsuan Yang from Academia Sinica.
I'm interested in Speech Foundation Models, Self-supervised Representation Learning and Multimodal Representation learning.


Recent Publications * indicates equal contribution
  • Can Speech LLMs Think while Listening?
    Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, Mike Seltzer
    arXiv preprint 2025
    arXiv 

    • Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
      Chien-yu Huang, ..., Yi-Jen Shih, et. al.
      ICLR 2025
      arXiv 

      • Unifying Model and Layer Fusion for Speech Foundation Models
        Yi-Jen Shih, David Harwath
        IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2025
        arXiv 

      • Self-supervised Speech Models for Word-Level Stuttered Speech Detection
        Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G. Dimakis, David Harwath
        IEEE Spoken Language Technology Workshop (SLT) 2024
        arXiv 

      • Measuring Sound Symbolism in Audio-Visual Models
        Wei-Cheng Tseng*, Yi-Jen Shih*, David Harwath, Raymond Mooney
        IEEE Spoken Language Technology Workshop (SLT) 2024
        arXiv 

      • Interface Design for Self-Supervised Speech Models
        Yi-Jen Shih, David Harwath
        Interspeech 2024
        arXiv  code 

      • SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data
        Hsuan-Fu Wang, Yi-Jen Shih, Heng-Jui Chang, Layne Berry, Puyuan Peng, Hung-yi Lee, Hsin-Min Wang, David Harwath
        ICASSP 2024 workshop on Self-supervision in Audio, Speech, and Beyond (SASB)
        arXiv 

      • Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model
        Hung-Chieh Fang*, Nai-Xuan Ye*, Yi-Jen Shih, Puyuan Peng, Hsuan-Fu Wang, Layne Berry, Hung-yi Lee, David Harwath
        ICASSP 2024 workshop on Self-supervision in Audio, Speech, and Beyond (SASB)
        arXiv 

      • AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
        Yuan Tseng, Layne Berry*, Yi-Ting Chen*, I-Hsiang Chiu*, Hsuan-Hao Lin*, Max Liu*, Puyuan Peng*, Yi-Jen Shih*, Hung-Yu Wang*, Haibin Wu*, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee
        International Conference on Acoustics, Speech, & Signal Processing (ICASSP) 2024
        arXiv 

      • M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrievall
        Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-yi Lee, David Harwath
        International Conference on Acoustics, Speech, & Signal Processing (ICASSP) 2023
        arXiv 

      • SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Modell
        Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath
        IEEE Spoken Language Technology Workshop (SLT) 2022
        arXiv  blog  code  present@JSALT22  poster 

      • Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformerl
        Yi-Jen Shih, Shih-Lun Wu, Frank Zalkow, Meinard Müller, Yi-Hsuan Yang
        IEEE Transactions on Multimedia (TMM) 2022
        arXiv  blog  code  demo  slides@MILA  talk@mMILA