Directly Optimizing Evaluation Metrics to Improve Text to Motion (2023)
There is a long-existing discrepancy between training and testing process of most generative models including both text-to-text models like machine translation (MT), and multi-modal models like image captioning and text-to-motion generation. These models are usually trained to optimize a specific objective like log-likelihood (MLE) in the Seq2Seq models or the KL-divergence in the variational autoencoder (VAE) models. However, they are tested using different evaluation metrics such as the BLEU score and Fr├ęchet Inception Distance (FID). Our paper aims to address such discrepancy in text-to-motion generation models by developing algorithms to directly optimize the target metric during training time. We explore three major techniques: reinforcement learning, contrastive learning methods, and differentiable metrics that are originally applied to natural language processing fields and adapt them to the language-and-motion domain.
Masters Thesis, Department of Computer Science, UT Austin.

Yili Wang Masters Student ywang98 [at] utexas edu