📝 Representative Publications

Multi-modal Large Language Model

  • Speech Pre-training: InstructSpeech (ICML, 2024), UniAudio (ICML, 2024)
  • Joint understanding and generation: Seamless Interaction (Technical Report, 2025), AudioGPT (AAAI, 2024)
  • Efficient Post-training: MVoice (ACL, 2024), VoiceTuner (ACM-MM, 2024)

Omini Audio Generative Models

  • Video-to-Audio Generation: Lumina-T2X (ICLR 2025), Make-An-Audio (ICML 2023)
  • Speech Generation: GenerSpeech (NeurIPS, 2022), FastDiff (IJCAI, 2022), ProDiff (ACM-MM, 2022), FastDiff 2 (ACL, 2023)
  • Music Generation: SingGAN (ACM-MM, 2022), Multi-Singer (ACM-MM, 2021)

Audio-Visual Language Processing

  • Speech Translation: TranSpeech (ICLR, 2023), AV-TranSpeech (ACL, 2023)
  • Self-Supervised Learning: Prosody-MAE (ACL, 2023)
Technical Report 2025
sym

TLDR: Llama 4 with speech-text interleaved to generate duplex audio, and diffusion model to generate dyadic motion gestures and facial expressions aligned with human speech.

We develop a suite of joint LLM and diffusion models (AVLM) to generate dyadic motion gestures and facial expressions aligned with human speech. The AVLM can understand and generate both speech and visual modalities. With 2D and 3D renderers, it brings us closer to interactive virtual agents. Our work are promoted by different media and forums, such as Meta AI, Linkedin, and Twitter. We have code released at Hugging Face download has yielded 30k+.

AAAI 2024
sym
ICML 2023
sym
ICLR 2023
sym

One of our continuous efforts to reduce communication barrier, and we have follow-up works: Audio-Visual S2T (MixSpeech, ICCV 2023), Audio-Visual S2ST (AV-TranSpeech, ACL 2023), Multi-modal S2ST, Style-aware S2ST, Zero-shot S2ST. Code released: .

NeurIPS 2022
sym

The first zero-shot TTS generalizable to unseen speaker, emotion, and prosody! Media coverage: PaperWeekly, Speech Home. Code released: .

ICJAI 2022
sym

One of our continuous efforts in generative modeling, and we have follow-up works: FastDiff 2, ProDiff. We release a diffusion text-to-speech pipeline Hugging Face using ProDiff and FastDiff . Our work are promoted by different media and forums, such as Tencent AI Lab, Speech Home, and Twitter, which is a Trending Project at both Github and Paperwithcode.

Full Publication List

  • denotes co-first authors, # denotes co-supervised

2025

2024

2023

2022

2021

2020 and Prior