📝 Representative Publications

Multi-modal Generative AI

  • Spoken Large Language Model: InstructSpeech (ICML 2024), UniAudio (ICML 2024), AudioGPT (AAAI demo 2024), Make-A-Voice (ACL 2024), HiFi-Codec
  • Text-to-Audio Synthesis: Make-An-Audio (ICML 2023)
  • Text-to-Speech Synthesis: GenerSpeech (NeurIPS 2022) for zero-shot text-to-speech, FastDiff (IJCAI 2022) / ProDiff (ACM-MM 2022a) for diffusion text-to-speech
  • Singing Voice Synthesis: SingGAN (ACM-MM 2022b) / Multi-Singer (ACM-MM 2021)

Multi-modal Language Processing

  • Audio-Visual Speech-to-Speech Translation: TranSpeech (ICLR 2023) / AV-TranSpeech (ACL 2023)
  • Self-Supervised Learning: Prosody-MAE (ACL 2023)
Arxiv 2025
sym
AAAI 2024
sym
ICML 2023
sym
ICLR 2023
sym

One of our continuous efforts to reduce communication barrier, and we have follow-up works: Audio-Visual S2T (MixSpeech, ICCV 2023), Audio-Visual S2ST (AV-TranSpeech, ACL 2023), Multi-modal S2ST, Style-aware S2ST, Zero-shot S2ST. Code released: .

NeurIPS 2022
sym

The first zero-shot TTS generalizable to unseen speaker, emotion, and prosody! Media coverage: PaperWeekly, Speech Home. Code released: .

ICJAI 2022
sym

One of our continuous efforts in generative modeling, and we have follow-up works: FastDiff 2, ProDiff. We release a diffusion text-to-speech pipeline Hugging Face using ProDiff and FastDiff . Our work are promoted by different media and forums, such as Tencent AI Lab, Speech Home, and Twitter, which is a Trending Project at both Github and Paperwithcode.

Full Publication List

  • denotes co-first authors, # denotes co-supervised

2025

2024

2023

2022

2021

2020 and Prior