Rongjie Huang (黄融杰) is at Seamless Team at FAIR. Previously, I graduated from College of Computer Science, Zhejiang University, supervised by Prof. Zhou Zhao. I also obtained Bachelor’s degree at Zhejiang University. My research interest includes Multi-modal Large Language Model, Video-Audio Generative Models, and Audio-Visual Language Processing. I have published first-author papers at the top international AI conferences such as NeurIPS/ICLR/ICML/ACL/IJCAI. I developed a few well-known Speech/NLP algorithms including:

🔥 News

  • 2025.04: I am awarded the Best Thesis Award by the Electrical Engineering Association!
  • 2025.02: 4 papers are accepted by ICLR 2025!
  • 2025.01: 1 paper is accepted by AAAI 2025!
  • 2024.10: 6 papers are accepted by NeurIPS 2024!
  • 2024.05: 6 papers are accepted by ACL 2024! (main conference and findings)!
  • 2024.05: 3 papers are accepted by ICML 2024!
  • 2024.03: 1 paper is accepted by NAACL 2024 main conference!
  • 2024.01: 1 paper is accepted by ICLR 2024!
  • 2023.11: 2 papers are accepted by AAAI 2024 main / AAAI 2024 demo!
  • 2023.10: I am awarded ByteDance Scholar Fellowship, and Chu Kochen Presidential Scholarship!
  • 2023.10: UniAudio released!
  • 2023.09: One paper is accepted by EMNLP 2023!
  • 2023.07: One paper is accepted by ACM-MM 2023!
  • 2023.06: One paper is accepted by ICCV 2023!
  • 2023.05: 8 papers are accepted by ACL 2023 (main conference and findings)! Thanks to my co-authors!
  • 2023.04: AudioGPT and HiFi-Codec released!
  • 2023.04: One papers is accepted by ICML 2023!
  • 2023.02: Make-An-Audio released! Media coverage: Heart of Machine, ByteDance and Twitter
  • 2023.01: One papers is accepted by ICLR 2023!
  • 2022.09: Two papers are accepted by NeurIPS 2022!

📝 Representative Publications

Multi-modal Large Language Model

  • Speech Pre-training: InstructSpeech (ICML, 2024), UniAudio (ICML, 2024)
  • Joint understanding and generation: Seamless Interaction (Technical Report, 2025), AudioGPT (AAAI, 2024)
  • Efficient Post-training: MVoice (ACL, 2024), VoiceTuner (ACM-MM, 2024)

Omini Audio Generative Models

  • Video-to-Audio Generation: Lumina-T2X (ICLR 2025), Make-An-Audio (ICML 2023)
  • Speech Generation: GenerSpeech (NeurIPS, 2022), FastDiff (IJCAI, 2022), ProDiff (ACM-MM, 2022), FastDiff 2 (ACL, 2023)
  • Music Generation: SingGAN (ACM-MM, 2022), Multi-Singer (ACM-MM, 2021)

Audio-Visual Language Processing

  • Speech Translation: TranSpeech (ICLR, 2023), AV-TranSpeech (ACL, 2023)
  • Self-Supervised Learning: Prosody-MAE (ACL, 2023)
Technical Report 2025
sym

TLDR: Llama 4 with speech-text interleaved to generate duplex audio, and diffusion model to generate dyadic motion gestures and facial expressions aligned with human speech.

We develop a suite of joint LLM and diffusion models (AVLM) to generate dyadic motion gestures and facial expressions aligned with human speech. The AVLM can understand and generate both speech and visual modalities. With 2D and 3D renderers, it brings us closer to interactive virtual agents. Our work are promoted by different media and forums, such as Meta AI, Linkedin, and Twitter. We have code released at Hugging Face download has yielded 30k+.

AAAI 2024
sym
ICML 2023
sym
ICLR 2023
sym

One of our continuous efforts to reduce communication barrier, and we have follow-up works: Audio-Visual S2T (MixSpeech, ICCV 2023), Audio-Visual S2ST (AV-TranSpeech, ACL 2023), Multi-modal S2ST, Style-aware S2ST, Zero-shot S2ST. Code released: .

NeurIPS 2022
sym

The first zero-shot TTS generalizable to unseen speaker, emotion, and prosody! Media coverage: PaperWeekly, Speech Home. Code released: .

ICJAI 2022
sym

One of our continuous efforts in generative modeling, and we have follow-up works: FastDiff 2, ProDiff. We release a diffusion text-to-speech pipeline Hugging Face using ProDiff and FastDiff . Our work are promoted by different media and forums, such as Tencent AI Lab, Speech Home, and Twitter, which is a Trending Project at both Github and Paperwithcode.

Full Publication List

  • denotes co-first authors, # denotes co-supervised

2025

2024

2023

2022

2021

2020 and Prior

Selected Honors Awarded

  • Best Thesis Award, National Electrical Engineering Association (2025).
  • Excellent Graduate, Zhejiang Province (2024).
  • Chu Kochen Presidential Scholarship (2023), highest honor at Zhejiang University
  • ByteDance Scholar Fellowship (100k RMB Bonus), 10 students per year
  • ICML/ICLR Grant Award
  • Outstanding Reviewers, ICML’22. Top 10%.
  • National Scholarship (2022, 2023, Grauate student). Top 1%.
  • National Scholarship (2020, 2021, Undergrauate student). Top 1%.
  • Excellent Graduate, Zhejiang Province (2021).
  • Chu Kochen Presidential Scholarship Finalist (2021).
  • First Prize in American Mathematical Modeling Competition (2020).
  • First Prize of National Mathematical Modeling Competition in Zhejiang Province (2019).

Professional Services

  • Conference Reviewer/Program Committee: ICML 2022, ACM-MM 2022, NeurIPS 2022, ARR 2022, ICML 2023, ARR 2023, ACL 2023, EMNLP 2023, ACM-MM 2023, NeurIPS 2023, ICLR 2023, ICML 2023, Neuralcomputing, IJCAI 2024, ACM-MM 2024, ACL 2024, TIP
  • Assist to Review: KDD 2022, AAAI 2022, EMNLP 2022, PRCV 2021, TMM