Rongjie Huang (黄融杰) did my Graduate study at College of Computer Science and Software, Zhejiang University, supervised by Prof. Zhou Zhao. I also obtained Bachelor’s degree at Zhejiang University. During my graduate study, I was lucky to collaborate with the CMU Speech Team led by Prof. Shinji Watanabe, and Audio Research Team at Zhejiang University. I was grateful to intern or collaborate at TikTok, Shanghai AI Lab (OpenGV Lab), Tencent Seattle Lab, Alibaba Damo Academic, with Yi Ren, Jinglin Liu, Chunlei Zhang and Dong Yu.
My research interest includes Multi-Modal Generative AI, Multi-Modal Language Processing, and AI4Science. I have published first-author papers at the top international AI conferences such as NeurIPS/ICLR/ICML/ACL/IJCAI. I developed a few well-known Speech/NLP algorithms including:
- AudioGPT, UniAudio: Multitask, Multilingual LLMs
- Make-An-Audio, GenerSpeech: Zero-shot text-guided synthesis
- FastDiff 1/2, ProDiff: AIGC diffusion models
- TranSpeech, and AV-TranSpeech: Multimodal Translation
In 2023, I lead or participate in the following research topics:
- Speech/NLP: multimodal generation and translation
- Large Language Models (LLMs): Audio/Visual
- Diffusion models: Image/Audio/3D
🔥 News
- 2024.01: 1 papers are accepted by ICLR 2024!
- 2023.11: 2 papers are accepted by AAAI 2024 main / AAAI 2024 demo!
- 2023.10: I am awarded ByteDance Scholar Fellowship, and Chu Kochen Presidential Scholarship!
- 2023.10: UniAudio released!
- 2023.09: One paper is accepted by EMNLP 2023!
- 2023.07: One paper is accepted by ACM-MM 2023!
- 2023.06: One paper is accepted by ICCV 2023!
- 2023.05: 8 papers are accepted by ACL 2023 (main conference and findings)! Thanks to my co-authors!
- 2023.04: AudioGPT and HiFi-Codec released!
- 2023.04: One papers is accepted by ICML 2023!
- 2023.02: Make-An-Audio released! Media coverage: Heart of Machine, ByteDance and Twitter
- 2023.01: One papers is accepted by ICLR 2023!
- 2022.09: Two papers are accepted by NeurIPS 2022!
📝 Representative Publications
Multi-modal Generative AI
- Spoken Large Language Model: AudioGPT (AAAI demo 2024), UniAudio, Make-A-Voice, HiFi-Codec
- Text-to-Audio Synthesis: Make-An-Audio (ICML 2023)
- Text-to-Speech Synthesis: GenerSpeech (NeurIPS 2022) for zero-shot text-to-speech, FastDiff (IJCAI 2022) / ProDiff (ACM-MM 2022a) for diffusion text-to-speech
- Singing Voice Synthesis: SingGAN (ACM-MM 2022b) / Multi-Singer (ACM-MM 2021)
Multi-modal Language Processing
- Audio-Visual Speech-to-Speech Translation: TranSpeech (ICLR 2023) / AV-TranSpeech (ACL 2023)
- Self-Supervised Learning: Prosody-MAE (ACL 2023)
-
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head. Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe. Arxiv, 2023
-
Academic / Industry Impact: Our work are promoted by different media and forums, such as Heart of Machine, New Intelligence, and Twitter. We have code released at .
-
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models. Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao. ICML, 2023. Hawaii, USA
-
Academic / Industry Impact: Our work are promoted by different media and forums, such as Heart of Machine, ByteDance, and Twitter. Code is coming!
- TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation. Rongjie Huang, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, Jinzheng He, and Zhou Zhao. ICLR, 2023. Kigali, Rwanda
One of our continuous efforts to reduce communication barrier, and we have follow-up works: Audio-Visual S2T (MixSpeech, ICCV 2023), Audio-Visual S2ST (AV-TranSpeech, ACL 2023), Multi-modal S2ST, Style-aware S2ST, Zero-shot S2ST. Code released: .
- GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech. Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. NeurIPS, 2022. New Orleans, USA
The first zero-shot TTS generalizable to unseen speaker, emotion, and prosody! Media coverage: PaperWeekly, Speech Home. Code released: .
- FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis. Rongjie Huang, Max W.Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. IJCAI, 2022(oral). Vienna, Austria
One of our continuous efforts in generative modeling, and we have follow-up works: FastDiff 2, ProDiff. We release a diffusion text-to-speech pipeline using ProDiff and FastDiff . Our work are promoted by different media and forums, such as Tencent AI Lab, Speech Home, and Twitter, which is a Trending Project at both Github and Paperwithcode.
Full Publication List
- denotes co-first authors, # denotes co-supervised
2024
-
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head. Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe. AAAI demo, 2024
-
StyleSinger: Style Transfer for Out-Of-Domain Singing Voice Synthesis. Yu Zhang#, Rongjie Huang, Ruiqi Li, Jinzheng He, Yan Xia, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao. AAAI, 2024.
-
Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis. Zhenhui Ye, Tianyun Zhong, Yi Ren, Jiaqi Yang, Weichuang Li, Jiawei Huang, Ziyue Jiang, Jinzheng He, Rongjie Huang, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao. ICLR, 2024.
2023
-
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models. Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao. ICML, 2023. Hawaii, USA
-
UniAudio: An Audio Foundation Model Toward Universal Audio Generation. Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, Shinji Watanabe, Helen Meng. Arxiv, 2023
-
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias. Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao. Arxiv
-
Make-An-Audio 2: Improving Text-to-Audio with Dual Text Information Representation. Jiawei Huang#, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, Zhou Zhao. Arxiv, 2023
-
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation. Rongjie Huang, Jinglin Liu, Huadai Liu, Yi Ren, Lichao Zhang, Jinzheng He, and Zhou Zhao. ICLR, 2023. Kigali, Rwanda
-
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation. Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Linjun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang, Jinglin Liu, Xiang Yin and Zhou Zhao. ACL, 2023
-
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition. Xize Cheng, Linjun Li, Tao Jin, Rongjie Huang, Wang Lin, Zehan Wang, Huangdai Liu, Ye Wang, Aoxiong Yin, Zhou Zhao. ICCV, 2023
-
CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-Training. Zhenhui Ye*, Rongjie Huang, Yi Ren, Ziyue Jiang, Jinglin Liu, Jinzheng He, Xiang Yin and Zhou Zhao. ACL, 2023
-
UniSinger: Unified End-to-End Singing Voice Synthesis With Cross-Modality Information Matching. Zhiqing Hong#, Chenye Cui, Rongjie Huang, Lichao Zhang, Jinglin Liu, Jinzheng He, Zhou Zhao. ACM MM, 2023
-
AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment. Ruiqi Li#, Rongjie Huang, Lichao Zhang, Jinglin Liu, Zhou Zhao. ACL finding, 2023
-
RMSSinger: Realistic-Music-Score based Singing Voice Synthesis. Jinzheng He, Jinglin Liu, Zhenhui Ye, Rongjie Huang, Chenye Cui, Huadai Liu, Zhou Zhao. ACL findingfinding, 2023
-
FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models. Ziyue Jiang, Qian Yang, Jialong Zuo, Zhenhui Ye, Rongjie Huang, Yi Ren, Zhou Zhao. ACL finding, 2023
-
Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation. Linjun Li, Tao Jin, Xize Cheng, Ye Wang, Wang Lin, Rongjie Huang, Zhou Zhao. ACL finding, 2023
-
ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer. Huadai Liu, Rongjie Huang, Xuan Lin, Wenqiang Xu, Maozong Zheng, Hong Chen, Jinzheng He, Zhou Zhao. EMNLP, 2023
2022
-
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech. Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. NeurIPS, 2022. New Orleans, USA
-
Prosody-TTS: Self-Supervised Prosody Pretraining with Latent Diffusion For Text-to-Speech. Rongjie Huang, Chunlei Zhang, Yi Ren, Zhou Zhao, Dong Yu. ACL finding, 2023
-
FastDiff 2: Dually Incorporating GANs into Diffusion Models for High-Quality Speech Synthesis. Rongjie Huang, Yi Ren, Jinglin Liu, Luping Liu, Zhou Zhao. ACL finding, 2023
-
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis. Rongjie Huang, Max W.Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, and Zhou Zhao. IJCAI, 2022(oral). Vienna, Austria
-
ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech. Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, and Yi Ren. ACM MM, 2022. Lisbon, Portugal
-
M4Singer: a Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus. Lichao Zhang, Ruiqi Li, Shoutong Wang, Liqun Deng, Jinglin Liu, Yi Ren, Jinzheng He, Rongjie Huang, Jieming Zhu, Xiao Chen, and Zhou Zhao. NeurIPS, 2022. New Orleans, USA
-
VarietySound: Timbre-Controllable Video to Sound Generation via Unsupervised Information Disentanglement. Chenye Cui, Yi Ren, Jinglin Liu, Rongjie Huang, Zhou Zhao. ICASSP, 2023
2021
-
Multi-Singer: Fast multi-singer singing voice vocoder with a large-scale corpus. Rongjie Huang, Feiyang Chen, Yi Ren, Jinglin Liu, Chenye Cui, and Zhou Zhao. ACM MM, 2021(oral). Chengdu, China | Project |
-
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model. Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Mei Li, and Zhou Zhao. Interspeech, 2021
-
Bilateral Denoising Diffusion Models. Max W.Y. Lam, Jun Wang, Rongjie Huang, Dan Su, Dong Yu. Preprint
2020 and Prior
- SingGAN: Generative Adversarial NetWork For High-Fidelity Singing Voice Generation. Rongjie Huang, Chenye Cui, Feiyang Chen, Yi Ren, Jinglin Liu, and Zhou Zhao. ACM MM, 2022. Lisbon, Portugal | Project
Selected Honors Awarded
- Excellent Graduate, Zhejiang Province (2024).
- Chu Kochen Presidential Scholarship (2023), highest honor at Zhejiang University
- ByteDance Scholar Fellowship (100k RMB Bonus), 10 students per year
- ICML/ICLR Grant Award
- Outstanding Reviewers, ICML’22. Top 10%.
- National Scholarship (2022, 2023, Grauate student). Top 1%.
- National Scholarship (2020, 2021, Undergrauate student). Top 1%.
- Excellent Graduate, Zhejiang Province (2021).
- Chu Kochen Presidential Scholarship Finalist (2021).
- First Prize in American Mathematical Modeling Competition (2020).
- First Prize of National Mathematical Modeling Competition in Zhejiang Province (2019).
Professional Services
- Conference Reviewer/Program Committee: ICML 2022, ACM-MM 2022, NeurIPS 2022, ARR 2022, ICML 2023, ARR 2023, ACL 2023, EMNLP 2023, ACM-MM 2023, NeurIPS 2023, ICLR 2023, ICML 2023, Neuralcomputing, IJCAI 2024, ACM-MM 2024, ACL 2024, TIP
- Assist to Review: KDD 2022, AAAI 2022, EMNLP 2022, PRCV 2021, TMM