A Single Prompt Will Have This AI Rapping and Dancing

Authors:
(1) Jiaben Chen, University of Massachusetts Amherst;
(2) Xin Yan, Wuhan University;
(3) Yihang Chen, Wuhan University;
(4) Siyuan Cen, University of Massachusetts Amherst;
(5) Qinwei Ma, Tsinghua University;
(6) Haoyu Zhen, Shanghai Jiao Tong University;
(7) Kaizhi Qian, MIT-IBM Watson AI Lab;
(8) Lie Lu, Dolby Laboratories;
(9) Chuang Gan, University of Massachusetts Amherst.
Table of Links
Abstract and 1. Introduction
-
Related Work
2.1 Text to Vocal Generation
2.2 Text to Motion Generation
2.3 Audio to Motion Generation
-
RapVerse Dataset
3.1 Rap-Vocal Subset
3.2 Rap-Motion Subset
-
Method
4.1 Problem Formulation
4.2 Motion VQ-VAE Tokenizer
4.3 Vocal2unit Audio Tokenizer
4.4 General Auto-regressive Modeling
-
Experiments
5.1 Experimental Setup
5.2 Main Results Analysis and 5.3 Ablation Study
-
Conclusion and References
A. Appendix
Abstract
In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs, but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation. The project page is available for research purposes at https://vis-www.cs.umass.edu/RapVerse.
1 Introduction
In the evolving landscape of multi-modal content generation in terms of sound and motion, significant strides have been made in individual modalities, including text-to-music [54, 1, 21], text-to-vocal [32], text-to-motion [13, 69, 4, 23, 34], and audio-to-motion [68, 15, 31] generation. These developments have paved the way for creating more dynamic and interactive digital content. Despite these advancements, existing works predominantly operate in silos, addressing each modality in isolation. However, there’s strong psychological evidence that for human beings, the generation of sound and motion are highly related and coupled [28]. A unified system for joint generation allows for a more expressive and nuanced communication of emotions, intentions, and context, where the generation of one modality could guide and assist the other in a coherent and efficient way.
In this paper, we tackle a crucial problem: can a machine not only sing with emotional depth but also perform with human-like expressions and motions? We propose a novel task for generating coherent singing vocals and whole-body human motions (including body motions, hand gestures, and facial expressions) simultaneously, see Fig. 1. This endeavor holds practical significance in fostering more immersive and naturalistic digital interactions, thereby elevating virtual performances, interactive gaming, and the realism of virtual avatars.
An important question naturally arises: what constitutes a good model for unified generation of sound and motion? Firstly, we consider textual lyrics as the proper form of inputs for the unified system, since text provides a highly expressive, interpretable flexible means of conveying information by human beings, and could serve as a bridge between various modalities. Previous efforts explore scores [32], action commands [69, 4, 23], or audio signals [68] as inputs, which are inferior to textual inputs in terms of semantic richness, expressiveness and flexible integration of different modalities.
Secondly, we reckon that a joint generation system that could produce multi-modal outputs simultaneously is better than a cascaded system that executes the single-modal generation sequentially. A cascaded system, combining a text-to-vocal module with a vocal-to-motion module, risks accumulating errors across each stage of generation. For instance, a misinterpretation in the text-to-vocal phase can lead to inaccurate motion generation, thereby diluting the intended coherence of the output. Furthermore, cascaded architectures necessitate multiple training and inference phases across different models, substantially increasing computational demands.
To build such a joint generation system, the primary challenges include: 1) the scarcity of datasets that provide lyrics, vocals, and 3D whole-body motion annotations simultaneously; and 2) the need for a unified architecture capable of coherently synthesizing vocals and motions from text. In response to these challenges, we have curated RapVerse, a large-scale dataset featuring a comprehensive collection of lyrics, singing vocals, and 3D whole-body motions. Despite the existence of datasets available for text-to-vocal [32, 22, 8, 55], text-to-motion [44, 35, 13, 30], and audio-to-motion [3, 15, 12, 9, 5, 65], the landscape lacks a unified dataset that encapsulates singing vocals, wholebody motion, and lyrics simultaneously. Most notably, large text-to-vocal datasets [22, 70] are predominantly in Chinese, limiting their applicability for English language research and lacking any motion data. And text-to-motion datasets [44, 13, 30] typically focus on text descriptions of specific actions paired with corresponding motions without audio data, often not covering whole body movements. Moreover, audio-to-motion datasets [32, 33] focus primarily on speech rather than singing. A comparison of existing related datasets is demonstrated in Table. 1. The RapVerse dataset is divided into two distinctive parts to cater to a broad range of research needs: 1) a Rap-Vocal subset containing a large number of pairs of vocals and lyrics, and 2) a Rap-Motion subset encompassing vocals, lyrics, and human motions. The Rap-Vocal subset contains 108.44 hours of high-quality English singing voice in the rap genre without background music. Paired lyrics and vocals are crawled from the Internet from 32 singers, with careful cleaning and post-processing. On the other hand, the Rap-Motion subset contains 26.8 hours of rap performance videos with 3D holistic body mesh annotations in SMPL-X parameters [42] using the annotation pipeline of Motion-X [30], synchronous singing vocals and corresponding lyrics.
With the RapVerse dataset, we explore how far we can push by simply scaling autoregressive multimodal transformers with language, audio, and motion for a coherent and realistic generation of vocals and whole-body human motions. To this end, we unify different modalities as token representations. Specifically, three VQVAEs [63] are utilized to compress whole-body motion sequences into three-level discrete tokens for head, body, and hand, respectively. For vocal generation, previous works [37, 7, 32, 37] share a common paradigm, producing mel-spectrograms of audio signals from input textual features and additional music score information, following with a vocoder [40, 62, 67] to reconstruct the phase. We draw inspiration from the speech resynthesis domain [45], and learn a self-supervised discrete representation to quantize raw audio signal into discrete tokens while preserving the vocal content and prosodic information. Then, with all the inputs in discrete representations, we leverage a transformer to predict the discrete codes of audio and motion in an autoregressive fashion. Extensive experiments demonstrate that this straightforward unified generation framework not only produces realistic singing vocals alongside human motions directly from textual inputs but also rivals the performance of specialized single-modality generation systems.
To sum up, this paper makes the following contributions:
• We release RapVerse, a large dataset featuring synchronous singing vocals, lyrics, and high-quality 3D holistic SMPL-X parameters.
• We design a simple but effective unified framework for the joint generation of singing vocals and human motions from text with a multi-modal transformer in an autoregressive fashion.
• To unify representations of different modalities, we employ a vocal-to-unit model to obtain quantized audio tokens and utilize compositional VQVAEs to get discrete motion tokens.
• Experimental results show that our framework rivals the performance of specialized single-modality generation systems, setting new benchmarks for joint generation of vocals and motion.