This AI Turns Lyrics Into Fully Synced Song and Dance Performances

Table of Links
Abstract and 1. Introduction
-
Related Work
2.1 Text to Vocal Generation
2.2 Text to Motion Generation
2.3 Audio to Motion Generation
-
RapVerse Dataset
3.1 Rap-Vocal Subset
3.2 Rap-Motion Subset
-
Method
4.1 Problem Formulation
4.2 Motion VQ-VAE Tokenizer
4.3 Vocal2unit Audio Tokenizer
4.4 General Auto-regressive Modeling
-
Experiments
5.1 Experimental Setup
5.2 Main Results Analysis and 5.3 Ablation Study
-
Conclusion and References
A. Appendix
In this section, we evaluate our proposed model on our proposed benchmark designed for joint vocal and whole-body motion generation from textual inputs.
5.1 Experimental Setup
Metrics. To evaluate the generation quality of singing vocals, we utilize the Mean Opinion Score (MOS) to gauge the naturalness of the synthesized vocal. For motion synthesis, we evaluate the generation quality of the body hand gestures and the realism of the face, respectively. Specifically, for gesture generation, we use Frechet Inception Distance (FID) based on a feature extractor from [13] to evaluate the distance of feature distributions between the generated and real motions, and Diversity (DIV) metric to assess the motions diversity. For face generation, we compare the vertex MSE [66] and the vertex L1 difference LVD [68]. Finally, we adopt Beat Constancy (BC) [29] to measure the synchrony of generated motion and singing vocals.
Baselines. We compare the vocal generation quality with the state-of-the-art vocal generation method DiffSinger [32]. And we also adapt the text-to-speech model FastSpeech2 [51] for vocal generation. For motion generation, we compare our method with both text-to-motion methods and audio-tomotion methods. For text-to-motion methods, since there is no existing open-sourced work for text to whole-body motion generation, we compare with transformer-based T2M-GPT [69] and MLD [4] for body generation. For the audio-to-motion generation, we compare with Habibie et al. [15] and the SOTA model Talkshow [68]. We report all the results on RapVerse with an 85%/7.5%/7.5% train/val/test split.
5.2 Main Results Analysis
Evaluations on joint vocal and whole-body motion generations. We compared both text-driven and audio-driven motion generation baselines in Table. 2 (a). To be noted, our setting is different from all existing methods in the following ways. First, we use rap lyrics as our textual input instead of motion textual descriptions, which contain direct action prompt words, such as walk and jump; Second we use text to jointly generate both audio and motion, instead of using audio to generate motion as audio-driven methods did. As is demonstrated, our model rivals with both text-to-motion and audio-to-motion methods in terms of metrics measuring body motion quality and face motion accuracy.
Furthermore, the cornerstone of our approach lies in the simultaneous generation of vocals and motion, aiming to achieve temporal alignment between these modalities. This objective is substantiated by our competitive results on the BC metric, which assesses the synchrony between singing vocals and corresponding motions, underscoring our success in closely synchronizing the generation of these two modalities. For the cascaded system, we integrate the text-to-vocal model DiffSinger with the audio-to-motion model Talkshow. Compared with the cascaded system, our joint-generation pipeline demonstrates superior outcomes while also reducing computational demands during both training and inference phases. In the cascaded architectures, errors tend to accumulate through each stage. Specifically, if the text-to-vocal module produces unclear vocals, it subsequently hampers the audio-to-motion model’s ability to generate accurate facial expressions that align with the vocal content.
Evaluations on vocal generations. We have also compared our method against other state-of-the-art text-to-vocal generation baselines in Table. 2 (b). While our unified model is trained to simultaneously generate vocals and motion, a task considerably more complex than generating vocals alone, its vocal generation component still manages to achieve results comparable to those systems designed solely for vocal generations.
5.3 Ablation Study
We present the outcomes of our ablation study in Table. 3. Initially, we explored the integration of a pre-trained large language model [48] for multi-modality generation, akin to the approach in [23]. However, the efficacy of utilizing pre-trained language models significantly lags behind our tailored design, underscoring that pre-training primarily on linguistic tokens does not facilitate effective prediction across multiple modalities like vocal and motion. Additionally, we study the impact of our compositional VQ-VAEs on motion generation. In contrast, a baseline employing a single VQVAE for the joint quantization of facial, body, and hand movements was implemented. This approach led to a noticeable degradation in performance, particularly marked by a -2.89 decrease in LVD. This decline can be attributed to the preponderance of facial movements in a singer’s performance. Using a single VQ-VAE model for full-body dynamics compromises the detailed representation of facial expressions, which are crucial for realistic and coherent motion synthesis.
Authors:
(1) Jiaben Chen, University of Massachusetts Amherst;
(2) Xin Yan, Wuhan University;
(3) Yihang Chen, Wuhan University;
(4) Siyuan Cen, University of Massachusetts Amherst;
(5) Qinwei Ma, Tsinghua University;
(6) Haoyu Zhen, Shanghai Jiao Tong University;
(7) Kaizhi Qian, MIT-IBM Watson AI Lab;
(8) Lie Lu, Dolby Laboratories;
(9) Chuang Gan, University of Massachusetts Amherst.