Rhythmic Gesticulator

Abstract

Automatic synthesis of realistic co-speech gestures is an increasingly important yet challenging task in artificial embodied agent creation. Previous systems mainly focus on generating gestures in an end-to-end manner, which leads to difficulties in mining the clear rhythm and semantics due to the complex yet subtle harmony between speech and gestures. We present a novel co-speech gesture synthesis method that achieves convincing results both on the rhythm and semantics. For the rhythm, our system contains a robust rhythm-based segmentation pipeline to ensure the temporal coherence between the vocalization and gestures explicitly. For the gesture semantics, we devise a mechanism to effectively disentangle both low- and high-level neural embeddings of speech and motion based on linguistic theory. The high-level embedding corresponds to semantics, while the low-level embedding relates to subtle variations. Lastly, we build correspondence between the hierarchical embeddings of the speech and the motion, resulting in rhythm- and semantics-aware gesture synthesis. Evaluations with existing objective metrics, a newly proposed rhythmic metric, and human feedback show that our method outperforms state-of-the-art systems by a clear margin.

System Overview

Our system is composed of three core components: (a) the data module preprocesses a speech, segments it into normalized blocks based on the beats, and extracts speech features from these blocks; (b) the training module learns a gesture lexicon from the normalized motion blocks and trains the generator to synthesize gesture sequences, conditioned on the gesture lexemes, the style codes, as well as the features of previous motion blocks and adjacent speech blocks; and (c) the inference module employs interpreters to transfer the speech features to gesture lexemes and style codes, which are then used by the learned generator to predict future gestures.

To evaluate the rhythmic performance, we propose a new objective metric, PMB, to measure the percentage of matched beats. Our method outper-forms state-of-the-art systems both objectively and subjectively, as indicated by the MAJE, MAD, FGD, PMB metrics, and human feedback. The cross-language synthesis experiment demonstrates the robustness of our system for rhythmic perception. In terms of application, We show our system’s flexible and effective style editing ability that allows editing of several directorial styles of the generated gestures without manual annotation of the data. Lastly, we have systematically conducted detailed ablation studies that justify the design choices of our system.

Sample Results

Our system can synthesize realistic co-speech upper-body gestures that match a given speech context both temporally and semantically. It takes speech audio as input and generates gesture sequences accordingly. Here are some results:

Style Editing

Our system has the style editing ability that allows editing of several directorial styles of the generated gestures without manual annotation of the data. We have synthesized three animations for each of the motion styles. Each animation has a constant desired low, mid, or high feature value, as shown in the below three videos.

BibTeX

@article{Ao2022RhythmicGesticulator, author = {Ao, Tenglong and Gao, Qingzhe and Lou, Yuke and Chen, Baoquan and Liu, Libin}, title = {Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings}, year = {2022}, issue_date = {December 2022}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {41}, number = {6}, issn = {0730-0301}, url = {https://doi.org/10.1145/3550454.3555435}, doi = {10.1145/3550454.3555435}, journal = {ACM Trans. Graph.}, month = {nov}, articleno = {209}, numpages = {19} }

Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings

Gesture results automatically synthesized by our system for a beat-rich TED talk clip.

Abstract

System Overview

Sample Results

Short Sample Results

Long Sample Results

Style Editing

Beat to The Music

Explained

BibTeX