Social Agent: Mastering Dyadic Nonverbal Behavior Generation via Conversational LLM Agents

SIGGRAPH ASIA 2025
Peking University, China1, Tencent, China2, corresponding author
cars peace

Our system generates natural and context-aware dyadic nonverbal behaviors via LLM-guided interaction control and dual-person gesture synthesis.

Brief Introduction

Nonverbal behaviors are essential for natural and expressive human communication, conveying emotions, intentions, and social dynamics beyond words. However, modeling dyadic nonverbal interactions remains a major challenge due to their multiscale complexity — from subtle gaze and gestures to high-level social cues like mimicry and engagement. Current data-driven approaches often fail to capture sparse but crucial social signals that shape authentic interactions. To address this, we propose Social Agent, an LLM-powered agentic framework that integrates psychological knowledge and conversational reasoning into motion synthesis. Our system dynamically analyzes social context and generates context-aware, bidirectional nonverbal behaviors, bridging high-level intentions with low-level embodied motion for realistic dyadic interaction generation.



System Overview

cars peace

Our goal is to synthesize full-body motion sequences for two interlocutors in a dyadic conversation, driven by their audio \((S^{\mathrm{I}}, S^{\mathrm{II}})\). The motion sequences, denoted as \((M^{\mathrm{I}}, M^{\mathrm{II}})\), each consists of a number of frames \(M = [m_t]\), where each frame \(m_t \in \mathbb{R}^{(J \times Q + G)}\) encodes both joint-level and global pose information. Here, \(J\), \(Q\), and \(G\) denote the number of joints, joint feature dimension, and global root feature dimension, respectively.

Our approach consists of three key components. First, we present a dyadic motion generation model that effectively synthesizes coordinated dyadic motions from speech inputs. Then we introduce our LLM-based Social Agent System which can derive contextual interaction constraints between two interlocutors through speech and instruction inputs. Finally, we introduce our training-free motion control mechanism that integrates these interaction constraints to guide the motion generation, significantly enhancing the naturalism and awareness of dyadic nonverbal behaviors.





LLM-based Social Agent System

LLM-based Social Agent System

Our approach leverages an LLM-based agentic system, to derive contextual interaction constraints for nonverbal behavior generation in dyadic conversation scenarios. This system is designed to act as a Director and provide high-level guidance for nonverbal behavior by analyzing multimodal inputs and instruction prompts. As shown in the figure, the system comprises two main components: the Scene Designer Agent, which operates before the initial round to analyze the dialogue and determine the initial proxemic setup, and the Dynamic Controller Agent, which is activated at the beginning of each round to analyze the current state, interpret the intentions of the interlocutors and determine the appropriate interactive behaviors for them. All modules in the Agent system are built into the prompt design method, using carefully tailored prompts based on relevant linguistic and human behavioral research.



Dyadic Nonverbal Behavior Generation

Our system successfully synthesizes high-quality, realistic dyadic interactions, enhancing the naturalness and coherence of dialogue scenarios.

Sample Results

Comparison

Ablation Study

Intermediate Reasoning Outputs of the Social Agent System

Scene Designer intermediate outputs

We showcase the Scene Designer workflow, which extracts scene context and generates the initial proxemic setup. The blue character is Character I, and the green character is Character II. The examples showcase the framework’s scene analysis and understanding capabilities, illustrating how it designs realistic and contextually appropriate initial proxemic setups for different scenarios. This facilitates subsequent interaction control by the Dynamic Controller Agent module, ensuring more natural and context-aware interactions.

Dynamic Controller reasoning outputs

We also illustrate the Dynamic Controller Agent’s ability to perform complex spatial reasoning by interpreting textual inputs to generate fine-grained spatial predictions. In the input, red text highlights the current spatial state of both characters, while the accompanying 3D visualization on the right depicts the configuration but is not part of the model’s input. In the output, blue text emphasizes the agent’s reasoning process—such as the inferred direction and distance of Character I’s movement. The displayed output is a concise version of the agent’s reasoning, retaining the most essential spatial information.

BibTeX


@inproceedings{10.1145/3757377.3763879,
  title={Social Agent: Mastering Dyadic Nonverbal Behavior Generation via Conversational LLM Agents},
  author={Zhang, Zeyi and Zhou, Yanju and Yao, Heyuan and Ao, Tenglong and Zhan, Xiaohang and Liu, Libin},
  year = {2025},
  isbn = {979-8-4007-2137-3/2025/12},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3757377.3763879},
  doi = {10.1145/3757377.3763879},
  booktitle = {SIGGRAPH Asia 2025 Conference Papers},
  articleno = {71},
  numpages = {10},
  location = {Hong Kong, China},
  series = {SA '25}
}