Nonverbal behaviors are essential for natural and expressive human communication, conveying emotions, intentions, and social dynamics beyond words. However, modeling dyadic nonverbal interactions remains a major challenge due to their multiscale complexity — from subtle gaze and gestures to high-level social cues like mimicry and engagement. Current data-driven approaches often fail to capture sparse but crucial social signals that shape authentic interactions. To address this, we propose Social Agent, an LLM-powered agentic framework that integrates psychological knowledge and conversational reasoning into motion synthesis. Our system dynamically analyzes social context and generates context-aware, bidirectional nonverbal behaviors, bridging high-level intentions with low-level embodied motion for realistic dyadic interaction generation.
Our goal is to synthesize full-body motion sequences for two interlocutors in a dyadic conversation, driven by their audio \((S^{\mathrm{I}}, S^{\mathrm{II}})\). The motion sequences, denoted as \((M^{\mathrm{I}}, M^{\mathrm{II}})\), each consists of a number of frames \(M = [m_t]\), where each frame \(m_t \in \mathbb{R}^{(J \times Q + G)}\) encodes both joint-level and global pose information. Here, \(J\), \(Q\), and \(G\) denote the number of joints, joint feature dimension, and global root feature dimension, respectively.
Our approach consists of three key components. First, we present a dyadic motion generation model that effectively synthesizes coordinated dyadic motions from speech inputs. Then we introduce our LLM-based Social Agent System which can derive contextual interaction constraints between two interlocutors through speech and instruction inputs. Finally, we introduce our training-free motion control mechanism that integrates these interaction constraints to guide the motion generation, significantly enhancing the naturalism and awareness of dyadic nonverbal behaviors.
Our approach leverages an LLM-based agentic system, to derive contextual interaction constraints for nonverbal behavior generation in dyadic conversation scenarios. This system is designed to act as a Director and provide high-level guidance for nonverbal behavior by analyzing multimodal inputs and instruction prompts. As shown in the figure, the system comprises two main components: the Scene Designer Agent, which operates before the initial round to analyze the dialogue and determine the initial proxemic setup, and the Dynamic Controller Agent, which is activated at the beginning of each round to analyze the current state, interpret the intentions of the interlocutors and determine the appropriate interactive behaviors for them. All modules in the Agent system are built into the prompt design method, using carefully tailored prompts based on relevant linguistic and human behavioral research.
Our system successfully synthesizes high-quality, realistic dyadic interactions, enhancing the naturalness and coherence of dialogue scenarios.
We showcase the Scene Designer workflow, which extracts scene context and generates the initial proxemic setup. The blue character is Character I, and the green character is Character II. The examples showcase the framework’s scene analysis and understanding capabilities, illustrating how it designs realistic and contextually appropriate initial proxemic setups for different scenarios. This facilitates subsequent interaction control by the Dynamic Controller Agent module, ensuring more natural and context-aware interactions.
We also illustrate the Dynamic Controller Agent’s ability to perform complex spatial reasoning by interpreting textual inputs to generate fine-grained spatial predictions. In the input, red text highlights the current spatial state of both characters, while the accompanying 3D visualization on the right depicts the configuration but is not part of the model’s input. In the output, blue text emphasizes the agent’s reasoning process—such as the inferred direction and distance of Character I’s movement. The displayed output is a concise version of the agent’s reasoning, retaining the most essential spatial information.
@inproceedings{10.1145/3757377.3763879,
title={Social Agent: Mastering Dyadic Nonverbal Behavior Generation via Conversational LLM Agents},
author={Zhang, Zeyi and Zhou, Yanju and Yao, Heyuan and Ao, Tenglong and Zhan, Xiaohang and Liu, Libin},
year = {2025},
isbn = {979-8-4007-2137-3/2025/12},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3757377.3763879},
doi = {10.1145/3757377.3763879},
booktitle = {SIGGRAPH Asia 2025 Conference Papers},
articleno = {71},
numpages = {10},
location = {Hong Kong, China},
series = {SA '25}
}