When Less Is More: A Sparse Facial Motion Structure For Listening Motion Learning

Abstract

Effective modeling of non-verbal facial behavior is crucial for human-robot interaction, yet current token-based methods often produce low-fidelity motion due to dense and redundant representations. This paper introduces a sparse facial motion structure that identifies keyframes and reconstructs transitions to capture essential motion dynamics more efficiently. Key contributions include: (1) a novel unsupervised keyframe discovery method, (2) a sparse representation framework that improves reconstruction and token expressiveness, and (3) a Transformer-based predictor for listening head motion using these sparse tokens. The approach outperforms state-of-the-art methods in both fidelity and diversity across benchmark datasets.