When Less Is More: A Sparse Facial Motion Structure For Listening Motion Learning

Ritsumeikan University
(Under-review)

This paper introduces a sparse facial motion structure that models keyframes and interpolates transitions to improve non-verbal listening head motion generation. The approach enhances motion fidelity and diversity, outperforming dense token methods on benchmark datasets.

Abstract

Effective modeling of non-verbal facial behavior is crucial for human-robot interaction, yet current token-based methods often produce low-fidelity motion due to dense and redundant representations. This paper introduces a sparse facial motion structure that identifies keyframes and reconstructs transitions to capture essential motion dynamics more efficiently. Key contributions include: (1) a novel unsupervised keyframe discovery method, (2) a sparse representation framework that improves reconstruction and token expressiveness, and (3) a Transformer-based predictor for listening head motion using these sparse tokens. The approach outperforms state-of-the-art methods in both fidelity and diversity across benchmark datasets.

Reconstruction

Listening Head Prediction - Appropriateness