📖 Abstract
This research delves into the problem of interactive editing of human motion generation.
Previous motion diffusion models lack explicit modeling of the word-level text-motion correspondence and good explainability,
hence restricting their fine-grained editing ability. To address this issue, we propose an attention-based motion diffusion model,
namely MotionCLR (/ˈmoʊʃn klɪr/), with CLeaR modeling of attention mechanisms.
Technically, MotionCLR models the in-modality and cross-modality interactions with self-attention and cross-attention, respectively.
More specifically, the self-attention mechanism aims to measure the sequential similarity between frames and impacts the order of motion features.
By contrast, the cross-attention mechanism works to find the fine-grained word-sequence correspondence and
activate the corresponding timesteps in the motion sequence.
Based on these key properties, we develop a versatile set of simple yet effective motion editing methods via manipulating attention maps,
such as motion (de-)emphasizing, in-place motion replacement, and example-based motion generation, etc.
For further verification of the explainability of the attention mechanism, we additionally explore the potential of action-counting and
grounded motion generation ability via attention maps.
Our experimental results show that our method enjoys good generation and editing ability with good explainability.
MotionCLR Interactive Demo Video