HumanMAC: Masked Motion Completion for Human Motion Prediction

Ling-Hao Chen¹, Jiawei Zhang², Yewen Li³, Yiren Pang², Xiaobo Xia⁴, and Tongliang Liu⁴

¹Tsinghua University, ²Xidian University, ³Nanyang Technological University, ⁴The University of Sydney

ICCV 2023

arXiv Code Slides Poster

Abstract

Human motion prediction is a classical problem in computer vision and computer graphics, which has a wide range of practical applications. Previous effects achieve great empirical performance based on an encoding-decoding style. The methods of this style work by first encoding previous motions to latent representations and then decoding the latent representations into predicted motions. However, in practice, they are still unsatisfactory due to several issues, including complicated loss constraints, cumbersome training processes, and scarce switch of different categories of motions in prediction. In this paper, to address the above issues, we jump out of the foregoing style and propose a novel framework from a new perspective. Specifically, our framework works in a denoising diffusion style. In the training stage, we learn a motion diffusion model that generates motions from random noise. In the inference stage, with a denoising procedure, we make motion prediction conditioning on observed motions to output more continuous and controllable predictions. The proposed framework enjoys promising algorithmic properties, which only needs one loss in optimization and is trained in an end-to-end manner. Additionally, it accomplishes the switch of different categories of motions effectively, which is significant in realistic tasks, e.g., the animation task. Comprehensive experiments on benchmarks confirm the superiority of the proposed framework. The project page is available at https://lhchen.top/Human-MAC.

Video

Comparison with Previous Methods

Comparison between the encoding-decoding fashion and masked motion completion. (a) The methods in the encoding-decoding fashion encode the observation into a latent explicitly and then decodes the latent into prediction results. (b) The proposed motion diffusion model generates motions from noise in the training stage. In the inference stage, it treats HMP as a masked completion task.

DCT-Completion in HumanMAC System

Comparison between the vanilla denoising procedure and the proposed DCT-Completion.

Examples of Prediction Results

Human3.6M HumanEva

Examples of Part-body Controllable Prediction

Note: The first column is the observation frame and the second column is the ground truth that provides fixed part of the body.

We fix the 4 joints at the torso and can see that the predicted results are biased towards determinism, with only slight variations in the arms and legs. This is consistent with the forward dynamics of the human skeleton.

When fixing the right arm, we can observe that the torso is clearly affected by the fixation of the right arm, and likewise the rest of the body. This makes some of the predictions distorted, but most of them are still valid.

When fixing the left leg, we see that the predicted results become diverse. An interesting phenomenon is that the response of the right leg naturally aligns with the left leg, despite controlling only the left leg.

The case of the fixing left leg is similar to fixing the right leg, where the predicted motion still becomes diverse within the context constraint.

Fixing the left arm is the same as when fixing the right arm.

When fixing the entire lower body, the upper body still produces diverse predictions. action.

When fixing the entire upper body, the prediction becomes deterministic.