MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Ling-Hao Chen^{😎 1, 3}, Shunlin Lu^{😎 2,
3},
Ailing Zeng³, Hao Zhang^{3, 4}, Benyou Wang², Ruimao Zhang², Lei Zhang^{🤗 3}

^😎Co-first author. Listing order is random.

^🤗Corresponding author.

¹Tsinghua University
²School of Data Science, Shenzhen Research Institute of Big Data, CUHK (SZ)
³International Digital Economy Academy (IDEA)
⁴The Hong Kong University of Science and Technology

arXiv Code Demo Blogpost

📖 Abstract

This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding by leveraging the powerful capabilities of Large Language Models (LLMs). Diverging from recent LLMs designed for video-only or motion-only understanding, we argue that understanding human behavior necessitates joint modeling from both videos and motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics and semantics effectively. In light of this, we present MotionLLM, a straightforward yet effective framework for human motion understanding, captioning, and reasoning. Specifically, MotionLLM adopts a unified video-motion training strategy that leverages the complementary advantages of existing coarse video-text data and fine-grained motion-text data to glean rich spatial-temporal insights. Furthermore, we collect a substantial dataset, MoVid, comprising diverse videos, motions, captions, and instructions. Additionally, we propose the MoVid-Bench, with carefully manual annotations, for better evaluation of human behavior understanding on video and motion. Extensive experiments show the superiority of MotionLLM in the caption, spatial-temporal comprehension, and reasoning ability.

🤩 Introducing MotionLLM

Introducing MotionLLM. (a) The input and output of MotionLLM. (b) MotionLLM has broad application scenarios, such as an intelligent fitness coach.

🤩 Interesting Results

Q: describe what the man is doing.

A: The man is performing a choreographed martial arts sequence on stage. He is demonstrating various martial arts techniques and movements while engaged in a conversation with the audience, showing his skills and artistry.

Q: why is the video funny?

A: The video is funny due to the man's exaggerated movements and gestures, which are meant to mimic the actions of a professional wrestler. The comedic effect comes from the contrast between the man's usual calm demeanor and the over-the-top movements being performed.

Q: The man in the video is playing a kind of musical instrument. Could you please guess what the kind of musical instrument is?

A: Yes, the man in the video is playing a small harmonica.

(* Actor in the video: Ling-Hao Chen)

✌️ More Features

Direction Coach Commentator Embodiment Reasoning

Q: In what shape does the man walk around? A circle or a square?

A: The man walks around in a circle.

Q: Assume you are a fitness coach, refer to the video of the professional athlete, please analyze specific action essentials in steps and give detailed instruction.

A: As a fitness coach, I would break the sequence into specific steps to instruct someone on how to perform it. First, the athlete would start in the squat position, engaging their core and lower body muscles. They would then lift the barbell up with controlled movement to shoulder height, keeping the barbell close to the body. At this point, the athlete would pause to ensure proper form and balance, and then continue lifting it overhead with the arms extended completely. The barbell should be held overhead with a slight lean backwards for stability. Once the barbell is overhead, the athlete would then lower it back down with control and engage their core and lower body muscles once more, returning to the squat position.

Q: Assume you are a commenter on weightlift events, please give me the script based on the video.

A: Commenting on the weightlift event, we can see a female lifter using a barbell. She starts with a light weight, and as she gains confidence, she picks up a heavier weight and performs a full lift. After the lift, she watches her own performance and pumps her arms in the air. Certainly, her focus and determination are evident as she strives to lift the barbell to the highest level. It is an impressive display of strength and athleticism, and her ability to handle both light and heavy weights is commendable. Overall, it is a truly inspiring moment, as we witness the power of a strong woman lifting heavy weights and celebrating her achievement.

Q: If you are a robot and you watch and obey the human's gesture command. What does the man's gesture mean in the video?

A: The man's gestures indicate that he is asking me to come.

(* Actor in the video: Shunlin Lu)

Q: what is the intention of the man?

A: The man is standing outside a door and possibly preparing to enter.

(* Actor in the video: Shunlin Lu)

🤝 Motion and Video Data Help with each other!

w/ moiton w/o motion

Against Hullciation

Q: describe the body movement of the woman

A: The woman's body movements are slow and controlled, with her lifting her upper body off the mat and her lower body following the movement. She maintains a straight back and neck throughout the movement.

Q: Why is the man running so fast?

A: The man is running at a high speed to push himself to the limit in his training. He may be trying to increase his endurance or to build up his ability to handle high-intensity activity.

Against Hullciation

Q: describe the body movement of the woman

A: The woman in the video performs a workout routine on a mat using her legs to support her body beneath her. She begins her routine by stretching her legs up the wall while lying on her back. She then proceeds to perform leg raises and leg sit-ups on her mat. After finishing her routine, the woman stops and stretches before walking away with a smile on her face.

Q: Why is the man running so fast?

A: The man is running very fast because he is trying to finish his indoor workout as quickly as possible. He might be feeling tired or in a rush, and running fast can help him get to the end of his workout more quickly.

🤗 Live Demo

👀 System Overview

System overview of MotionLLM. (a) MotionLLM takes videos or human motions as visual input V. It first processes the visual input with a vision encoder and translates the vision embeddings into linguistic space via a V-L translator. (b) MotionLLM is trained in two stages. In the first stage, we train the V-L translator to learn the modality translation. In the second stage, we fine-tune the LLM and the V-L translator via instruction tuning data.

🦾 Technical Detail Comparison

Technical comparisons with other VLLMs. (a) LLaVA takes the images as input only. (b) Video-LLaVA shares a unified V-L translator for images and videos due to the small modality gap between the two modalities. (c) To bridge the larger modality gap between motion and videos, we take two separated V-L translators for better modality translations.

🫡 Comparison with Video Understanding Baselines

Input Video

Comparisons with baselines

Videos comprehension of models. The results show good performance of MotionLLM on captioning, spatial-temporal comprehension, and reasoning. The comparison with Video-Chat and Video-LLaVA shows good sequentiality and direction comprehension of MotionLLM.
(* Actor in the video: Shunlin Lu)

💪 Comparison with Motion Captioning Baselines

Input Motion

Comparisons with baselines

Examples of motions comprehension. The results demonstrate the proficiency of MotionLLM in captioning, spatial-temporal understanding, and reasoning. Comparison with TM2T and MotionGPT underscores the effectiveness of MotionLLM in handling unseen motions.

🌹 Acknowledgement

The author team would like to deliver many thanks to many people. Qing Jiang helps a lot with some parts of manual annotation on MoVid Bench and resolves some ethics issues of MotionLLM. Jingcheng Hu provided some technical suggestions for efficient training. Shilong Liu and Bojia Zi provided some significant technical suggestions on LLM tuning. Jiale Liu, Wenhao Yang, and Chenlai Qian provided some significant suggestions for us to polish the paper. Hongyang Li helped us a lot with the figure design. Yiren Pang provided GPT API keys when our keys were temporarily out of quota.

Citation

@article{motionllm,
  title={MotionLLM: Understanding Human Behaviors from Human Motions and Videos},
  author={Chen, Ling-Hao and Lu, Shunlin and Zeng, Ailing and Zhang, Hao and Wang, Benyou and Zhang, Ruimao and Zhang, Lei},
  journal={arxiv:2405.20340},
  year={2024}
}

The website template was adapted from HumanMAC Project.