Motion4Motion: Motion Transfer Across Subjects at Inference

Abstract

Skeleton-free, training-free motion transfer

This work explores motion transfer from one video to another, a problem that is crucial for animating diverse characters. Prior work largely focuses on humans or human-like characters and relies on a predefined skeleton, which limits generalization to different species and restricts large-scale training due to the scarcity of labeled cross-topology data. We step out of the skeleton-based framework and propose Motion4Motion, a training-free motion transfer framework that models the motion flow of a subject in a video instead of its skeleton, making motion transfer across species straightforward. Extensive experiments and novel applications show Motion4Motion outperforms strong baselines.

Video

Watch the 2-minute overview

A walkthrough of Motion4Motion, including cross-species transfer and the surprising case of teaching a table how to walk.

Project video — YouTube

Motivation

Why skeleton-free?

Mainstream motion transfer uses skeletons to bridge the gap between a source and a target character. This works well for human-to-human, but breaks as soon as the morphologies diverge—there is no shared skeletal template between a human, a goose, and a panda, and labeled cross-topology data is scarce. Spatial alignment becomes ill-defined, and existing systems often collapse into stiff movement, identity drift, or sliding artifacts.

Motion4Motion throws away the skeleton. Instead of kinematics, it operates on dense pixel-level motion flow and injects that flow into a pre-trained video diffusion transformer through a simple mechanism we call TransPE—Transferring Positional Encoding.

Teaser figure — cross-species motion transfer — **Motion4Motion** transfers motion across disparate subjects (e.g., human → panda, human → goose) at inference, without a uniform skeleton.

Method

Motion flow in, motion flow out

At its core, Motion4Motion is a two-stage inference procedure on top of a frozen Diffusion Transformer (WAN-T2V). We extract a motion flow from the source video, retarget it to the target subject, and then reshape the self-attention of the denoiser so the target appearance follows that flow.

1Motion flow extraction

From the first source frame, we sample anchor points on the subject using Grounded SAM-2. A semantic matcher built on diffusion features establishes cross-image correspondence to the target, while a point tracker yields the temporal trajectories of the source anchors across frames.

2TransPE attention

During denoising, we cache the target subject's K/V, replicate them along time, and re-embed them with RoPE positions taken from the retargeted motion flow. Concatenating these position-aware features into the attention forces the DiT to synthesize the target at the coordinates dictated by the source flow—no training required.

System overview of Motion4Motion — **Pipeline.** Anchor points are matched between the source first frame and the target image (cross-image correspondence); a point tracker extracts the source motion flow. TransPE re-embeds the target's cached features with the retargeted flow's positions inside self-attention.

Tracking

Source motion flow, visualized

The motion flow is a topology-agnostic representation of dynamics: a set of spatio-temporal trajectories that we later read through positional encoding rather than render directly. Below, anchor points tracked across the source video.

Point tracking on the source video.

TransPE

How K and V are rewired

Given the cached target K/V from inversion, TransPE replicates them along the time axis and re-embeds them with positional encodings taken from the target motion flow. The DiT's query then "looks for" the target's appearance at the new coordinates.

Illustrating the Q/K/V manipulation performed by TransPE.

Results

Cross-species results

On benchmarks for animal (33 pairs) and human (123 pairs) motion transfer, Motion4Motion ranks best across Textual Similarity, Motion Fidelity, Temporal Consistency, Appearance Consistency, and Pose Similarity, outperforming FlexiAct, MotionClone, MotionDirector, RoPECraft, Diffusion-As-Shader, and WAN-Move.

Cross-species motion transfer across different animal pairs.

More cross-species transfers.

Application

Teaching a table to walk

T2V models struggle with novel concept composition like "a desk coming to life, running rapidly along a muddy riverside"—they tend to produce a static or sliding desk. Using Motion4Motion with a light "bone-binding" trick (SAM-2 masks link the human legs to the table legs), we drive a table's gait from a walking human, entirely via flow-based attention manipulation.

Driving a still table with a walking human.

Cross-morphology motion transfer — table walking — **Cross-morphology transfer.** Human gait (A) drives a still table (B) via masked correspondence (C–D), producing a coherent walking table (H) that is temporally synced with the source (G).

Cite

BibTeX

@inproceedings{chen2026motion4motion,
  title     = {Motion4Motion: Motion Transfer Across Subjects at Inference},
  author    = {Chen, Ling-Hao and Yin, Zixin and Wang, Duomin and Zeng, Xianfang and Yu, Gang},
  booktitle = {SIGGRAPH Conference Papers '26},
  year      = {2026},
  publisher = {ACM},
  doi       = {10.1145/3799902.3811062}
}