MotionMap: Representing Multimodality in Human Pose Forecasting



Reyhaneh Hosseininejad* and Megh Shukla*,
Saeed Saadatnejad, Mathieu Salzmann, Alexandre Alahi
Computer Vision and Pattern Recognition (CVPR) 2025

arXiv GitHub Presentation




Problem Statement

Predicting human motion from observed skeletal poses seems straightforward, yet beneath the surface lies an intriguing complexity: for any given pose sequence, an infinite number of possible future motions exist. This inherent multimodality challenges existing forecasting models, which typically rely on oversampling a large number of predictions. However, no matter how many futures are generated, these methods inevitably fail to cover all realistic possibilities, often missing critical or rare motions essential for real-world applications.

To overcome this limitation, it becomes essential to rethink human pose forecasting. Rather than aiming to capture endless hypothetical futures, a more practical approach is to explicitly learn and represent the realistic transitions observed in available data. By grounding multimodal predictions firmly within observed transitions, the forecasting task transitions from an ill-posed to a well-posed challenge. But first, what does multimodality in human pose forecasting mean?

Multimodality in Human Pose Forecasting

Before diving deeper into our methodology, let’s first clarify what we mean by “multimodality” in human pose forecasting. In simple terms, multimodality refers to the presence of multiple distinct yet plausible future motions from a single observed sequence. Imagine seeing a person standing still—this single pose could naturally lead to multiple realistic futures, such as walking forward, turning around, or even sitting down. Each of these actions represents a different “mode.” Importantly, each mode comprises a set of likely and coherent motions logically connected to the observed pose. By recognizing these diverse yet realistic futures, multimodal forecasting models can offer richer and more informative predictions. Formally, we define multimodality as

Multimodality in human pose forecasting refers to a diverse yet realistic set of future actions with a logical transition from an observed pose sequence.


But how exactly can we efficiently encode multimodality? And how do we effectively distinguish between likely and unlikely future motions? Answering these questions could lead to more robust, realistic, and practical pose forecasting solutions.



MotionMap

MotionMap Idea

That’s exactly where the MotionMap representation comes in, marking a new way of deep regression. We can think of MotionMap as a kind of visual map, specifically a heatmap that clearly shows all the different paths human motion can realistically take from any observed pose. On this map, distinct peaks mark the most probable future movements, directly learning from the actual transitions observed in the training data. Unlike traditional methods that need countless random predictions, MotionMap naturally captures multiple scenarios at once, even highlighting the rare yet crucial motions we can’t afford to overlook. With this representation, we predict the explicit distribution over different future sequences for an observation. Why is this important? First, we know the exact number of future motions per sample, allowing us to be much more efficient with our predictions. Second, we also know which future is more likely since the intensity of different peaks can be treated as a measure of confidence. Third, MotionMap can be used in tandem with metadata such as action labels for controllable human pose forecasting. For instance, we can map different confidences to different actions, allowing the user to generate motion sequences based on the action and confidence. Fourth, MotionMap predicts the confidence of the mode, and the uncertainty in the motion conditioned on the mode. As a result, we get accurate uncertainty estimates allowing for safe deployment. Finally, our quantitative results show that MotionMap is robust and is state-of-the-art in human pose forecasting.

But how do we learn MotionMap for each sample, and how exactly do we use MotionMap for predicting multimodality given an observation? We answer this in the next section.



Methodology

Overview

Let’s break down how MotionMap works. First, we use an autoencoder which is a model that learns how to compress and then reconstruct human motion sequences. This autoencoder takes as input the observation and one of the many multimodal ground truths, with the goal of compressining and reconstructing the entire pose sequence (input + multimodal ground truth). During training, the autoencoder captures key patterns from past (observed) and future (predicted) skeletal poses. Essentially, it becomes good at recognizing realistic motion transitions by seeing many examples of how poses evolve.

The catch is, during actual prediction, we won’t know the future poses. So how do we still manage to predict realistic future movements? Our key intuition is that even if we don’t have the future pose, what we need is a latent that describes the future pose. This is where our MotionMap comes to the rescue. With MotionMap, we represent an observation’s multimodal ground truth through heatmaps. The way we construct this is by first encoding each multimodal ground truth pose sequence into a vector using the autoencoder. Next, we use t-SNE, a popular dimensionality reduction technique, to embed each encoding into two dimensions. It is these two dimensional encodings that we represent through local maxima in MotionMap. We train a model that predicts the MotionMap directly from the observed poses. On the predicted MotionMap, the maxima represent the likeliest future motions corresponding to a given observation. However, how do we go from maxima to the predicted future pose sequence?

To convert these heatmaps into actual predicted movements, we use a codebook. The codebook is a dictionary / map that links each maxima on the heatmap to its corresponding embedding (which was reduced to 2D earlier). By looking up the heatmap’s peaks in this dictionary, we can quickly obtain the latent embedding for the pose sequence. As a result, we have the final piece in the jigsaw puzzle to make the autoencoder work at test time: the missing future latent obtained from MotionMap.



Experiments

We evaluate different state-of-the-art methods under our well-posed setup, which as described involves learning to translate different transitions in the training dataset to unseen test samples. To do this, we select a subset of test samples with multimodal ground truth in the training dataset. Futhermore, we focus on prediction efficiency, and evaluate different methods with less than 10 samples. Our results are shown below.

Results

We make two key observations from our results. First, we see that methods with high diversity may not necessarily be accurate. This is because many of the predicted transitions are not realistic given the observed pose sequence. Second, we observe that MotionMap has a higher accuracy in the multimodal metrics, especially when compared to state-of-the-art diffusion based methods. This is because MotionMap explicitly learns diverse transitions, and is able to translate them to test samples. In fact, our additional results in the appendix show that the ability to recall transitions does not come at the cost of predicting generic motions for unseen samples. Our results show that heatmaps and codebooks outperform diffusion and repeated sampling.

Apart from the quantitative evaluation, we also provide extensive qualititative analysis centred around uncertainty, controllability, ranking and diversity. Our first analysis is centred around comparing different methods through the MotionMap representation.

Sampling Comparison

This comparison compares multimodal predictions by different methods through MotionMap. We do this by embedding multimodal predictions for three different samples (a, b, c) on their respective ground truth MotionMap. We confirm our quantitative evaluation: methods that focus on high diversity do so at the cost of realism, predicting transitions that are non-existent in the dataset. Although diffusion based methods are much more realistic, they fail to capture all different transitions, including some that are rare. In constrast, MotionMap, the method successfully captures diversity and realism.

Sampling Comparison

MotionMap predicts the confidence for different modes through local maxima in the heatmap. Modes with higher confidence have peaks of higher intensity in the predicted MotionMap. This allows us to rank different predictions based on the confidence as shown. Predictions closed to the ground truth motion have higher confidence, with rare modes and potentially unnatural motions having lower intensity peaks in MotionMap.

Sampling Comparison

We also visualize the uncertainty learnt by MotionMap. Not only do we introduce heteroscedastic modelling, by introducing the concept of mode and forecast, we can also improve the semantics behind the learnt uncertainty. Not only does MotionMap predict the confidence of each mode, the uncertainty network predicts the uncertainty for the pose sequence conditioned on the mode. This is important since without conditioning on the mode, previous methods counterintuitively showed that homoscedastic modelling is better than heteroscedastic modelling. Intuitively, since the predicted motion is multimodal, conditioning on a mode prevents the learnt uncertainty from averaging across all the modes.

In this figure, we note two observations. First, joints with a higher degree of mobility have larger uncertainty estimates. This is expected since mobile joints have a larger range of motion. Second, we highlight the role of the nose as a joint that indicates the orientation of the person. We note that the uncertainty for the nose is higher for the pose sequence which involves a faster turn in comparison to the motion with a slower turn.

Controllability

MotionMap can be used in tandem with metadata to control the generated motion sequence. For datasets such as Human 3.6M which come with action labels, we can establish a duality between the 2-D embeddings of pose sequences and their action labels. Therefore, predicting the MotionMap also corresponds to predicting the likelihood of different actions as the forecasted motion for the observed pose sequence. Consequently, downstream tasks can control the generation of different futures by choosing from within the likeliest actions, which we show in the figure above. This allows us to incorporate user preference in the pose forecasting pipeline.

Controllability

We conclude our qualitative study by visualizing the diversity of the predicted motions. By using the same decoder as state-of-the-art methods like BeLFusion, we note that MotionMap predicts diverse yet realistic future pose sequences.



Wrapping Up: A Smarter Approach to Human Pose Forecasting

In this work, we tackled the challenge of making human pose forecasting well-posed and more efficient. Our proposed representation, MotionMap, explicitly encodes multiple possible future motions while also quantifying their confidence. This allows us to distinguish more likely movements from less probable ones, also offering a structured approach to uncertainty in motion prediction. By modeling the spread of future motions directly, MotionMap eliminates the need for excessive random sampling, making it both sample-efficient and robust. We demonstrated how this approach leads to more diverse yet realistic motion predictions, ultimately improving reliability in real-world applications. Beyond accuracy, MotionMap also paves the way for controllable pose forecasting and practical uncertainty estimation—two aspects that could be highly valuable in domains like robotics, animation, and autonomous systems. Through comprehensive analysis, we showcased the effectiveness of MotionMap in enhancing both the safety and reliability of human motion forecasting. Our takeway? MotionMap shows that heatmaps and codebooks can outperform diffusion and repeated sampling in human pose forecasting! Looking ahead, we believe this paradigm shift can inspire future research in making motion prediction more structured, interpretable, and useful in real-world scenarios.



Citation

If our work is useful, please consider citing the accompanying paper and starring our code on GitHub!

@inproceedings{
hosseininejad2025motionmap,
title={MotionMap: Representing Multimodality in Human Pose Forecasting},
author={Reyhaneh Hosseininejad and Megh Shukla and Saeed Saadatnejad and Mathieu Salzmann and Alexandre Alahi},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}