JEPA world model for human motion

Early exploration, Nov 2025. Before narrowing to Physical AI infrastructure, we were mapping motion capture pipelines and human movement dynamics. This post is from that period.

We trained a Joint-Embedding Predictive Architecture (JEPA) that learns the dynamics of human movement from clinical motion capture data. Given a person’s current physical state, the model predicts their future state in a learned latent space without reconstructing pixel-level observations. This enables latent-space motion planning, physics-filtered simulation variants, and uncertainty-aware clinical assessment.

This post describes the architecture, training procedure, and how the world model fits into a 10-stage simulation pipeline that takes raw skeleton tracking from an iPad and produces physics-validated motion predictions.

Why a world model for clinical motion?

Clinical rehabilitation assessment requires understanding not just what a patient did, but what they could have done. A clinician watches a patient walk and mentally simulates: “If their knee flexion were 5 degrees better, would the gait cycle normalize?” This counterfactual reasoning is the gap between measurement and clinical insight.

We built a world model that learns the dynamics of clinical movement (gait, balance, sit-to-stand, squat) from real patient data captured via iPad body tracking. The model operates in a 64-dimensional latent space where:

Encoding compresses a high-dimensional physical state (joint positions, velocities, contact flags) into a compact representation
Prediction forecasts future states at arbitrary time horizons without autoregressive rollout
Uncertainty quantification identifies states where the model lacks confidence. These are often the clinically interesting moments: balance loss, gait asymmetry onset, movement compensation

The architecture

We implement a custom JEPA following the self-supervised paradigm introduced by LeCun (2022)^[1]: learn representations by predicting in latent space rather than pixel space.

Three networks, one objective:

The online encoder maps a physical state vector x_t to a latent embedding z_t:

Encoder: x_t → Linear(feature_dim, 128) → ReLU → Dropout(0.1) → Linear(128, 64) → z_t

The predictor takes the current latent z_t and a time horizon Δh and predicts the future latent:

Predictor: [z_t; Δh] → Linear(65, 128) → ReLU → Dropout(0.1) → Linear(128, 64) → ẑ_{t+Δh}

The target encoder is an exponential moving average (EMA) copy of the online encoder. It produces the training targets (the “correct” latent embeddings) without receiving any gradient signal:

z_target = TargetEncoder(x_{t+Δh})

After each gradient step:

θ_target ← τ · θ_target + (1 − τ) · θ_online     where τ = 0.996

The high τ value (0.996) means the target encoder changes very slowly. It provides a stable, slowly-evolving prediction target that prevents representation collapse (the failure mode where the encoder maps everything to a constant).

Loss function:

L = MSE(ẑ_{t+Δh}, z_target)

The predictor is trained to match the target encoder’s representation of the actual future state. Because the target encoder is detached from gradients, the online encoder must learn genuinely useful representations. It cannot cheat by collapsing to a trivial mapping.

The predictor learns to match the target encoder’s representation of the actual future state.

What goes into a “physical state”

The input feature vector x_t is not raw joint positions. It is a structured physical state built by the pipeline’s Physical State Builder:

Joint positions: 3D coordinates of tracked body joints (root, hips, knees, ankles, shoulders, elbows, wrists, spine, head) in world frame
Joint velocities: Finite-difference velocities computed from adjacent frames
Contact flags: Binary indicators for left/right foot ground contact, estimated from foot height (< 5cm above floor) and velocity
Phase labels: Current movement phase (stance/swing for gait, descent/bottom/ascent for squat, static/unstable for balance)

The Physical State Builder normalizes all values to consistent units (meters, m/s, radians) and aligns them to a world frame with:

Origin at room center
Floor at Y = 0
Forward direction computed from first-to-last root joint displacement
Gravity vector: [0, −9.81, 0]

Training

Dataset construction: The JEPA Dataset Builder (Stage 5 of the pipeline) converts a sequence of physical states into training transitions:

For each pair of states (x_t, x_{t+Δh}) where Δh varies from 1 to the sequence length, a training sample is created: (x_t, Δh, x_{t+Δh}). Variable-horizon training is critical; it forces the model to learn dynamics at multiple timescales rather than memorizing single-step transitions.

Training parameters:

Parameter	Value
Latent dimension	64
Hidden dimension	128
Dropout	0.1
EMA τ	0.996
Optimizer	AdamW
Learning rate	1e-3
Batch size	64
Epochs	50

Training on a single clinical assessment (200–1000 frames) takes under 60 seconds on a single GPU. The model checkpoint stores encoder_state_dict, predictor_state_dict, target_encoder_state_dict, training_config, and loss_curve.

Uncertainty quantification via MC-Dropout

At inference time, we estimate prediction uncertainty using Monte Carlo Dropout:

Keep dropout enabled (both encoder and predictor in .train() mode)
Run 10 forward passes with different dropout masks
Compute per-timestep standard deviation across the 10 predictions
Report mean standard deviation as the uncertainty estimate

High uncertainty indicates states where the model has seen limited training data. These are often clinically meaningful moments:

Transition between movement phases (stance → swing in gait)
Balance recovery after perturbation
Asymmetric movements (favoring one leg)
Novel movement patterns not seen in training

The 10-stage pipeline

The world model is Stage 6 of a 10-stage simulation pipeline that transforms raw iPad body tracking into physics-validated motion predictions.

Pipeline overview

Stage	Name	Input	Output
0	Adapter	Raw skeleton data	Standardized states + metadata
1	Alignment	States + environment scan	World-frame aligned states
2	Physical State	Aligned states	Explicit physical states (positions, velocities, contacts)
3	Physics Labels	Adapter metadata	Normalized failure/phase labels
4	Sim Replay	States + environment	Replay validation log
5	JEPA Dataset	Physical states + labels	NPZ binary training dataset
6	JEPA World Model	Training dataset	Latent embeddings + model checkpoint
7	Planning	Explicit states + JEPA model	Motion variants (planned trajectories)
8	Sim Execution	Accepted variants	Outcome labels + corrections
9	Task Pack	All artifacts	Knowledge extraction for downstream use

Each stage produces artifacts that are persisted and auditable. A Stage 10 audit pass verifies hash consistency across all artifacts.

Stage 0 — Adapter

Three adapter types extract standardized state representations from different data sources:

VideoPose3DAdapter: Processes 3D skeleton sequences from the VideoPose3D pipeline (MediaPipe 2D → temporal 3D lifting → LiDAR depth fusion)
RoboticsAdapter: Processes skeleton sequences from the robotics pipeline (cleaned, normalized, phase-detected)
AssessmentAdapter: Processes assessment scores and metrics directly

All adapters output an AdapterResult containing:

states: List of per-frame state dictionaries
metadata: Failures, phases, contacts, assessment type
source_info: Provenance tracking

Stage 1 — Alignment

The Alignment Builder registers the motion data to the physical environment:

Floor plane detection scores candidate planes from the environment scan:

+2.0 for horizontal alignment
+3.0 for “floor” or “ground” classification
Penalty proportional to center height (prefer the lowest horizontal plane)

World frame construction:

Origin at room center or floor plane intersection
Up axis: fixed [0, 1, 0]
Forward axis: computed from first-to-last root joint displacement, projected onto the horizontal plane
Gravity: [0, −9.81, 0]

Quality checks:

Feet above floor (clearance ≥ −3cm)
Gravity vector not inverted
Scale consistent with meters

Stage 2 — Physical State

The Physical State Builder computes explicit state representations:

Per-joint 3D positions in world frame
Per-joint velocities via finite differences
Center of mass estimation
Ground contact flags from foot height + velocity thresholds
Normalization to consistent units

Stage 5 — JEPA Dataset

The Dataset Builder creates variable-horizon transitions:

For each frame pair (t, t+Δh), encodes (x_t, Δh, x_{t+Δh})
Δh ranges from 1 frame to the full sequence length
Saves as NPZ binary (numpy compressed) for efficient loading
Includes metadata: feature dimensions, transition count, horizon distribution

Stage 7 — Planning

After the world model is trained, the Motion Planner operates in latent space:

Phase detection from root velocity:

if velocity < mean − std  →  "static"
if velocity > mean + std  →  "dynamic"
else                      →  "transition"

Key frame identification:

Frame 0 and final frame (always included)
Phase transition frames (where phase label changes)
Fallback: 5 evenly-spaced keyframes if no transitions detected

The planner generates motion variants by perturbing latent states at keyframes and rolling forward through the predictor. A physics filter validates each variant against physical constraints (velocity limits, ground penetration, acceleration bounds) before acceptance.

Stage 8 — Sim Execution

Accepted variants are executed and auto-labeled:

Outcome labeling: Success/failure classification based on task-specific criteria
Correction generation: For failed variants, compute the minimal latent correction that would produce a successful outcome
Failure variant analysis: Identify which physical parameters (joint angles, velocities, timing) most strongly predict failure

Evaluation

Reconstruction accuracy (Stage 6 output):

Per-sample MSE between predicted and target latent embeddings
Horizon-bucketed MSE: error as a function of prediction horizon Δh
Rollout L2 error averaged over held-out sequences

We evaluate at three horizons: Δh = 1 (next frame), Δh = 5 (short-term), Δh = 10 (medium-term). Error increases monotonically with horizon, as expected. The model captures local dynamics well but accumulates drift over longer predictions.

Uncertainty calibration:

MC-Dropout uncertainty correlates with actual prediction error (higher uncertainty → higher MSE)
Uncertainty is elevated during phase transitions and asymmetric movements
Low uncertainty during steady-state locomotion

Limitations

Single-subject models: Currently trained per-assessment, not across patients. Cross-subject generalization requires larger datasets and subject-conditioning.
Latent space interpretability: The 64-dimensional latent space captures dynamics but individual dimensions are not clinically interpretable without post-hoc analysis.
Planning is passthrough: The current planner passes states through without active goal-seeking. Full goal-conditioned planning in latent space is designed but not yet validated.
Limited input modalities: The model receives joint positions and velocities but not muscle activations, EMG, or force plate data. Clinical-grade predictions may require these additional signals.
Short training sequences: Clinical assessments are 10–60 seconds. The world model sees limited diversity per training session.

Future work

Cross-subject pretraining: Train a foundation model on aggregated clinical data, fine-tune per patient
Goal-conditioned planning: Specify target poses or functional outcomes, plan trajectories in latent space
Hierarchical JEPA: Multi-scale latent representations for whole-body dynamics (coarse) and per-joint dynamics (fine)
Integration with reinforcement learning: Use the world model as a simulator for rehabilitation policy learning
Real-time prediction: On-device inference for live feedback during assessments

References

LeCun, Y. “A Path Towards Autonomous Machine Intelligence.” OpenReview, 2022.
Assran, M. et al. “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture.” CVPR 2023.
Grill, J.B. et al. “Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning.” NeurIPS 2020.
Loper, M. et al. “SMPL: A Skinned Multi-Person Linear Model.” SIGGRAPH Asia 2015.
Pavllo, D. et al. “3D Human Pose Estimation = 2D Pose Estimation + Matching.” CVPR 2019.
Lugaresi, C. et al. “MediaPipe: A Framework for Building Perception Pipelines.” CVPR Workshop, 2019.

If you are working on motion capture for Physical AI and want access to this pipeline, join the beta.