← All work

November 2025 · Research

JEPA world model for human motion

Early exploration, Nov 2025. Before narrowing to Physical AI infrastructure, we were mapping motion capture pipelines and human movement dynamics. This post is from that period.

We trained a Joint-Embedding Predictive Architecture (JEPA) that learns the dynamics of human movement from clinical motion capture data. Given a person’s current physical state, the model predicts their future state in a learned latent space without reconstructing pixel-level observations. This enables latent-space motion planning, physics-filtered simulation variants, and uncertainty-aware clinical assessment.

This post describes the architecture, training procedure, and how the world model fits into a 10-stage simulation pipeline that takes raw skeleton tracking from an iPad and produces physics-validated motion predictions.


Why a world model for clinical motion?

Clinical rehabilitation assessment requires understanding not just what a patient did, but what they could have done. A clinician watches a patient walk and mentally simulates: “If their knee flexion were 5 degrees better, would the gait cycle normalize?” This counterfactual reasoning is the gap between measurement and clinical insight.

We built a world model that learns the dynamics of clinical movement (gait, balance, sit-to-stand, squat) from real patient data captured via iPad body tracking. The model operates in a 64-dimensional latent space where:

  • Encoding compresses a high-dimensional physical state (joint positions, velocities, contact flags) into a compact representation
  • Prediction forecasts future states at arbitrary time horizons without autoregressive rollout
  • Uncertainty quantification identifies states where the model lacks confidence. These are often the clinically interesting moments: balance loss, gait asymmetry onset, movement compensation

The architecture

We implement a custom JEPA following the self-supervised paradigm introduced by LeCun (2022)[1]: learn representations by predicting in latent space rather than pixel space.

Three networks, one objective:

The online encoder maps a physical state vector x_t to a latent embedding z_t:

Encoder: x_t → Linear(feature_dim, 128) → ReLU → Dropout(0.1) → Linear(128, 64) → z_t

The predictor takes the current latent z_t and a time horizon Δh and predicts the future latent:

Predictor: [z_t; Δh] → Linear(65, 128) → ReLU → Dropout(0.1) → Linear(128, 64) → ẑ_{t+Δh}

The target encoder is an exponential moving average (EMA) copy of the online encoder. It produces the training targets (the “correct” latent embeddings) without receiving any gradient signal:

z_target = TargetEncoder(x_{t+Δh})

After each gradient step:

θ_target ← τ · θ_target + (1 − τ) · θ_online     where τ = 0.996

The high τ value (0.996) means the target encoder changes very slowly. It provides a stable, slowly-evolving prediction target that prevents representation collapse (the failure mode where the encoder maps everything to a constant).

Loss function:

L = MSE(ẑ_{t+Δh}, z_target)

The predictor is trained to match the target encoder’s representation of the actual future state. Because the target encoder is detached from gradients, the online encoder must learn genuinely useful representations. It cannot cheat by collapsing to a trivial mapping.

x_tEncoderonlinez_tΔhPredictorẑ_(t+Δh)x_(t+Δh)Target EncoderEMA copy, no gradientz_targetMSEEMAGradient flows only through Encoder and Predictor (solid boxes)
The predictor learns to match the target encoder’s representation of the actual future state.

What goes into a “physical state”

The input feature vector x_t is not raw joint positions. It is a structured physical state built by the pipeline’s Physical State Builder:

  • Joint positions: 3D coordinates of tracked body joints (root, hips, knees, ankles, shoulders, elbows, wrists, spine, head) in world frame
  • Joint velocities: Finite-difference velocities computed from adjacent frames
  • Contact flags: Binary indicators for left/right foot ground contact, estimated from foot height (< 5cm above floor) and velocity
  • Phase labels: Current movement phase (stance/swing for gait, descent/bottom/ascent for squat, static/unstable for balance)

The Physical State Builder normalizes all values to consistent units (meters, m/s, radians) and aligns them to a world frame with:

  • Origin at room center
  • Floor at Y = 0
  • Forward direction computed from first-to-last root joint displacement
  • Gravity vector: [0, −9.81, 0]

Training

Dataset construction: The JEPA Dataset Builder (Stage 5 of the pipeline) converts a sequence of physical states into training transitions:

For each pair of states (x_t, x_{t+Δh}) where Δh varies from 1 to the sequence length, a training sample is created: (x_t, Δh, x_{t+Δh}). Variable-horizon training is critical; it forces the model to learn dynamics at multiple timescales rather than memorizing single-step transitions.

Training parameters:

ParameterValue
Latent dimension64
Hidden dimension128
Dropout0.1
EMA τ0.996
OptimizerAdamW
Learning rate1e-3
Batch size64
Epochs50

Training on a single clinical assessment (200–1000 frames) takes under 60 seconds on a single GPU. The model checkpoint stores encoder_state_dict, predictor_state_dict, target_encoder_state_dict, training_config, and loss_curve.

Uncertainty quantification via MC-Dropout

At inference time, we estimate prediction uncertainty using Monte Carlo Dropout:

  1. Keep dropout enabled (both encoder and predictor in .train() mode)
  2. Run 10 forward passes with different dropout masks
  3. Compute per-timestep standard deviation across the 10 predictions
  4. Report mean standard deviation as the uncertainty estimate

High uncertainty indicates states where the model has seen limited training data. These are often clinically meaningful moments:

  • Transition between movement phases (stance → swing in gait)
  • Balance recovery after perturbation
  • Asymmetric movements (favoring one leg)
  • Novel movement patterns not seen in training

The 10-stage pipeline

The world model is Stage 6 of a 10-stage simulation pipeline that transforms raw iPad body tracking into physics-validated motion predictions.

Pipeline overview

StageNameInputOutput
0AdapterRaw skeleton dataStandardized states + metadata
1AlignmentStates + environment scanWorld-frame aligned states
2Physical StateAligned statesExplicit physical states (positions, velocities, contacts)
3Physics LabelsAdapter metadataNormalized failure/phase labels
4Sim ReplayStates + environmentReplay validation log
5JEPA DatasetPhysical states + labelsNPZ binary training dataset
6JEPA World ModelTraining datasetLatent embeddings + model checkpoint
7PlanningExplicit states + JEPA modelMotion variants (planned trajectories)
8Sim ExecutionAccepted variantsOutcome labels + corrections
9Task PackAll artifactsKnowledge extraction for downstream use

Each stage produces artifacts that are persisted and auditable. A Stage 10 audit pass verifies hash consistency across all artifacts.

Stage 0 — Adapter

Three adapter types extract standardized state representations from different data sources:

  • VideoPose3DAdapter: Processes 3D skeleton sequences from the VideoPose3D pipeline (MediaPipe 2D → temporal 3D lifting → LiDAR depth fusion)
  • RoboticsAdapter: Processes skeleton sequences from the robotics pipeline (cleaned, normalized, phase-detected)
  • AssessmentAdapter: Processes assessment scores and metrics directly

All adapters output an AdapterResult containing:

  • states: List of per-frame state dictionaries
  • metadata: Failures, phases, contacts, assessment type
  • source_info: Provenance tracking

Stage 1 — Alignment

The Alignment Builder registers the motion data to the physical environment:

Floor plane detection scores candidate planes from the environment scan:

  • +2.0 for horizontal alignment
  • +3.0 for “floor” or “ground” classification
  • Penalty proportional to center height (prefer the lowest horizontal plane)

World frame construction:

  • Origin at room center or floor plane intersection
  • Up axis: fixed [0, 1, 0]
  • Forward axis: computed from first-to-last root joint displacement, projected onto the horizontal plane
  • Gravity: [0, −9.81, 0]

Quality checks:

  • Feet above floor (clearance ≥ −3cm)
  • Gravity vector not inverted
  • Scale consistent with meters

Stage 2 — Physical State

The Physical State Builder computes explicit state representations:

  • Per-joint 3D positions in world frame
  • Per-joint velocities via finite differences
  • Center of mass estimation
  • Ground contact flags from foot height + velocity thresholds
  • Normalization to consistent units

Stage 5 — JEPA Dataset

The Dataset Builder creates variable-horizon transitions:

  • For each frame pair (t, t+Δh), encodes (x_t, Δh, x_{t+Δh})
  • Δh ranges from 1 frame to the full sequence length
  • Saves as NPZ binary (numpy compressed) for efficient loading
  • Includes metadata: feature dimensions, transition count, horizon distribution

Stage 7 — Planning

After the world model is trained, the Motion Planner operates in latent space:

Phase detection from root velocity:

if velocity < mean − std  →  "static"
if velocity > mean + std  →  "dynamic"
else                      →  "transition"

Key frame identification:

  • Frame 0 and final frame (always included)
  • Phase transition frames (where phase label changes)
  • Fallback: 5 evenly-spaced keyframes if no transitions detected

The planner generates motion variants by perturbing latent states at keyframes and rolling forward through the predictor. A physics filter validates each variant against physical constraints (velocity limits, ground penetration, acceleration bounds) before acceptance.

Stage 8 — Sim Execution

Accepted variants are executed and auto-labeled:

  • Outcome labeling: Success/failure classification based on task-specific criteria
  • Correction generation: For failed variants, compute the minimal latent correction that would produce a successful outcome
  • Failure variant analysis: Identify which physical parameters (joint angles, velocities, timing) most strongly predict failure

Evaluation

Reconstruction accuracy (Stage 6 output):

  • Per-sample MSE between predicted and target latent embeddings
  • Horizon-bucketed MSE: error as a function of prediction horizon Δh
  • Rollout L2 error averaged over held-out sequences

We evaluate at three horizons: Δh = 1 (next frame), Δh = 5 (short-term), Δh = 10 (medium-term). Error increases monotonically with horizon, as expected. The model captures local dynamics well but accumulates drift over longer predictions.

Uncertainty calibration:

  • MC-Dropout uncertainty correlates with actual prediction error (higher uncertainty → higher MSE)
  • Uncertainty is elevated during phase transitions and asymmetric movements
  • Low uncertainty during steady-state locomotion

Limitations

  1. Single-subject models: Currently trained per-assessment, not across patients. Cross-subject generalization requires larger datasets and subject-conditioning.
  2. Latent space interpretability: The 64-dimensional latent space captures dynamics but individual dimensions are not clinically interpretable without post-hoc analysis.
  3. Planning is passthrough: The current planner passes states through without active goal-seeking. Full goal-conditioned planning in latent space is designed but not yet validated.
  4. Limited input modalities: The model receives joint positions and velocities but not muscle activations, EMG, or force plate data. Clinical-grade predictions may require these additional signals.
  5. Short training sequences: Clinical assessments are 10–60 seconds. The world model sees limited diversity per training session.

Future work

  • Cross-subject pretraining: Train a foundation model on aggregated clinical data, fine-tune per patient
  • Goal-conditioned planning: Specify target poses or functional outcomes, plan trajectories in latent space
  • Hierarchical JEPA: Multi-scale latent representations for whole-body dynamics (coarse) and per-joint dynamics (fine)
  • Integration with reinforcement learning: Use the world model as a simulator for rehabilitation policy learning
  • Real-time prediction: On-device inference for live feedback during assessments

References

  1. LeCun, Y. “A Path Towards Autonomous Machine Intelligence.” OpenReview, 2022.
  2. Assran, M. et al. “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture.” CVPR 2023.
  3. Grill, J.B. et al. “Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning.” NeurIPS 2020.
  4. Loper, M. et al. “SMPL: A Skinned Multi-Person Linear Model.” SIGGRAPH Asia 2015.
  5. Pavllo, D. et al. “3D Human Pose Estimation = 2D Pose Estimation + Matching.” CVPR 2019.
  6. Lugaresi, C. et al. “MediaPipe: A Framework for Building Perception Pipelines.” CVPR Workshop, 2019.

If you are working on motion capture for Physical AI and want access to this pipeline, join the beta.