← All work

June 2026 · Engineering

Sim-to-real is a calibration problem

Domain randomisation makes the sim-to-real gap survivable. Calibration makes it small. Nobody has built the infrastructure for the second one.

Domain randomisation works. Training a policy against randomised friction, mass, and compliance produces a policy that is robust to variance in real-world physics. The technique is well-validated and widely used for good reasons.

But robustness and accuracy are different outcomes. A policy trained against friction randomised over [0.5, 5.0] will survive on the bench. It will not tell you that the specific asset’s friction is 6.3, not 3.2, and that correcting that one value is all it takes to go from zero successes to ten. When a policy fails in deployment, the wrong parameter is invisible inside the range you randomised over. The failure becomes a Slack message.

Calibration is a different strategy. It matches simulation to the specific physical asset rather than making the policy tolerant of a wide range. When it fails, the responsible parameter is identified, the discrepancy is measured, and the patch is a single structured write. The reason calibration has not been the default is that nobody has built the infrastructure for it: no standard way to record a real-world deployment, score faithfulness by parameter, and emit a typed correction. That is the infrastructure we are building.


The proof

We ran a coffee maker through Veron with a deliberate friction mismatch: trigger friction set to twice the calibrated value, simulating the kind of divergence that accumulates between asset creation and real deployment. Baseline result: zero from ten trials complete the task. The policy fails every time, for the same reason, in a way that would be diagnosed as “sim-to-real gap” and addressed with more training data or more randomisation.

Instead, Veron routed the failure deterministically to the responsible parameter, patched friction from 3.2 to 6.31, and ran the evaluation again. Post-patch: ten from ten. Faithfulness score moved from 0.00 to 0.97. The policy did not change. The physics did.

More training data does not fix a wrong friction value. Domain randomisation makes it survivable, at the cost of masking the exact failure that mattered. Calibration fixes it.


Why this matters now

Two trends are converging that make structured calibration urgent rather than merely useful.

Generative world models, including Cosmos and GR00T-Dreams, are getting genuinely good at producing physically plausible simulation from a single image or description. The output looks right. The question these models cannot answer on their own is whether it is right. A world model that generates a lid that opens freely when the real lid requires 6.3 N of trigger force will train a robot to fail. What those models need is a structured oracle: calibrated physics parameters that can verify whether a generated simulation matches physical reality. Without that layer, world models generate unverifiable dreams.

On the other side, robot fleets are scaling. The deployments are generating enormous volumes of failure data. That data sits in object storage, unstructured, diagnosed as Slack messages. The missing layer is not more data; it is the routing logic that turns “the robot failed” into “trigger friction was wrong” and the calibration logic that turns that diagnosis into a patch.

We are not a data plane. We are not a world-model trainer. We are the substrate that connects them: the layer where simulation fidelity is validated, patched, and made auditable.


A few design decisions worth naming

The router that produced the coffee maker result is deterministic. It scores four layers of faithfulness (asset, robot, policy, scene) against what simulation predicted, and routes to the layer that dropped below threshold. There is no learned model in that path. We made this choice because a learned router cannot be audited, and auditability is the point. Every patch is reversible. Every decision traces back to the observation that produced it.

We also patch one parameter at a time, from the highest-confidence observation. Covariance between friction and compliance is real; patching both simultaneously from a single deployment run is speculation dressed as inference. One probe, one patch, re-evaluate.

The LLM in the system translates tickets into plain English. It does not make routing decisions. That separation is non-negotiable.


The technical specification (SDK, four-layer DNA schema, FailureTicket format, and current build status) is documented on the Veron product page.

If you are running sim-to-real and your failures are currently Slack messages, get in touch.