Capturing failure as a first-class artifact

Most asset pipelines treat failure as an absence: a test that didn’t pass, a bundle that didn’t ship. We started treating failure as a presence: a typed record of what went wrong, where, and how to correct it. The result was an asset refinement loop that converged in two iterations instead of twelve, and an asset pipeline that started telling us what it needed.

This post walks the experiment.

The setup

Refinement loops in the assetRefinery pipeline traditionally operated on spec deltas, small parameter jitters applied at random until the evaluator stopped complaining. The pattern is familiar from any optimization-by-perturbation system. It works. It is also slow, and it spends most of its iterations editing parameters that are already correct.

We had a hypothesis from the physically-embodied Gaussian splatting research brief: if the refinement loop had access to structured failure signals instead of pass/fail flags, it could route each correction to the parameter that actually mattered.

The hypothesis was small enough to test in an afternoon.

The schema

We defined a FailureSignals dataclass with five fields:

geometry_mismatch_mm: signed displacement of a part from its target
joint_clearance_fail: boolean, joint cannot articulate without collision
contact_equilibrium_fail: boolean, surfaces overlap at rest
reachability_fail: boolean, manipulation point is unreachable
grasp_instability: boolean, grasp annotation doesn’t survive a small perturbation

This is not the only schema we considered. It is the smallest one that covers the four scenarios from the brief. Every field maps to a measurement an existing GA2.5 or family-evaluator pass already computes. The work wasn’t in defining new measurements; the work was in publishing the existing measurements as a typed artifact.

We resisted the urge to make this schema bigger. Five fields, all named after a physical failure mode, all derivable from existing checks.

The corrective policy

For each named failure mode, we wrote one bounded parameter edit. Not a learned policy, not a search procedure, not a controller. Just a deterministic mapping:

def corrective_step(p: AssetParams, s: FailureSignals) -> AssetParams:
    out = AssetParams(**asdict(p))
    if abs(s.geometry_mismatch_mm) > 0.5:
        out.trigger_offset_mm = p.trigger_offset_mm * 0.3
        if not (6.0 <= p.trigger_size_mm <= 10.0):
            out.trigger_size_mm = 0.7 * p.trigger_size_mm + 0.3 * 8.0
    if s.joint_clearance_fail:
        out.safety_lock_clearance_mm = max(p.safety_lock_clearance_mm, 0.5) + 0.3
    if s.contact_equilibrium_fail:
        out.top_bridge_overlap_mm = 0.0
    if s.grasp_instability:
        out.part_placement_error_mm = max(0.0, p.part_placement_error_mm - 0.8)
    return out

A geometry mismatch pulls the trigger offset 70% toward zero. A joint clearance failure raises the safety-lock clearance by 0.3 mm above the floor. A contact equilibrium failure zeroes out the overlap. A grasp instability shrinks the part placement error by 0.8 mm.

Every edit is bounded. None of them overshoot. The reason is exactly the brief’s recommendation: map named failure modes to bounded parameter edits, then see how far that gets you.

The trial

We ran four scenarios from the research brief:

Lighter trigger misplacement (offset 3 mm, size 5 mm)
Lighter safety-lock interference (clearance 0 mm)
Controller top_bridge overlap (overlap 2 mm)
Hybrid placement error (placement error 2.5 mm)

Each scenario was run twenty times: random seeds for the baseline jitter loop, deterministic seeds for the corrective loop. Maximum iterations: twelve.

The aggregate result:

Metric	Baseline (jitter)	Corrective (failure-signal)
Mean iterations to pass	11.84	2.50
Final pass rate	9%	100%

The full results live in results/2026-05-01-corrective-refinement-results.md.

What we learned

The corrective loop dominates jitter when the signal is structured. This is not surprising. It is also not the interesting finding.

The interesting finding is that the bottleneck is signal coverage, not signal strength. Every scenario maps cleanly to one parameter as long as the signal is captured at all. The risk in production is not that the corrective policy is wrong; it is that GA2.5 and the family evaluators don’t currently emit the signals in a structured form. They emit a packaged report. The structured form is one Pydantic dataclass away.

Bounded edits prevent the recurring failure of optimization loops. None of our corrective edits ever overshot. The 100% pass rate on synthetic scenarios is partly an artifact of deterministic faults; every scenario is solvable in one edit. On real GA2.5 data, expect coupled failures: fixing trigger size triggers a new joint clearance failure. The harness logs final_signals per trial precisely so those couplings are inspectable.

The synthetic 100% number is not a forecast. We say so in the results post. The metric framing is intentional: we wanted a deterministic upper bound to confirm the policy wasn’t overshooting. Real performance will be lower.

The cleanest finding from this experiment was not “the policy works.” It was “the policy is small once the signal exists.” The real work is upstream: instrumenting the existing checks to publish typed failure rows.

Why this matters beyond refinement

The same pattern (capture failure as a structured artifact) generalizes to two other places in the pipeline.

Pipeline gaps. Every individual run surfaces a small list of issues about the pipeline that built it: a missing API, a transcription error, a unit confusion. These currently live in commit messages and Slack threads. They should live in a typed pipeline_gaps.json artifact bundled with the neural object.

Transfer report failures. When the transfer solver returns “no” for a (task, robot) pair, the reason is currently a string. It should be a typed record that downstream tools can index and aggregate.

Failure as artifact is a small idea with broad reach. Once you start producing it, you stop losing the lessons.

What we’re not promising

We haven’t yet wired structured signals into production GA2.5. The harness ran in an offline experiment block under services/assetRefinery/generateAssets/experiments/. The block is removable without touching the pipeline.
We haven’t measured behavior on real coupled failures. The synthetic scenarios are deliberately one-edit-solvable.
We haven’t built a learned corrective policy. We don’t plan to until the deterministic version is exhausted. The brief’s “implementation notes” recommended exactly this scope.

We will not deploy a learned controller until we’ve shipped the deterministic one and watched it fail on real data. That is the order in which lessons accumulate.

What we want from the field

The hypothesis we want tested: if you publish your own evaluator’s failures as a typed schema and route them to bounded corrections, your refinement loop will converge faster than spec-only jitter. The schema and the policy are five hundred lines together. The hard part is the institutional decision to treat failure as a first-class artifact.

If your team has done this in any form (robotics asset pipeline, scene-graph construction, sim authoring), we want to compare schemas.

Next: Three trials we ran, and what they actually told us: a meta-post on the experiment harness behind this series.