Three trials we ran, and what they actually told us

We pulled three recent papers off arXiv and asked, for each: what happens if we run the smallest faithful version of this idea against our pipeline today? The papers were Physically Embodied Gaussian Splatting (corrective refinement), REArtGS (articulated reconstruction), and Splat-MOVER (affordance-aware assets). The trials took an afternoon each. The findings, in two cases, surprised us.

This post is the meta-view. The individual trial writeups live in experiments/results/.

The shape of the experiment harness

The harness is intentionally small. Three additive Python scripts under services/assetRefinery/generateAssets/experiments/, none of them wired into the production pipeline. Every script takes an existing asset bundle as input, computes a deterministic metric, and writes a structured result.

Deleting the experiments folder has no effect on the running pipeline. This was a deliberate constraint. We wanted the experiments to be cheap to abandon if the findings were null.

services/assetRefinery/generateAssets/experiments/
├── __init__.py
├── experiment_corrective_refinement.py
├── experiment_articulated_recon.py
└── experiment_affordance_aware.py

We didn’t want any of these trials to leave a permanent footprint until they earned one. The folder is the hedge.

Trial 1 — Corrective refinement

The hypothesis was that a refinement loop fed structured failure signals would converge faster than a spec-only jitter loop. The result was strong: mean iterations to pass dropped from 11.84 to 2.50; final pass rate rose from 9% to 100% on synthetic scenarios.

The interesting finding was upstream of the hypothesis. The bottleneck is not the corrective policy; that part is small. The bottleneck is the institutional decision to publish existing evaluator measurements as a typed artifact instead of as a packaged report.

Full writeup. Standalone post: Failure as a first-class artifact.

The headline number is real. It is also overdetermined by deterministic synthetic faults. The lesson lives in the schema, not the metric.

Trial 2 — REArtGS-inspired articulated reconstruction

The hypothesis was that an articulated reconstruction branch using multi-state evidence would beat both the procedural family-builder baseline and the current SF3D hybrid.

The aggregate result, on a four-scenario set:

Source	Dimension hit rate
`builder_only` (family template)	0.92
`hybrid` (current SF3D pipeline)	0.75
`reartgs_mode` (proxy)	0.83

Builder-only beat hybrid on aggregate. That was not the expected ordering. Three of four scenarios were synthetic family stand-ins where the family template happened to match the spec. A free win.. The interesting comparison is on the lighter, the one real bundle.

On the lighter alone, the hybrid pipeline scored 0.0: zero of three axes within 25% of spec. The reason was a silent failure mode: SF3D had reconstructed the lighter head and dropped the long handle. The bundle had passed every other check we had at the time. Parsing dimensions out of tier1_asset_spec.color_material_hints was the test that caught it.

This finding became qualification test T5. The trial was, in effect, a way of generating a regression test we didn’t previously have.

The trial’s stated success criterion was “REArtGS beats baselines.” The trial’s actual gift was a previously-uncaught silent failure mode.

Full writeup. Standalone post: What we actually need from REArtGS-style reconstruction.

Trial 3 — Splat-MOVER-inspired affordance scoring

The hypothesis was that affordance-aware grasp scoring would raise the hit rate on existing grasp annotations by ≥15 percentage points.

The result was the opposite. The baseline scoring counted both grasps in the lighter bundle as hits. The affordance scoring rejected one of them: the top_down grasp landing in the nozzle / flame zone, which is the lighter family’s forbidden_contact_regions. Hit rate fell from 1.0 to 0.5.

The trial’s framing of the metric was wrong. Affordance scoring is a gate, not a proposer. It rejects bad grasps; it doesn’t surface new good ones. To raise hit rate, you need a grasp generator producing more candidates and an affordance gate filtering them. With two grasps to score and a strict filter, the metric collapses.

The valuable finding from the trial was the forbidden-region detection itself. The top_down grasp is technically stable (confidence 0.6, medium stability) but it would burn the gripper. No baseline scorer in our pipeline could have caught this. The affordance gate did, deterministically, on the first run.

The metric framing inverted, and the trial was still useful. We learned what to measure next time.

Full writeup.

What the three trials have in common

Each trial took a paper and asked the smallest faithful version of its question. Not “let us reproduce REArtGS.” But: “if REArtGS-style multi-state evidence existed, would the metric we already track move?” The answer is allowed to be no, and twice, in different ways, the answer was no.

Each trial produced a regression test or a schema change. The corrective-refinement trial produced the FailureSignals schema. The articulated-reconstruction trial produced T5. The affordance trial produced the forbidden-region check. None of these were the trial’s stated objective. All of them were durable wins.

Each trial revealed a metric framing problem. The corrective trial’s 100% pass rate is a deterministic ceiling, not a forecast. The articulated trial’s hit rate is dominated by family-template lookups. The affordance trial’s hit rate inverts because the gate is not a proposer. In every case, the trial taught us how to measure the next time.

The most reliable output of an experiment is not the result. It is the corrected metric framing for the experiment after this one.

What this means for how we do experiments

Three lessons we’re codifying.

Run the smallest faithful version first. A four-hour experiment that produces a regression test is worth more than a four-week experiment that fails to reproduce a paper.

Preserve removability. The experiments folder is additive. Deleting it changes nothing. This made it cheap to run trials we expected to be null.

Write the result honestly. Every trial writeup carries a “caveats” section that names the synthetic shortcuts, the metric framing problems, and the comparisons we’re not making. Honesty in the writeup is what makes the trial reusable.

We will keep doing this. The experiments folder is the standing pattern, not these three trials.

What is next

A multi-state benchmark for the lighter and one additional articulated family. The articulated trial’s caveat (we do not have two-state evidence on disk) is a real prerequisite for any deeper REArtGS comparison.
A grasp-generator-plus-gate evaluation. The affordance trial’s metric inverted because we only had two grasps to score. A generator producing 50 candidates per asset would let us measure the gate honestly.
A coupled-failures version of the corrective trial. The synthetic scenarios were one-edit-solvable. The real GA2.5 distribution is not.

Next: What we actually need from REArtGS-style articulated reconstruction: a sharper read of the second trial, with the benchmark we want the field to build.