What we actually need from REArtGS-style articulated reconstruction

REArtGS is the kind of paper that the assetRefinery pipeline was built to absorb. Multi-state evidence, motion-aware constraints, joint inference grounded in geometry rather than priors; the techniques map directly onto the gaps we know we have. We ran a small trial, took it as far as the data on disk would allow, and came away with a sharper understanding of what we’re missing.

The unmet need is not a better reconstruction algorithm. It is a benchmark.

This post argues for one.

What we measured

The trial scored three sources of body geometry against measured spec dimensions parsed from the asset’s tier1_asset_spec.color_material_hints:

builder_only: the family template’s nominal extents
hybrid: the current SF3D-plus-procedural-overlay output
reartgs_mode: a proxy that rescaled the hybrid output toward the spec, standing in for “what multi-state evidence would tell us if we had it”

The aggregate hit rates were 0.92, 0.75, and 0.83. The proxy’s number is meaningful as an upper bound, not as a measurement of REArtGS itself. We say so explicitly in the trial writeup.

The interesting result was on the joint side. Joint type accuracy was 100% across the test set, but only because every scenario was annotated with its family, and the family priors do the work. Without family hints, principal-axis PCA on the lighter mesh predicts the wrong joint type. Joint axis alignment, by contrast, was off by a mean of 18.4°, and the lighter alone contributed all of that error.

Family priors solve joint type for free. Joint axis is where the real reconstruction work lives.

The unmet need is not the algorithm

Two findings sharpen this claim.

Finding 1. The lighter bundle didn’t contain two-state evidence. Three input images, all of resting state. The trial couldn’t run REArtGS in the form the paper specifies (multi-state observations of an articulated object) because the data didn’t exist.

Finding 2. Even if it had, our scoring metric (dimension hit rate against spec) doesn’t depend on the joint axis. It depends on the body extents. The trial’s reartgs_mode proxy could have nailed the body and still left the joint axis at 18° error, because the body is what the metric is sensitive to.

The right experiment is not “run REArtGS on a single-state lighter.” It is “give the field a two-state benchmark with measured joint axes, and see whether REArtGS-style methods recover them.”

That benchmark doesn’t exist.

We do not need a smarter reconstruction algorithm next. We need ten objects with two observed states each, with measured joint axes labelled.

What a useful benchmark would contain

The minimum viable shape:

Ten articulated objects across four families. Lighters (trigger), drawers (slide), cabinet doors (hinge), simple appliance lids (hinge). Mix small and large. Mix slow and fast articulation.
Two observed states per object. Resting plus partially-actuated. RGB-D where possible, RGB-only as a fallback. Five to ten frames per state.
Measured joint axes. Calibrated from CAD or measured by hand. Three-axis vector per joint, with an uncertainty estimate.
Measured dimensions. Per-axis extents in centimeters, measured by hand, not parsed from a product page.
Per-state segmentation masks. Family-aware foreground masks for each state, generated by rembg or equivalent.
A typed schema. A small JSON manifest binding the above together. We would propose v0.1 and ask the field to refine it.

This is not a research project. It is two weeks of capture, one week of measurement, and one week of authoring the schema.

Benchmarks are how a community concentrates effort. Single-paper reproductions are how it disperses effort.

What we would do with it

If the benchmark existed today, the next experiment would be tightly scoped.

For each object, run three reconstruction paths against both states:

builder_only: family-template extents and joint axis priors. Our deterministic baseline.
hybrid_single_state: SF3D body from state 1, family-builder overlay, single-state joint axis from PCA + family hints. Our current production path.
reartgs_mode: multi-state body from both observed states, joint axis estimated from the geometric difference between states. The honest version of the trial we ran in proxy.

Score on three axes:

Body dimension fidelity against measured spec.
Joint type accuracy against family ground truth.
Joint axis alignment against measured axis. This is the new metric the benchmark unlocks. It is also the metric we expect REArtGS-style methods to win on, because it is the metric they are designed for.

The metric is the deliverable. Without the third axis, the benchmark is a duplicate of what we already have.

What the broader pipeline gains from this

Joint axis alignment is not an academic metric. It cascades into three production failures we already know about.

Physics validation pass rate. A joint authored with the wrong axis fails GA2.5 validation when the asset is articulated under load. The drawer slides at an angle. The trigger rotates around a point that is not the hinge. Validation catches this; refinement is then expensive.

Manipulation policy transfer. A policy trained on a wrong-axis asset learns to compensate. The compensation doesn’t transfer to the real object. The transfer report would say “yes” with margin, and the real-world result would disagree.

Affordance region placement. The lighter’s forbidden_contact_regions is anchored to the trigger’s joint axis. A wrong axis places the forbidden region in the wrong location, and the affordance gate stops catching the grasps it should reject.

The right joint axis is upstream of three different downstream failures we already track. The benchmark would let us measure how much each downstream failure depends on it.

What we ask of the field

We will publish the benchmark schema if no comparable one exists by Q3. We are asking, first, whether one is in the works that we haven’t noticed. If you are aware of a candidate, write to us. The duplication cost of two competing benchmarks is higher than the cost of waiting a quarter for the better one.

If no candidate exists, we will publish a v0.1 schema and seed it with the lighter and the drawer from our own pipeline. Two objects is not a benchmark. It is an invitation.

We do not want the credit for the benchmark. We want the data the benchmark unlocks.

What we will not do until the benchmark exists

Implement REArtGS as a production GA1 branch. The cost is non-trivial and the win is unmeasurable without the third metric axis. The work would be premature.
Train a learned joint-axis estimator. Same logic. We don’t yet know how much axis error tolerable downstream is. A learned estimator with no metric to optimize against would be retrofitted to whatever evaluation set we happen to have.
Promise multi-state input on the bundle schema. v0.2 of the neural object schema has typed stubs for body_state_meshes/<state>.ply. The stubs remain empty until the benchmark forces them to be populated.

Next: Cross-robot transfer reports that actually say why: closing this series with the most consequential design choice in the bundle format.