From LiDAR scan to walkable 3D world

We turned an iPad Pro into a room scanner that produces a first-person walkable 3D environment; textured with real photographs, not generated imagery. A single 60-second walk-around with a consumer tablet creates a navigable digital twin of any indoor space.

This post describes the full pipeline: LiDAR capture on iOS, volumetric depth fusion on the backend, photo-texture baking, and real-time first-person rendering back on the iPad.

The problem

Existing room scanning apps produce either point clouds (hard to navigate), untextured meshes (gray blobs), or photogrammetry reconstructions (slow, fragile, require hundreds of photos). None of them let you walk through the result in first-person on the same device that captured it, with the actual appearance of the room.

We wanted: scan a room in 60 seconds, wait 2 minutes, then walk through it like a video game. See the real walls, floor, furniture, and objects exactly as they appear.

What we built

A four-stage pipeline that runs across the iPad and a Python backend:

Stage 1: Capture (iPad, 60 seconds)

The iOS app runs an ARWorldTrackingConfiguration with mesh and scene depth enabled. For each accepted frame, it records:

Smoothed LiDAR depth map (256×192, float32 meters)
RGB image (1920×1440)
Per-pixel confidence map (0=low, 1=medium, 2=high)
6-DOF camera pose (position + quaternion)
Camera intrinsics (fx, fy, cx, cy)
Exposure and motion metadata

A real-time frame acceptance filter rejects frames with motion blur (gyroscope × exposure × focal length > 50), low depth confidence ratio (<25% high-confidence pixels), or insufficient baseline from the previous frame (<30mm). A typical scan accepts 400–950 frames from a 60-second walkthrough.

The accepted frames are packaged into a compressed bundle and uploaded to the backend.

Stage 2: Volumetric Fusion (Backend, ~30 seconds)

The backend integrates all depth frames into a single coherent surface using Open3D’s ScalableTSDFVolume:

Each depth frame is filtered by confidence (reject low-confidence pixels) and clamped to 4.5m max range
Camera poses are converted from ARKit’s OpenGL convention (−Z forward, Y up) to OpenCV convention (+Z forward, −Y down) via a diag(1, −1, −1, 1) right-multiplication on the camera-to-world matrix
Each frame is integrated into the TSDF volume at 1.5–2cm voxel resolution with 4× voxel SDF truncation distance
Marching cubes extracts a triangle mesh from the volume

The raw mesh typically contains 1–2.5 million faces with per-vertex RGB colors from the fused depth+color integration.

Stage 3: Mesh Cleanup + Texture Baking (Backend, ~2 minutes)

The raw TSDF mesh has artifacts: floating fragments, double-layer surfaces from reflective objects, degenerate triangles. A multi-pass cleanup removes these:

Connected component filtering: remove clusters smaller than 1% of the largest component
Observation counting: project every face centroid into every camera; remove faces visible from zero viewpoints (typically 10–18% of raw faces)
Thin sheet detection: find adjacent face pairs with nearly opposite normals (dot product < −0.8) and remove both
Second-pass component filtering: remove newly disconnected fragments
Quadric decimation: reduce to 150K–200K faces for iPad rendering performance
Laplacian smoothing: one iteration at λ=0.5 to reduce voxel staircase artifacts

After cleanup, the mesh is UV-unwrapped using xatlas parametrization, then texture-baked:

For each face in the mesh:

Project the face centroid into every camera frame
Score each frame: score = visibility × center_bonus / distance, with edge rejection (last 5% of image boundaries)
Select the highest-scoring frame
Compute an affine transform from the projected image triangle to the UV atlas triangle
Warp the source image pixels into the atlas via OpenCV

Unassigned faces (not visible from any scored frame) are filled via BFS propagation from assigned neighbors, running up to 30 iterations.

The atlas image is vertically flipped before embedding in the USDZ package. The atlas is rendered in image coordinates (origin top-left) but USD texture sampling uses OpenGL coordinates (origin bottom-left).

Stage 4: Viewer (iPad, real-time)

The textured USDZ is loaded into a RealityKit scene with .nonAR camera mode. A PerspectiveCamera entity drives the viewport. The navigation system provides:

First-person movement via virtual joystick (left side of screen)
Free-look via pan gesture (right side of screen, restricted via UIGestureRecognizerDelegate to avoid conflict with joystick)
Camera orientation: yaw + π rotation to align RealityKit’s −Z camera forward with the navigation controller’s +Z forward direction
Pitch clamped to ±60 degrees

The collision and walkable systems are generated server-side from the mesh’s surface classification (floor/wall/ceiling/obstacle based on face normals and height from floor plane).

Key engineering challenges

The coordinate system war

The single hardest bug in the entire pipeline was getting three different coordinate conventions to agree:

System	Forward	Up	Handedness
ARKit (OpenGL)	−Z	+Y	Right
Open3D (OpenCV)	+Z	−Y	Right
RealityKit (OpenGL)	−Z	+Y	Right

ARKit provides camera poses in OpenGL convention. Open3D’s TSDF integration expects OpenCV convention. The conversion is a right-multiplication by diag(1, −1, −1, 1) on the camera-to-world matrix. This flips the camera’s Y and Z axes without changing the camera’s world position.

Without this conversion, each frame’s depth unprojects backwards, producing a mesh of chaotic overlapping fragments instead of a coherent room. The geometry appears roughly correct at a distance (same bounding box) but is completely uninhabitable.

The same conversion must be applied consistently in the texture baker’s projection. Mesh vertices exist in the TSDF world frame (which was built with converted poses), so projecting back to image coordinates requires the same converted poses.

A second coordinate issue arises in the USDZ texture: the texture atlas is rendered in image coordinates (y=0 at top) but USD’s UsdUVTexture samples with OpenGL UV coordinates (v=0 at bottom). A vertical flip of the atlas image before USDZ packaging resolves this.

Depth confidence as a quality signal

iPad Pro’s LiDAR provides a per-pixel confidence level (0, 1, 2) with each depth frame. We found that filtering by confidence dramatically affects mesh quality:

Confidence filter	Valid pixels/frame	Mesh quality
None (all pixels)	100%	Noisy, double surfaces on reflective objects
≥ Medium (1)	~95%	Clean walls/floors, some noise on ceramics
High only (2)	~95%	Cleanest, but sparse on dark/glossy surfaces

We use ≥ Medium as the default. The 5% of low-confidence pixels typically correspond to reflective surfaces (toilets, sinks, mirrors, windows) where the LiDAR pulse scatters. Removing them eliminates the worst double-surface artifacts.

Texture projection accuracy

The texture baker must project 3D mesh vertices back into 2D camera images with sub-pixel accuracy. Any systematic error produces visible texture misalignment: curtain textures appearing on ceilings, floor textures on walls.

Two conditions must hold:

Mesh vertices must be in the same world frame as the camera poses used for projection
The projection must happen before any post-processing that modifies vertex positions (like floor alignment)

We discovered that floor alignment (mesh.vertices[:, 1] -= floor_y) applied before texture baking shifts all vertices relative to the camera poses, causing every face to sample from the wrong image location. Moving texture baking before floor alignment fixed the misalignment completely.

The observation gap

In a typical room scan, 10–18% of TSDF mesh faces have zero camera observations. They formed from depth integration but no camera ever pointed directly at them. These faces cannot be textured and appear as gray patches.

The causes:

Back-facing surfaces (the camera approached a wall from one side but the TSDF volume extended slightly past it)
Occluded geometry (behind furniture, inside corners)
Surfaces only seen at extreme oblique angles (below the scoring threshold)

Our BFS gap-filling propagates the nearest assigned frame’s texture to unassigned neighbors. This covers most small gaps. Large unobserved regions (entire walls the camera never faced) remain untextured. This is a capture coverage limitation, not a reconstruction limitation.

Technical details

TSDF volumetric fusion

We use Open3D’s ScalableTSDFVolume with TSDFVolumeColorType.RGB8 for joint geometry and color integration. The scalable variant uses a hash map of voxel blocks rather than a dense grid, allowing efficient fusion of room-scale environments without pre-allocating memory for the entire volume.

Parameters:

Parameter	Value	Rationale
Voxel length	0.015–0.02 m	Balance between surface detail and memory/compute
SDF truncation	4× voxel length	Standard TSDF truncation band
Max depth	4.5 m	LiDAR noise increases significantly beyond 4m
Min depth	0.1 m	Reject noise floor
Color type	RGB8	Per-voxel color averaging

Integration of 400–950 frames takes 15–45 seconds on a single CPU core. The resulting mesh contains 600K–2.5M faces before decimation.

Mesh cleanup pipeline

Connected component analysis uses Open3D’s cluster_connected_triangles() which returns per-triangle cluster IDs and cluster sizes. We keep clusters with ≥1% of the largest cluster’s triangle count. This removes thousands of floating fragments (typically 5K–50K small clusters) while preserving the main room structure and any large separate objects.

Observation counting iterates over sampled camera poses (every 2nd–3rd frame for speed) and for each face:

Computes the view vector (camera position − face centroid)
Checks visibility: dot(face_normal, view_direction) > 0.05
Projects the centroid to image coordinates and checks bounds
Increments the observation counter if all checks pass

Faces with zero observations are removed. This eliminates ghost surfaces that formed in the TSDF from indirect depth integration but were never directly confirmed by any camera.

Thin sheet detection builds an edge-to-face adjacency map and checks face normal consistency across shared edges. Adjacent faces with dot(n1, n2) < −0.8 (nearly opposite normals) indicate a thin sheet (two surface layers separated by less than one voxel). Both faces are removed.

Quadric decimation uses Open3D’s simplify_quadric_decimation() which minimizes the quadric error metric (QEM) at each edge collapse. We target 150K–200K faces, which provides sufficient detail for iPad rendering at 30–60 FPS while keeping the USDZ file under 25 MB.

Texture atlas baking

UV unwrapping uses xatlas, which performs automatic chart generation and atlas packing. For a 200K-face mesh, xatlas produces ~165K UV vertices (some mesh vertices are split at UV seams) in approximately 90 seconds.

Frame scoring for each face evaluates all sampled camera poses:

score = visibility × center_bonus / distance

Where:

visibility = dot(face_normal, normalize(cam_pos − face_centroid)) (how directly the face is pointed at the camera)
center_bonus = 1.0 − 0.5 × (dist_from_image_center / max_possible_distance) (prefer pixels near the image center; less lens distortion, typically sharper)
distance = Euclidean distance from camera to face centroid

Faces within the last 5% of image boundaries are rejected (edge distortion penalty). The visibility threshold is 0.02 (nearly edge-on faces are still accepted if no better view exists).

Affine warping uses OpenCV’s getAffineTransform() to compute the 2×3 matrix mapping the projected image triangle to the UV atlas triangle. The warp is applied to a cropped region of the atlas (bounding box of the UV triangle + 1px margin) rather than the full atlas, which is approximately 1000× faster than warping the full 4096×4096 image per face.

Gap filling via BFS: for each unassigned face, check all face-adjacent neighbors (via trimesh.face_adjacency). If any neighbor is assigned, copy its frame assignment. Repeat for up to 30 iterations. This propagates texture from well-observed regions into small gaps and crevices.

USDZ packaging

The final asset is a USDZ file (zipped USDC + texture PNG) containing:

UsdGeomMesh with vertex positions, face indices, and vertex normals
faceVarying UV coordinates with explicit indices (one UV index per face-vertex)
UsdPreviewSurface material with UsdUVTexture reading from the embedded texture.png
UsdPrimvarReader_float2 connecting the st primvar to the texture sampler

The UV primvar must use faceVarying interpolation with explicit indices because xatlas produces more UV vertices than mesh vertices (seam splits). Without explicit indices, RealityKit cannot look up the correct UV coordinate per face-vertex, resulting in a blank or incorrectly textured mesh.

Results

Scan	Room size	Frames	TSDF time	Texture time	USDZ size	Textured faces
Room 13 (bedroom)	8.6×8.9×3.2m	952	16s	57s	7.7 MB	99.7%
Bath (bathroom)	4.4×5.9×3.0m	417	14s	120s	25.1 MB	99.7%

Texture coverage exceeds 99% on both scans after gap filling. The remaining <1% untextured faces are on surfaces completely occluded from all camera viewpoints.

Wall and floor surfaces texture cleanly with minimal seams. Reflective surfaces (ceramic, glass) show geometry artifacts from LiDAR multipath but correct texture. Thin objects (chair legs, plant stems) are lost to TSDF voxelization at 1.5–2cm resolution.

First-person walkthrough of the reconstructed room: textured geometry, real-time navigation.

Limitations

Voxel resolution vs thin objects: 1.5cm voxels cannot represent objects thinner than ~3cm. Chair legs, cables, and plant stems dissolve into the floor.
Reflective surfaces: LiDAR multipath on ceramic, glass, and water creates double-layer surfaces. Confidence filtering removes most artifacts but some remain.
Texture seams: Adjacent faces assigned to different source frames show visible color discontinuities at triangle edges. Multi-frame blending would smooth these.
Processing time: The full pipeline (download + fusion + cleanup + texture + upload) takes 2–4 minutes. Real-time on-device processing is not yet possible.
Coverage dependency: Surfaces not scanned (behind closed doors, above camera height) cannot be reconstructed. The system makes no attempt to hallucinate unseen geometry.

Future work

Per-texel multi-frame blending: Replace per-face hard assignment with per-texel weighted averaging of top 2–4 frames, with exposure normalization
Depth-consistent occlusion testing: Before accepting a frame for a face, verify via depth map that the face is not occluded by closer geometry
Plane regularization: Detect dominant planes (walls, floor, ceiling) and snap nearby mesh faces to planar surfaces for cleaner architecture
On-device TSDF: Apple’s Metal Performance Shaders could enable real-time volumetric fusion on iPad’s GPU, eliminating the backend dependency
Streaming large environments: LOD mesh streaming for multi-room environments that exceed iPad memory limits

References

Curless, B. and Levoy, M. “A Volumetric Method for Building Complex Models from Range Images.” SIGGRAPH 1996.
Lorensen, W.E. and Cline, H.E. “Marching Cubes: A High Resolution 3D Surface Construction Algorithm.” SIGGRAPH 1987.
Garland, M. and Heckbert, P.S. “Surface Simplification Using Quadric Error Metrics.” SIGGRAPH 1997.
Young, J. “xatlas: Mesh parameterization / UV unwrapping library.” GitHub, 2020.
Zhou, Q.Y., Park, J., and Koltun, V. “Open3D: A Modern Library for 3D Data Processing.” arXiv:1801.09847, 2018.
Pixar. “Universal Scene Description (USD) Specification.” Pixar Animation Studios, 2016–2024.

If you are working on room-scale capture for Physical AI and want to try this pipeline, join the beta.