DEEP DIVE

How Binary Diffs at Scale Actually Work

2026-04-14  ·  9 min read

How Binary Diffs at Scale Actually Work

Ask any build engineer who has spent time debugging a Perforce or Git LFS depot: "what does a diff of two versions of CityHub.umap actually show you?" The answer, almost universally, is nothing useful. You get binary files differ and a byte count. The entire 800MB of scene data treated as a single opaque blob, diffed with the same logic used to compare two JPEG thumbnails.

We built Diversion from the assumption that this is fixable — not with a clever wrapper around diff, but by understanding what's actually inside a .uasset or .umap before you touch it. Here's how our binary diff pipeline works, where it gets complicated, and what it still can't do.

Why Naive Delta Compression Fails on .uasset Files

A .uasset file is not a raw binary blob. Unreal serializes its asset format as a structured stream: a header block, an export table, then the export data for each object in the asset (meshes, materials, Blueprints, etc.). The export data for most large assets is LZ4-compressed — which is the first thing that defeats naive delta compression.

LZ4 is designed for speed, not for diff-stable output. If you change a single float in a material parameter and reserialize the asset, the LZ4-compressed block for that export may shift by hundreds of bytes, changing the byte pattern in unpredictable ways even if the logical change was tiny. An rsync-style rolling-checksum diff — the same approach underlying most binary delta tools — will find essentially zero matching blocks and produce a "diff" that's nearly the size of the whole file.

We're not saying LZ4 is bad — it's the right call for Unreal's runtime performance goals. We're saying it makes content-aware diffing mandatory if you want meaningful deltas. Diffing the compressed layer directly is noise.

Block-Level Diffing: The Actual Approach

Diversion's diff pipeline decompresses the export data before diffing. We maintain a local parse of the asset's structure: header, name table, import table, export table, and per-export payload. Each export is hashed independently (we use xxHash64 for speed). On a subsequent version of the same asset, we reconstruct the same structural view and compare hashes at the export granularity.

This gives you diffs that look like:

CityHub_BP_StreetLight  →  export[3] (TransformComponent)
  Location.X: 2840.0 → 2912.5
  Location.Z: 0.0    → 118.0

CityHub_StaticMeshComponent_Road  →  export[7] (mesh_payload)
  mesh_payload changed (84 KB → 91 KB)
  [binary section, no semantic decode]

For Blueprint-based actors, most of the export payload is structured enough that we can surface property-level changes. For mesh payloads, we can tell you the block changed and its size delta, but we don't attempt to decode vertex data — that's below the threshold of meaningful editorial diff output.

The Rolling-Checksum Layer Underneath

Even with structural parsing, very large export payloads (high-polygon mesh data, texture mips stored inline) need a fallback. For exports above a configurable size threshold (default: 512 KB), we run a second pass using a rolling-checksum chunker similar to the one used by rsync and librsync — Rabin fingerprinting with a 4 KB average chunk size.

The chunker produces a sequence of content-defined blocks. If an artist edited a mesh and added a new LOD, the early chunks of the mesh payload are likely identical to the base version; only the tail region diverges. The rolling chunker finds and credits those matching regions, which means the stored delta for a "LOD1 added" edit is proportional to the size of LOD1, not the size of the whole mesh file.

One important detail: we apply the rolling chunker on the decompressed export payload, then recompress when writing to the object store. The compression happens after the diff-stable chunking, not before. This sequence matters — inverting it would give you the noise problem described earlier.

A Concrete Example: 200GB Open-World Level

Consider a scenario common in mid-size Unreal teams: a persistent world map, call it Maps/Overworld_P.umap, that has grown to around 340 MB as the scene accumulates streaming sub-levels, dynamic lighting actors, and foliage placements. Two level designers are working on it in separate branches — one adjusting foliage density in the northern biome, one updating spawn logic for an event trigger in the southern district.

With a naive binary diff tool, both commits show as "binary files differ, 340 MB." With Diversion's structural parser, the foliage edit shows as changes to roughly 40 export entries in the InstancedStaticMeshComponent arrays. The spawn logic edit shows as changes to 3 Blueprint event graph exports. The rest of the 340 MB — static geometry, lighting data, unmodified actors — shows zero diff. Storage delta for the foliage commit: ~2.8 MB. Storage delta for the spawn logic commit: ~180 KB.

The practical implication isn't just storage savings. It's that your history is legible. When you're debugging why a streaming sub-level started failing its distance culling check three weeks ago, you can scan commit messages and diff output instead of binary-bisecting the level file.

What We Still Can't Diff Meaningfully

Structural parsing has limits. Unreal's asset serializer doesn't expose a stable, documented schema for every asset type — some internal asset formats (particularly render data cached inside UStaticMesh assets) are opaque to tooling outside the engine itself. For those sections, we fall back to block-level hashing with no semantic interpretation.

Texture assets (.utexture embedded in .uasset packages) store mip data in compressed formats like BC7 or ASTC. Diffing at the texel level would require decoding and re-encoding the mip chain, which is expensive and not currently on our path. We show "texture payload changed" with a size delta, and that's the honest limit of what we surface for texture edits.

We're also clear that Blueprints serialized with heavy instancing — where a single Blueprint class has hundreds of actor instances in a map, each with per-instance property overrides — produce export tables that can be legitimately large and slow to parse at scale. If your depot has 50 sub-levels each containing thousands of foliage instance overrides, our parse time per asset will be higher than simpler scenes. We tune parse timeouts and have a fast-path for assets whose top-level hash hasn't changed, but it's worth knowing the cost model.

Storage and Transfer Implications

The delta storage model compounds over time. A 300 MB .umap that receives incremental level design edits for 6 months — actors added, positions tweaked, lighting parameters adjusted — will accumulate significantly less than 300 MB × commit_count in your object store. In practice, for projects with regular incremental editing patterns, we see delta chains where the typical per-commit delta is 1–15% of the base asset size, depending on the edit intensity.

Transfer works the same way. When a developer syncs a branch they haven't touched in two weeks, Diversion computes the delta between their last-synced state and HEAD, transfers only the changed blocks, and reconstructs the full asset locally. This is the same property that makes rsync efficient for file synchronization — applied at the structured-export layer rather than the raw byte layer.

The place this story gets complicated is after major Unreal Engine version upgrades. Engine version bumps often trigger a full asset resave pass, which reserializes every asset in the depot and can invalidate most of your delta chain at the object store level — effectively requiring a new base snapshot. This is a known cost in game development with VCS tooling, not something specific to Diversion. Plan for a storage event around every major engine upgrade, and keep a resave commit isolated so its diff noise doesn't contaminate your normal history.