bcmr v0.6.0
online github↗
docs / internals / Streaming Checkpoint Copy

Streaming Checkpoint Copy

bcmr implements a Streaming Checkpoint Copy (SCC) algorithm that unifies three capabilities no existing cp-class tool provides together: inline integrity hashing at zero extra I/O, crash-safe resumable state, and constant-time resume verification.

This document describes the algorithm design with formal analysis, presents ablation experiments across macOS and Linux validating each design decision, and compares with prior art.

Problem Statement

File copying with integrity verification faces a fundamental I/O trade-off. Let SS denote the source file of size nn bytes and DD the destination.

OperationI/O PassesTotal Bytes Read/Written
cp S D1R + 1W2n2n
cp S D && sha256sum S D3R + 1W4n4n
rsync --checksum S D2R + 1W3n3n (both sides hash)

The verification tax is steep: confirming a copy doubled is a whole-file re-read. For a 100 GB file on a 500 MB/s drive, that is 200 extra seconds purely for re-reading.

Resume after interruption is worse. Existing tools verify the written prefix by re-hashing it entirely:

Tresume-verify=O(k)where k=bytes already writtenT_{\text{resume-verify}} = \mathcal{O}(k) \quad \text{where } k = \text{bytes already written}

For a 90% complete 100 GB transfer, this means re-reading 90 GB just to confirm the prefix is intact.

Algorithm Design

Core Insight

BLAKE3 achieves 1--5 GB/s on modern hardware (NEON / AVX-512), which exceeds the throughput of most storage devices. Hashing is no longer the bottleneck --- disk I/O is. This means we can compute the source hash during the copy at effectively zero marginal cost.

Data Flow

SCC data flow

The two hashers run in lockstep: src_hasher accumulates a single BLAKE3 over the whole file (used for -V and for cross-run resume detection); block_hasher is reset every 4 MiB so the session file can record per-block hashes. See Experiment 10 for when the source hasher is skipped.

Session File

The session file persists the copy state across crashes. It uses a compact binary format:

Session Layout

For a file of nn bytes with block size b=4MBb = 4\,\text{MB}:

session=256+32n/bbytes|\text{session}| = 256 + 32 \cdot \lceil n / b \rceil \quad \text{bytes}

File SizeBlocksSession SizeOverhead
1 GB2568.2 KB7.6×1067.6 \times 10^{-6}
10 GB2,56080 KB7.6×1067.6 \times 10^{-6}
100 GB25,600800 KB7.6×1067.6 \times 10^{-6}
1 TB262,1448 MB7.6×1067.6 \times 10^{-6}

Session Overhead

The overhead converges to 32/b7.6×10632/b \approx 7.6 \times 10^{-6} as metadata becomes negligible.

Crash Safety Invariant

The write ordering ensures a strict invariant:

Crash Safety

Let SS be the session and BkB_k the kk-th block on disk. After each checkpoint:

k<S.n:Bk is durable on disk    H(Bk)=S.hashes[k]\forall\, k < S.n: \quad B_k \text{ is durable on disk} \;\wedge\; H(B_k) = S.\text{hashes}[k]

The ordering is:

  1. write(dst, block_data) --- block in page cache
  2. fdatasync(dst) --- block durable on media
  3. session.save() via atomic write-fsync-rename --- session updated

If a crash occurs at any point:

  • Before step 2: Block not on disk. Session unchanged. Block is recopied.
  • Between steps 2 and 3: Block on disk, session old. Block is recopied (redundant but correct).
  • After step 3: Both durable. Resume from Bk+1B_{k+1}.

No state can be reached where the session claims a block is complete but the block is not on disk.

Crash recovery state machine

The two recovery branches that recopy a block are wasteful but correct; the only forbidden state ("session says k is durable, but k isn't on disk") is unreachable.

Resume: Tail-Block Verification

On resume, we exploit the invariant. All blocks except possibly the last are guaranteed by the checkpoint ordering. We only verify the tail block --- the one that was being written when the crash occurred:

Tresume=Thash(b)=O(1)independent of kT_{\text{resume}} = T_{\text{hash}}(b) = \mathcal{O}(1) \quad \text{independent of } k

versus full prefix rehash:

Told-resume=Thash(k)=O(k)T_{\text{old-resume}} = T_{\text{hash}}(k) = \mathcal{O}(k)

2-Pass Verified Copy

Since the source hash is computed inline during the copy, the -V verification mode needs only one additional pass (re-read destination to hash it), not two:

ModeI/O PassesTotal Bytes
Old -V: copy, hash src, hash dst3R + 1W4n4n
New -V: copy+hash src, hash dst2R + 1W3n3n
Saving1 full read eliminatednn bytes, 25% of total

I/O Complexity

Durable Sync

On macOS, fsync() only flushes data from the OS buffer cache to the drive's write cache --- it does not issue a cache flush command to the drive controller. Data can be lost on power failure. F_FULLFSYNC via fcntl() issues a full barrier.

bcmr uses F_FULLFSYNC on macOS and fdatasync() on Linux (where data=ordered mode on ext4/XFS provides sufficient ordering guarantees). After every atomic rename, the parent directory is also fsynced to ensure the directory entry is durable.

Page Cache Management

Large copies pollute the page cache, evicting unrelated cached data. On Linux, bcmr calls posix_fadvise(FADV_DONTNEED) at each checkpoint interval to evict already-copied pages from both source and destination file descriptors.


Ablation Experiments

All experiments use median of 3--5 runs. File data is pseudo-random ((i*7+13) mod 256) to prevent compression and deduplication artifacts.

Test environments:

  • macOS: Apple Silicon, APFS SSD
  • Linux: Intel Xeon Gold 6238R (AVX-512), NVMe SSD (Samsung), ext4

Experiment 1: Inline BLAKE3 Hash Overhead

Hypothesis: BLAKE3 throughput exceeds storage I/O, so inline hashing adds negligible wall-clock time.

Method: Copy files of size n{16,64,256,512,1024}n \in \{16, 64, 256, 512, 1024\} MB in three modes: (A) copy only, (B) copy + hasher.update(), (C) copy + hash + hasher.clone() per block.

BLAKE3 Throughput

PlatformBLAKE3 ThroughputBottleneck
macOS (NEON)~1.0 GB/sCPU-bound for fast SSD
Linux (AVX-512)~5.4 GB/sAlways I/O-bound

On Linux, BLAKE3 at 5.4 GB/s exceeds NVMe peak (~3.5 GB/s). Inline hashing is truly free --- the CPU finishes hashing before the next disk read completes.

On macOS, BLAKE3 at ~1 GB/s is comparable to SSD speed, so warm-cache tests show 8--56% overhead. In cold-cache (real-world) scenarios, disk latency dominates and the overhead shrinks toward the Linux numbers.

Experiment 2: 2-Pass vs 3-Pass Verification

Hypothesis: Eliminating one full-file read from verification saves n/(3n)33%n / (3n) \approx 33\% of total I/O.

File Size3-pass2-passSpeedupTheoretical
64 MB171 ms163 ms1.05x1.33x
256 MB654 ms623 ms1.05x1.33x
512 MB1426 ms1251 ms1.14x1.33x

Warm-cache results show 5--14% savings (page cache masks the eliminated read). With cold cache, the savings converge toward the theoretical 33%.

Experiment 3: Tail-Block vs Full Prefix Rehash

Hypothesis: Ttail=O(1)T_{\text{tail}} = \mathcal{O}(1) while Tfull=O(k)T_{\text{full}} = \mathcal{O}(k).

Resume Verification

WrittenFull Rehash (macOS / Linux)Tail-BlockSpeedup
48 MB50.7 / 15.3 ms4.4 / 1.6 ms11x / 10x
192 MB198.2 / 62.5 ms4.5 / 1.8 ms44x / 34x
384 MB396.2 / 122.2 ms5.1 / 1.8 ms78x / 66x
768 MB817.3 / 240.0 ms5.6 / 1.8 ms145x / 131x

Tail-block verification is constant at ~5 ms (macOS) / ~1.8 ms (Linux) regardless of file size. The speedup grows linearly with kk as predicted.

Experiment 4: Sync Interval Overhead

Hypothesis: There exists an interval II where the fsync overhead is acceptable (<20%<20\%) and worst-case rework on crash is bounded.

Sync Interval

IntervalmacOS OverheadLinux OverheadMax Rework
4 MB+225%+37%4 MB
16 MB+58%+12.5%16 MB
64 MB+16%+3.9%64 MB
256 MB+9%+0.8%256 MB

64 MB (16 blocks) is the chosen default: 16%\leq 16\% overhead on both platforms, at most 64 MB of rework (~0.1s on NVMe).

Experiment 5: F_FULLFSYNC vs fsync on macOS

Hypothesis: F_FULLFSYNC costs negligibly more than fsync on Apple Silicon.

F_FULLFSYNC Comparison

File SizefsyncF_FULLFSYNCDifference
4 MB7.0 ms6.0 ms-14%
16 MB12.0 ms13.9 ms+16%
64 MB33.0 ms34.1 ms+3%
256 MB143.5 ms125.0 ms-13%

Differences are within noise. F_FULLFSYNC provides correct durability guarantees at no measurable performance cost. SQLite, RocksDB, and PostgreSQL all use F_FULLFSYNC on macOS.

Experiment 6: copy_file_range with Offset (Linux)

Hypothesis: The kernel fast path supports non-zero offsets for resume, avoiding userspace buffer copies.

copy_file_range Resume

File Sizeread/writecopy_file_rangeSpeedup
64 MB52 ms42 ms1.24x
256 MB185 ms171 ms1.08x
512 MB356 ms323 ms1.10x

8--24% faster on NVMe. The benefit would be larger on slower media or network filesystems where zero-copy matters more.


Comparison with Prior Art

Tool Comparison

cprsynccurl -Caria2bcmr (SCC)
Resume granularityNoneBlock rollingByte offset16 KiB bitmap4 MB blocks
Resume verificationN/AO(n)\mathcal{O}(n) rollingNonePiece hash (if available)O(1)\mathcal{O}(1) tail-block
State persistenceNoneNoneNone.aria2 control fileBinary session file
Crash safetyNonePartial file leftPartial file leftGood (bitmap)fdatasync ordering invariant
Source change detectionNonemtime+sizeNoneIf-Modified-Sincemtime+size+inode (session)
Inline hashNoNoNoNoAlways-on BLAKE3
Verify I/O costN/A3n3nN/A2n2n (with piece hashes)2n2n (inline src hash)

Key differentiators:

  • Constant-time resume verification --- no other cp-class tool achieves O(1)\mathcal{O}(1).
  • Always-on source hash --- verification is a byproduct of copying, not a separate pass.
  • Formal crash safety --- write ordering invariant with F_FULLFSYNC / fdatasync + directory fsync.

Summary

DecisionMeasured CostMeasured Benefit
Always-on BLAKE30--15% CPU (hidden by I/O)Free source hash
Session file<8×106< 8 \times 10^{-6} of file sizeCrash-safe resume
64 MB checkpoint4--16% overhead\leq 64 MB rework
Tail-block verify1.8--5.6 ms constant50--145x vs full rehash
2-pass -V0% (saves I/O)25% less total I/O
F_FULLFSYNC~0% (single file)Correct macOS durability
copy_file_range offset0% (saves I/O)8--24% faster resume