bcmr v0.6.0
online github↗
docs / internals / Internals

Internals

These pages document the design decisions inside bcmr that aren't visible from the CLI surface — the measurements behind the shipped ones, the work designed but not shipped, and the explicit non-goals we've decided against. The structure mirrors the layers of the tool:

  • Streaming Checkpoint Copy is the algorithm at the heart of every single-file copy: 4 MiB blocks, inline BLAKE3, 64 MB checkpoint with crash-safe write ordering, O(1)\mathcal{O}(1) resume verification, and the kernel fast paths (copy_file_range, clonefile/reflink) we fall through to when available.

  • Local Multi-File Performance covers the parts above that algorithm: per-file fsync gating, file-level --jobs concurrency, and skipping the whole-source hash when nothing reads it. These are the changes that took bcmr from "13× slower than cp on a 2100-file repo" to "1.5× faster than cp, 2× faster than rsync" without touching the per-file path.

  • Wire Protocol & Remote Transfers is the binary frame protocol that drives bcmr serve, the per-worker SSH connection design, the negotiated wire compression (LZ4 + Zstd-3 with auto-skip), and the content-addressed dedup for repeat PUTs.

  • Path B: Direct-TCP + AEAD Data Plane is the design of the optional direct-TCP data plane that bypasses SSH's single-stream crypto ceiling using AES-256-GCM over a rendezvous-established TCP connection.

  • Non-Goal: Rolling-Checksum Delta-Sync explains why bcmr will not implement rsync's signature feature. Three pieces: the 1996 trade-off has flipped, the workloads that need byte-precise delta already moved to CDC-based tools, and the engineering cost doesn't convert us into an rsync replacement afterward.

  • Open Questions lists work that's designed but not shipped: splice(2) zero-copy, io_uring reads, CAS LRU eviction, and the failed pipelined-hashing experiment.

See Also (design discussion in context)

Not every design rationale lives under Internals — some is in documents that serve a different primary audience:

  • README — What bcmr does that cp and scp don't is the user-facing capability narrative (integrity defaults, resume semantics, unified local/SSH CLI, JSON for agents) that this Internals section measures and justifies.
  • Remote Copy guide — How It Works is the two-paragraph summary of the serve vs legacy split, aimed at a user deciding which mode their transfer will pick.
  • SECURITY.md lists which internal components are in-scope for vulnerability reports (direct-TCP data plane, rendezvous auth, session file format, CAS). The threat-model discussion for AEAD framing and session keys lives on the Path B page above.

Cross-Cutting Summary

Most rows below have their own experiment with measured numbers; click through to the relevant page for the table. Every number here is from a specific workload — the benefit column notes the conditions so the row can't be lifted context-free. Rows marked non-goal are design decisions without a measurement (the reasoning lives on the linked page).

DecisionLives onMeasured benefit (workload)
Always-on BLAKE3 (single hash)SCCFree on Linux AVX-512 (~5 GB/s > NVMe); 8--56 % overhead on macOS NEON warm cache
Tail-block resume verifySCC50--145× vs full prefix rehash, 48--768 MiB written, mac/Linux
64 MiB checkpoint intervalSCC≤ 16 % overhead and ≤ 64 MiB rework on both platforms (single-file, warm cache)
copy_file_range with offsetSCC8--24 % faster resume on Linux NVMe (64--512 MiB)
Opt-in per-file fsyncLocal Perf13× faster (9.9 s → 0.72 s) on 2100 × 4 KiB repo, mac APFS warm cache
--jobs parallel local copyLocal Perf1.5--1.67× vs -j1 on 10 000 × 64 KiB files, mac APFS warm cache
Skip src hash when unusedLocal Perf28 % off (285 → 205 ms) on 32 MiB streaming no-verify copy, mac APFS
Single spawn_blocking copy loopLocal Perf2.3× (12.3 → 5.4 s) on 2 GiB streaming copy, Linux NVMe ext4 warm cache
Session + checkpoint gated on intentLocal Perf~2× (3.9 → 1.89 s) on 1 GiB streaming copy, mac APFS; lands within 1.65× of cp
Per-worker SSH connectionsWireUp to ~6× parallel throughput (not re-measured on this tree; mscp's 8-conn 100 Gbps figure)
Wire compression (Zstd-3)Wire2.48--5.59× vs uncompressed on 64 MiB source-text, ~10 MB/s WAN; essentially no cost on incompressible blocks (auto-skip)
CAP_DEDUP repeat PUTWire32 % faster (18.9 → 12.9 s) on 64 MiB re-upload, ~10 MB/s WAN; savings match the bytes not sent
CAP_FAST GETWireMixed. 1.07× on WAN (network-bound); 0.78× (i.e. slower) on Linux loopback due to pipe-size + spawn_blocking issues documented in the experiment
CAS LRU capWireHolds CAS ≤ cap under 3× 24 MiB repeat uploads (unit-test-sized; intended to prove bound, not speedup)
No rolling-checksum delta-syncNon-Goalnon-goal — reach for rsync / restic / borg / OCI when byte-precise delta is genuinely required