Internals
These pages document the design decisions inside bcmr that aren't visible from the CLI surface — the measurements behind the shipped ones, the work designed but not shipped, and the explicit non-goals we've decided against. The structure mirrors the layers of the tool:
-
Streaming Checkpoint Copy is the algorithm at the heart of every single-file copy: 4 MiB blocks, inline BLAKE3, 64 MB checkpoint with crash-safe write ordering, resume verification, and the kernel fast paths (
copy_file_range,clonefile/reflink) we fall through to when available. -
Local Multi-File Performance covers the parts above that algorithm: per-file fsync gating, file-level
--jobsconcurrency, and skipping the whole-source hash when nothing reads it. These are the changes that took bcmr from "13× slower thancpon a 2100-file repo" to "1.5× faster thancp, 2× faster thanrsync" without touching the per-file path. -
Wire Protocol & Remote Transfers is the binary frame protocol that drives
bcmr serve, the per-worker SSH connection design, the negotiated wire compression (LZ4 + Zstd-3 with auto-skip), and the content-addressed dedup for repeat PUTs. -
Path B: Direct-TCP + AEAD Data Plane is the design of the optional direct-TCP data plane that bypasses SSH's single-stream crypto ceiling using AES-256-GCM over a rendezvous-established TCP connection.
-
Non-Goal: Rolling-Checksum Delta-Sync explains why bcmr will not implement rsync's signature feature. Three pieces: the 1996 trade-off has flipped, the workloads that need byte-precise delta already moved to CDC-based tools, and the engineering cost doesn't convert us into an rsync replacement afterward.
-
Open Questions lists work that's designed but not shipped:
splice(2)zero-copy,io_uringreads, CAS LRU eviction, and the failed pipelined-hashing experiment.
See Also (design discussion in context)
Not every design rationale lives under Internals — some is in documents that serve a different primary audience:
- README — What bcmr does that
cpandscpdon't is the user-facing capability narrative (integrity defaults, resume semantics, unified local/SSH CLI, JSON for agents) that this Internals section measures and justifies. - Remote Copy guide — How It Works is the two-paragraph summary of the serve vs legacy split, aimed at a user deciding which mode their transfer will pick.
- SECURITY.md lists which internal components are in-scope for vulnerability reports (direct-TCP data plane, rendezvous auth, session file format, CAS). The threat-model discussion for AEAD framing and session keys lives on the Path B page above.
Cross-Cutting Summary
Most rows below have their own experiment with measured numbers; click through to the relevant page for the table. Every number here is from a specific workload — the benefit column notes the conditions so the row can't be lifted context-free. Rows marked non-goal are design decisions without a measurement (the reasoning lives on the linked page).
| Decision | Lives on | Measured benefit (workload) |
|---|---|---|
| Always-on BLAKE3 (single hash) | SCC | Free on Linux AVX-512 (~5 GB/s > NVMe); 8--56 % overhead on macOS NEON warm cache |
| Tail-block resume verify | SCC | 50--145× vs full prefix rehash, 48--768 MiB written, mac/Linux |
| 64 MiB checkpoint interval | SCC | ≤ 16 % overhead and ≤ 64 MiB rework on both platforms (single-file, warm cache) |
copy_file_range with offset | SCC | 8--24 % faster resume on Linux NVMe (64--512 MiB) |
| Opt-in per-file fsync | Local Perf | 13× faster (9.9 s → 0.72 s) on 2100 × 4 KiB repo, mac APFS warm cache |
--jobs parallel local copy | Local Perf | 1.5--1.67× vs -j1 on 10 000 × 64 KiB files, mac APFS warm cache |
| Skip src hash when unused | Local Perf | 28 % off (285 → 205 ms) on 32 MiB streaming no-verify copy, mac APFS |
| Single spawn_blocking copy loop | Local Perf | 2.3× (12.3 → 5.4 s) on 2 GiB streaming copy, Linux NVMe ext4 warm cache |
| Session + checkpoint gated on intent | Local Perf | ~2× (3.9 → 1.89 s) on 1 GiB streaming copy, mac APFS; lands within 1.65× of cp |
| Per-worker SSH connections | Wire | Up to ~6× parallel throughput (not re-measured on this tree; mscp's 8-conn 100 Gbps figure) |
| Wire compression (Zstd-3) | Wire | 2.48--5.59× vs uncompressed on 64 MiB source-text, ~10 MB/s WAN; essentially no cost on incompressible blocks (auto-skip) |
CAP_DEDUP repeat PUT | Wire | 32 % faster (18.9 → 12.9 s) on 64 MiB re-upload, ~10 MB/s WAN; savings match the bytes not sent |
CAP_FAST GET | Wire | Mixed. 1.07× on WAN (network-bound); 0.78× (i.e. slower) on Linux loopback due to pipe-size + spawn_blocking issues documented in the experiment |
| CAS LRU cap | Wire | Holds CAS ≤ cap under 3× 24 MiB repeat uploads (unit-test-sized; intended to prove bound, not speedup) |
| No rolling-checksum delta-sync | Non-Goal | non-goal — reach for rsync / restic / borg / OCI when byte-precise delta is genuinely required |