bcmr v0.6.0
online github↗
docs / internals / Open Questions

Open Questions

These are investigations that surfaced during the v0.5.7 / v0.5.8 work but were intentionally deferred. Each needs design before shipping; the notes here record the shape we have in mind so the follow-up doesn't start from scratch.

Zero-Copy Serve GET via splice(2)

CAP_FAST shipped in v0.5.9 with a splice(2) Linux path (see Wire Experiment 14), but the loopback measurement showed it's actually slower than the default buffered path because of two implementation problems:

  1. Pipe buffer: fcntl(F_SETPIPE_SZ, 4 MiB) needs the kernel's pipe-max-size knob lifted (default 1 MiB on Ubuntu) or root. When it silently fails, each 4 MiB chunk takes ~64 splice rounds instead of the expected 1.
  2. spawn_blocking per chunk: the splice loop currently dispatches one tokio blocking task per chunk --- the exact anti-pattern Experiment 13 fixed for the local copy path.

Fix: move the entire splice loop into one spawn_blocking per file; either probe /proc/sys/fs/pipe-max-size and use the largest allowed value, or fall back to writing the frame header + plain read+write syscalls inside that one blocking task.

Until that's done, --fast still has a use: it skips the server's BLAKE3 computation, which matters for low-spec servers even over WAN.

io_uring Read Path on Linux

Each read in the inner copy loop is a plain read(2) syscall since v0.5.9 (Local Perf Experiment 13 moved the loop into one spawn_blocking). Batching via io_uring would replace the read+write pair per chunk with a single submission queue entry, saving the round trip through the kernel.

Decision needed first: tokio-uring vs raw io-uring --- the former still requires its own runtime (tokio_uring::start()) that doesn't drive standard tokio futures, which is a structural problem for bcmr; the latter requires running outside tokio for the read loop and reattaching futures around it.

Estimated win after Experiment 13: single-digit-percent on sustained NVMe reads. The big multipler that motivated this entry in v0.5.7's open list (10x slower than cp) was actually the spawn_blocking-per-chunk overhead, which Experiment 13 closed without io_uring.

CAS LRU / Cap

The dedup CAS at ~/.local/share/bcmr/cas grows monotonically. Cleanest design is an LRU with a configurable byte cap (default ~1 GiB), garbage-collected on the next dedup-enabled PUT.

Layout sketch:

  • Sidecar index file mapping hash -> (size, last_access_unix).
  • Before each PUT that uses dedup, sum the index sizes. If over cap, drop oldest entries until under.
  • last_access_unix updated whenever a block is read for a CAS hit.

Edge case: concurrent PUTs on the same machine. Could share an advisory lock on the index, or just accept eventual consistency (the worst that happens is a recently-evicted block gets re-fetched from the wire on the next request).

Pipelined Hashing for the Streaming-Copy Hot Path

Tried in v0.5.8 (see Experiment 10) and the obvious version was slower than the synchronous double hash --- the per-chunk Vec<u8> clone for the channel send and the channel sync overhead together cost more than the parallelism saved. A useful version would need:

  • A buffer pool (e.g. bytes::Bytes with refcount) so the channel send is zero-copy.
  • Probably also update_rayon so the hash itself parallelises across cores instead of running on one.

The wins on the existing serial path are small enough that "skip the hash entirely when it's not needed" was the better lever (Experiment 10 did this, gating on verify || session.is_some()). The leftover gap to cp on the streaming path comes from per-block hash, per-checkpoint posix_fadvise, and tokio I/O scheduling overhead --- those are separate experiments.

Recursive Tree Dedup

Dedup currently fires only on individual file PUTs. Extending it to directory copies (where the client first sends a manifest of all files + per-file block hashes, server probes the CAS in one round-trip, client streams only what's missing across the whole tree) would be the natural follow-up to Experiment 11. Saves N1N - 1 extra round-trips for NN files.

Content-Defined Chunking for CAS (FastCDC)

CAP_DEDUP today chunks at fixed 4 MiB boundaries + BLAKE3 per block (see Experiment 11 and src/core/serve_client/ops.rs:319). That's optimal for "re-upload the exact same artifact" — the measured 32 % win on that experiment — but a single-byte insertion partway through a file shifts every subsequent block boundary and evicts the whole tail from the CAS hit set.

FastCDC-style chunking replaces the fixed boundary with a lightweight gear rolling hash that picks boundaries based on content, targeting the same ~4 MiB average. A mid-file insert only displaces the chunks around it; everything after the next CDC boundary still matches the previous upload. This is not rsync-style byte-precise delta-sync (see Non-Goal: Rolling-Checksum Delta-Sync) — the wire unit stays a whole chunk of ~4 MiB, matched whole, hashed with BLAKE3. Only the boundary rule changes.

Protocol change required. The "wire format unchanged" intuition this proposal sometimes comes with is wrong:

  • Message::HaveBlocks { block_size: u32, hashes: Vec<[u8; 32]> } today carries a single block size for all hashes; CDC needs per-chunk length (and offset) on the wire. Either extend the message or introduce a variant under a new cap bit (CAP_DEDUP_CDC) so the fixed-block path stays interoperable with older peers.
  • Server reconstruction currently derives each block's file offset from idx * block_size; with variable chunks the offset has to come from the wire.
  • Post-upload CAS lookup by hash is unchanged (CAS is keyed by hash alone, not offset).

Measure before shipping. CAS's one measured workload is identical 64 MiB re-upload. CDC's additional win over fixed blocks materialises only on re-uploads where content was modified in place — SQL dumps patched mid-file, ML checkpoints with rewritten weights, VM images after an in-place patch. We have no data that these dominate bcmr users' actual traffic. Reasonable experiment shape:

  1. Feature-flag a gear-hash chunker (e.g. the fastcdc crate) behind an env var; emit the variable-chunk message when set.
  2. On a workload with real mid-file deltas (a repeated SQL dump across schema migrations, or a synthetic "insert 64 KiB at byte offset 512 MiB in a 2 GiB file"), compare CAS hit rate and wire bytes saved vs fixed blocks.
  3. If hit-rate gain is material on a representative workload, plan a protocol bump; otherwise park indefinitely.

Conservative implementation estimate: 1–2 weeks (chunker integration, protocol variant, server offset handling, test matrix for small files / all-zero inputs / boundary-aligned edits). Add a few days up front for the measurement gate.

Closing the 1 GiB Single-File Gap to scp

After Experiment 18 closed many-files to 1.18× scp, the single-1-GiB workload still runs at ~2× scp (--fast 4.61 s vs scp 2.41 s). The remaining overhead is structural:

  1. Per-block BLAKE3 (~1 GB/s NEON / AVX-512). For 1 GiB that's ~1 second of pure hash CPU. Scp computes nothing.
  2. Frame overhead. Each Data frame has a 9-byte header per 4 MiB chunk. Negligible bytes but each write_message crosses the tokio scheduler. ~256 frames per GiB.
  3. Tokio I/O scheduling. Per Experiment 13, tokio::fs reads do spawn_blocking per call; we partially fixed this for the local copy path but the serve send-loop still goes through protocol::write_message().await per frame.

Possible directions, ordered by cleanness:

  • A --no-hash mode that drops integrity entirely (stronger than --fast which only skips server-side; this also skips client-side per-block hashing). Would close most of the 1× CPU-second gap.
  • A "bulk" wire mode that sends the body raw after a single Put header — no per-chunk framing during streaming. Server reads declared_size bytes; client writes them. Loses the ability to interleave Error mid-stream but matches scp's shape exactly.

Single-file isn't the most common workload for bcmr serve (WAN deployments are wire-bound; many-files dominates LAN), so this is parked behind any user complaint that actually identifies single-file as their bottleneck.

Silent Fallback When Path Escapes Server Root

The --root jail (default $HOME) is correct security behavior, but when the server rejects a path the client falls back silently to a slower transport (legacy per-file SSH). Symptom from a user's perspective: "bcmr copy of /tmp/foo takes 30 s where scp takes 2 s". Root cause is invisible unless you strace the server or know to look for the path /... escapes server root line on stderr. Found while benchmarking Experiment 17.

Fix shape: when the server returns Error for a path-escape reason during the initial Stat/List, the client should print a clear stderr warning ("falling back to legacy SSH transport because remoterejectedremote rejected path") and either continue with the fallback (current behavior) or exit non-zero (opinionated; breaks scripts that didn't realize they were inside a jail).

Path B: Direct TCP After SSH Rendezvous

Experiment 19 shipped the "mscp-style" fix: open N independent SSH sessions, stripe files across them, get N× the crypto ceiling. That took us past scp -r on a contended box. The remaining single-stream ceiling that Path A can't remove is the ~500 MB/s-per-core cap that each SSH session's cipher imposes: the only way past that per-connection is to not encrypt at the SSH layer at all.

Shape of the fix:

  • Client opens one SSH connection to the remote and invokes bcmr serve --listen <port> (or a dedicated rendezvous subcommand). SSH is the authenticated channel — key exchange happens here, a shared session key gets derived.
  • Server binds a TCP listener on <port>, replies with the port + key over the SSH control channel.
  • Client opens a direct TCP connection to the listener, authenticates with the key, and talks the same bcmr serve wire protocol over that socket. SSH connection stays open as the control/watchdog channel.
  • On LAN, the direct TCP path can skip encryption entirely (user opts in via flag — the threat model is "we trust the link"), or use a fast AEAD (AES-GCM with a session key) that isn't constrained to OpenSSH's cipher negotiation.

Risks / design questions that need actual work:

  • Trust model: now there's a second auth surface. Need to bind the derived key to the SSH session tightly so an attacker can't race to the listener. Review carefully.
  • Firewall friendliness: an extra port may not be reachable through an institutional firewall that only allowed SSH. Flag turns this on; otherwise fall back to Path A.
  • Listener lifecycle: port allocation, port collision, what happens if the client dies mid-batch. Watchdog via SSH control channel handles the last one; the first two are engineering.
  • Code organization: introduce trait Transport in src/core/transport/ with SshTransport (current) and DirectTcpTransport (feature-gated --features direct-transport). Same repo, not extracted to a new crate until a second consumer shows up.

Goal: single-stream saturate a 10 GbE NIC (~1.25 GB/s practical) or better on LAN. On WAN this helps less because the wire is already the bottleneck — Path A-style parallelism matters more there too, and the two are composable (N parallel direct-TCP streams).

Tracked on its own branch per decision; won't merge until it's demonstrably safe and has a clear user story.

xattr Cross-FS Edge Cases

Today's xattr preservation (see code under commands/copy.rs) is best-effort: ENOTSUP and EPERM are both swallowed silently. That's the right default for cross-FS copies, but we should track which attributes were dropped and surface that under -v --- silent dropping of security.selinux or Finder tags is a footgun for power users.