Wire Protocol & Remote Transfers
For remote transfers, bcmr implements a binary frame protocol
(bcmr serve) that replaces per-file SSH process spawning with a
persistent connection over stdin/stdout. The protocol uses
length-prefixed frames ([4B length][1B type][payload]) and supports:
Stat, List, Hash, Get, Put, Mkdir, Resume, plus the
extensions covered on this page.
Key properties of the base design:
- Single connection: all operations multiplexed over one SSH session, eliminating process spawns for files.
- Server-side hashing: the remote bcmr computes BLAKE3 hashes locally, avoiding data round-trips for verification.
- Automatic fallback: if the remote does not have bcmr installed, transfers silently fall back to legacy SCP.
- Frame size limit:
read_messagerejects frames MiB to prevent memory exhaustion from malicious peers.
The Hello / Welcome handshake carries an optional trailing
capabilities byte (LZ4 = 0x01, Zstd = 0x02, Dedup = 0x04,
Fast = 0x08). Old decoders read version and stop; new decoders
read caps too. Talking to a peer that doesn't advertise a bit
just means the feature stays off, so no protocol version bump is
needed for backward-compatible additions.
sequenceDiagram
participant C as Client
participant S as Server
Note over C,S: Both hold their own caps mask.<br/>The intersection drives behaviour.
C->>S: Hello { version: 1, caps: client_caps }
S->>S: effective = server_caps AND client_caps
S->>C: Welcome { version: 1, caps: effective }
C->>C: algo = negotiate(client_caps, effective)
Note over C,S: For each Data frame:<br/>encode_block(algo, ...) emits<br/>Data (raw) or DataCompressed.
When both peers advertise CAP_ZSTD it wins; if only one does,
LZ4 (if both have it) is the fallback; otherwise raw Data. The
encoder also runs a per-block auto-skip: if compressed.len()
exceeds 95 % of the original (random / already-compressed bytes)
the block goes raw to save the receiver's decompress pass.
Parallel SSH with Independent Connections
SSH's ControlMaster multiplexing serializes all channels through
one TCP connection and one encryption context. For parallel
workers, throughput is bounded by a single core's encryption speed
regardless of .
bcmr assigns each parallel worker its own ControlPath, creating
independent TCP connections:
The mscp project measured 5.98x speedup with 8 independent connections on a 100 Gbps link.
See the Remote Copy guide for end-user configuration.
Experiment 9: Wire Compression for Data Frames
Hypothesis: Per-block LZ4/Zstd encoding pays for itself whenever the network is slower than the codec. On modern CPUs LZ4 decodes at multiple GB/s, so the receiver is never compute-bound; the only question is ratio.
Method (Part A --- codec probe): encode then decode a single 4 MiB block three times (random, text-like, mixed) for each algorithm. Ratios and throughputs measured on Apple Silicon:
| Workload | Algo | Ratio | Enc MB/s | Dec MB/s |
|---|---|---|---|---|
| random | LZ4 | 1.004 | 3578.2 | 17330.5 |
| random | Zstd-1 | 1.000 | 4769.9 | 33692.0 |
| random | Zstd-3 | 1.000 | 4655.3 | 33635.0 |
| random | Zstd-9 | 1.000 | 2442.8 | 31432.3 |
| text | LZ4 | 0.390 | 472.8 | 1526.7 |
| text | Zstd-1 | 0.210 | 301.2 | 871.9 |
| text | Zstd-3 | 0.198 | 320.5 | 1012.1 |
| text | Zstd-9 | 0.180 | 47.2 | 1130.9 |
| mixed | LZ4 | 0.697 | 863.4 | 2875.6 |
| mixed | Zstd-3 | 0.599 | 457.2 | 1971.9 |
Interpretation:
- Random data. All three codecs return ratios indistinguishable from 1.0. Sending compressed is pure CPU waste, so the wire path must auto-skip when the encode output is within 5 % of the input.
- Text-like data. Zstd-3 reaches 5x reduction at 320 MB/s encode. For anything under ~2.5 Gbps of effective network throughput, compression is the bandwidth bottleneck, not the CPU.
- Zstd-9 is consistently worse than -3 for file content: encode drops by 7x (to 47 MB/s) for only a ~2 % ratio gain. Skip it.
Decision: Default to auto-negotiation advertising both LZ4 and Zstd. The handshake picks Zstd when both sides speak it (better ratio at acceptable encode cost), falls back to LZ4 when only one does, and to raw Data frames otherwise. Zstd level fixed at 3 --- the library's own default, and our measurement agrees.
Method (Part B --- auto-skip in vivo): a unit test encodes a 4 MiB
pseudo-random block through encode_block(Lz4, ...) and asserts the
emitted message type is Data (raw), not DataCompressed. Covers
the happy path where the codec's frame header + payload overshoots
the 0.95 × original threshold and the encoder falls back.
Experiment 11: Content-Addressed Dedup for Repeat PUT
Hypothesis: For dev workflows where the same artifact is uploaded to a remote host repeatedly, the second-and-onward upload can avoid the wire entirely if the receiver remembers what it has seen. BLAKE3 is already computed per 4 MiB block, so a tiny pre-flight that exchanges hashes lets the server short-circuit to a local CAS read.
Design: Negotiate CAP_DEDUP in the Hello/Welcome caps byte.
When active and the file is at least 16 MiB:
sequenceDiagram
participant C as Client
participant S as Server
participant CAS as Server CAS<br/>~/.local/share/bcmr/cas
C->>C: hash file in 4 MiB blocks
C->>S: Put { path, size }
C->>S: HaveBlocks { block_size,<br/>hashes: [h0, h1, h2, ...] }
loop for each hash
S->>CAS: exists?
end
S->>C: MissingBlocks { bits: 0b...10110 }
Note over C,S: 1 = block needed on wire,<br/>0 = server already has it
loop for each block i
alt bit i is 1
C->>S: Data / DataCompressed (block i)
S->>CAS: write block i (mtime = now)
S->>S: write block i to dst
else bit i is 0
S->>CAS: read block i (mtime → now)
S->>S: write block i to dst
end
end
C->>S: Done
S->>C: Ok { hash: <BLAKE3 of dst> }
The composite hash returned in Ok covers the full file regardless
of which blocks took which path. The 16 MiB threshold protects
small uploads from the round-trip cost of HaveBlocks/MissingBlocks
itself.
Method: 64 MiB pseudo-random file uploaded twice from a macOS
laptop to host-L (Linux NVMe over the public internet, ~30 ms
RTT, ~10 MB/s effective WAN bandwidth). Cold cache via
rm -rf ~/.local/share/bcmr/cas between runs.
| Run | Wall (s) | Notes |
|---|---|---|
| 1 (cold cache) | 18.96 | full 64 MiB on the wire |
| 2 (warm cache) | 12.93 | every block a CAS hit; ~6 s saved |
The savings track the eliminated wire bytes: 64 MiB at ~10 MB/s ≈ 6 s, which matches the observed delta. The remaining 13 s is local hash + CAS read + dst write + protocol round trips, all on either side of the network. For higher-bandwidth links the relative win shrinks; for slower / metered ones (cellular tethering, transoceanic SSH) it grows.
Correctness check: SHA-256 of source matches both destinations across the two runs.
CAP_FAST: Skip Server Hash + Linux Splice
--fast advertises CAP_FAST in the client's caps byte. When the
server also has it (it always does on supported platforms),
GET responses skip the inline BLAKE3 entirely and on Linux the
file → stdout payload moves through splice(2) with no userspace
buffer. The server's Ok carries hash: None; clients that need
end-to-end integrity get it via -V (re-hash dst client-side).
The 4 MiB pipe buffer (set via fcntl(F_SETPIPE_SZ)) means each
chunk needs one splice call from the file and one from the pipe
to stdout, with no copy ever touching userspace. Compression is
mutually exclusive with this path --- the encoder needs userspace
bytes --- so CAP_FAST only activates the splice variant when
--compress=none. CAP_FAST without splice (compression on, or
non-Linux) still wins from skipping the server's BLAKE3.
Experiment 14: CAP_FAST Real Numbers
Hypothesis: Skipping the server-side BLAKE3 should always be
a small win (server CPU saved) and on Linux the additional
splice(2) zero-copy on the file → stdout payload should be a
larger win when the network isn't the bottleneck.
Method (WAN): 1 GiB random file pulled from host-L over a
~10 MB/s residential link. Default vs --fast, both with
--compress=none so the codec doesn't dominate.
| Mode | Mean (s) | Notes |
|---|---|---|
| default | 157.79 | full BLAKE3 + buffered frames |
--fast | 147.35 | hash skipped, splice on Linux |
Result: 1.07x. The wire is the bottleneck; saving ~700 ms of server-side hash on top of ~150 s of network is noise.
Method (loopback): same file, but bcmr copy localhost:src dst running on host-L itself. SSH-encrypted localhost peaks
around ~500 MB/s on AES-NI hardware so this isolates server
behaviour from real-network jitter.
| Mode | Mean (s) | Throughput |
|---|---|---|
| default | 4.43 | ~230 MB/s |
--fast | 5.69 | ~180 MB/s |
Result: --fast is slower. Two compounding causes:
- Pipe sizing falls back silently.
fcntl(F_SETPIPE_SZ, 4 MiB)needs root or a raised/proc/sys/fs/pipe-max-size; the default cap on Ubuntu is 1 MiB. The call silently caps at the max and the actual buffer stays small (default 64 KiB), so each 4 MiB chunk needs ~64 paired splice rounds. spawn_blockingper chunk. The splice loop currently lives in its owntokio::task::spawn_blockingper chunk --- the same anti-pattern that Experiment 13 identified for the local copy path. With 256 chunks per 1 GiB that's 256 thread bounces, more than enough to wipe out the savings from skipping hash and memcpy.
Decision: Keep --fast for the hash-skip benefit. The splice
path stays in; the per-chunk spawn_blocking + pipe-sizing fixes
are tracked on the Open Questions
page and land in later releases.
Experiment 15: CAS LRU Eviction Under Load
Hypothesis: A monotonically-growing CAS makes dedup unusable on disk-constrained boxes. A simple LRU-by-mtime scheme should keep the store at the configured cap while preserving the most recently hit blocks (which are also the most likely to recur).
Method: end-to-end integration test with three distinct
24 MiB files (each = 6 blocks) PUT in sequence to a local serve
process, with BCMR_CAS_CAP_MB=32 (=8 blocks). Cap is enforced
on the server side at the start of each PUT. After all three
uploads the CAS is walked and totals are checked.
| What | Expected | Measured |
|---|---|---|
| Cumulative bytes if no eviction | 72 MiB | --- |
| Cap | 32 MiB | --- |
| CAS bytes after 3rd PUT | 32 MiB | 32 MiB ✓ |
| CAS blob count after 3rd PUT | 8 | 8 ✓ |
The mtime touch on cas::write and cas::read means a block hit
during PUT N+1 stays warmer than untouched blocks from PUT N,
matching the dev-loop pattern (re-upload the same artifact).
Decision: ship the LRU as default. The hit rate degrades
gracefully as cap shrinks, and the BCMR_CAS_CAP_MB=0 escape
hatch restores the v0.5.8 unbounded behaviour for users who
explicitly want it.
Experiment 12: Wire Compression Across Real Hosts
The earlier Experiment 9 measured codec ratios in isolation; this one re-runs the protocol over real SSH connections to confirm the prediction. 64 MiB of source-text-like content from host-N to three peers, three runs each.
| Peer | None (s) | LZ4 (s) | Zstd (s) | Zstd vs None |
|---|---|---|---|---|
| host-L (WAN, ~10 MB/s, kernel 6.x) | 18.18 | 10.22 | 3.25 | 5.59x |
| host-M (WAN, ~10 MB/s, kernel 5.x) | 8.14 | 4.36 | 3.28 | 2.48x |
| host-N (LAN, gigabit, macOS arm64) | 9.58 | 3.26 | 1.82 | 5.28x |
Zstd-3 wins on every link. LZ4 wins over None but loses to Zstd because the bandwidth saving from Zstd's extra ratio more than pays for the lower encode throughput. host-M's smaller relative win comes from the path's variance dominating the small absolute duration --- the absolute saving is similar to the others.
Experiment 17: Per-File fsync as the Many-Files Tax
Hypothesis: bcmr serve was 7.86× slower than scp -r on a
10000 × 64 KiB benchmark on host-L loopback. --fast and
default were both at ~24 s while scp finished in 3 s. The CPU
breakdown (~3 s total CPU on a 24 s wall) ruled out compute.
Likely culprit: per-file fdatasync in both directions ---
server handle_put calls file.sync_all() per file, client
GET callback calls dst_file.sync_all() per file. At ~1-2 ms
per fdatasync × 10000 files, that's 10-20 s of pure barrier
time — matching the gap.
cp/rsync/scp don't fsync per file by default. The local
copy path in bcmr already gates fsync on --sync (see
Experiment 7).
The serve path was over-promising durability silently.
Method: 10000 × 64 KiB random files, host-L loopback ssh (Ubuntu 22.04, NVMe ext4, kernel 6.8). Warm cache, hyperfine, 2 runs each.
| Command | Mean (s) | vs scp |
|---|---|---|
scp -r | 3.11 | 1.00x |
bcmr copy -r (v0.5.13, before) | 24.75 | 7.96x slower |
bcmr copy -r (after, default) | 6.35 | 2.04x slower |
bcmr copy -r --sync (after) | 15.46 | 4.97x slower |
Implementation: New CAP_SYNC = 0x10 advertised by
server; client OR's it into Hello caps when --sync is set.
Both sides gate their per-file fsync on the negotiated bit.
Default is off, matching cp/scp/local-copy default behavior.
Decision: Ship. Closes the many-files gap from 7.96x →
2.04x with no functionality loss — --sync users still get
exactly the durability they asked for (and pay the 5x cost
they implicitly opted into).
The single-file path also benefits — the GET fsync at end-of-file was ~0.5 s on a 1 GiB stream:
| Command | Before | After |
|---|---|---|
bcmr copy --fast localhost:src dst | 5.35 s | 4.85 s |
bcmr copy localhost:src dst (default) | 8.27 s | 5.13 s |
Experiment 18: Client-Side Request Pipelining
Hypothesis: After Experiment 17 (CAP_SYNC) closed the per-file
fsync overhead, host-L still showed bcmr serve at 2.04× scp on
the 10000 × 64 KiB many-files bench. CPU breakdown ruled out
hash/encode work — the wall time was dominated by RTT-style
serialization. The single-file put() does
send Put → send Data* → send Done → await Ok, then the next
file. Server's dispatch loop is FIFO and would happily process
queued requests, but the client never queues more than one. SFTP
(what scp -r uses under the hood) keeps a window of ~64
in-flight requests; that's the gap.
Method: Same 10000 × 64 KiB host-L loopback bench, plus the mirror upload direction and a single-1-GiB regression check. Three runs each, hyperfine, warm cache.
| Workload | scp | bcmr (post-Phase-0) | bcmr (post-Phase-1) | bcmr ratio vs scp |
|---|---|---|---|---|
| 10000 × 64 KiB DOWNLOAD | 3.02 s | 6.35 s | 3.58 s | 1.18× |
| 10000 × 64 KiB UPLOAD | 3.10 s | ~6.4 s (est) | 3.67 s | 1.18× |
1 GiB single (--fast) | 2.41 s | 4.85 s | 4.61 s | 1.91× |
| 1 GiB single default | 2.41 s | 5.13 s | 5.05 s | 2.09× |
Implementation: Two new methods on ServeClient:
pipelined_put_files(files, on_complete): takes ownership ofstdin, spawns a writer task that emitsPut / Data* / Donefor every file in order. The reader task (the caller's task) collects FIFOOkhashes fromstdout.pipelined_get_files(files, sync, on_complete): mirror — the writer task sends everyGetrequest up-front, the reader demuxes theData* / Okstream into per-file dst handles.
The OS pipe between the client process and the SSH child plus SSH's own send window provide natural backpressure when the server hasn't drained yet — no explicit channel needed.
Server unchanged: the dispatch loop in serve.rs:159-198
already processes requests in FIFO order. Pipelining is purely
a client-side win.
Failure path: if the reader bails mid-batch (server emits
Error for file K), both pipelined methods call writer.abort()
before writer.await to avoid a deadlock where the writer keeps
pushing into a stdin buffer nobody drains. The connection is left
indeterminate after a pipelined failure — caller drops, not
retries, on the same client.
Decision: Ship. Many-files closed from 7.86× scp to 1.18× in three steps (CAP_SYNC, GET pipelining, PUT pipelining). bcmr serve is now within 18 % of scp on multi-small-files while still offering resume, content-addressed dedup, inline integrity, and a progress UI scp doesn't have.
Experiment 19: Parallel SSH Connections Break the Single-Stream Ceiling
Hypothesis: After Experiment 18 closed the per-file RTT gap
through client-side pipelining, bcmr serve ran at ~1.18× scp -r
on the happy path (light load) and, we discovered, fell back to
~3-5× slower under heavy contention. The remaining gap has a
structural cause: SSH gives you one cipher stream per TCP
connection. AES-NI tops out at ~500 MB/s/core and ChaCha20 at
~200-500; a single SSH session is walled in by exactly one
crypto thread on each side. Opening N independent SSH connections
in parallel — what mscp does — multiplies the crypto ceiling
by N until NIC or disk takes over. No protocol change.
Method: 10000 × 64 KiB files, host-L loopback ssh, same
dataset as Experiment 17. Run under genuine production contention
(load average ~93 from unrelated minimap2 -t 30 jobs on the
box) so the crypto bottleneck actually materialises — Experiment
17's "load ~1" numbers hid the single-stream ceiling behind
abundant headroom. 2 runs per N, /usr/bin/time for wall + CPU
breakdown, scp -r as baseline.
| Command | Wall mean (s) | Speedup vs N=1 | Ratio vs scp |
|---|---|---|---|
bcmr copy -r --parallel 1 | 67.3 | 1.00× | 3.47× slower |
bcmr copy -r --parallel 2 | 40.8 | 1.65× | 2.10× slower |
bcmr copy -r --parallel 4 | 23.7 | 2.84× | 1.22× slower |
bcmr copy -r --parallel 8 | 14.7 | 4.58× | 1.32× faster |
scp -r (baseline) | 19.4 | — | 1.00 |
Near-linear through N=4, diminishing past that (box already running 90+ threads of other users' CPU work — we're taking what we can). At N=8 bcmr beats scp by 24% on the same loaded box.
The scaling curve:
---
config:
xyChart:
width: 700
height: 340
themeVariables:
xyChart:
titleColor: "#222"
plotColorPalette: "#7E6EAC, #CABBE9"
---
xychart-beta
title "Exp 19: 10000 × 64 KiB on host-L under load ~93 (lower = faster)"
x-axis "Pool size N" [1, 2, 4, 8]
y-axis "Wall time (s)" 0 --> 80
line [67.3, 40.8, 23.7, 14.7]
line [19.4, 19.4, 19.4, 19.4]
The upper line is bcmr copy -r --parallel N, the flat line at
19.4 s is the scp -r baseline — bcmr crosses below it between
N=4 and N=8. The diminishing return past N=4 is the box telling
us the NIC and disk queue have started bounding us; the crypto
ceiling that limited N=1 is already gone by then.
Implementation: New ServeClientPool { clients: Vec<ServeClient> }
in src/core/serve_client.rs:
connect_with_caps(target, caps, n)opens N connections concurrently viafutures::try_join_all— total handshake latency ≈ one connection's, not N×, because they auth in parallel.pipelined_put_files_striped/pipelined_get_files_stripedpartition input files round-robin across the N clients and drive all buckets concurrently via anothertry_join_all. Each bucket runs the existing single-client pipelined method unchanged; the pool is purely a dispatch-and-scatter layer.- PUT hashes come back in input-index order: each bucket saves its original indices, results are re-scattered into a size-N output vec at the end.
- One-shot protocol ops (
stat,list,mkdir) stay onpool.first_mut()— no parallelism win for a single round trip, so don't waste the other N-1 connections.
Server unchanged: every connection talks the same protocol
it always did. The pool is a client-side construct; bcmr serve
on the other end sees N independent sessions exactly as if N
separate users had opted in concurrently.
UX: --parallel N on the CLI now applies to both transports.
For the serve fast path, default stays N=1 (no surprise
behavior change for small batches — opening 8 SSH connections
for a 3-file copy is pure handshake tax with no payoff).
Users who know they're moving many small files or need to
saturate a fat pipe opt in explicitly.
The part that's not free:
- N× SSH handshakes on connect (done concurrently so wall latency is ~1 handshake, but CPU + auth cost is N×).
- N× memory for per-client buffers, stdin/stdout pipes, tokio
tasks. For N=8 that's ~16 MiB of pipe buffers on the box
— trivial. Each
bcmr servesubprocess adds a few MB of its own. - Progress callbacks fire from multiple tasks simultaneously,
so they need
Fn + Send + Sync + Clone + 'static(our runner callbacks already are).on_completeno longer fires in input order — documented.
What doesn't help: OpenSSH ControlMaster=auto multiplexes
channels over one TCP connection, which means they share a
single cipher stream. Multi-channel ≠ multi-crypto. To lift the
crypto ceiling you genuinely need N TCP connections = N SSH
sessions.
Decision: Ship. First release where bcmr serve is faster
than scp -r on both many-small-files and a realistically
contended system. Single-connection baseline (--parallel 1)
unchanged; users who want the win opt in. For single large files,
--direct=direct lifts the stream ceiling instead (Exp 20).
Experiment 20: Direct-TCP Data Plane (Path B) Beats scp 2.84× on a Single Large File
Hypothesis: OpenSSH serialises an entire session through one cipher stream, so single-file copies over SSH are walled in at a fraction of the link capacity even when the CPU has plenty of AES-GCM headroom. Carving the data plane onto a dedicated AEAD-framed TCP socket (SSH kept only for auth + rendezvous) should lift that ceiling to within noise of the raw TCP rate.
Method: 1 GiB random file, host-N → host-L GET, LAN link
(single-stream iperf3 -t 5 reports 41 MiB/s as the
physical ceiling). --compress=none --fast to keep the number
about transport cost, not Zstd or BLAKE3. 3 iterations per mode
(best-of-3 reported — WiFi variance is enough that a median
blurs the peak-achievable rate1), fresh dst file each run.
| Command | best-of-3 | vs scp |
|---|---|---|
scp | 9.4 MiB/s | — |
bcmr --direct=ssh | 8.7 MiB/s | 0.93× |
bcmr --direct=direct | 26.7 MiB/s | 2.84× |
bcmr --direct=ssh --parallel=4 | 8.4 MiB/s | 0.89× |
bcmr --direct=direct --parallel=4 | 27.8 MiB/s | 2.96× |
| iperf3 single-stream ceiling | 41 MiB/s | 4.36× |
Reading: SSH mode tracks scp — same OpenSSH stream, same
ceiling — and lands within noise of it. Direct-TCP is 2.84×
scp and reaches 65 % of the raw TCP rate; SSH mode stays at
23 %. --parallel=4 doesn't help a single 1 GiB file: no
per-file chunk routing exists on this branch, so the extra
sessions only add handshake cost. --parallel earns its keep on
multi-file batches (Exp 19), not on one big file.
Integrity: BLAKE3 of the received bytes matches the source on
every run (cross-checked with md5sum). AEAD framing is active
on every direct-TCP run — is_aead_negotiated() returns true,
enforced by the serve_direct_tcp_put_get_roundtrip assertion.
Decision: Ship. --direct=ssh stays the default (backwards
compat, no new sshd config on the server). Users on LAN-class
links flip --direct=direct and take the ~3× win on single
large files. AEAD is mandatory on direct-TCP so the CLI can't
silently drop into plaintext.
Summary
Each row below states the specific workload behind the number. Don't lift this table out of context without the qualifiers.
| Decision | Measured Cost | Measured Benefit (workload) |
|---|---|---|
| Per-worker SSH | 0 % (additive) | Up to ~6× parallel throughput (mscp's 8-connection, 100 Gbps figure; not re-measured on this tree) |
| Serve protocol | 0 % (replaces SSH spawns) | Eliminates per-file process overhead (qualitative; measured at "~50 ms spawn vs ~0.1 ms frame" in Serve Protocol Benefits) |
| Auto-skip wire compression | Negligible (LZ4 ~4 GB/s encode on random 4 MiB blocks) | Applies to all Data frames; per-block auto-skip keeps incompressible blocks raw |
| Wire compression (Zstd-3) | ~320 MB/s encode, ~1 GB/s decode on Apple Silicon | 2.48--5.59× over uncompressed on 64 MiB source-text, ~10 MB/s WAN (Exp 12) |
CAP_DEDUP repeat PUT | One file re-read client-side + hash all blocks | 32 % faster (18.9 → 12.9 s) on 64 MiB re-upload, ~10 MB/s WAN — savings match eliminated wire bytes |
CAP_FAST GET | One spawn_blocking + raw write(2) for headers (v0.5.13 fix) | 1.07× on WAN (network-bound); ~1.55× over default on host-L loopback after fix (was 0.78× regression in v0.5.10, see Exp 14) |
| CAS LRU cap | Walk + sort the CAS dir per PUT (cheap) | Holds store size ≤ cap under 3× 24 MiB repeated uploads (Exp 15) |
CAP_SYNC per-file fsync gate | Negotiated bit; off by default (matches cp/scp) | 3.9× on 10000 × 64 KiB host-L loopback (24.75 → 6.35 s); ~10 % on 1 GiB single (Exp 17) |
| Client-side request pipelining | Writer-task spawn per batch; writer.abort() on error path | 1.8× on 10000 × 64 KiB host-L loopback (6.35 → 3.58 s); lands at 1.18× of scp -r (Exp 18) |
ServeClientPool parallel SSH | --parallel N opts into N concurrent SSH sessions; default N=1 preserves old behavior | 4.58× scaling N=1→N=8 on host-L under load 93; bcmr N=8 beats scp -r by 24% (14.7 vs 19.4 s) on 10000 × 64 KiB (Exp 19) |
--direct=direct data plane | SSH used only for auth + rendezvous; data over a dedicated AES-256-GCM-framed TCP socket (Path B) | 2.84× vs scp on 1 GiB GET host-N → host-L (best-of-3): scp 9.4 → bcmr-ssh 8.7 → bcmr-direct 26.7 MiB/s; 65% of iperf3's 41 MiB/s ceiling (Exp 20) |
Footnotes
-
Per-iteration numbers — scp: 8.8, 9.2, 9.4 MiB/s · bcmr-ssh: 8.7, 7.5, 8.4 · bcmr-direct: 16.3, 12.7, 26.7 · bcmr-ssh-4: 8.4, 7.5, 5.7 · bcmr-direct-4: 3.8, 14.3, 27.8. WiFi fluctuation accounts for most of the spread inside each mode. A 10 GbE testbed would turn the bottleneck from transport framing into single-core AES-GCM (~5 GB/s host-N, ~1.5 GB/s host-L per
crypto_probe.rs) — still an order of magnitude above OpenSSH, but we don't have the hardware to re-measure here. mscp isn't in the table because it doesn't install on host-N; conceptually it overlapsbcmr --parallel=Nplus per-file chunk routing, and this branch doesn't yet stripe a single file across streams. ↩