Local Multi-File Performance
This page covers performance investigations of the local hot path --- the in-process logic that runs when bcmr is copying or moving files on a single host. The common thread is that the v0.5.4 design had correctness-first defaults (fsync after every rename, serial file-at-a-time copy, hash every byte twice) that were the right baseline but paid for guarantees the user hadn't asked for, and left cores and queues idle on typical workloads.
The Streaming Checkpoint Copy ablation on the SCC page covered the single-file case where these choices were measured to be free. The experiments here cover the regimes that single-file benchmarks didn't surface: many-files, fast-disk, and unused-hash paths.
Experiment 7: Per-File Durability Cost
Hypothesis: Calling F_FULLFSYNC on the parent directory after
every atomic rename is correct for single-file copies where rework
cost is irrelevant, but dominates wall-clock time when the operation
is copying thousands of small files.
Method: bcmr copy -r of a 2100-file, 9 MiB directory tree
(resembling a small source repo); compared against cp -R and
rsync -a on the same input. Five runs, median reported. macOS
Apple Silicon, APFS SSD, warm cache.
| Command | Before gate | After gate |
|---|---|---|
bcmr copy -r (default) | 9.90 s | 0.72 s |
cp -R | 1.00 s | 1.00 s |
rsync -a | 0.75 s | 0.75 s |
bcmr copy -r --sync | 9.90 s | 9.90 s |
Interpretation: The pre-gate path issued two F_FULLFSYNC calls
per file (one on the file descriptor, one on the parent directory
after rename). F_FULLFSYNC is a full drive-cache flush command
and costs roughly 4 ms of barrier latency per call on the test
hardware. At 2100 files that is 8.4 s of pure barrier time, matching
the observed 9.9 s almost exactly.
Neither cp nor rsync fsyncs by default --- they rely on the OS
page cache and background flush. bcmr now does the same: the default
path does zero per-file fsync calls, and --sync restores the old
durability-strong behaviour (still 9.9 s, but deterministic).
Decision: Gate per-file durable sync on --sync. No final
directory fsync is issued at the end of the operation either, again
matching cp and rsync. Users who need transactional guarantees
opt in explicitly.
Why the original design was wrong: the v0.5.4 ablation argued
F_FULLFSYNC was free (SCC Experiment 5). The measurement was
done on single large files where the fsync cost is amortised over
hundreds of megabytes of copy time. For many-small-files the cost
becomes fsync_count × fsync_latency, which scales with file count,
not file size. The single-file benchmark didn't exercise the regime
where the pathology lives.
Experiment 8: File-Level Parallelism
Hypothesis: execute_plan iterates the plan serially, awaiting
each copy_file before starting the next. For many-file workloads
on an NVMe/APFS device with disk queue, this leaves most of
the queue idle.
Method: 10 000 × 64 KiB files (~640 MiB), sweep --jobs N from
1 to 32. Five runs per N.
--jobs | Mean (s) | Relative |
|---|---|---|
| 1 | 6.52 | 1.67x |
| 2 | 5.23 | 1.34x |
| 4 | 5.05 | 1.29x |
| 8 | 4.16 | 1.06x |
| 16 | 4.39 | 1.12x |
| 32 | 3.91 | 1.00x |
Interpretation: Throughput improves monotonically up to the
physical core count (8 on the test box) and plateaus beyond it. The
platform has 8 performance cores and each copy task is mostly
I/O-bound, so the scheduler can happily overlap 8 tasks worth of
read / write / fdatasync without the kernel becoming the
bottleneck. Beyond 16, adding concurrency just churns the tokio
runtime without unlocking additional disk parallelism.
Decision: Default --jobs = min(num_cpus, 8). Users with faster
storage or different profiles can override.
Implementation note: directory creation stays serial so a parent
always exists before its children try to open files inside it.
walkdir yields parents before contents, so a single pre-pass over
plan.entries picking out CreateDir nodes is enough. The file
stream then runs through futures::stream::buffer_unordered(N).
Experiment 10: Whole-Source BLAKE3 on the I/O Thread
Hypothesis: streaming_copy updates two BLAKE3 hashers per byte
--- the per-block hasher (needed at the next checkpoint to populate
the session) and the whole-source hasher. On macOS NEON BLAKE3 runs
at ~1 GB/s, so the doubled hash work effectively serialises 2 GB/s
worth of CPU against ~2 GB/s of APFS write throughput. The whole-
source hash is only consumed when --verify is set or when a
session is being persisted across runs; for a one-shot copy of a
small file with neither flag, it's pure overhead.
Method: Streaming-path 32 MiB file copy on macOS APFS. Five runs each, hyperfine.
| Mode | Mean (ms) | Δ vs cp |
|---|---|---|
cp (no hash) | 18 | 1.00x |
bcmr stream (block hash only, after fix) | 205 | 11.4x slower |
bcmr stream -V (block + source hash, after fix) | 285 | 15.9x slower |
bcmr stream (block + source hash, before fix) | ~285 | 15.9x slower |
The skip is gated on verify || session.is_some(). For files >= 64
MiB or with --resume/--strict/--append set, the source hash is
needed and computed. The 28 % saving (285 ms → 205 ms) on the no-
hash path is the upper bound for the no-verify case --- the rest of
the gap to cp is the per-block hash itself, the per-checkpoint
posix_fadvise, and tokio I/O scheduling overhead, which a future
revision can pick at separately.
A pipelined version that overlaps the source hash with the next
read+write+block-hash would need a refcounted buffer pool and
update_rayon; deferred to Open Questions.
Experiment 13: One spawn_blocking for the Whole Loop
Hypothesis: Tokio's async file I/O (tokio::fs::File::read,
write, seek, stream_position) wraps each call in its own
spawn_blocking. For a 2 GiB file at 4 MiB blocks that's ~1024
round trips through the blocking-thread pool; on Linux NVMe the
syscall ceiling is much higher than the device, so this overhead
dominates wall time.
Method: Same bcmr copy --reflink=disable of a 2 GiB random
file on host-L (Xeon Gold AVX-512, NVMe ext4, kernel 6.x)
before and after the refactor.
| Command | Before | After | cp |
|---|---|---|---|
| Wall (s) | 12.34 | 5.38 | 2.17 |
| Throughput (MB/s) | 170 | 383 | ~1000 |
Implementation: streaming_copy now try_clone()'s both file
descriptors into std::fs::File handles (which dup the fds, so
the sync handle and the original tokio::fs::File share the same
open file description), takes ownership of the Option<Session>,
and runs the entire read/write/sparse-detect/checkpoint loop
inside one tokio::task::spawn_blocking. The session is returned
through the join handle so the outer async function preserves the
&mut Option<Session> contract.
Decision: Ship the refactor. It's a 2.3x wall-clock win on Linux NVMe and ~1.7x on macOS APFS for the streaming path, with no test regressions across the existing 14 e2e copy cases.
Why not just use io_uring? tokio-uring requires its own
runtime that can't drive standard tokio futures, which would mean
a major restructuring of bcmr. This experiment's 2.3× captures
most of what io_uring would have added on top of tokio::fs;
the rest is tracked on the Open Questions
page.
The v0.5.8 → v0.5.10 progression of the remaining cp-gap — driven by gating the session + checkpoint fsync on the user's explicit intent — is covered in Experiment 16 below.
Experiment 16: Gate Session + Block Hash + Checkpoint Fsync on Intent (v0.5.10)
Hypothesis: v0.5.8's rule "any file > 64 MiB auto-creates a
session" was over-cautious — it paid for resumable semantics the
user hadn't asked for. Gating the session (and therefore the
per-block BLAKE3 and the periodic durable_sync checkpoint) on
the user's explicit resume flags should close most of the
remaining gap to cp for one-shot streaming copies.
Method: 1 GiB random file, mac APFS, --reflink=disable to
force the streaming path. Warm cache. 5 runs, hyperfine.
| command | mean wall (s) | vs cp |
|---|---|---|
cp | 1.14 | 1.00x |
bcmr copy (default, no flag, v0.5.10) | 1.89 | 1.65x |
bcmr copy -C (session created) | 4.46 | 3.91x |
bcmr copy -V (+source rehash + verify) | 4.69 | 4.12x |
bcmr copy (v0.5.9: same command, session auto-created) | ~3.9 | ~3.4x |
What the change does: three separate gates now all check
session.is_some():
create_sessiondrops thefile_size > 64 MiBauto-trigger — only-C/-s/-acreate a session now.block_hasheritself becomesOption<Hasher>driven by session presence. Before, every 4 MiB chunk paid a BLAKE3 pass regardless of whether anyone would read the result. On NEON that's ~1 GB/s of wasted CPU.- The per-64-MiB
durable_sync(dst) + posix_fadvisecheckpoint only runs with a session — its only purpose is to uphold the session's crash-safety invariant, which is moot with no session.
Decision: Ship. Users who want crash-safe resume pass the
flag and pay the price. One-shot bcmr copy big.iso dst/ now
runs at ~83 % of cp's wall time.
The progression of the Linux-NVMe streaming gap across releases (2 GiB file):
| version | wall | gap vs cp |
|---|---|---|
| v0.5.8 streaming | 12.34 s | 5.7x |
| v0.5.9 (spawn_blocking refactor, Exp 13) | 5.38 s | 2.48x |
| v0.5.10 (this experiment) | est. ~2.5 s | ~1.15x |
Summary
Each row ties the benefit to the workload it was measured on.
| Decision | Measured Cost | Measured Benefit (workload) |
|---|---|---|
| Opt-in per-file fsync | ~0 % (default skip) | 13× (9.9 → 0.72 s) on 2100 × 4 KiB files, mac APFS warm cache |
--jobs parallel local copy | 0 % (configurable) | 1.67× (6.52 → 3.91 s) -j1 → -j32 on 10 000 × 64 KiB files, mac APFS warm cache |
| Skip src hash when unused | 0 % (saves CPU) | 28 % (285 → 205 ms) on a 32 MiB streaming copy, no --verify, mac APFS |
| Single spawn_blocking copy loop | One std::fs::File dup per call | 2.3× (12.3 → 5.38 s) on 2 GiB streaming copy, Linux NVMe ext4 |
| Session + checkpoint gated on intent | 0 % (off when unasked) | ~2× (3.9 → 1.89 s) on 1 GiB streaming copy, mac APFS — lands at 1.65× of cp |