Performance
str exists to delete allocations. A view-producing op (slice, trim, substring, …) is a couple of pointer moves and one small object — versus native String, which allocates a new string and copies the bytes every time. The scanning ops add SWAR/SIMD kernels, and replace / padStart / padEnd are built directly from the view in a single pass.
Figures are microbenchmarks via
as-bench, all over one ~2 kb string, on wasmtime. Reproduce withnpm run charts:build.
Per-Operation Speedup
Every native String operation vs its str counterpart — native (red) is the 1× baseline, str (blue) its speedup:
| Operation | vs native String |
|---|---|
replace | ~12× faster |
indexOf / includes | ~8.5× faster |
replaceAll | ~3.7× faster |
lastIndexOf | ~2.6× faster |
padStart / padEnd | ~1.9× faster |
trim / trimStart | ~1.4–1.5× faster |
slice / substring | ~parity (no copy) |
toUpperCase / toLowerCase | ~parity (defers to native) |
View ops sit at parity on a tiny slice — the avoided copy is cheap there — and pull ahead as the slice grows, since str never copies. replace / replaceAll are also correct where this asc version's native String#replaceAll corrupts longer replacements; str fuzzes them against a trusted reference instead.
Throughput
Native vs str SWAR vs str SIMD, in millions of ops/sec:
SWAR and SIMD
The scanning hot paths — indexOf, includes, lastIndexOf, and compare — are accelerated in three tiers, chosen at compile time:
- SIMD — 8 code units per step via
v128, used when--enable simdis set (ASC_FEATURE_SIMD). - SWAR — SIMD-Within-A-Register: 4 code units per step with ordinary
u64math (a Mycroft zero-detect for the unit search). The default when SIMD is off. - scalar — handles the short sub-block tail.
When SIMD is off the entire v128 branch is dead-code-eliminated, and vice versa, so you only pay for the tier you build. Wide loads are always bounded by the remaining length, so they never read past the backing string — no scratch padding is needed.
Copies and equality checks use the same idea: copyBytes and equalsBytes run a size-tiered manual loop (v128 / u64 / scalar tail) that beats the bulk-memory intrinsics on small/medium ranges, and fall back to memory.copy / memory.compare on large ones.
Running benchmarks locally
npm run bench # microbenchmarks (as-bench)
npm run charts:build # bench both builds and render charts to build/charts/
npm run charts # build and serve the charts locallyBoth the SIMD and SWAR builds are covered by the test suite (run under two as-test modes) and by differential fuzzing against the native String methods, so the accelerated paths stay byte-exact with the standard library.
