SWAR & SIMD Scanning
Under the generated fast path, the lowest-level hot loop is scanning — finding the next structurally-significant character: the closing " of a string, a \ escape, the end of a number. Doing that one character at a time is the slow path. SWAR and SIMD do it many characters at a time, and that's where most of json-as's throughput comes from. JSON_MODE picks which one is generated.
Everything is UTF-16
AssemblyScript strings are sequences of 16-bit code units, not bytes. So every scanner works over u16 lanes, not u8 bytes — a detail that shapes both the SWAR magic constants and the choice of SIMD ops below.
SWAR — 4 lanes per 64-bit word
SWAR ("SIMD Within A Register") finds a target code unit in a plain u64 using only arithmetic. To locate a backslash (0x5C) in the four UTF-16 lanes of a word, json-as XORs the word with a broadcast of the target — zeroing any lane that matches — then runs the classic "is there a zero lane?" borrow trick:
// 4 UTF-16 code units packed in one u64; find a lane == '\' (0x5C)
const b = block ^ 0x005c_005c_005c_005c; // matching lane becomes 0x0000
const mask = (b - 0x0001_0001_0001_0001) & ~b & 0x0080_0080_0080_0080;
// each matching lane now carries a 0x80 markerSubtracting 1 from a lane only borrows across its boundary when the lane is already zero; & ~b & 0x0080… isolates that borrow per lane. A second mask guards against a 0x5C landing in the high byte of a code unit, so only true low-byte matches count. No branches, four lanes checked at once.
SIMD — 8 lanes per 128-bit vector
With --enable simd, the same idea uses real vector instructions over a v128 of eight u16 lanes. Compare-equal produces an all-ones lane per match, bitmask collapses that to one bit per lane, and ctz (count-trailing-zeros) gives the first match's position:
const SPLAT_5C = i16x8.splat(0x5c); // broadcast '\'
const block = load<v128>(srcStart); // 8 code units
const mask = i16x8.bitmask(i16x8.eq(block, SPLAT_5C));
if (mask != 0) {
const offset = ctz(mask) << 1; // first '\', ×2 (u16 -> bytes)
// ...handle the escape at srcStart + offset
}Eight code units per iteration, branch only when something is actually found.
The HYBRID string copy
Deserializing a string is "copy from the source into the field, decoding escapes." json-as fuses the scan and the copy with a strategy tuned for the common case (few or no escapes):
- Clean block (no
\in the 8 lanes) → store the wholev128to the output and advance 16 bytes. If the clean run continues, find its end with the same vector scan andmemory.copythe entire remainder in one call — bandwidth-bound, not character-bound. - Escape block → a single whole-
v128store copies the plain prefix for free (it overshoots past the escape, into reserved slack), then the escape itself is decoded via a lookup table (DESERIALIZE_ESCAPE_TABLE) rather than a branch ladder.
const mask = i16x8.bitmask(i16x8.eq(block, SPLAT_5C));
if (mask == 0) {
store<v128>(bs.offset, block); // stream a clean block
bs.offset += 16;
// ...then bulk memory.copy the rest of the clean run
} else {
store<v128>(bs.offset, block); // one store covers the plain prefix
const laneIdx = ctz(mask) << 1; // ...decode the escape from a table
}Those overshooting stores are why the string paths reserve a little extra room (ensureSize(len + 16)): a v128 write near the end of the output may spill up to 15 bytes past the logical cursor, and the slack absorbs it. Serialization uses the same streaming-store idea in reverse, writing the value out in v128-sized chunks.
Same result, three speeds
NAIVE, SWAR, and SIMD are bit-for-bit equivalent — they only differ in how many lanes they chew per step (1 / 4 / 8) and therefore in throughput and code size. That's the whole reason the mode is a build-time switch rather than three separate libraries.
