Today, alongside our colleagues at Google and Mozilla, we announced JetStream 3.0, a major update to the cross-browser benchmark suite. While the shared announcement covers the breadth of the suite and the collaborative effort behind it, we wanted to take a moment to dive deeper into how the WebKit team approached these challenges and the engineering work in JavaScriptCore behind our improvements.
Benchmarks are among the best tools browser engine developers use to drive performance. The web is ever evolving though, and any benchmark will get out of date with new best practices. Moreover, when the most accessible optimizations have been addressed in a benchmark, subsequent optimizations tend to become less and less general and more specific to that exact workload. JetStream 3 represents both a refresh and a fundamental shift in how we measure performance, particularly regarding WebAssembly and the scale of modern web applications.
The Evolution of WebAssembly Benchmarking
One of the most significant changes in JetStream 3 is how we measure WebAssembly (Wasm) workloads. To understand why we changed it, we have to look back at where Wasm started. When JetStream 2 was released, WebAssembly was in its infancy. The earliest adopters of Wasm were large C/C++ projects that previously compiled to an earlier technology, asm.js. We anticipated large C/C++ applications (like video games) where users would be more willing to tolerate a long, one-time startup cost in exchange for high-throughput performance afterward. Consequently, JetStream 2 scored Wasm in two distinct phases: Startup and Runtime.
An Infinity Problem
Over the years, browser engines have become remarkably efficient at instantiating WebAssembly modules. As startup times improved, the incentive to pursue even micro-optimizations actually compounded. Shaving 0.1 ms off a 100 ms workload is indistinguishable from noise; however, once engines successfully reduced that instantiation time to just 2 ms, that same 0.1 ms improvement suddenly represented a 5% performance gain. In WebKit, for instance, we optimized the startup path so aggressively that for certain smaller workloads, our startup time effectively reached zero seconds. In JetStream 2, each iteration’s time was computed with Date.now(), which rounds down so any sub-1 ms time becomes 0 ms.
This created a unique challenge for the benchmark. The scoring formula was Score = 5000 / Time. Thus, when the time hit zero, the score became infinity. We eventually had to patch the benchmark harness in JetStream 2.2, clamping the score to 5000, so that a 0 ms sub-score didn’t obsolete all other scores.
While getting an infinite score sounds like a victory, it was a clear sign that browser engines had outgrown JetStream 2’s Wasm subtests. On the modern web, Wasm is in the critical path for many page loads. It is used in libraries, image decoders, and UI frameworks. A “zero” startup time in a microbenchmark doesn’t account for how well the code runs immediately after instantiation, which got lost in the long, single Runtime score. JavaScript benchmarks in JetStream 2 measured the full execution lifecycle, so we concluded that WebAssembly scoring should work the same way.
In JetStream 3, we retired the separate, single iteration, Startup/Runtime scoring for Wasm. Instead, we have adopted the same scoring methodology used for JavaScript benchmarks; Running the same code for many iterations and extracting:
- First Iteration: Captures compilation and initial setup (the “startup” cost before).
- Worst Case Iterations: Captures “jank” and garbage collection or tiering pauses.
- Average Case Iterations: Captures sustained throughput.
These scores are then geometrically averaged to compute the subtest’s overall score, which in turn feeds into the geometric mean of the full benchmark score. This shift forces engines to optimize the entire lifecycle of a WebAssembly instance, ensuring that Wasm integrates smoothly into the interactive web.
Escaping the Microbenchmark Trap
Another major goal for JetStream 3 was to move away from small workloads. In the early days of browser competition, small, tight loops, typically called “Microbenchmarks”, were often used to prove one engine was faster than another. While useful for regression testing, microbenchmarks are dangerous for long-term optimization. They encourage engines to over-fit to specific patterns, creating overly specialized optimizations for a specific hot loop or function that might not translate to (or worse, even harm) general performance.
JetStream 3 focuses on larger, longer-running workloads. By increasing the code size and complexity, we dilute the impact of any single “hot” function. This forces the engine to be generally efficient across a variety of optimizations not only in the JIT compilers, but also in standard library functionality.
These new workloads also ensure we are optimizing for the features developers are actually using today:
- JavaScript: Private/public class fields, Promises/async functions, and modern RegExp features.
- WebAssembly: WasmGC (Garbage Collection), SIMD (Single Instruction, Multiple Data), and Exception Handling.
Optimizing JavaScriptCore for JetStream 3
With the new benchmark in place, we turned to the engineering work needed to perform well on it. To that end, the WebKit team has made significant architectural improvements to JavaScriptCore over the last year.
WebAssembly GC
JetStream 3 includes WasmGC workloads compiled from Dart, Java, and Kotlin. These languages exercise different allocation patterns and usage strategies, which stress-tests the engine’s GC implementation broadly.
Inlining GC allocations
When a WasmGC program creates a struct or an array, the browser’s engine needs to allocate memory for that object on the garbage collected heap. In our initial implementation, every allocation went through a general-purpose C++ function call. This is similar to how calling malloc works in C: the program jumps to a shared allocation routine, which finds available memory, sets it up, and returns a pointer. While correct, this overhead adds up quickly in WasmGC workloads that create millions of small objects, which is typical of languages like Dart and Kotlin that compile to WasmGC.
We optimized this in two phases. First, we changed the memory layout of GC objects. Originally, each struct and array had a separate, out-of-line backing store for its fields. This meant that creating a struct required two allocations: one for the object header and one for the field data. It also meant that every field access required following an extra pointer. We changed both structs (291579@main) and arrays (291760@main) to store their field data directly after the object header in a single contiguous allocation, eliminating the second allocation and the pointer indirection. This single change was a roughly 40% improvement on the WasmGC subtests.
Second, we inlined the allocation fast path directly into the generated machine code. Rather than calling out to a C++ function for every allocation, the compiler now emits a short sequence of instructions that bumps a pointer, writes the object header, and returns, all without leaving the generated code. The slow path (when the current memory block is full and a new one needs to be obtained) still calls into C++, but the fast path handles the vast majority of allocations. We added this optimization to both of our WebAssembly compiler tiers: the Build Bytecode Quickly (BBQ) baseline compiler (292808@main, 292925@main) and the Optimized Machine-code Generator (OMG) optimizing compiler (298551@main). We also embedded runtime type information directly into each GC object (306226@main), so that type checks (such as checking whether a cast is valid) can be done with a simple pointer comparison rather than looking up type metadata through a separate table.
GC Type Reference Counting
In WebAssembly, every function and GC object has a concrete type defined in the module’s type section. The engine needs to keep these type definitions alive for as long as any object or function references them. Wasm type definitions may be shared across Workers, while JSC’s garbage collector is per-Worker and not global across the process. To solve this mismatch, we use reference counting. In our initial version of WasmGC, each object held a reference to its type definition, and when the object was destroyed, the reference was released.
One of the performance problems we saw initially was that every time the garbage collector destroyed a WasmGC object, it had to decrement the reference count on the object’s type. For workloads that create and destroy large numbers of short-lived objects (which is often common in garbage-collected languages), this meant that garbage collection was spending a large fraction of its time just updating these reference counts. On top of that, looking up type information required acquiring a global lock and searching a hash table; this created contention when multiple threads were compiling or running WebAssembly code simultaneously.
We attacked this from several angles. The most impactful change was eliminating the need for GC objects to have destructors at all (292257@main). Previously, each GC object had a destructor that released its type reference. By restructuring how type information is stored, so that the garbage collector can find type definitions through the object’s Structure (discussed in more detail in a previous blog post) rather than reference counting them per-object, we removed the destructor entirely. This was a roughly 40% improvement on the Dart-flute-wasm subtest, because the garbage collector could now sweep dead GC objects without doing any per-object cleanup work. Finally, we also moved type references directly into the type definition records themselves (300159@main), removing the need for global locks and hash table lookups when accessing a type’s runtime type information (e.g. for casting).
GC Type Checks
Languages that compile to Wasm with Wasm GC features rely heavily on type checks. Operations like downcasting an object to a more specific type (ref.cast), testing whether an object is an instance of a type (ref.test), and verifying function signatures at indirect call sites (call_indirect) all require the engine to determine at runtime whether one type is a subtype of another. In our initial implementation, each of these checks was a call into a C++ runtime function that walked the type hierarchy, which was expensive for workloads that perform millions of casts per second.
We addressed this by inlining the type checks directly into the generated machine code using a technique known as Cohen’s type display algorithm. The idea is that each type stores a fixed-size array, called a display, containing pointers to all of its ancestor types. To check whether object A is a subtype of type B, the engine only needs to look up a single entry in A’s display at B’s inheritance depth (i.e. the number of superclasses B has) and compare pointers, rather than walking a chain of parent pointers. We first inlined this into our optimizing compiler, OMG (298798@main), and then into our baseline compiler, BBQ (298842@main), so that even code that has not yet been fully optimized benefits. To make the display lookup as fast as possible, we embedded the first six display entries directly into each type’s runtime type information record (299056@main), so that the common case of shallow type hierarchies requires no extra pointer indirection and stays within a single cache line.
We also found that many type checks could be eliminated entirely. The Wasm type system carries static type information through the wasm binary, and in many cases the engine can prove at compile time that a value already satisfies the required type, in particular, when inlining a generic leaf function. We propagated this static type information from the parser through to code generation (298875@main), so that when the compiler can prove a value is already a GC object, it skips the runtime cell and object checks that would otherwise be emitted (298961@main). Together, these changes ensure that type checking overhead is minimal across all execution tiers.
Better Inlining Heuristics with Callsite Feedback
When a compiler inlines a function, it replaces a function call with the body of the called function. This avoids the overhead of the call itself and, more importantly, gives the compiler the opportunity to optimize the caller and callee together. For example, if the caller always passes a constant to a parameter, the inlined body can be specialized for that constant.
However, inlining has a cost: it makes the compiled code larger. If the compiler inlines too aggressively, the generated code may end up duplicating the same logic repeatedly, which is worse for the CPU’s instruction cache. Secondly, the more code inlined, the longer compilations take, which in a JIT can have a significant performance impact. This means the compiler needs good heuristics to decide which call sites are worth inlining and which are not.
Our initial Wasm inlining implementation made these decisions one call site at a time. Each call was evaluated independently: if the callee was small enough and the call was hot enough, it would be inlined. The problem is that this local, greedy approach does not consider the overall picture. It might inline several less-important calls and then run out of its code size budget before reaching a call that would have been far more beneficial to inline.
We replaced this with a non-local inlining decision system (302215@main) that considers all the call sites in a function simultaneously. The algorithm assigns a priority to each candidate call site based on how frequently it is executed, how large the callee is, and how much optimization opportunity inlining would unlock. It then inlines call sites in priority order until the code size budget is exhausted. This is particularly important for WasmGC workloads, where source-language compilers for Dart, Kotlin, and Java tend to emit many small functions in order to keep download sizes small, expecting the browser’s compiler to inline the important ones.
Polymorphic Indirect Call Inlining
Languages that compile to WasmGC, such as Dart and Kotlin, rely heavily on virtual method dispatch. When you call a method on an object in these languages, the actual function that gets called depends on the runtime type of the object. In the generated WebAssembly, this is represented as an indirect call: instead of calling a specific function, the code loads a function pointer from a table and calls through it. Indirect calls are significantly more expensive than direct calls because the processor cannot predict where the call will go, and the compiler cannot inline a function if it does not know which function will be called.
To address this, we added profiling to our baseline compiler, BBQ, that records which function is actually called at each indirect call site as the program runs (299870@main). When our optimizing compiler, OMG, later recompiles the function, it uses this profile data. If a call site always calls the same function (monomorphic), OMG can generate a guard that checks “is the target still the expected function?” and then either takes a fast path that calls (or inlines) the known function directly, or falls back to the slow indirect call path. This guard-and-inline pattern turns what was an unpredictable indirect call into a predictable direct call in the common case.
We extended this approach to handle call sites that call a small number of different functions (301972@main). Rather than giving up when a call site has more than one target, we now profile up to three distinct callees. The optimizing compiler can then generate a sequence of checks, one for each known target, each followed by an inlined copy of that target’s code, with a final fallback for any unknown callee. The fallback is annotated as rare (308557@main), which tells later compiler passes to optimize for the expected case where one of the known targets matches. This optimization is not something a source language compiler can do statically, since it does not know which concrete types will appear at a given call site until the program actually runs.
Abstract Heaps
We introduced a concept called “Abstract Heaps” to our Wasm compiler pipeline, which we already had in the JS pipeline (300394@main, 300472@main, 300499@main, 305726@main). Abstract Heaps allow the compiler to reason about the side effects of Wasm instructions more precisely based on types. This enables type-based-alias-analysis (TBAA), which can describe the abstract heap access of each operation. By understanding which parts of the heap are affected by specific operations, we can eliminate redundant loads and stores in Wasm. This is particularly useful in combination with inlining, which can reveal more information about the objects being allocated. While the source language compiler can do some of this before sending the code to the browser, not all do for a number of reasons. For example, code size is very important so a source language compiler may choose not to inline in order to reduce the time to launch of an application. Additionally, as mentioned above, JavaScriptCore profiles indirect function calls and potentially inlines them, which isn’t feasible statically.
Register Allocator Improvements
A register allocator is the part of a compiler that decides which program values live in CPU registers (fast) and which must be “spilled” to memory (slow). It is one of the most performance-critical compiler passes because nearly every instruction in the generated code is affected by its decisions. It is also one of the most compilation-time-critical passes, especially for WebAssembly where compilation speed directly impacts page load time.
We replaced our previous register allocator, which dynamically switched between a graph coloring algorithm (iterated register coalescing) and a linear scan algorithm, with a new greedy register allocator. The greedy algorithm works by processing each program value’s “live range” (the span of instructions during which the value is needed) in priority order. For each value, it attempts to assign a register. If no register is available, it considers whether evicting a lower-priority value from its register would be beneficial. In our measurements, greedy allocation produces similar code quality to graph coloring, while running much faster. It also produces better code than linear scan because it can make more globally informed decisions about which values deserve registers the most.
The core data structure that tracks which registers are in use at which points in time was originally built on a balanced binary tree (std::set), which required many pointer-chasing operations across different cache lines. We replaced it with a B+ tree specifically designed for interval queries (299207@main), where each tree node is sized to fit in one or two cache lines. This made the register allocation phase nearly 2x faster for large functions, which matters because WebAssembly functions generated from languages like Dart and Kotlin can contain tens of thousands of values.
Beyond the core algorithm, we added several important features. Live range splitting (304261@main) allows the allocator to keep a spilled value in a register for the portion of code where it is most heavily used, even if it cannot stay in a register for its entire lifetime. Coalescing with pinned registers (304889@main) eliminates unnecessary copy instructions when WebAssembly runtime values, such as the pointer to the current instance, are already in a known register. Proactive spill slot coalescing (306945@main) reduces the size of the stack frame by sharing memory between spilled values that are never alive at the same time. And for values that are simple constants, the allocator can now rematerialize them (recompute them from scratch) at each use point rather than loading them from the stack (292654@main, 302864@main), which frees up a register for other values.
The improved compilation speed of the greedy allocator also had an indirect benefit: it allowed us to raise the maximum function size that our optimizing compiler will attempt to compile (292197@main). Previously, very large functions were left in the baseline tier because the optimizing compiler took too long. With the greedy allocator, we can now fully optimize larger functions without unacceptable compilation pauses. Once the greedy allocator had proven itself across all configurations, we removed the old linear scan allocator entirely (305981@main).
IPInt support for SIMD
JavaScriptCore has three execution tiers for WebAssembly: BBQ and OMG are the JIT tiers, as described above, but there is also an interpreter tier, IPInt. IPInt executes WebAssembly bytecode directly without any compilation, taking inspiration from the Wizard engine. This means IPInt can begin running code nearly instantly, only requiring a small amount of metadata. BBQ and OMG, being JIT compilers, take longer to generate code than IPInt and have a larger memory footprint but have higher throughput.
In JetStream 3, there are several workloads (transformersjs-bert-wasm and dotnet-aot-wasm) that use SIMD instructions. Previously, as IPInt did not support SIMD instructions, when a WebAssembly module used SIMD, the engine had to synchronously compile the function with BBQ before it could execute. This meant that any module using SIMD lost the near instant-start benefit of the interpreter tier and paid a mandatory compilation cost up front. It also meant that SIMD could not be used at all in JIT-less configurations, which is a requirement in some contexts for security reasons (e.g. when Lockdown Mode is enabled).
We implemented full SIMD support in IPInt, covering all of the standard WebAssembly SIMD instructions on both ARM64 and x86_64. Beyond the instructions themselves, we needed to teach the rest of the interpreter about the 128-bit v128 type: local and global variable access, function calls and tail calls, on-stack replacement to BBQ, exception handling, and WasmGC struct and array fields. We also ensured that SIMD works correctly when the JIT is completely disabled. With all of this in place, we enabled IPInt SIMD by default (301576@main), giving WebAssembly SIMD code the same near instant-start execution that non-SIMD code has always had.
JavaScript improvements
JetStream 3 includes workloads that exercise modern JavaScript features that were not yet standardized or widely adopted when JetStream 2 was released, including BigInt arithmetic, async/await-heavy control flow, and Promise combinators. Optimizing for these workloads led to improvements across several subsystems in JavaScriptCore.
BigInt Improvements
BigInt is a JavaScript feature that allows programs to work with arbitrarily large integers. Unlike regular JavaScript numbers, which are 64-bit floating point values and lose precision for integers beyond 2^53, BigInts can represent integers of any size. JetStream 3 includes workloads that exercise BigInt arithmetic heavily, which exposed several areas where our implementation could be improved.
We made improvements on two fronts: the algorithms used for arithmetic and the memory representation of BigInt values.
For multiplication, we implemented the Comba algorithm (304184@main), which is significantly faster for multiplying small, same-sized BigInts. The traditional “schoolbook” multiplication algorithm processes one digit at a time, writing partial results and carrying. Comba’s algorithm instead accumulates all the partial products for each column of the result at once, which allows the processor to pipeline the multiply-and-add operations more efficiently. We also added a fast path for single-digit multipliers (303956@main), which is the common case for operations like doubling a value.
For division, we imported and adapted a modern division algorithm implementation (304114@main), and then further optimized it with the DIV2BY1 technique (304278@main). DIV2BY1 works by pre-computing the multiplicative inverse of the divisor. Then, instead of dividing, each step of the algorithm multiplies by this inverse, which is faster on modern CPUs where multiplication is much cheaper than division. This is especially effective when dividing repeatedly by the same divisor, which is common in operations like converting a BigInt to a decimal string. We also added a cache for remainder computation (308930@main) that remembers the inverse, so that repeated modulo operations with the same divisor avoid recomputing it.
On the memory representation side, BigInts in JavaScriptCore were historically stored as a small header object with a pointer to a separate heap-allocated array of digits. This meant every BigInt operation required following a pointer to get to the actual digit data, and creating a BigInt required two allocations. We changed this so that the digits are stored directly after the object header in a single contiguous allocation (307016@main), similar to how we improved WasmGC struct layout. We also shrank the minimum BigInt object size to just 16 bytes (308026@main) by storing the sign bit in an existing, unused per-object bitfield rather than using a dedicated byte for it. For arithmetic operations that need temporary working space, we switched from allocating temporary BigInt objects on the garbage-collected heap to using stack-allocated vectors (304337@main), which avoids triggering garbage collection during intermediate computation steps. Lastly, we improved how our JIT compilers speculate on BigInt types (308054@main), allowing the optimizing compiler to generate more efficient code when it can prove that a value is a BigInt.
MicrotaskQueue and Async function improvements
The microtask queue is the mechanism that JavaScript engines use to schedule small units of deferred work. Every time a Promise resolves, the callbacks attached to it are placed on the microtask queue. Every time an async function hits an await, the continuation of that function is scheduled as a microtask. In heavily asynchronous code, such as the doxbee-async benchmark in JetStream 3 that models a chain of database operations, the microtask queue processes millions of tasks per second. This made it a critical performance target.
Our microtask queue was originally implemented in WebCore, WebKit’s web platform layer, and JavaScriptCore called into it through a virtual function interface. Every time the engine needed to schedule or run a microtask, it crossed this boundary, which involved virtual dispatch, redundant safety checks, and prevented the compiler from optimizing across the boundary. We completely rewrote the microtask queue, first extracting it into its own subsystem (291566@main), then moving it into JavaScriptCore (291649@main), and finally moving the last remaining WebCore enqueue code into JSC as well (308483@main). The new implementation uses specialized entry points for calling microtask functions (305407@main) that skip setup work that is redundant when the engine is already running, and uses lightweight tagging techniques (308140@main) to dispatch microtasks without needing to read type metadata from memory.
Alongside the microtask queue rewrite, we systematically moved all of our Promise implementation from JavaScript builtins to C++. In JavaScriptCore, “builtins” are internal functions implemented in JavaScript that ship with the engine and implement parts of the standard library. While convenient to write, they have significant startup costs because they need to be parsed and compiled, and the JIT compilers have limited ability to optimize across the boundary between builtin JavaScript code and the engine’s C++ internals. We rewrote Promise.all (304039@main), Promise.race (303893@main), Promise.allSettled (304081@main), Promise.any (304084@main), Promise.prototype.finally (305248@main), and Promise.resolve/reject (301423@main) in C++. This eliminated the parsing and compilation overhead and allowed us to hand-optimize the hot paths. We also added support in our optimizing JIT compilers for Promise.prototype.then (308636@main), so that when the compiler can prove the promise is a standard Promise (not a subclass with overridden behavior), it can generate the then operation inline without any function call overhead.
For async functions specifically, we made three significant improvements. First, we changed how async functions are resumed after an await. Previously, awaiting a value would create a promise reaction (a callback) that, when triggered, would schedule a microtask to resume the function. This double-dispatch meant each await required two trips through the microtask queue. We changed this so that the microtask queue drives async function resumption directly (303208@main), cutting the overhead in half. Second, for async functions that contain no await (which is more common than it might sound, since conditional code paths may not always reach an await), we inline the entire function body so that it executes synchronously without any generator or microtask overhead at all (304987@main). Third, after each await, the engine must check whether the awaited value is a “thenable” (an object with a then method), because thenables are treated specially by the Promise resolution algorithm. We optimized this check by tracking whether the then property has ever been added to common object types (304952@main), so that for plain objects and iterator results that do not have a then property, the check can be skipped entirely (304355@main).
Performance Results
The combination of these architectural changes resulted in roughly a 10% improvement from Safari 26.0 to Safari 26.4. Because JetStream 3 scores the full lifecycle of each workload, this improvement reflects gains that users experience directly: faster initial page loads from quicker WebAssembly compilation, and smoother interactions from more efficient JavaScript execution and reduced garbage collection pauses.

Conclusion
JetStream 3 reflects a shift in how we think about browser benchmarking: scoring the full lifecycle rather than isolated phases, measuring real-world workloads rather than microbenchmarks, and developing the suite collaboratively across browsers. Building it pushed us to make meaningful architectural improvements to JavaScriptCore, from WasmGC allocation and inlining to BigInt arithmetic and async function execution. We’re excited about the opportunities provided by the JetStream 3 benchmark to further improve the performance of applications on the web. As always, we will continue our efforts to make the fastest, most secure browser for Safari users.












).