spacing=nonfrench
Abhinav Jangda, Bobby Powers, Emery D. Berger, and Arjun Guha
University of Massachusetts Amherst
Abstract
All major web browsers now support WebAssembly, a low-level bytecode intended to serve as a compilation target for code written in languages like C and C++. A key goal of WebAssembly is performance parity with native code; previous work reports near parity, with many applications compiled to WebAssembly running on average slower than native code. However, this evaluation was limited to a suite of scientific kernels, each consisting of roughly 100 lines of code. Running more substantial applications was not possible because compiling code to WebAssembly is only part of the puzzle: standard Unix APIs are not available in the web browser environment. To address this challenge, we build Browsix-Wasm, a significant extension to Browsix [29] that, for the first time, makes it possible to run unmodified WebAssembly-compiled Unix applications directly inside the browser. We then use Browsix-Wasm to conduct the first large-scale evaluation of the performance of WebAssembly vs. native. Across the SPEC CPU suite of benchmarks, we find a substantial performance gap: applications compiled to WebAssembly run slower by an average of 45% (Firefox) to 55% (Chrome), with peak slowdowns of (Firefox) and (Chrome). We identify the causes of this performance degradation, some of which are due to missing optimizations and code generation issues, while others are inherent to the WebAssembly platform.
1 Introduction
Web browsers have become the most popular platform for running user-facing applications, and until recently, JavaScript was the only programming language supported by all major web browsers. Beyond its many quirks and pitfalls from the perspective of programming language design, JavaScript is also notoriously difficult to compile efficiently [12, 17, 31, 30]. Applications written in or compiled to JavaScript typically run much slower than their native counterparts. To address this situation, a group of browser vendors jointly developed WebAssembly.
WebAssembly is a low-level, statically typed language that does not require garbage collection, and supports interoperability with JavaScript. The goal of WebAssembly is to serve as a universal compiler target that can run in a browser [16, 15, 18].111The WebAssembly standard is undergoing active development, with ongoing efforts to extend WebAssembly with features ranging from SIMD primitives and threading to tail calls and garbage collection. This paper focuses on the initial and stable version of WebAssembly [18], which is supported by all major browsers. Towards this end, WebAssembly is designed to be fast to compile and run, to be portable across browsers and architectures, and to provide formal guarantees of type and memory safety. Prior attempts at running code at native speed in the browser [13, 14, 4, 38], which we discuss in related work, do not satisfy all of these criteria.
WebAssembly is now supported by all major browsers [34, 8] and has been swiftly adopted by several programming languages. There are now backends for C, C++, C#, Go, and Rust [39, 24, 2, 1] that target WebAssembly. A curated list currently includes more than a dozen others [10]. Today, code written in these languages can be safely executed in browser sandboxes across any modern device once compiled to WebAssembly.
A major goal of WebAssembly is to be faster than JavaScript. For example, the paper that introduced WebAssembly [18] showed that when a C program is compiled to WebAssembly instead of JavaScript (asm.js), it runs 34% faster in Google Chrome. That paper also showed that the performance of WebAssembly is competitive with native code: of the 24 benchmarks evaluated, the running time of seven benchmarks using WebAssembly is within 10% of native code, and almost all of them are less than slower than native code. Figure 1 shows that WebAssembly implementations have continuously improved with respect to these benchmarks. In 2017, only seven benchmarks performed within 1.1 of native, but by 2019, this number increased to 13.
These results appear promising, but they beg the question: are these 24 benchmarks representative of WebAssembly’s intended use cases?
The Challenge of Benchmarking WebAssembly
The aforementioned suite of 24 benchmarks is the PolybenchC benchmark suite [5], which is designed to measure the effect of polyhedral loop optimizations in compilers. All the benchmarks in the suite are small scientific computing kernels rather than full applications (e.g., matrix multiplication and LU Decomposition); each is roughly 100 LOC. While WebAssembly is designed to accelerate scientific kernels on the Web, it is also explicitly designed for a much richer set of full applications.
The WebAssembly documentation highlights several intended use cases [7], including scientific kernels, image editing, video editing, image recognition, scientific visualization, simulations, programming language interpreters, virtual machines, and POSIX applications. Therefore, WebAssembly’s strong performance on the scientific kernels in PolybenchC do not imply that it will perform well given a different kind of application.
We argue that a more comprehensive evaluation of WebAssembly should rely on an established benchmark suite of large programs, such as the SPEC CPU benchmark suites. In fact, the SPEC CPU 2006 and 2017 suite of benchmarks include several applications that fall under the intended use cases of WebAssembly: eight benchmarks are scientific applications (e.g., 433.milc, 444.namd, 447.dealII, 450.soplex, and 470.lbm), two benchmarks involve image and video processing (464.h264ref and 453.povray), and all of the benchmarks are POSIX applications.
Unfortunately, it is not possible to simply compile a sophisticated native program to WebAssembly. Native programs, including the programs in the SPEC CPU suites, require operating system services, such as a filesystem, synchronous I/O, and processes, which WebAssembly and the browser do not provide. The SPEC benchmarking harness itself requires a file system, a shell, the ability to spawn processes, and other Unix facilities. To overcome these limitations when porting native applications to the web, many programmers painstakingly modify their programs to avoid or mimic missing operating system services. Modifying well-known benchmarks, such as SPEC CPU, would not only be time consuming but would also pose a serious threat to validity.
The standard approach to running these applications today is to use Emscripten, a toolchain for compiling C and C++ to WebAssembly [39]. Unfortunately, Emscripten only supports the most trivial system calls and does not scale up to large-scale applications. For example, to enable applications to use synchronous I/O, the default Emscripten MEMFS filesystem loads the entire filesystem image into memory before the program begins executing. For SPEC, these files are too large to fit into memory.
A promising alternative is to use Browsix, a framework that enables running unmodified, full-featured Unix applications in the browser [28, 29]. Browsix implements a Unix-compatible kernel in JavaScript, with full support for processes, files, pipes, blocking I/O, and other Unix features. Moreover, it includes a C/C++ compiler (based on Emscripten) that allows programs to run in the browser unmodified. The Browsix case studies include complex applications, such as LaTeX, which runs entirely in the browser without any source code modifications.
Unfortunately, Browsix is a JavaScript-only solution, since it was built before the release of WebAssembly. Moreover, Browsix suffers from high performance overhead, which would be a significant confounder while benchmarking. Using Browsix, it would be difficult to tease apart the poorly performing benchmarks from performance degradation introduced by Browsix.
Contributions
-
•
Browsix-Wasm: We develop Browsix-Wasm, a significant extension to and enhancement of Browsix that allows us to compile Unix programs to WebAssembly and run them in the browser with no modifications. In addition to integrating functional extensions, Browsix-Wasm incorporates performance optimizations that drastically improve its performance, ensuring that CPU-intensive applications operate with virtually no overhead imposed by Browsix-Wasm ().
-
•
Browsix-SPEC: We develop Browsix-SPEC, a harness that extends Browsix-Wasm to allow automated collection of detailed timing and hardware on-chip performance counter information in order to perform detailed measurements of application performance ().
-
•
Performance Analysis of WebAssembly: Using Browsix-Wasm and Browsix-SPEC, we conduct the first comprehensive performance analysis of WebAssembly using the SPEC CPU benchmark suite (both 2006 and 2017). This evaluation confirms that WebAssembly does run faster than JavaScript (on average 1.3 faster across SPEC CPU). However, contrary to prior work, we find a substantial gap between WebAssembly and native performance: code compiled to WebAssembly runs on average 1.55 slower in Chrome and 1.45 slower in Firefox than native code ().
-
•
Root Cause Analysis and Advice for Implementers: We conduct a forensic analysis with the aid of performance counter results to identify the root causes of this performance gap. We find the following results:
-
1.
The instructions produced by WebAssembly have more loads and stores than native code (2.02 more loads and 2.30 more stores in Chrome; 1.92 more loads and 2.16 more stores in Firefox). We attribute this to reduced availability of registers, a sub-optimal register allocator, and a failure to effectively exploit a wider range of x86 addressing modes.
-
2.
The instructions produced by WebAssembly have more branches, because WebAssembly requires several dynamic safety checks.
-
3.
Since WebAssembly generates more instructions, it leads to more L1 instruction cache misses.
We provide guidance to help WebAssembly implementers focus their optimization efforts in order to close the performance gap between WebAssembly and native code ().
-
1.
2 From Browsix to Browsix-Wasm
Browsix [29] mimics a Unix kernel within the browser and includes a compiler (based on Emscripten [39, 33]) that compiles native programs to JavaScript. Together, they allow native programs (in C, C++, and Go) to run in the browser and freely use operating system services, such as pipes, processes, and a filesystem. However, Browsix has two major limitations that we must overcome. First, Browsix compiles native code to JavaScript and not WebAssembly. Second, the Browsix kernel has significant performance issues. In particular, several common system calls have very high overhead in Browsix, which makes it hard to compare the performance of a program running in Browsix to that of a program running natively. We address these limitations by building a new in-browser kernel called Browsix-Wasm, which supports WebAssembly programs and eliminates the performance bottlenecks of Browsix.
Emscripten Runtime Modifications
Browsix modifies the Emscripten compiler to allow processes (which run in WebWorkers) to communicate with the Browsix kernel (which runs on the main thread of a page). Since Browsix compiles native programs to JavaScript, this is relatively straightforward: each process’ memory is a buffer that is shared with the kernel (a SharedArrayBuffer), thus system calls can directly read and write process memory. However, this approach has two significant drawbacks. First, it precludes growing the heap on-demand; the shared memory must be sized large enough to meet the high-water-mark heap size of the application for the entire life of the process. Second, JavaScript contexts (like the main context and each web worker context) have a fixed limit on their heap sizes, which is currently approximately 2.2 GB in Google Chrome [6]. This cap imposes a serious limitation on running multiple processes: if each process reserves a 500 MB heap, Browsix would only be able to run at most four concurrent processes. A deeper problem is that WebAssembly memory cannot be shared across WebWorkers and
Browsix-Wasm uses a different approach to process-kernel communication that is also faster than the Browsix approach. Browsix-Wasm modifies the Emscripten runtime system to create an auxiliary buffer (of 64MB) for each process that is shared with the kernel, but is distinct from process memory. Since this auxiliary buffer is a SharedArrayBuffer the Browsix-Wasm process and kernel can use Atomic API for communication. When a system call references strings or buffers in the process’s heap (e.g., writev or stat), its runtime system copies data from the process memory to the shared buffer and sends a message to the kernel with locations of the copied data in auxiliary memory. Similarly, when a system call writes data to the auxiliary buffer (e.g., read), its runtime system copies the data from the shared buffer to the process memory at the memory specified. Moreover, if a system call specifies a buffer in process memory for the kernel to write to (e.g., read), the runtime allocates a corresponding buffer in auxiliary memory and passes it to the kernel. In case the system call is either reading or writing data of size more than 64MB, Browsix-Wasm divides this call into several calls such that each call only reads or writes at maximum 64MB of data. The cost of these memory copy operations is dwarfed by the overall cost of the system call invocation, which involves sending a message between process and kernel JavaScript contexts. We show in §4.2.1 that Browsix-Wasm has negligible overhead.
Performance Optimization
While building Browsix-Wasm and doing our preliminary performance evaluation, we discovered several performance issues in parts of the Browsix kernel. Left unresolved, these performance issues would be a threat to the validity of a performance comparison between WebAssembly and native code. The most serious case was in the shared filesystem component included with Browsix/Browsix-Wasm, BrowserFS. Originally, on each append operation on a file, BrowserFS would allocate a new, larger buffer, copying the previous and new contents into the new buffer. Small appends could impose substantial performance degradation. Now, whenever a buffer backing a file requires additional space, BrowserFS grows the buffer by at least 4 KB. This change alone decreased the time the 464.h264ref benchmark spent in Browsix from 25 seconds to under 1.5 seconds. We made a series of improvements that reduce overhead throughout Browsix-Wasm. Similar, if less dramatic, improvements were made to reduce the number of allocations and the amount of copying in the kernel implementation of pipes.
3 Browsix-SPEC
To reliably execute WebAssembly benchmarks while capturing performance counter data, we developed Browsix-SPEC. Browsix-SPEC works with Browsix-Wasm to manage spawning browser instances, serving benchmark assets (e.g., the compiled WebAssembly programs and test inputs), spawning perf processes to record performance counter data, and validating benchmark outputs.
We use Browsix-SPEC to run three benchmark suites to evaluate WebAssembly’s performance: SPEC CPU2006, SPEC CPU2017, and PolyBenchC. These benchmarks are compiled to native code using Clang 4.0, and WebAssembly using Browsix-Wasm. We made no modifications to Chrome or Firefox, and the browsers are run with their standard sandboxing and isolation features enabled. Browsix-Wasm is built on top of standard web platform features and requires no direct access to host resources – instead, benchmarks make standard HTTP requests to Browsix-SPEC.
3.1 Browsix-SPEC Benchmark Execution
Figure 2 illustrates the key pieces of Browsix-SPEC in play when running a benchmark, such as 401.bzip2 in Chrome. First (1), the Browsix-SPEC benchmark harness launches a new browser instance using a WebBrowser automation tool, Selenium.222https://www.seleniumhq.org/ (2) The browser loads the page’s HTML, harness JS, and Browsix-Wasm kernel JS over HTTP from the benchmark harness. (3) The harness JS initializes the Browsix-Wasm kernel and starts a new Browsix-Wasm process executing the runspec shell script (not shown in Figure 2). runspec in turn spawns the standard specinvoke (not shown), compiled from the C sources provided in SPEC 2006. specinvoke reads the speccmds.cmd file from the Browsix-Wasm filesystem and starts 401.bzip2 with the appropriate arguments. (4) After the WebAssembly module has been instantiated but before the benchmark’s main function is invoked, the Browsix-Wasm userspace runtime does an XHR request to Browsix-SPEC to begin recording performance counter stats. (5) The benchmark harness finds the Chrome thread corresponding to the Web Worker 401.bzip2 process and attaches perf to the process. (6) At the end of the benchmark, the Browsix-Wasm userspace runtime does a final XHR to the benchmark harness to end the perf record process. When the runspec program exits (after potentially invoking the test binary several times), the harness JS POSTs (7) a tar archive of the SPEC results directory to Browsix-SPEC. After Browsix-SPEC receives the full results archive, it unpacks the results to a temporary directory and validates the output using the cmp tool provided with SPEC 2006. Finally, Browsix-SPEC kills the browser process and records the benchmark results.
4 Evaluation
We use Browsix-Wasm and Browsix-SPEC to evaluate the performance of WebAssembly using three benchmark suites: SPEC CPU2006, SPEC CPU2017, and PolyBenchC. We include PolybenchC benchmarks for comparison with the original WebAssembly paper [18], but argue that these benchmarks do not represent typical workloads. The SPEC benchmarks are representative and require Browsix-Wasm to run successfully. We run all benchmarks on a 6-Core Intel Xeon E5-1650 v3 CPU with hyperthreading and 64 GB of RAM running Ubuntu 16.04 with Linux kernel v4.4.0. We run all benchmarks using two state-of-the-art browsers: Google Chrome 74.0 and Mozilla Firefox 66.0. We compile benchmarks to native code using Clang 4.0333The flags to Clang are -O2 -fno-strict-aliasing. and to WebAssembly using Browsix-Wasm (which is based on Emscripten with Clang 4.0).444Browsix-Wasm runs Emscripten with the flags -O2 -s TOTAL_MEMORY=1073741824 -s ALLOW_MEMORY_GROWTH=1 -fno-strict-aliasing. Each benchmark was executed five times. We report the average of all running times and the standard error. The execution time measured is the difference between wall clock time when the program starts, i.e. after WebAssembly JIT compilation concludes, and when the program ends.
4.1 PolyBenchC Benchmarks
Haas et al. [18] used PolybenchC to benchmark WebAssembly implementations because the PolybenchC benchmarks do not make system calls. As we have already argued, the PolybenchC benchmarks are small scientific kernels that are typically used to benchmark polyhedral optimization techniques, and do not represent larger applications. Nevertheless, it is still valuable for us to run PolybenchC with Browsix-Wasm, because it demonstrates that our infrastructure for system calls does not have any overhead. Figure 3(a) shows the execution time of the PolyBenchC benchmarks in Browsix-Wasm and when run natively. We are able to reproduce the majority of the results from the original WebAssembly paper [18]. We find that Browsix-Wasm imposes a very low overhead: an average of 0.2% and a maximum of 1.2%.
4.2 SPEC Benchmarks
We now evaluate Browsix-Wasm using the C/C++ benchmarks from SPEC CPU2006 and SPEC CPU2017 (the new C/C++ benchmarks and the speed benchmarks), which use system calls extensively. We exclude four data points that either do not compile to WebAssembly555400.perlbench, 403.gcc, 471.omnetpp, and 456.hmmer from SPEC CPU2006 do not compile with Emscripten. or allocate more memory than WebAssembly allows.666From SPEC CPU2017, the ref dataset of 638.imagick_s and 657.xz_s require more than 4 GB RAM. However, these benchmarks do work with their test dataset. Table 1 shows the absolute execution times of the SPEC benchmarks when running with Browsix-Wasm in both Chrome and Firefox, and when running natively.
WebAssembly performs worse than native for all benchmarks except for 429.mcf and 433.milc. In Chrome, WebAssembly’s maximum overhead is 2.5 over native and 7 out of 15 benchmarks have a running time within 1.5 of native. In Firefox, WebAssembly is within 2.08 of native and performs within 1.5 of native for 7 out of 15 benchmarks. On average, WebAssembly is 1.55 slower than native in Chrome, and 1.45 slower than native in Firefox. Table 2 shows the time required to compile the SPEC benchmarks using Clang and Chrome. (To the best of our knowledge, Firefox cannot report WebAssembly compile times.) In all cases, the compilation time is negligible compared to the execution time. However, the Clang compiler is orders of magnitude slower than the WebAssembly compiler. Finally, note that Clang compiles benchmarks from C++ source code, whereas Chrome compiles WebAssembly, which is a simpler format than C++.
| Benchmark | Native | Google Chrome | Mozilla Firefox |
|---|---|---|---|
| 401.bzip2 | 370 0.6 | 864 6.4 | 730 1.3 |
| 429.mcf | 221 0.1 | 180 0.9 | 184 0.6 |
| 433.milc | 375 2.6 | 369 0.5 | 378 0.6 |
| 444.namd | 271 0.8 | 369 9.1 | 373 1.8 |
| 445.gobmk | 352 2.1 | 537 0.8 | 549 3.3 |
| 450.soplex | 179 3.7 | 265 1.2 | 238 0.5 |
| 453.povray | 110 1.9 | 275 1.3 | 229 1.5 |
| 458.sjeng | 358 1.4 | 602 2.5 | 580 2.0 |
| 462.libquantum | 330 0.8 | 444 |