May 26, 2016

Manuel Rego: CSS Grid Layout and positioned items

Igalia WebKit

As part of the work done by Igalia in the CSS Grid Layout implementation on Chromium/Blink and Safari/WebKit, we’ve been implementing the support for positioned items. Yeah, absolute positioning inside a grid. 😅

Probably the first idea is that come to your mind is that you don’t want to use positioned grid items, but maybe in some use cases it can be needed. The idea of this post is to explain how they work inside a grid container as they have some particularities.

Actually there’s not such a big difference compared to regular grid items. When the grid container is the containing block of the positioned items (e.g. using position: relative; on the grid container) they’re placed almost the same than regular grid items. But, there’re a few differences:

  • Positioned items don't stretch by default.
  • They don't use the implicit grid. They don't create implicit tracks.
  • They don't occupy cells regarding auto-placement feature.
  • autohas a special meaning when referring lines.

Let’s explain with more detail each of these features.

Positioned items shrink to fit

We’re used to regular items that stretch by default to fill their area. However, that’s not the case for positioned items, similar to what a positioned regular block does, they shrink to fit.

This is pretty easy to get, but a simple example will make it crystal clear:

<div style="display: grid; position: relative;
            grid-template-columns: 300px 200px; grid-template-rows: 200px 100px;">
    <div style="grid-column: 1 / 3; grid-row: 1 / 3;"></div>
    <div style="position: absolute; grid-column: 1 / 3; grid-row: 1 / 3;"></div>
</div>

In this example we’ve a simple 2x2 grid. Both the regular item and the positioned one are placed with the same rules taking the whole grid. This defines the area for those items, which takes the 1st & 2nd rows and 1st & 2nd columns.

Positioned items shrink to fit Positioned items shrink to fit

The regular item stretches by default both horizontally and vertically, so it takes the whole size of the grid area. However, the positioned item shrink to fit and adapts its size to the contents.

For the examples in the next points I’m ignoring this difference, as I want to show the area that each positioned item takes. To get the same result than in the pictures, you’d need to set 100% width and height on the positioned items.

Positioned items and implicit grid

Positioned items don’t participate in the layout of the grid, neither they affect how items are placed.

You can place a regular item outside the explicit grid, and the grid will create the required tracks to accommodate the item. However, in the case of positioned items, you cannot even refer to lines in the implicit grid, they'll be treated as auto. Which means that you cannot place a positioned item in the implicit grid. they cannot create implicit tracks as they don't participate in the layout of the grid.

Let’s use an example to understand this better:

<div style="display: grid; position: relative;
            grid-template-columns: 200px 100px; grid-template-rows: 100px 100px;
            grid-auto-columns: 100px; grid-auto-rows: 100px;
            width: 300px; height: 200px;">
    <div style="position: absolute; grid-area: 4 / 4;"></div>
</div>

The example defines a 2x2 grid, but the positioned item is using grid-area: 4 / 4; so it tries to goes to the 4th row and 4th column. However the positioned items cannot create those implicit tracks. So it’s positioned like if it has auto, which in this case will take the whole explicit grid. auto has a special meaning in positioned items, it’ll be properly explained later.

Positioned items do not create implicit tracks Positioned items do not create implicit tracks

Imagine another example where regular items create implicit tracks:

<div style="display: grid; position: relative;
            grid-template-columns: 200px 100px; grid-template-rows: 100px 100px;
            grid-auto-columns: 100px; grid-auto-rows: 100px;
            width: 500px; height: 400px;">
    <div style="grid-area: 1 / 4;"></div>
    <div style="grid-area: 4 / 1;"></div>
    <div style="position: absolute; grid-area: 4 / 4;"></div>
</div>

In this case, the regular items will be creating the implicit tracks, making a 4x4 grid in total. Now the positioned item can be placed on the 4th row and 4th column, even if those columns are on the explicit grid.

Positioned items can be placed on the implicit grid Positioned items can be placed on the implicit grid

As you can see this part of the post has been modified, thanks to @fantasai for notifying me about the mistake.

Positioned items and placement algorithm

Again the positioned items do not affect the position of other items, as they don’t participate in the placement algorithm.

So, if you’ve a positioned item and you’re using auto-placement for some regular items, it’s expected that the positioned one overlaps the other. The positioned items are completely ignored during auto-placement.

Just showing a simple example to show this behavior:

<div style="display: grid; position: relative;
            grid-template-columns: 300px 300px; grid-template-rows: 200px 200px;">
    <div style="grid-column: auto; grid-row: auto;"></div>
    <div style="grid-column: auto; grid-row: auto;"></div>
    <div style="grid-column: auto; grid-row: auto;"></div>
    <div style="position: absolute;
                grid-column: 2 / 3; grid-row: 1 / 2;"></div>
</div>

Here we’ve again a 2x2 grid, with 3 auto-placed regular items, and 1 absolutely positioned item. As you can see the positioned item is placed on the 1st row and 2nd column, but there’s an auto-placed item in that cell too, which is below the positioned one. This shows that the grid container doesn’t care about positioned items and it just ignores them when it has to place regular items.

Positioned items and placement algorithm Positioned items and placement algorithm

If all the children were not positioned, the last one would be placed in the given position (1st row and 2nd column), and the rest of them (auto-placed) will take the other cells, without overlapping.

Positioned items and auto lines

This is probably the biggest difference compared to regular grid items. If you don’t specify a line, it’s considered that you’re using auto, but auto is not resolved as span 1 like in regular items. For positioned items auto is resolved to the padding edge.

The specification introduces the concepts of the lines 0 and -0, despite how weird it can sound, it actually makes sense. The auto lines would be referencing to those 0 and -0 lines, that represent the padding edges of the grid container.

Again let’s use a few examples to explain this:

<div style="display: grid; position: relative;
            grid-template-columns: 300px 200px; grid-template-rows: 200px 200px;
            padding: 50px 100px; width: 500px; height: 400px;">
    <div style="position: absolute;
                grid-column: 1 / auto; grid-row: 2 / auto;"></div>
</div>

Here we have a 2x2 grid container, which has some padding. The positioned item will be placed in the 2nd row and 1st column, but its area will take up to the padding edges (as the end line is auto in both axis).

Positioned items and auto lines Positioned items and auto lines

We could even place positioned grid items on the padding itself. For example using “grid-column: auto / 1;” the item would be on the left padding.

Positioned items using auto lines to be placed on the left padding Positioned items using auto lines to be placed on the left padding

Of course if the grid is wider and we’ve some free space on the content box, the items will take that space too. For example:

<div style="display: grid; position: relative;
            grid-template-columns: 300px 200px; grid-template-rows: 200px 200px;
            padding: 50px 100px; width: 600px; height: 400px;">
    <div style="position: absolute;
                grid-column: 3 / auto; grid-row: 2 / 3;"></div>
</div>

Here the grid columns are 500px, but the grid container has 600px width. This means that we’ve 100px of free space in the grid content box. As you can see in the example, that space will be also used when the positioned items extend up to the padding edges.

Positioned items taking free space and right padding Positioned items taking free space and right padding

Offsets

Of course you can use offsets to place your positioned items (left, right, top and bottom properties).

These offsets will apply inside the grid area defined for the positioned items, following the rules explained above.

Let’s use another example:

<div style="display: grid; position: relative;
            grid-template-columns: 300px 200px; grid-template-rows: 200px 200px;
            padding: 50px 100px; width: 500px; height: 400px;">
    <div style="position: absolute;
                grid-column: 1 / auto; grid-row: 2 / auto;
                left: 90px; right: 70px; top: 80px; bottom: 60px;"></div>
</div>

Again a 2x2 grid container with some padding. The positioned item have some offsets which are applied inside its grid area.

Positioned items and offets Positioned items and offets

Wrap-up

I’m not completely sure about how important is the support of positioned elements for web authors using Grid Layout. You’ll be the ones that have to tell if you really find use cases that need this. I hope this post helps to understand it better and make your minds about real-life scenarios where this might be useful.

The good news is that you can test this already in the most recent versions of some major browsers: Chrome Canary, Safari Technology Preview and Firefox. We hope that the 3 implementations are interoperable, but please let us know if you find any issue.

There’s one last thing missing: alignment support for positioned items. This hasn’t been implemented yet in any of the browsers, but the behavior will be pretty similar to the one you can already use with regular grid items. Hopefully, we’ll have time to add support for this in the coming months.

Igalia and Bloomberg working together to build a better web Igalia and Bloomberg working together to build a better web

Last but least, thanks to Bloomberg for supporting Igalia in the CSS Grid Layout implementation on Blink and WebKit.

May 26, 2016 10:00 PM

May 25, 2016

Release Notes for Safari Technology Preview 5

WebKit Blog

Safari Technology Preview Release 5 is now available for download. If you already have Safari Technology Preview installed, you can update from the Mac App Store’s Updates tab. Release 5 of Safari Technology Preview covers WebKit revisions 200418–201083.

JavaScript

  • Corrected the entropy of Math.random() for the first two invocations (r201053)
  • Corrected sticky RegExp handling when backtracking alternatives with dissimilar match lengths (r200946)
  • Fixed ES6 site compatibility when Function.name is inferred from property names (r200423)
  • Sped up ES6 Array iterators between 4x to 6x (r200422)
  • Made the Object constructor be aware of new.target by storing the target’s prototype to the newly created object’s prototype (r200421)
  • Fixed calls to getters and setters on super being called with wrong this object (r200586)
  • Improved error messages for accessing arguments.callee and similar getters in strict mode (r200694)
  • Made TypedArray.prototype.slice use the byteLength of passed array for memmove (r200667)

CSS

  • Fixed the cascading order for !important properties in ::slotted and ::host rules correctly (r201073)
  • Added color-gamut media query support for wide gamut displays (r201065)
  • Made Web Fonts only download when characters are used in its unicode-range (r200601)
  • Restored legacy parsing of color attributes with 4 and 8 digits (r200501)
  • Corrected how transitions behave when auto values are used (r200622)
  • Unprefixed -webkit-cross-fade() (r200888)
  • Corrected cross-fade() rendering to match expectations (r200889)
  • Corrected how prefixed and unprefixed variants in CSSStyleDeclaration are handled (r200769)
  • Stopped attempting to compute min/max width for replaced elements with no intrinsic size (r200486)
  • Unprefixed CSS Grid Layout properties (r200510)
  • Fixed static position for positioned CSS Grid items (r200572)
  • Corrected parsing when just using span as a grid-line value (r200755)
  • Implemented CSS Grid auto-repeat computation (r200618)

Web APIs

  • Started blocking Geolocation API calls on pages served over insecure connections (r200686)
  • Made NodeList iterable (r200619)
  • Added support for title attribute tooltips inside Shadow DOM content (r200923)
  • Stopped retargeting event.target when an event bubbles up from an assigned node to its assigned slot (r200464)
  • Enabled IndexedDB in Web Worker scripts (r200697)
  • Fixed IndexedDB transactions so they can’t be committed or aborted twice (r200598)
  • Started propagating user gesture state across postMessage boundaries (r200908)
  • Aligned window.scroll(), scrollTo(), and scrollBy() with the CSSOM spec with support for the options argument (r200907)
  • Made all scrolling height and width values be integral rounded (r200915)
  • Added support for ArrayBufferView in the CSS Font Loading API (r200921)

Web Inspector

  • Hook up ShadowChicken in the Debugger tab to properly show tail call deleted frames (r200981)
  • Made let and const work as expected in Console evaluations (r200533)
  • Improved organization of the Debugger tab sidebar (r200566, r200600)
  • Added Object Graph view to heap snapshots and removed the Summary view (r200474, r200517)
  • Fixed sites that relied on putting custom properties on console.prototype (r201022)
  • Improved performance of filtering large timeline recordings (r201047)
  • Made Inspect Element and element selection work with Shadow DOM nodes (r200539)
  • Fixed the start times in the Frames timeline data grid (r201082)
  • Started persisting breakpoints in scripts named via //# sourceURL (r201019)
  • Fixed the scrollbar covering the last column in data grids when always showing scrollbars (r200962)
  • Fixed Computed Style so it no longer shows both prefixed and unprefixed variants of properties (r200952)
  • Made the Call Trees view in the Timelines tab filterable and hide it from views that are not filterable (r200708, r200873)
  • Started showing in-progress message in timeline views that do not show data until the recording finishes (r200573, r200594)
  • Improved performance of the Console when it tries to render thousands of messages at once (r200471)
  • Fixed filtering by duration in the Frames timeline view (r200809)
  • Fixed loading of //# sourceMappingURL with a relative URL (r200806)
  • Improved console.count() to better match other browsers
  • Improved performance of the Timelines tab by profiling Web Inspector with Web Inspector (r200740, r200745, r200773, r200779, r200949)

Media

  • Made media elements not pause right away when removed from the document (r200431)
  • Started returning a Promise from HTMLMediaElement.prototype.play() (r200638)
  • Stopped updating media duration at playback end while seeking (r200675)

Security

  • Fixed case matching against the path portion of CSP source expression URLs that ends in a forward-slash (r200445)
  • Corrected a CORS check what was sometimes incorrectly failing for media loads (r200493)

Networking

  • Stopped restarting a resource preload if there is already one pending for the same URL (r200630)

Accessibility

  • Media controls are now keyboard accessible along with other Shadow DOM elements (r200520)

Bug Fixes

  • Fixed large animated GIFs not animating until the last frame on slow networks (r200939)
  • Fixed Zoom In and Zoom Out on PDF documents (r200611)

By Timothy Hatcher at May 25, 2016 05:05 PM

May 23, 2016

ECMAScript 6 Proper Tail Calls in WebKit

WebKit Blog

Proper Tail Calls (PTC) is a new feature in the ECMAScript 6 language. This feature was added to facilitate recursive programming patterns, both for direct and indirect recursion. Various other design patterns can benefit from PTC as well, such as code that wraps some functionality where the wrapping code directly returns the result of what it wraps. Through the use of PTC, the amount of memory needed to run code is reduced. In deeply recursive code, PTC enables code to run that would otherwise throw a stack overflow exception.

What is a Proper Tail Call?

Typically when calling a function, stack space is allocated for the data associated with making a function call. This data includes the return address, prior stack pointer, arguments to the function, and space for the function’s local values. This space is called a stack frame. A call made to a function in tail position will reuse the stack space of the calling function. A function call is in tail position if the following criteria are met:

  • The calling function is in strict mode.
  • The calling function is either a normal function or an arrow function.
  • The calling function is not a generator function.
  • The return value of the called function is returned by the calling function.

When a function call is in tail position, ECMAScript 6 mandates that such a call must reuse the stack space of its own frame instead of pushing another frame onto the call stack. To emphasize, ECMAScript 6 requires that a call in tail position will reuse the caller’s stack space. The calling function’s frame is called a tail deleted frame as it is no longer on the stack once it makes a tail call. This means that the tail deleted function will not show up in a stack trace. It is important to note that PTC differs from Tail Call Optimization, which is a discretionary optimization that many optimizing compilers will make for various performance reasons.

Benefits of Proper Tail Calls

PTC was added to ECMAScript primarily to reuse stack space. The reuse of the stack memory allows for recursive and tail call coding patterns common to functional programming and other programming paradigms. Using PTC, a program could make an unbounded number of consecutive tail calls without unboundedly growing the stack.

PTC provides other benefits as well. Programs that utilize PTC can see a reduced memory footprint because the garbage collector will be more likely to collect certain local objects. Programs that utilize PTC can also see an improvement in speed because there is less processing when returning from a tail called function.

Stack Space

Reduced stack usage can provide benefits in other ways as well. Modern computing devices incorporate tiered memory caches to reduce latency in memory accesses. Although these caches are generous in size, they are still finite. Reducing stack usage through the use of PTC also reduces the amount of cache space needed, freeing up cache space for other memory accesses.

Locally Allocated Objects

Consider a function that allocates a local object, but that object is never made visible to other code. The only references to such a local object will be through a pointer in the function’s stack frame or in a register that the function is using. Should the JavaScript virtual machine need to garbage collect memory, it will find a reference to such a local object by scanning the stack and the contents of the CPU’s registers. If that function makes a call to another function and that call is not a tail call, then any local objects of the calling function will not be collected until the calling function itself returns. However, if a function makes a tail call to another function, all local objects of the calling function can be garbage collected because there are no more stack references to the object.

Returning from a Tail Called Function

Another benefit of PTC is that when a leaf function returns, it bypasses all intermediate tail called functions and returns directly to the first caller that didn’t make a tail call. This eliminates all of the return processing of those intermediate functions. The deeper the call chain of successive tail calls, the more performance benefit this provides. This works for both direct and mutual recursion.

Examples

There are many algorithms that are best written using recursion. Many of those algorithms naturally take advantage of PTC, while others may require some reworking. Consider writing a program to compute the greatest common divisor (GCD) function using Euclid’s algorithm. The translation of Euclid’s algorithm into a program that utilizes PTC is simple, elegant, and natural:

"use strict";
function gcd(m, n)
{
    if (!n)
        return m;
    return gcd(n, m % n);
}

The natural translation of other recursive mathematical functions can lead to recursive calls that are not in tail position. For example, a program that computes factorial (N!) is commonly written as:

"use strict";
function factorial(n)
{
    if (!n)
        return 1;
    return n * factorial(n - 1);
}

In this function, the recursive call to factorial() is not in tail position because the return statement computes and returns the product of n and the result of the recursive call. As a reminder, to be in tail position, the return value of the called function must be the only thing returned by the calling function. With a little modification, we can rewrite factorial to utilize PTC as follows:

"use strict";
function factorial(n, partialFactorial = 1)
{
    if (!n)
        return partialFactorial;
    return factorial(n - 1, n * partialFactorial);
}

This change puts the recursive call to factorial in tail position which allows the function to take advantage of PTC. The number of recursive calls and arithmetic operations is the same for both versions.

This next example involves the functions computeSquareRoot(), computePositiveSquareRoot() and iterativeSquareRoot() which are used to calculate the square roots of numbers using Newton’s Iterative method:

"use strict";
function computeSquareRoot(x)
{
    if (!x)
        return "0";

    let imaginaryPart = "";
    if (x < 0) {
        x = -x;
        imaginaryPart = "i";
    }

    return computePositiveSquareRoot(x).toString() + imaginaryPart;
}

function computePositiveSquareRoot(x)
{
    let targetEpsilon = x / 10000000000;
    return iterativeSquareRoot(x, x / 2, targetEpsilon);
}

function iterativeSquareRoot(x, estimate, targetEpsilon)
{
    let newEstimate = ((x / estimate) + estimate) / 2;
    let delta = Math.abs(estimate - newEstimate);

    if (delta <= targetEpsilon)
        return newEstimate;

    return iterativeSquareRoot(x, newEstimate, targetEpsilon);
}

The top function, computeSquareRoot(), determines if the result will be a real or imaginary number and then calls computePositiveSquareRoot(), which sets up the iterative square process and returns the result of iterativeSquareRoot(). The call to computePositiveSquareRoot() in computeSquareRoot() is not in tail position since additional processing is done on the result of the call. All other function calls are tail position.

Suppose computeSquareRoot() is called with 99 as the argument. It will call computePositiveSquareRoot(99), which will subsequently call iterativeSquareRoot(99, ...). Using Web Inspector, we observed that iterativeSquareRoot() calls back to itself 6 times before returning a result. That result is returned directly back to computeSquareRoot, where it is converted to a string, saving the processing of 7 returns.

This last example shows the type of functional programming that PTC enables:

"use strict";
function newList(count)
{
    let head = { value: count, next: null };
    while (--count)
        head = { value: count, next: head };
    return head;
}

let count = 100000;
let list = newList(count);

function contains(list, x)
{
    if (!list)
        return false;
    if (list.value == x)
        return true;
    return contains(list.next, x);
}
...

if (contains(list, someValue))
   ...

The function contains() searches the list using tail recursion. For a list the size of 100,000 elements given in this example, most browsers will run out of stack memory and throw an exception. In strict mode, where contains() can take advantage of PTC, the program runs just fine. It is also interesting to note that even with a list size small enough to allow this code to run without PTC, using PTC results in the code running 2.5x faster.

Things to Note

There are a couple subtle, but minor issues to be aware of when using PTC. Remember that PTC is only available in strict mode and only for calls made from tail position. The other notable change involves the generation of stack traces. There are some non-standard ECMAScript features in JavaScript that work differently in the presence of PTC. These include Error.stack and the Console object’s stack trace. For example, say a tail called function gonnaThrowAnError() throws an Error object; the function that catches that Error will not see the function that called gonnaThrowAnError() in the Error object’s stack trace. As a general rule, the Console object’s stack trace will not include a function that made a tail call to another function. We call such frames tail deleted frames because its as if they are deleted from the stack when making a call.

Debugging PTC with ShadowChicken

Because PTC places a strict resource guarantee on stack usage, JavaScriptCore cannot keep around information for all tail deleted frames. Keeping around any extra resources, no matter how small, for an unbounded number of tail deleted frames is not possible because it could use an unbounded amount of memory and eventually exhaust the memory limits of the program. Given that tail calls do not keep around any information in the program’s executing state, debugging tail calls can be challenging when using an interactive debugging tool. Without any added machinery, the debugger will not know if a tail call did or did not occur. Because we want to make debugging tail calls inside Web Inspector a great experience, we have implemented mechanisms inside JavaScriptCore to keep around a shadow stack that will display a finite number, currently 128, tail deleted frames. This allows us to both provide strict guarantees on memory usage and to provide users of PTC the benefit of seeing the most important stack frames in their program inside Web Inspector.

We call our shadow stack implementation ShadowChicken. The name is an homage to the CHICKEN scheme interpreter. ShadowChicken uses a logging mechanism for constructing its shadow stack. The log works as follows:

  • On entry to a function, the function’s callee will log a prologue packet.
  • When making a tail call, the caller will log a tail packet just before the tail call occurs.
  • When an exception is thrown, ShadowChicken will log a throw packet.

To construct the shadow stack, ShadowChicken takes two inputs:

  1. The log filled with a sequence of prologue, tail, and throw packets.
  2. The current top of the machine stack (note that the machine stack does not contain any frames that are tail deleted).

Given these two inputs, ShadowChicken is able to construct a shadow stack that includes tail-deleted frames. It will reconcile the machine stack with its log. Because the log has tail packets for when tail calls occurred, it is able to use the log to insert tail-deleted stack frames into the shadow stack to represent frames that were only present on the machine stack before a tail call occurred. ShadowChicken uses a constant amount of memory on top of the memory your program already uses. The log is fixed in size. Whenever ShadowChicken runs out of space in the log, it will perform its reconciliation algorithm to update its internal data structure about the state of the shadow stack. The shadow stack will contain at most 128 tail deleted frames, a number we believe strikes a good balance between programs that intentionally take advantage of PTC and programs that just happen to use PTC without the explicit intention of doing so.

Hooking Up ShadowChicken to Web Inspector

Because JavaScriptCore has the machinery for constructing a shadow stack, Web Inspector can use JavaScriptCore’s shadow stack in its debugger. This allows users of Web Inspector to interact with tail deleted stack frames as if they are real stack frames that are on the current machine stack.

Let’s see some interactions with Web Inspector and the iterative square root code to compute the square root of 99. After setting a breakpoint in iterativeSquareRoot() and invoking computeSquareRoot(99), Web Inspector shows that we are paused, ready to return the result.

Breakpoint with tail deleted frames in the stack.

Web Inspector not only shows the frame we’re stopped in and the original caller, computeSquareRoot(), but also shows the seven tail deleted frames. These are highlighted in the above image. Tail deleted frames in Web Inspector show up with a gray ƒ icon to their left. As the next image shows, the variables in tail deleted frames can be examined as if the frame were a normal frame. The next image shows Web Inspector inspecting a tail deleted frame one level up from the leaf frame.

Selecting a tail deleted frame.

In this image, the local variables (circled) can be examined. The highlighted line in the program also shows where the tail deleted frame made a tail call from.

The next image shows what happens when single stepping from the breakpoint.

Returning from tail deleted frames.

With on click of the step into button, Web Inspector now shows that we have returned directly to where computeSquareRoot() made the first tail call.

Summary

WebKit has ECMAScript 6 Proper Tail Calls and web developers are encouraged to take advantage of them to design programs in ways that were not possible before. Existing code can benefit as well. Web Inspector makes developing and debugging with PTC straightforward and intuitive.

The current Safari Technology Preview Release 4 has support for PTC. However, Web Inspector was only recently hooked up to work with ShadowChicken and therefore it is not in Safari Technology Preview Release 4. It is expected to be in the next Safari Technology Preview release. You can also try out ShadowChicken in Web Inspector by using a WebKit Nightly build.

Acknowledgements

This article was cowritten with Saam Barati from the JavaScriptCore team. We invite comments and feedback on WebKit’s PTC implementation. Feel free to get in touch with @msaboff, @saambarati, and @jonathandavis on Twitter.

By Michael Saboff at May 23, 2016 09:15 PM

Igalia Compilers Team: Awaiting the future of JavaScript in V8

Igalia WebKit

On the evening of Monday, May 16th, 2016, we have made history. We’ve landed the initial implementation of “Async Functions” in V8, the JavaScript runtime in use by the Google Chrome and Node.js. We do these things not because they are easy, but because they are hard. Because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one we are willing to accept. It is very exciting to see this, roughly 2 months of implementation, codereview and standards finangling/discussion to land. It is truly an honour.

 

To introduce you to Async Functions, it’s first necessary to understand two things: the status quo of async programming in JavaScript, as well as Generators (previously implemented by fellow Igalian Andy).

 

Async programming in JavaScript has historically been implemented by callbacks. `window.setTimeout(function toExecuteLaterOnceTimeHasPassed() {}, …)` being the common example. Callbacks on their own are not scalable: when numerous nested asynchronous operations are needed, code becomes extremely difficult to read and reason about. Abstraction libraries have been tacked on to improve this, including caolan’s async package, or Promise libraries such as Q. These abstractions simplify control flow management and data flow management, and are a massive improvement over plain Callbacks. But we can do better! For a more detailed look at Promises, have a look at the fantastic MDN article. Some great resources on why and how callbacks can lead to utter non-scalable disaster exist too, check out http://callbackhell.com!

 

The second concept, Generators, allow a runtime to return from a function at an arbitrary line, and later re-enter that function at the following instruction, in order to continue execution. So right away you can imagine where this is going — we can continue execution of the same function, rather than writing a closure to continue execution in a new function. Async Functions rely on this same mechanism (and in fact, on the underlying Generators implementation), to achieve their goal, immensely simplifying non-trivial coordination of asynchronous operations.

 

As a simple example, lets compare the following two approaches:
function deployApplication() {
  return cleanDirectory(__DEPLOYMENT_DIR__).
    then(fetchNpmDependencies).
    then(
      deps => Promise.all(
        deps.map(
          dep => moveToDeploymentSite(
            dep.files,
            `${__DEPLOYMENT_DIR__}/deps/${dep.name}`
        ))).
    then(() => compileSources(__SRC_DIR__,
                              __DEPLOYMENT_DIR__)).
    then(uploadToServer);
}

The Promise boiler plate makes this preit harder to read and follow than it could be. And what happens if an error occurs? Do we want to add catch handlers to each link in the Promise chain? That will only make it even more difficult to follow, with error handling interleaved in difficult to read ways.

Lets refactor this using async functions:
async function deployApplication() {
    await cleanDIrectory(__DEPLOYMENT_DIR__);
    let dependencies = await fetchNpmDependencies();

    // *see below*
    for (let dep of dependencies) {
        await moveToDeploymentSite(
            dep.files,
            `${__DEPLOYMENT_DIR__}/deps/${dep.name}`);
    }

    await compileSources(__SRC_DIR__,
                         __DEPLOYMENT_DIR__);
    return uploadToServer();
}

 

You’ll notice that the “moveToDeploymentSite” step is slightly different in the async function version, in that it completes each operation in a serial pipeline, rather than completing each operation in parallel, and continuing once finished. This is an unfortunate limitation of the async function specification, which will hopefully be improved on in the future.

 

In the meantime, it’s still possible to use the Promise API in async functions, as you can `await` any Promise, and continue execution after it is resolved. This grants compatibility with  numerous existing Web Platform APIs (such as `fetch()`), which is ultimately a good thing! Here’s an alternative implementation of this step, which performs the `moveToDeploymentSite()` bits in parallel, rather than serially:

 

await Promise.all(dependencies.map(
  dep => moveToDeploymentSite(
    dep.files,
    `${__DEPLOYMENT_DIR__}/deps/${dep.name}`
)));

 

Now, it’s clear from the let dependencies = await fetchNpmDependencies(); line that Promises are unwrapped automatically. What happens if the promise is rejected with an error, rather than resolved with a value? With try-catch blocks, we can catch rejected promise errors inside async functions! And if they are not caught, they will automatically return a rejected Promise from the async function.
function throwsError() { throw new Error("oops"); }

async function foo() { throwsError(); }

// will print the Error thrown in `throwsError`.
foo().catch(console.error)

async function bar() {
  try {
    var value = await foo();
  } catch (error) {
    // Rejected Promise is unwrapped automatically, and
    // execution continues here, allowing us to recover
    // from the error! `error` is `new Error("oops!")`
  }
}

 

There are also lots of convenient forms of async function declarations, which hopefully serve lots of interesting use-cases! You can concisely declare methods as asynchronous in Object literals and ES6 classes, by preceding the method name with the `async` keyword (without a preceding line terminator!)

 

class C {
  async doAsyncOperation() {
    // ...
  }
};

var obj = {
  async getFacebookProfileAsynchronously() {
    /* ... */
  }
};

 

These features allow us to write more idiomatic, easier to understand asynchronous control flow in our applications, and future extensions to the ECMAScript specification will enable even more idiomatic forms for writing complex algorithms, in a maintainable and readable fashion. We are very excited about this! There are numerous other resources on the web detailing async functions, their benefits, and perhaps ways they might be improved in the future. Some good ones include [this piece from Google’s Jake Archibald](https://jakearchibald.com/2014/es7-async-functions/), so give that a read for more details. It’s a few years old, but it holds up nicely!

 

So, now that you’ve seen the overview of the feature, you might be wondering how you can try it out, and when it will be available for use. For the next few weeks, it’s still too experimental even for the “Experimental Javascript” flag.  But if you are adventurous, you can try it already!  Fetch the latest Chrome Canary build, and start Chrome with the command-line-flag `–js-flags=”–harmony-async-await”`. We can’t make promises about the shipping timeline, but it could ship as early as Chrome 53 or Chrome 54, which will become stable in September or October.

 

We owe a shout out to Bloomberg, who have provided us with resources to improve the web platform that we love. Hopefully, we are providing their engineers with ways to write more maintainable, more performant, and more beautiful code. We hope to continue this working relationship in the future!

 

As well, shoutouts are owed to the Chromium team, who have assisted in reviewing the feature, verifying its stability, getting devtools integration working, and ultimately getting the code upstream. Terriffic! In addition, the WebKit team has also been very helpful, and hopefully we will see the feature land in JavaScriptCore in the not too distant future.

By compilers at May 23, 2016 02:08 PM

May 11, 2016

Release Notes for Safari Technology Preview 4

WebKit Blog

Safari Technology Preview Release 4 is now available for download. If you already have Safari Technology Preview installed, you can update from the Mac App Store’s Updates tab. Release 4 of Safari Technology Preview covers WebKit revisions 199865–200417.

Networking

  • Allow non-standard HTTP headers in WebSocket handshakes, which makes the 1Password extension work again (r200120, r200219)

Media

  • Fixed Netflix video playback (r200172)

JavaScript

  • Disabled Symbol.isConcatSpreadable due to performance concerns; is expected to return in the next release (r200149)
  • Made super() available to object literals, not just ES6 classes (r199927)
  • Sped up calling bound functions with no bound arguments by 4x (r199946)
  • Implemented String.prototype.localeCompare from ECMA-402 (r199967)
  • Optimized JSON.parse for a 1–2.5% improvement in Kraken json-parse-financial (r199968)
  • Implemented RegExp.prototype.@@replace and use it for String.prototype.replace (r200117)
  • Implemented spec changes for String.prototype.padStart and String.prototype.padEnd (r200194, r200210)
  • Unified how Math.pow() is optimized across all JIT tiers (r200208)
  • Made Reflect.toString() be [object Object] not [object Reflect] (r200355)

CSS

  • Made -webkit-image-set work inside CSS variables (r199884)
  • Changed transitions to no longer animate to/from auto values (r200360)
  • Implemented proper handling of animation-delay with a negative delay (r200042)
  • Started parsing play-state as part of the animation shorthand (r200043)
  • Made toggling animation-play-state not restart a finished animation (r200047)
  • Fixed a regression which caused position: absolute pseudo elements to inherit text-decoration (r200302)
  • Moved CSS Grid behind a runtime switch that is currently enabled by default (r200215, r200389)
  • Started implementation of auto-fill and auto-fit for CSS Grid (r200182, r200368)
  • Fixed computed style of grid-template-columns and grid-template-rows properties (r199981)
  • Fixed a bug with positioned grid items in vertical writing mode (r199874)
  • Fixed alignment with CSS Grid content distribution (r200181)
  • Improved user agent styles for <math> elements (r199869)

Web APIs

  • Fixed wheel events so they fire with body, html { height: 100% } (r200247)
  • Marked IndexedDB constructors as hidden on the worker global object until it is supported (r199889)
  • Made ping attribute for anchor elements only work for http/https URLs (r199900)
  • Renamed Shadow DOM’s getAssignedNodes to assignedNodes and support flattened option (r200285)
  • Removed Shadow DOM’s Node.prototype.rootNode because it was not compatible with existing websites (r200297)
  • Made document.currentScript return null when executing a script inside a shadow tree (r200327)
  • Fixed clicks sometimes being ignored inside button elements when the mouse moves (r200414)

Web Inspector

  • Made console a namespace object (like Math and JSON), allowing functions to be called unbound (r200350, r200373)
  • Fixed an issue where scripts would not load due to Esprima.js not being found (r200229)
  • Started showing dynamically added <script>// <![CDATA[ elements added to a frame as resources (r200065)
  • Made sourceURL and sourceMappingURL always work when using the Function constructor (r199939)
  • Restored filtering to the Timelines tab (r200067)
  • Added column number info to event listener locations (r199940)
  • Fixed profiles missing from records in JavaScript & Events timeline (r199979)
  • Fixed selecting a bar in the Frames timeline mode (r199972)
  • Made sorting by name or location columns work as expected (r199974)
  • Fixed the line error widget showing up on the wrong resource (r200064)
  • Clarified Retained Size in heap snapshots by hiding retained size of non-dominated children (r200086)
  • Made the debugger statements evaluated in the console properly show the source code (r199897)
  • Made jump to line work correctly the first time in pretty-printed JavaScript (r200262)
  • Improved scrolling performance in Timelines tab (r200270)
  • Improved performance of rendering many console messages (r200401)
  • Changed console.assert and console.trace to allow format specifiers (r200370)
  • Improved performance of console.assert by 10x when the assertion is true (r200371)
  • Changed console.time and console.timeEnd to use a default label when none if specified, and warn when attempting to start an already started timer (r200400)
  • Added CSS autocompletion suggestions for -webkit-user-select (r200154)

Rendering

  • Made non-accelerated CSS and SVG animations run at 60fps (r200164, r200171)
  • Made <select multiple> padding consistent with other browsers (r200265)
  • Fixed blur filter escaping an enclosing overflow: hidden (r200283)
  • Fixed a regression with min-content and box-sizing: border-box that affected Facebook’s messenger.com (r199895)

Accessibility

  • Made VoiceOver properly speak superscript content (r200214)
  • Fixed navigation around composed emoji characters and content with multiple whitespace sequences (r200258)
  • Made aria-label attribute work on <label> elements (r200290)
  • Made region a landmark and <section> elements have a role of region if there is an author provided accessible name via the aria-label or aria-labelledby attributes (r200415)

Bug Fixes

  • Corrected how WebKit determines the user’s preferred region from the system language setting. (r200105)

You can file bugs or feature requests at the WebKit bug tracker, or you can submit feedback or bugs to Apple on Apple’s bug reporting website. For other questions or feedback, feel free to reach us on Twitter at @webkit or Jonathan Davis at @jonathandavis.

By Timothy Hatcher at May 11, 2016 05:05 PM

May 06, 2016

Locking in WebKit

WebKit Blog

Back in August 2015 we replaced all spinlocks and OS-provided mutexes in WebKit with the new WTF::Lock (WTF stands for Web Template Framework). We also replaced all OS-provided condition variables with WTF::Condition. These new primitives have some cool properties:

  1. WTF::Lock and WTF::Condition only require one byte of storage each. WTF::Lock only needs two bits in that byte. The small size encourages using huge numbers of very fine-grained locks. OS mutexes often require 64 bytes or more. The small size of WTF::Lock means that there’s rarely an excuse for not having one, or even multiple, fine-grained locks in any object that has things that need to be synchronized.
  2. WTF::Lock is super fast in the case that matters most: uncontended lock acquisition. Parallel algorithms tend to avoid contention by having many fine-grained locks. This means that a mature parallel algorithm will have many uncontended lock acquisitions – that is, calls to lock() when the lock is not held, and calls to unlock() when nobody is waiting in line. Similarly, WTF::Condition optimizes for the common case of calling notify when no threads are waiting.
  3. WTF::Lock is fast under microcontention. A microcontended lock is one that is contended and the critical section is short. This means that shortly after any failing lock attempt, the lock will become available again since no thread will hold the lock for long. This is the most common kind of contention in parallel code, since it’s common to go to great pains to do very little work while holding a lock.
  4. WTF::Lock doesn’t waste CPU cycles when a lock is held for a long time. WTF::Lock is adaptive: it changes its strategy for how to wait for the lock to become available based on how long it has been trying. If the lock doesn’t become available promptly, WTF::Lock will suspend the calling thread until the lock becomes available.

Compared to OS-provided locks like pthread_mutex, WTF::Lock is 64 times smaller and up to 180 times faster. Compared to OS-provided condition variables like pthread_cond, WTF::Condition is 64 times smaller. Using WTF::Lock instead of pthread_mutex means that WebKit is 10% faster on JetStream, 5% faster on Speedometer, and 5% faster on our page loading test.

Making WTF::Lock and WTF::Condition fit in one byte is not easy and the technique behind this has multiple layers. Lock and Condition offload all thread queueing and suspending functionality to WTF::ParkingLot, which manages thread queues keyed by the addresses of locks. The design of ParkingLot is inspired by futexes. ParkingLot is a portable user-level implementation of the core futex API. ParkingLot also needs fast locks, so we built it its own lock called WTF::WordLock.

This post starts by describing some background on locking. Then we describe the ParkingLot abstraction and how we use this to build WTF::Lock and WTF::Condition. This section also shows some alternate locking algorithms on top of ParkingLot. Then we describe how ParkingLot and WordLock are implemented, hopefully in enough detail to allow for meaningful scrutiny. The post concludes with some performance data, including comparisons to a bunch of lock algorithms.

Background

This section describes some background about locks. This includes the atomic operations that we use to implement locks as well as some classic algorithms like spinlocks and adaptive locks. This section also tries to give appropriate shout-outs to other lock implementations.

Atomic Operations

CPUs and programming languages provide few guarantees about the way that memory accesses interact with each other when executed concurrently, since the expectation is that programmers will prevent concurrent accesses to the same memory by guarding them with locks. But if you’re like me and you like to implement your own locks, you’ll want some lower-level primitives that do have strong guarantees.

C++ provides std::atomic for this purpose. You can use it to wrap a primitive type (like char, int, or a pointer type) with some atomicity guarantees. Provided you stick to the strongest memory ordering (seq_cst), you can be sure that the concurrent executions of std::atomic operations will behave as if they were executed sequentially. This makes it possible to consider whether an algorithm is sound even when run concurrently by envisioning all of the possible ways that its various atomic operations could interleave.

In WebKit, we use our own wrapper called WTF::Atomic. It’s a stripped-down version of what std::atomic provides. We’ll just consider three of its atomic methods: T load(), store(T), and bool compareExchangeWeak(T expected, T desired).

Load and store are self-explanatory. The interesting one is compareExchangeWeak, which implements the atomic CAS (compare-and-swap) operation. This is a CPU primitive that can be thought of as running the following pseudocode atomically:

bool CAS(T* pointer, T expected, T desired)
{
    if (*pointer != expected)
        return false;
    *pointer = desired;
    return true;
}

This method is implemented in terms of std::atomic::compare_exchange_weak, which is implemented in terms of the lock; cmpxchg instruction on x86 or in terms of ldrex and strex on ARM. This form of CAS is called weak because we allow it to spuriously do nothing and return false. The opposite is not true – if the CAS returns true, then it must be that during its atomic execution, it saw *pointer being equal to expected and then it changed it to desired.

Spinlocks

Armed with WTF::Atomic, we can implement the simplest kind of lock, called a spinlock. Here’s what it looks like:

class Spinlock {
public:
    Spinlock()
    {
        m_isLocked.store(false);
    }

    void lock()
    {
        while (!m_isLocked.compareExchangeWeak(false, true)) { }
    }

    void unlock()
    {
        m_isLocked.store(false);
    }
private:
    Atomic<bool> m_isLocked;
};

This works because the only way that lock() can succeed is if compareExchangeWeak returns true. If it returns true, then it must be that this thread observed m_isLocked being false and then instantly flipped it to true before any other thread could also see that it had been false. Therefore, no other thread could be holding a lock. Either no other thread is calling lock() at all, or their calls to lock() are still in the while loop because compareExchangeWeak is continuing to return false.

Adaptive Locks

Adaptive locks spin only for a little bit and then suspend the current thread until some other thread calls unlock(). This guarantees that if a thread has to wait for a lock for a long time, it will do so quietly. This is a desirable guarantee for conserving CPU time and power.

By contrast, spinlocks can only handle contention by spinning. The simplest spinlocks will make contending threads appear to be very busy and so the OS will make sure to schedule them. This will waste tons of power and starve other threads that may really be able to do useful work. A simple solution would be to make the spinlock sleep – maybe for a millisecond – in between CAS attempts. This turns out to hurt performance on real code because it postpones progress when the lock becomes available during the sleep interval. Also, sleeping doesn’t completely solve the problem of inefficiency when spinning. WebKit has some locks that may be held for a long time. For example, the lock used to control the interleaving between compiler threads and the garbage collector is usually uncontended but may sometimes be held, and contended, for the entire time that it takes to collect the JS heap or the entire time that it takes to compile some function. In the most extreme case, a lock may protect blocking IO. That could happen unexpectedly, for example if a page fault on an innocent-looking load leads to swapping. Some critical sections can take a while and we don’t want contending threads to poll during that time, even if it’s only 1KHz.

There’s no good way to make spinning efficient. If we increase the delay between CAS attempts then we’re just increasing the delay between when the lock gets unlocked and when a contending thread can get it. If we decrease the delay, the lock becomes less efficient. We want a locking algorithm that ensures that if spinning doesn’t quickly give us the lock, our thread will quietly wait until exactly the moment when the lock is released. This is what adaptive locks try to do.

The kinds of adaptive locks that we will implement can be split into two separate data structures:

  1. Some small, atomic field in memory that summarizes the lock’s state. It can answer questions like, “does anyone hold the lock?” and “is anyone waiting to get the lock?” We will call this the atomic state.
  2. A queue of threads that are waiting to get the lock, and a mechanism for suspending and resuming those threads. The queue must have its own synchronization primitives and some way to be kept in sync with the atomic state. We say parking to mean enqueuing a thread and suspending it. We say unparking to mean dequeuing a thread and resuming it.

WTF::Lock is WebKit’s implementation of an adaptive lock, optimized for the things that we care about most: low space usage and a great fast path for uncontended lock acquisition. The lock object contains only the atomic state, while the queue is created on-demand inside the ParkingLot. This allows our locks to only require two bits. Like other adaptive locks, WTF::Lock provides a guarantee that if a thread has to wait a long time for a lock, it will do so quietly.

Related Work

If you know that you need an adaptive lock, you can be sure that the mutex implementation on modern OSes will safely adapt to contention and avoid spinning. Unfortunately, most of those OS mutexes will be slower and bigger than a spinlock because of artificial constraints that arise out of compatibility (a pthread_mutex_t is 64 bytes because of binary compatibility with programs that were compiled against ancient implementations that had to be 64 bytes) or the need to support features that you may not need (like recursive locking – even if you don’t use it, pthread_mutex_lock() may have to at least do a check to see if you asked for it).

WebKit’s locking infrastructure is most inspired by Linux’s excellent futex primitive. Futexes empower developers to write their own adaptive locks in just a few lines of code (Franke, Russell, and Kirkwood ’02). Like with futexes, we materialize most of the data for a lock only when that lock experiences contention, and we locate that data by hashing the address of the lock. Unlike futexes, our implementation does not rely on kernel support and so it will work on any OS. The ParkingLot API has some functionality that futexes lack, like invoking callbacks while holding internal ParkingLot locks.

The idea of using very few bits per adaptive lock is widespread, especially in Java virtual machines. For example, HotSpot usually only needs two or three bits for the state of an object’s lock. I’ve co-authored a paper on locking in another Java VM, which also compressed an adaptive lock into a handful of bits. We can trace some of the ideas about how to build locks that are small and fast to meta-locks (Agesen et al ’99) and tasuki locks (Onodera and Kawachiya ’99).

New proposals like std::synchronic and hardware transactional memory also seek to speed up locking. We will show that these techniques don’t exhibit the performance qualities that we want for WebKit.

Despite the basic techniques being well understood in certain communities, it’s hard to find a lock implementation for C++ that has the qualities we want. Spinlocks are widely available, and those are often optimized for using little memory and having great fast paths for uncontended lock acquisition and microcontention. But spinlocks will waste CPU time when the lock isn’t immediately available. OS mutexes know how to suspend threads if the lock is not available, so they are more efficient under contention – but they usually have a slower uncontended fast path, they don’t necessarily have the behavior we want under microcontention, and they require more memory. C++ provides access to OS mutexes with std::mutex. Prior to WTF::Lock, WebKit had a mix of spinlocks and OS mutexes, and we would try to pick which one to use based on guesses about whether they would benefit more from uncontended speed or efficiency under contention. If we needed both of those qualities, we would have no choice but to punt on one of them. For example, we had a spinlock in CodeBlock that should have been adaptive because it protected long critical sections, and we had an OS mutex in our parallel GC that accounted for 3% of our time in the Octane Splay benchmark because of shortcomings in fast path performance. These issues are resolved thanks to WTF::Lock. Also, there was no way to have a small lock (1 byte or less) that was also efficient under contention, since OS mutexes tend to be large. In the most extreme case we will have one lock per JavaScript object, so we care about the sizes of our locks.

Building Locks With ParkingLot

WTF::ParkingLot is a framework for building adaptive locks and other synchronization primitives. Both WTF::Lock and WTF::Condition use it for parking threads until the lock becomes available or the condition is notified. ParkingLot also gives us everything we’ll need to implement the synchronization schemes that are coming to Web standards like SharedArrayBuffer.

Adaptive locks need to be able to park and unpark threads. We believe that synchronization primitives shouldn’t have to maintain their own parking queues, but instead, a single global data structure should provide a way to access queues by using the address of the lock’s atomic state as a key. The concurrent hashtable of queues is called WTF::ParkingLot. Since each thread can be queued only once at any given time, ParkingLot‘s memory usage is bounded by the number of threads. This means that locks don’t have to pay the price for space for a queue. This makes a lot of sense since for WebKit, which usually runs a small number of threads (about ten on my system) but can easily allocate millions of locks (in the worst case, one per JavaScript object).

ParkingLot takes care of queueing and thread suspension so that lock algorithms can focus on other things, like how long to spin for, what kind of delays to introduce into spinning, and which threads to favor when unlocking.

ParkingLot API

Parking refers to suspending the thread while simultaneously enqueuing it on a queue keyed by some address. Unparking refers to dequeuing a thread from a queue keyed by some address and resuming it. This kind of API must have a mechanism for resolving the suspend-resume race, where if a resume operation happens moments before the suspend, then the thread will suspend anyway. ParkingLot resolves this by exposing the fact that the queues are protected by locks. Parking invokes a client callback while the queue lock is held, and gives the client a chance to decide whether they want to proceed or not. Unparking invokes a client callback while the queue lock is held, and tells the client if a thread was dequeued and if there are any more threads left on the queue. The client can rely on this additional synchronization to ensure that racy deadlocks don’t happen.

The basic API of ParkingLot comprises parkConditionally, unparkOne, and unparkAll.

bool parkConditionally(address, validation, beforeSleep, timeout). This takes the const void* address and uses it as a key to find, and lock, that address’s queue. Calls the bool validation() callback (usually a C++ lambda) while the lock is held. If the validation returns false, the queue is unlocked and parkConditionally() returns false.

If the validation returns true, the current thread is placed on the queue and the queue lock is released. Once the queue lock is released, this calls the void beforeSleep() callback. This turns out to be useful for some synchronization primitives, but most locks will pass an empty thunk. At this point, the thread is suspended until some call to unparkOne() dequeues this thread and resumes it. The client can supply a timeout using a ParkingLot::Clock::time_point (ParkingLot::Clock is a typedef for std::chrono::steady_clock). The thread will not stay suspended past that time point.

void unparkOne(address, callback). This takes a const void* address and uses it as a key to find, and lock, that address’s queue. Then unparkOne tries to dequeue one thread. Once it does this, it calls the void callback(UnparkResult), passing a struct that reports if a thread was dequeued and whether the queue is now empty. Then it unlocks the queue lock. If it had dequeued a thread, it signals it now.

void unparkAll(address). Unparks all threads on the queue associated with the given address.

This API gives us everything we need to implement the locking primitives we need: WTF::Lock and WTF::Condition. It also allows us to build userland versions of FUTEX_WAIT/FUTEX_WAKE operations, which are required by SharedArrayBuffer atomics.

WTF::Lock

We have many locks, including locks inside very frequently allocated objects like JavaScriptCore’s Structure. We want a lock object that takes as little memory as possible, so we go to great lengths to make the lock fit in one byte. We do many such tricks in Structure since there is so much pressure to make that object small. We also want the core algorithm to leave open the possibility of having the lock embedded in bitfields, though Lock doesn’t support this because C++ requires objects to be at least one byte. As this section will show, ParkingLot makes it so easy to implement fast locking algorithms that if clients did need to embed a lock into a bitfield, it would be reasonable for them to have their own implementation of this algorithm.

Our goals are to have a lock that:

  1. Uses as few bits as possible.
  2. Requires only a CAS on the fast path for locking and unlocking to maximize uncontended throughput.
  3. Is adaptive.
  4. Maximizes throughput under contention.

Making the fast path require only a CAS means that WTF::Lock‘s atomic state must be able tell us if there are any threads parked. Otherwise, the unlock() function would have to always call ParkingLot::unparkOne() in case there were threads parked. While such an implementation would be functional, it would be far from optimal. Afterall, ParkingLot::unparkOne() is obligated to do hashing, acquire some queue lock, and call a callback. This is a lot more work than we want in the common path of unlock().

This implies having two bits for the atomic state:

  • isLockedBit to indicate if the lock is locked.
  • hasParkedBit to indicate if there may be threads parked.

Locking is allowed to proceed any time the isLockedBit is clear even if the hasParkedBit is set. This property is called barging. We will dive into the implications of barging later.

If locking does not succeed, the algorithm chooses between trying again and parking the thread. Prior to parking, it sets the hasParkedBit. The validation callback it passes to parkConditionally checks that the lock still has both isLockedBit and hasParkedBit set. We don’t want to park if isLockedBit is clear since this means that the lock is available. We don’t want to park if hasParkedBit is clear since this means that the lock has forgotten that we are about to park.

If the hasParkedBit is clear, then unlocking just clears the isLockedBit. If the hasParkedBit is set, it calls unparkOne() passing a callback that really unlocks the lock. This callback will set the lock’s state to either hasParkedBit or 0, depending on whether the UnparkResult reports that there are still more threads on the queue.

We call this basic algorithm a barging lock, and a basic implementation might look like this:

class BargingLock {
public:
    BargingLock()
    {
        m_state.store(0);
    }

    void lock()
    {
        for (;;) {
            uint8_t currentState = m_state.load();

            // Fast path, which enables barging since we are happy to grab the
            // lock even if there are threads parked.
            if (!(currentState & isLockedBit)
                && m_state.compareExchangeWeak(currentState,
                                               currentState | isLockedBit))
                return;

            // Before we park we should make sure that the hasParkedBit is
            // set. Note that because compareAndPark will anyway check if the
            // state is still isLockedBit | hasParkedBit, we don't have to
            // worry too much about this CAS possibly failing spuriously.
            m_state.compareExchangeWeak(isLockedBit,
                                        isLockedBit | hasParkedBit);

            // Try to park so long as the lock's state is that both
            // isLockedBit and hasParkedBit are set.
            ParkingLot::parkConditionally(
                &m_state,
                [this] () -> bool {
                    return m_state.load() == isLockedBit | hasParkedBit;
                });
        }
    }

    void unlock()
    {
        // We can unlock the fast way if the hasParkedBit was not set.
        if (m_state.compareExchangeWeak(isLockedBit, 0))
            return;

        // Fast unlocking failed, so unpark a thread.
        ParkingLot::unparkOne(
            &m_state,
            [this] (ParkingLot::UnparkResult result) {
                // This runs with the queue lock held. If mayHaveMoreThreads
                // is true, we unlock while leaving the hasParkedBit set.
                if (result.mayHaveMoreThreads)
                    m_state.store(hasParkedBit);
                else
                    m_state.store(0);
            });
    }

private:
    static const uint8_t isLockedBit = 1;
    static const uint8_t hasParkedBit = 2;

    Atomic<uint8_t> m_state;
};

WTF::Lock closely follows this algorithm, but has additional performance tweaks like spinning and inline fast paths.

Spinning

Adaptive locks combine parking and spinning. Spinning is great because it protects microcontention scenarios from doing parking. Microcontention is when a thread fails the fast path lock acquisition because the lock is not available right now, but that lock will become available in less time than what it would take to park and then unpark. Before WTF::Lock::lock() parks a thread, it will spin 40 times, calling yield between spins. This turns out to be good enough across a while range of platforms. The algorithm can be visualized as follows:

// Fast path:
if (m_word.compareExchangeWeak(0, isLockedBit))
    return;

// Spin 40 times:
for (unsigned i = 40; i--;) {
    // Do not spin if there is a queue.
    if (m_word.load() & hasParkedBit)
        break;
    // Try to get the lock.
    if (m_word.compareExchangeWeak(0, isLockedBit))
        return;
    // Hint that now would be a good time to context switch.
    sched_yield();
}

// Now do all of the queuing and parking.
// ...

This is a known-good approach, which we borrow from JikesRVM’s locks. We suspect that this algorithm, including the spin limit set at 40, is portable enough for our needs. JikesRVM experiments found it to be optimal on a 12-way POWER machine in 1999. I found that it was still optimal when I tried to optimize those locks further on Intel hardware with various CPU and memory topologies. Microbenchmarks that I ran for this post confirm that 40 is still optimal, and that there is a broad plateau of near-optimal settings between about 10 and 60 spins.

Fast Paths

WTF::Lock is structured around an inline fast path for lock() that just does a single lock attempt, and an inline fast path for unlock() that unlocks the lock if there is nobody parked. Having small inline fast paths means that most lock clients will only pay the price of a CAS on locking and unlocking.

Summary of WTF::Lock

WTF::Lock is a high performance lock that fits in one byte. The underlying algorithm only needs two bits, so it would be suitable for cramming a lock into a bitfield. See wtf/Lock.h and wtf/Lock.cpp for the full implementation.

Barging and Fairness

WTF::Lock makes a particular choice about how to handle unlocking: it clears the isLockedBit, which makes the lock available to any thread, not just the one it unparks. This implies that the thread that has been waiting for the longest may have the lock stolen from it by some other thread, which may not have waited at all. A thread that suffers such defeat has no choice but to park again, which puts it at the end of the queue.

This shortcoming can be fixed by having unlock() unpark a thread without releasing the lock. This kind of protocol hands off ownership of the lock from the thread doing the unlocking to the thread that had waited the longest. If the lock also lacks an adaptive spin loop, then this protocol enforces perfect FIFO (first-in, first-out) discipline on threads contending for a lock. FIFO is an attractive property, and it ensures that no thread will get the lock stolen from it.

However, allowing barging instead of enforcing FIFO allows for much higher throughput when a lock is heavily contended. Heavy contention in systems like WebKit that use very fine-grained locks implies that multiple threads are repeatedly locking and unlocking the same lock. In the worst case, a thread will make very little progress between two critical sections protected by the same lock. In a barging lock, if a thread unlocks a lock that had threads parked then it is still eligible to immediately reacquire it if it gets to the next critical section before the unparked thread gets scheduled. Barging permits threads engaged in microcontention to take turns acquiring the lock many times per turn. On the other hand, FIFO locks force contenders to form a convoy where they only get to hold the lock once per turn. This makes the program run much slower than with a barging lock because of the huge number of context switches – one per lock acquisition!

Futex Algorithms and ParkingLot

ParkingLot is very similar to futexes. Both primitives follow the principle that a lock should not have to maintain its own queue. Futexes get help from the kernel and have a richer API, which enables some locking protocols that would be impossible to implement with ParkingLot, like priority inheritance locks. However, ParkingLot is powerful enough to support the basic FUTEX_WAIT/FUTEX_WAKE operations that form the core of futexes.

FUTEX_WAIT can be implemented as follows:

enum Result {
    TimedOut, // Timeout was reached.
    TryAgain, // The comparison failed.
    Success   // The thread parked and then unparked.
};
Result wait(Atomic<int32_t>* futex, int32_t expected,
            Clock::time_point timeout)
{
    bool comparisonSucceeded = false;
    bool result = ParkingLot::parkConditionally(
        futex,
        [&] () -> bool {
            comparisonSucceeded = futex->load() == expected;
            return comparisonSucceeded;
        },
        [] () { },
        timeout);
    if (result)
        return Success;
    if (comparisonSucceeded)
        return TimedOut;
    return TryAgain;
}

ParkingLot abstracts a simple version of this behind an API called parkConditionally().

FUTEX_WAKE that wakes one thread (the common case) can be implemented as a call to unparkOne:

bool wake(Atomic<int32_t>* futex)
{
    return ParkingLot::unparkOne(futex).didUnparkThread;
}

Being able to emulate core futex functionality means that we can implement various kind of futex-based lock algorithms. We have done this for the purpose of benchmarking our lock implementations. Here are some of the lock algorithms that we have implemented:

  • ThunderLock: simple lock algorithm that unparks all threads anytime there had been threads parked. This releases a thundering herd of threads that all try to grab the lock. All but one will have to park again. This algorithm is simpler than BargingLock and requires only three states. It’s easy to implement this with futexes, which support a variant of WAKE that wakes all threads. This is also a great algorithm to use if multiple locks share the same address.
  • CascadeLock: adaptive lock that is similar to glibc‘s lowlevellock algorithm used for pthread_mutex on Linux. This algorithm unparks at most one thread on unlock(). The hard part of an adaptive lock that unparks at most one thread is determining when the atomic state is allowed to forget that there are threads parked. The sooner the atomic state claims there are no thread parked, the sooner unlock() calls can take the fast path. But we don’t want to forget parked threads too soon, as this could lead to deadlock. WTF::Lock solved this problem by using the unparkOne() callback, but that’s not available to futexes. Instead, CascadeLock solves this problem by having any thread that parks acquire the lock in the LockedAndParked state. This conservatively ensures that we never forget about parked threads. It also means that as soon as a thread succeeds in acquiring the lock without parking and no other threads are contending, the lock will forget the parked state and future unlocks will be fast.
  • HandoffLock: This is a strict first-in, first-out lock that has unlock() hand off lock ownership to the thread it unparks. This lock is more deterministic than the other algorithms, but as we will show in our performance evaluation, it’s also a lot slower.

Additionally, we’ve also implemented a version of WTF::Lock in this same style so that it’s easy to compare to the other algorithms:

  • BargingLock: configurable version of WTF::Lock. This lock cannot be implemented using futexes because it requires a callback in unparkOne(), which only ParkingLot provides.

WTF::Condition

The park/unpark operations of ParkingLot align perfectly with the needs of condition variables. WTF::Condition supports lots of condition-variable primitives, like various kinds of waiting with a timeout. In this section we just consider the three most basic primitives, since the other ones are easy to build on top of these: wait, notifyOne, and notifyAll.

The hardest part of a condition variable is that it must appear to unlock the lock at the same time that the thread waits on the condition. Unlocking the lock and separately waiting on the condition would mean that notify operations could happen just after unlocking and just before waiting. We address this with the beforeSleep callback to parkConditionally. This callback runs just after the ParkingLot places the calling thread on a parking queue, but just before the thread is actually parked. This means that as soon as the lock is unlocked, any notify operations are guaranteed to release this thread from the condition variable.

This is a simple and precise algorithm, which ensures that wait will never return unless the condition was notified.

Interestingly, this algorithm means that WTF::Condition doesn’t actually need to place any data into its atomic state – it just uses it to access a queue in the ParkingLot, which then does all of the work. We exploit this to use the contents of the Condition to just record whether there are any waiters. We use the various other callbacks from ParkingLot to maintain this cache, and we use it to make notifyOne/notifyAll very fast when there isn’t anyone waiting: they just return without calling into ParkingLot.

The complete algorithm for the fundamental Condition operations is:

class Condition {
public:
    Condition()
    {
        m_hasWaiters.store(false);
    }

    void wait(Lock& lock)
    {
        ParkingLot::parkConditionally(
            &m_hasWaiters,
            [this] () -> bool {
                // This is the validation callback. It runs with the queue lock
                // held. We will return true, so we know that the queue will
                // have waiters before we release the queue lock.
                m_hasWaiters.store(true);
                return true;
            },
            [this, &lock] () {
                // This is the beforeSleep callback. It runs once our thread
                // is on the parking queue and so will definitely be notified -
                // and so won't actually sleep - if an unpark operation
                // happens. This runs after we have already unlocked the queue
                // lock, so it's safe to do whatever we like. We use this as an
                // opportunity to unlock the lock.
                lock.unlock();
            });
        lock.lock();
    }

    void notifyOne()
    {
        if (!m_hasWaiters.load())
            return;

        ParkingLot::unparkOne(
            &m_hasWaiters,
            [this] (ParkingLot::UnparkResult result) {
                m_hasWaiters.store(result.mayHaveMoreThreads);
            });
    }

    void notifyAll()
    {
        if (!m_hasWaiters.load())
            return;

        m_hasWaiters.store(false);
        ParkingLot::unparkAll(&m_hasWaiters);
    }

private:
    Atomic<bool> m_hasWaiters;
};

This case illustrates some differences from futexes. Supporting condition variables with futexes requires a bit more magic, since we have to unlock the lock before calling FUTEX_WAIT. That would allow a notify call to happen in between the unlocking and the waiting.

One way around this is to use the atomic state to indicate if there is currently any thread stuck in between unlocking and waiting. We would set it to true at the start of wait, and set it to false at the start of notify. Unfortunately, that would lead to spurious returns from wait: anytime a notify operation happens just before we get to FUTEX_WAIT, the wait will return even if the notify also woke up some other thread. This would be a valid implementation of wait according to pthreads and Java, since those allow for spurious wakeups.

We like that ParkingLot allows us to avoid spurious wakeups. When debugging concurrent code, it’s great to be able to isolate what happened. Ensuring that wait only returns as a result of a notification is a welcome dose of determinism when trying to understand the behavior of a concurrent program.

WTF::Lock and WTF::Condition both take just one byte and implement all of the features you’d expect from such synchronization primitives. This is possible due to the flexibility of the ParkingLot API. ParkingLot is also powerful enough to support many futex-based algorithms, since ParkingLot::compareAndPark/unparkOne are intra-process equivalents of FUTEX_WAIT/FUTEX_WAKE.

Implementing WTF::ParkingLot

WTF::ParkingLot provides the primitives needed to build any kind of adaptive lock. ParkingLot is a collection of queues of parked threads. Queues are keyed by the address of their lock’s atomic state. ParkingLot is based on a concurrent hashtable to maximize parallelism – even if many threads are experiencing contention and need to do things to the queues, those threads will likely get to do their queue operations in parallel because the hashtable has no single bottleneck.

We use ParkingLot to save memory in locks. A risk with any side-table approach is that we are just shifting space consumption from the lock object to the ParkingLot. Fortunately, ParkingLot gives us a strong guarantee: the size of ParkingLot is bounded by the number of threads. It mostly relies on thread-local objects, which it allocates on-demand and destroys automatically when threads die. As we will show, all of ParkingLot‘s data structures obey the rule that their size is asymptotically bounded by the number of threads. This means that the number of locks and even the rate at which you contend on them has no impact on the hard O(threads) space bound of ParkingLot. In exchange for this fixed per-thread overhead, ParkingLot enables all of your locks to take only one byte each.

There are three fundamental operations: parkConditionally, unparkOne, and unparkAll. We’ll describe just the first two in this section, since unparkAll is trivial to implement using the same logic as unparkOne.

Concurrent Hashtable of Synchronized Queues

An easy way to implement ParkingLot is to have a single lock that guards a hashtable that maps addresses to queues. This would probably work great for programs that weren’t very parallel, but that lock will become a bottleneck in programs with lots of threads. ParkingLot avoids this bottleneck by using a concurrent hashtable.

The intuition behind concurrent hashtables is that different threads are unlikely to interfere with each other because they are likely to do accesses using different keys, which hash to different buckets. Therefore even concurrent writes are likely to proceed in parallel. The most sophisticated concurrent hashtable algorithms use lock-free data structures throughout. But a simpler approach is to just put a lock around each bucket. This is the approach we take in ParkingLot. The algorithm turns out to be fairly simple because we do not have to optimize resizing. We can take these shortcuts because:

  • There is only one concurrent hashtable. ParkingLot is not instantiable. All of its member functions are static. So there is only one of these concurrent hashtables in any process.
  • Its size is bounded by the number of threads. A thread takes a lot of memory already. This means that we don’t have to be worried about the space consumption of this hashtable, so long as it’s O(threads) and the per-thread overhead is significantly smaller than a typical thread stack.
  • We must acquire a lock associated with the queue once we find it in the hashtable. This means that it wouldn’t be too beneficial to make the hashtable itself lock-free. All users of it will grab a lock anyway. This motivates a solution that doesn’t involve a lock-free concurrent hashtable – just one that attempts to minimize lock contention.
  • Iterating over the whole table is uncommon and not very important. This means that iteration, like resizing, can be gross.

Our resizing algorithm will leak the old (smaller) hashtable. This is essential for making the algorithm sound. Because there is only one ParkingLot and its size is bounded by the number of threads, we can compute a hard bound on the amount of leaked memory.

The most important part of our resizing algorithm is that it makes resizing an extremely rare event. Resizing the table only happens when the following conditions arise:

  • a thread parks itself for the first time.
  • the thread count at that time is greater than one third of the hashtable’s size.

This ensures that resizing occurs only when the high watermark of threads increases. When we grow the table, we always make the new size be twice what we need. These rules combined ensure that if the max number of threads that were active at any time is N then the number of resizes we have ever done is at most log(N). Since we know that we can implement a very bad resize algorithm, we’ll first consider how to make the ParkingLot work in the absence of resizing.

Simplified Algorithm for a Fixed-Size Hashtable

Let’s assume that the hashtable size is fixed and all threads agree on a pointer to the hashtable. This allows us to consider a simpler version of the algorithm. We’ll worry about adding resizing later.

The basic algorithm we use is that each hashtable bucket is a queue. Each bucket has a lock (specifically, a WordLock, described later in this section) to protect itself. We use this lock as the queue lock for the purpose of the ParkingLot API. The hashtable only supports enqueue and dequeue, so collisions are handled by simply interleaving the collided queues. For example, if addresses 0x42 and 0x84 both hash to bucket at index 7, and you perform a sequence of enqueue operations like:

  • enqueue(0x42, T1)
  • enqueue(0x42, T2)
  • enqueue(0x84, T3)
  • enqueue(0x84, T4)

Then the bucket at index 7 will point to a queue that looks like:

head -> {addr=0x42, thr=T1} -> {addr=0x42, thr=T2} -> {addr=0x84, thr=T3} -> {addr=0x84, thr=T4} <- tail

This design means that enqueuing doesn’t have to worry about collisions at all; it just records the actual desired address in the queue node (i.e. the ThreadData for the current thread). Dequeuing resolves collisions by finding the first element in the list, starting at head, that has the address we are dequeuing for.

After enqueuing a thread when parking, ParkingLot must suspend it until it is dequeued during unparking. ParkingLot uses a thread-local condition variable to suspend threads. Only large overheads matter on this code path, since its performance is dominated by the work that the OS has to do to make the thread not runnable anymore. Hence, it’s fine for ParkingLot to bottom out in OS condition variable code.

In this design, ParkingLot::parkConditionally proceeds as follows:

  1. Hash the provided atomic state address to get the index into the hashtable. Better yet, this gives us a pointer to our bucket. From here on, we only worry about this bucket.
  2. Lock the bucket’s lock.
  3. Call the provided validation callback. The bucket’s lock is also the queue lock for the client’s atomic state address, so calling the validation callback here satisfies the contract of parkConditionally. If the validation fails, we release the bucket lock and return.
  4. If the validation succeeds, we enqueue the current thread by appending it to the linked list at the tail. The current thread’s ThreadData will contain the address that we are parking on.
  5. Unlock the bucket’s lock.
  6. Call the beforeSleep callback. Doing work at this point turns out to be great for condition variables; more on that later.
  7. Wait on the current thread’s parking condition variable.

Unparking a thread via ParkingLot::unparkOne proceeds as follows:

  1. Hash the provided atomic state address to get the bucket pointer.
  2. Lock the bucket’s lock.
  3. Search forward from head to find the first entry in the queue that matches our address, and then remove that entry. We may not find any such entry. The queue may even be completely empty.
  4. Call the provided callback, telling it if we dequeued any threads and if the queue has any more elements. Giving this information to the client while we hold the bucket’s lock turns out to be great for locks; more on that later.
  5. Unlock the bucket’s lock.
  6. If we had dequeued a thread, tell it that it can wake up now by signaling its parking condition.

The other operations on ParkingLot are simple variations on these two. ParkingLot::compareAndPark is just a wrapper for parkConditionally, and unparkAll is almost like unparkOne except that it finds all of the entries matching the address rather than just the first one.

Resizing the Hashtable

We don’t want to make a guess about how many threads the process will have. WebKit contributors sometimes like to add threads, and we don’t want to discourage that. Web APIs can cause WebKit to start threads, and the number of threads can be controlled by the web page. Therefore, we don’t want to get into the business of guessing how many threads we will see. This implies that the hashtable must be resizable.

If we lock every bucket in the current hashtable, then we have exclusive access to the table and we can do with it as we wish. Any other thread wishing to access the table will be stuck trying to acquire the lock of some bucket, since the park/unpark operations from the previous section all start with locking some bucket’s lock. The intuition is that resizing can simply lock all of the old table’s buckets and then allocate a new hashtable and copy the old one’s contents into it. Then, while still holding the locks of all of the buckets, it can repoint the global hashtable pointer to the new table. Then we can release the locks on the old table’s buckets. This implies another change: the park/unpark algorithms will check if the global hashtable pointer is still the same after the bucket lock is locked. Without resizing, the park implementation might have looked like:

void ParkingLot::parkConditionally(...)
{
    Hashtable* hashtable = g_hashtable; // load global hashtable pointer.
    Bucket* bucket = hashtable->buckets[hash % hashtable->size];
    bucket->lock.lock();
    // The rest of the parking algorithm...
}

Resizing means that any hashtable operation begins like this instead:

void ParkingLot::parkConditionally(...)
{
    Bucket* bucket;
    for (;;) {
        Hashtable* hashtable = g_hashtable;
        bucket = hashtable->buckets[hash & hashtable->size];
        bucket->lock.lock();

        if (hashtable == g_hashtable)
            break;


        // The hashtable resized while we were waiting for the lock. Try
        // again.
        bucket->lock.unlock();
    }
    // At this point, we will have a bucket, it will be locked, and the
    // hashtable is not resizing.
    // The rest of the parking algorithm...
}

After resizing, we need to leak the old hashtable. We cannot be sure after unlocking all of its buckets how many threads are still stuck between having loaded the old hashtable pointer and attempting to lock a bucket. Threads may be stuck in between any two instructions for an indeterminate amount of time due to OS scheduling. Worse, a bucket’s lock may have any number of threads waiting on it, so we cannot delete the lock. Rather than try to ask the OS about the status of all threads in the system to detect when it’s safe to delete the old table, we just leak the old hashtables. This is fine because of exponential resizing. Let’s say that the hashtable started with a size of 1 and resized up to 64. Then we will have allocated hashtables of the following sizes:

1 + 2 + 4 + 8 + 16 + 32 + 64

This is a geometric series, which converges to 127 (i.e. 64 * 2 – 1). In general, the amount of memory we will waste due to leaking old tables is proportional to the amount of memory used by the current table. Somewhat humorously, the ParkingLot will record all “leaked” hashtables in a global vector to ensure that leak detector tools don’t bother us about these harmless and intentional leaks.

We optimize this a bit further, by having the buckets be separate heap-allocated objects. The hashtable just contains pointers to buckets, and we reuse buckets across resizings. This means that the only thing that we leak are the bucket pointer arrays, which are an order of magnitude smaller than the total size of all of the buckets. In our implementation, the leak is bounded (total amount of leaked memory is bounded by the amount of memory we are using) and very small (it’s bounded by the size of the pointer array, which is much smaller than the total amount of memory used for buckets, which in turn is bounded by the number of threads and is much smaller than the total amount of memory that threads use for other things like stacks).

Summary of WTF::ParkingLot

To summarize, ParkingLot provides parking queues keyed by the memory addresses of locks. The memory usage of ParkingLot has nothing to do with the number of locks – it’s bounded by the number of threads currently parked (which is bounded by the number of threads). Using some simple concurrency tricks, ParkingLot is able to provide parallelism when different threads are queueing on different addresses. See wtf/ParkingLot.h and wtf/ParkingLot.cpp for the full implementation.

WTF::WordLock

WTF::ParkingLot needs a lock implementation for protecting buckets. This shouldn’t be a spinlock because we don’t put a bound on the amount of code that may execute while the lock is held. ParkingLot will use this lock to synchronize the validation in parkConditionally() and the callback in unparkOne(). Even though those callbacks usually do very little work, we don’t want to place strict limits on them. We also need the lock to behave well under microcontention and to not take too much memory. This means that we need something like WTF::Lock.

Fortunately, it’s possible to implement that algorithm without a ParkingLot if we’re willing to use an entire pointer-sized word. This is what WTF::WordLock gives us. It’s less desirable than WTF::Lock, since it requires more memory, but it’s standalone so that we can use it for all of the locking needs of ParkingLot. A WordLock instance just has a Atomic<uintptr_t> inside it. There is no other overhead except for some small per-thread data structures that get created the first time that a thread contends for a lock and get destroyed automatically when the thread dies.

A lock needs three data structures: the atomic state, a queue of threads, and a lock to protect the queue. In our BargingLock algorithm, the atomic state comprises a bit that tells us if the lock is locked and a bit that tells us the queue is non-empty. WordLock adapts this algorithm by having the atomic state be a pointer to the head of the queue, with the two low-order bits of the pointer stolen to represent whether the lock is locked and whether the queue is locked. We interpret the atomic state as follows:

  • The lowest bit is the isLockedBit.
  • The second-lowest bit is the isQueueLockedBit.
  • The rest of the bits are a pointer to the head of the queue, or null if it is empty.

The queue is represented using ThreadData objects. There is one such object per thread. It contains the pointers necessary to manage the queue, including a next pointer and a tail pointer. We use the convention that the head of the queue points to the tail, which obviates the need to allocate any other memory for storing a pointer to tail: the atomic state just points to head, which gives an O(1) way of finding the tail.

In all other regards, WTF::WordLock follows the BargingLock algorithm. Our experiments will show that except for space usage, WordLock performs just as well as Lock. See wtf/WordLock.h and wtf/WordLock.cpp for the full implementation.

Performance

We replaced all of the locks in WebKit with WTF::Lock because it was safer than spinlocks (no chance of a thread wasting time in a spin loop) and both faster and smaller than OS-provided mutexes. This section shows the performance implications of this change, including some exploration of locking protocols that WebKit does not use but that we either discovered by accident or that we’ve heard of others using.

This section first shows the performance of WTF::Lock when running WebKit benchmarks, and then shows some microbenchmark results using a bunch of different lock variants.

WebKit Performance With WTF::Lock

Prior to WTF::Lock, WebKit used a mix of OS-provided mutexes and spinlocks. We would guess how important the lock was for fast path performance and space and how long the critical section was going to be. We would always use OS-provided mutexes for critical sections that we thought might be long. We had data that suggested that we had picked incorrectly in at least some cases: some of those OS-provided mutexes were slowing us down and some of the spinlocks would cause sched_yield to show up in time profiles. The difficulty of guessing what kind of lock to use motivated us to implement WTF::Lock.

It’s now difficult to revert this change and return to a world where we pick different locks for different critical sections, and we suspect that using spinlocks is generally not a good idea. If some part of the code unexpectedly takes a long time, for example due to swapping, then the last thing we want is for other threads to start busy-waiting for the lock. We also knew from experience that trying to alleviate that program by making spinlocks sometimes sleep would only degrade performance in the common case.

This section sets out to establish that if you know that you need an adaptive lock then WTF::Lock is what you want to use. We use three benchmarks: JetStream 1.1, Speedometer 1.0, and PLT3 (our internal page load time test). All of these benchmarks are run in a Mac Pro with two 2.7 GHz 6-Core Xeon E5 CPUs (with hyperthreading, so 24 logical CPUs) and 16 GB RAM running El Capitan. The “OS Mutex” results are from replacing WTF::Lock with a wrapper for pthread_mutex_t. The WTF::Lock results are the baseline. These numbers are gathered using WebKit r199680 with r199690 backported (since it affected performance on this machine).

JetStream Performance

This chart shows the JetStream score for both OS Mutex and WTF::Lock. Higher is better.

JetStream is a JavaScript benchmark that consists of small to medium-sized programs that stress various parts of our JavaScript engine. WebKit relies on locks heavily when running JavaScript programs. In the most extreme case, each object may have its own lock and this lock may be acquired on any property access. This is necessary to allow our concurrent compiler to inspect the heap. Without locks, those accesses would not be safe. These JetStream numbers show that it’s important to have fast locks when running JavaScript.

Speedometer Performance

This chart shows the Speedometer score for both OS Mutex and WTF::Lock. Higher is better.

Speedometer is a JavaScript and DOM benchmark comprised of web apps implemented in different web frameworks. It stresses the entire engine. We can see that for this realistic test, WTF::Lock gives a 5% speed-up.

PLT3 Performance

This chart shows PLT3 geometric mean page load times. Lower is better.

PLT3 speeds up by 5% if you switch to WTF::Lock. Since PLT3 is not entirely dominated by JavaScript, this suggests that there are many other locks in WebKit that benefit from being fast.

Summary of WebKit Lock Performance

WTF::Lock is always a speed-up over pthread_mutex_t. It’s also 64x smaller – it uses only one byte while pthread_mutex_t uses 64 bytes. Based on this data, we are confident that the right choice is to continue using WTF::Lock instead of pthread_mutex.

Microbenchmark Performance

This section explores the performance of various locks on a simple microbenchmark that can start any number of threads which repeatedly lock a lock and do some small amount of floating point math (each loop iteration does one double addition and multiplication). We can vary the locking protocol and some parameters of the locking protocol (like the amount of spinning it will do before parking). This compares WTF::Lock and WTF::WordLock to spinlocks and miscellaneous lock algorithms that use ParkingLot. This section also compares WTF::Lock to std::synchronic and hardware transactional memory.

These benchmarks are run on a MacBook Pro with a 2.6 GHz Intel Core i7 with four cores and hyperthreading and 16 GB of RAM running El Capitan.

Microcontention for Various Thread Counts

This chart shows the number of successful lock acquisitions per second across all threads as a function of the number of threads. This uses a critical section that does one loop iteration while holding the lock. Higher is better.

We use six locking protocols:

This chart shows that as you scale up the number of threads, WTF::Lock can easily hold its own. It’s hard to tell how slow that OS mutex and HandoffLock are. In fact, for 10 threads they are about 160x slower.

Notice that for a single thread, the fastest locks are always spinlocks. This is because spinlocks do not have to use CAS when unlocking. Using CAS when unlocking is necessary for locks that have a queue, since you need to check for parked threads at the moment that you unlock. Spinlocks don’t do this, so they can just store 0 – or whatever the “I’m not locked” value is – into the lock’s atomic state.

It’s also clear that depending on the number of threads contending, different locks have very different performance. It appears that WTF::Lock is not so great for two or three threads.

Finally, it’s clear that the x86 pause instruction is not useful for our spinlocks. Intel shows that it is a speed-up, but we cannot confirm their claim.

Optimizing the Spin Limit of WTF::Lock

This chart shows the number of successful lock acquisitions per second across all threads as a function of the spin limit. Higher is better. This test uses 4 threads, since for fewer threads the spin limit doesn’t matter much, and for more threads the chart doesn’t look much different than this. This uses a critical section that does one loop iteration while holding the lock.

We initially picked a spin limit of 40 based on ancient JikesRVM experiments. Surprisingly, this chart precisely confirms that 40 is still optimal.

Microcontention With Different Locks

This chart shows the number of successful lock acquisitions per second across all threads as a function of the number of threads. This uses a critical section that does one loop iteration while holding the lock. Higher is better.

This explores three algorithms:

  • ThunderLock. This unleashes a thundering herd every time it unparks threads.
  • CascadeLock. This is based on glibc’s lowlevellock algorithm.
  • BargingLock. This is like WTF::Lock, but more configurable.

We then run each one in two variants, one that is 8-bit and one that is 32-bit.

This plot shows that CascadeLock and ThunderLock exhibit wonderful performance for smaller numbers of threads. BargingLock and ThunderLock exhibit the best performance for many threads. This chart suggests that we might have additional performance improvements if we try to take the best of ThunderLock and CascadeLock and integrate them into the WTF::Lock algorithm. On the other hand, this microbenchmark is quite extreme and it doesn’t decisively favor any of these algorithms. Because of these results, we have a bug open about continuing to reconsider our WTF::Lock implementation.

Contention With Different Critical Section Lengths

This chart shows the number of successful lock acquisitions per second across all threads as a function of the number of loop iterations while the critical section is held. Higher is better. This uses 4 threads.

All previous microbenchmark charts used a very short critical section. This shows what happens when the critical section length is increased. Unsurprisingly, the performance gap between the OS mutex and WTF::Lock gets reduced, but even for long critical sections (1000 double multiplies and 1000 double adds), WTF::Lock is still almost 2x faster.

Lock Fairness

One advantage of OS mutexes is that they guarantee fairness: All threads waiting for a lock form a queue, and, when the lock is released, the thread at the head of the queue acquires it. It’s 100% deterministic. While this kind of behavior makes mutexes easier to reason about, it reduces throughput because it prevents a thread from reacquiring a mutex it just released. It’s common for WebKit threads to repeatedly acquire the same lock. This section attempts to evaluate the relative fairness of OS mutexes and WTF::Lock.

Our fairness benchmark starts ten threads while a lock is held. It waits a bit after starting them to maximize the likelihood that all of those threads pile up on the lock’s queue. Then we release the lock, and count how many times each thread got to acquire the lock during a 100 millisecond run. A FIFO lock will ensure that each thread got to acquire the lock the same number of times except for an off-by-one step: whenever the 100 millisecond test run finishes, some set of threads may have had a chance to do exactly one more lock acquisition because they happened to come first in the round-robin cycle.

The chart above shows the fairness results for the OS Mutex. As expected, it’s completely fair.

WTF::Lock is slightly less fair according to the chart above. However, the least lucky WTF::Lock thread still got to acquire the lock about 180x more times than any OS Mutex thread: thread 8 was the least lucky WTF::Lock thread with only 556797 acquisitions, 15% less than the thread 10, which was the luckiest. But that’s a huge number of lock acquisitions compared to 3010, the best that the OS mutex threads could do.

This is a surprising result. It’s clear that the OS mutex is doing exactly what it set out to do: no thread gets to do any more work than any other thread. On the other hand, WTF::Lock does not guarantee fairness. Analysis of the algorithm shows that a thread that has been waiting the longest to get the lock may fail to get the lock and be forced to go to the end of the queue. But this chart shows that even without having a fairness guarantee, the unluckiest thread using WTF::Lock got better treatment than any thread using the guaranteed-fair mutex. It’s almost as if the OS mutex is not actually fair because while thread 1 is served equally to thread 2, all 10 threads are underserved relative to a hypothetical thread 11, which is using a different algorithm. Indeed, we can think of thread 11 as being the OS context switch handler.

Fair algorithms make sense in some contexts, like if all critical sections are long and it matters that the longest wait for any thread is bounded by the number of threads and the total length of their critical sections. But WebKit uses tiny critical sections and some of them become contended. The cost of ensuring fairness in small critical sections turns out to be too high to be practical.

We have to account for the possibility that the OS mutex is slower than WTF::Lock for some reason other than fairness. We can test this since we have also implemented HandoffLock, which is a completely fair lock implemented using ParkingLot.

The chart above shows the fairness results for HandoffLock. It performs almost exactly like the OS mutex. This result has some interesting implications. It shows that the OS mutex’s performance is likely to be due entirely to its deterministic fairness guarantee. It also implies that the extra overhead that ParkingLot introduces does not adversely affect the speed with which ParkingLot can handoff execution from one thread to another.

Comparison to Other Novel Locks

The C++ language has a proposed feature called std::synchronic that addresses some of the same problems as ParkingLot. It allows users to write their own locks, and those locks can fit into a small amount of memory. Lock algorithms focus a lot on how to handle contention so as to optimize throughput even when multiple threads want to hold the same lock. An approach for handling contention that is popular in scholarly computer science is transactional memory. If a transactional critical section is contended but the contending threads don’t have any races other than the race to get the lock (i.e. they access disjoint memory except for the lock itself) then these threads will get to run concurrently. If a race is detected, some threads will abort and retry, possibly reverting to a convention lock algorithm. Modern x86 chips support transactional memory via Hardware Lock Elision (HLE). WebKit avoids using a single lock to protect unrelated data, since this is both awkward (it’s easiest to put a tiny WTF::Lock right next to whatever field it protects) and suboptimal (it causes pointless contention). In WebKit we add locks in order to protect data races, so transactions are unlikely to help. This section evaluates the performance of these alternatives, with an emphasis on a WebKit-style critical section, where racing on the lock implies a race for the same underlying data.

This chart shows the number of successful lock acquisitions per second across all threads as a function of the number of threads. This uses a critical section that does one loop iteration while holding the lock. Higher is better.

To test std::synchronic we implement a lock that follows the “TTAS lock” algorithm in the synchronic test suite. To test HLE, we implement a basic spinlock wrapped with xacquire/xrelease. As this chart shows, WTF::Lock is always significantly faster than either of these kinds of locks. We suspect that std::synchronic performs relatively poorly because it requires the analog of ParkingLot::unparkOne() to run every time a lock is released, even if nobody is waiting. On the other hand, the same features that make std::synchronic a bit slower also make its API a lot easier to use. We suspect that HLE performs relatively poorly because the locks in this benchmark protect a data race. We only use locks in WebKit when there is a data race to protect, so although this benchmark is unfair to the intended use case of HLE, we believe that it’s an appropriate benchmark for simulating how we use locks. We aren’t the first to observe that transactional memory isn’t great. That post observes that one problem with transactional memory is the lack of a killer app, and observes that the industry as a whole is missing a concurrency killer app. WebKit uses concurrency to dramatically speed up JIT compilation and it uses parallelism to dramatically speed up garbage collection, and both are possible thanks to fast locks.

Summary

We replaced all of WebKit’s locks with our own lock implementation, called WTF::Lock. We did this because we wanted to aggressively reduce the sizes of our locks while increasing overall performance. We also wanted the lock to be adaptive, so that threads would not spin when a lock was held for a long time. The new lock, called WTF::Lock is implemented using a reusable abstraction for parking and queuing threads, and that abstraction will come in handy when implementing new web standards.

By Filip Pizlo at May 06, 2016 03:00 PM

April 27, 2016

Release Notes for Safari Technology Preview 3

WebKit Blog

Safari Technology Preview Release 3 is now available for download. If you already have Safari Technology Preview installed, you can update from the Mac App Store’s Updates tab. Release 3 of Safari Technology Preview covers WebKit revisions 199086–199865.

JavaScript

  • Added support for Symbol.isConcatSpreadable per the ES6 spec (r199397)
  • Made RegExp constructor get the Symbol.match property to decide if an object should be constructed like a RegExp object per the ES6 spec (r199106)
  • Changed String.match and String.search to use RegExp constructor per the ES6 spec (r199144)
  • Corrected how function declarations are hoisted per the ES6 spec (r199179)
  • Improved parsing of ES6 arrow functions (r199352)
  • Added RegExp.prototype[@@split] and made String.prototype.split use it per the ES6 spec (r199731)
  • Added RegExp.prototype[@@search] (r199748)
  • Updated the treatment of invoking methods on RegExp.prototype per the ES6 spec (r199545)
  • Made more test cases pass with ES6 RegExp unicode flag (r199523)
  • Added support for caching accesses to arguments.length for a performance speed-up (r199240)
  • Corrected the behavior of throw() for generators yielding to an inner generator per draft ECMAScript spec (r199652)

CSS

  • Implemented the functional :host() pseudo class (r199291)
  • Improved support for SVG cursor images (r199625)
  • Started using OpenType math fonts by default for MathML (r199773)
  • Fixed measurement of hanging punctuation (r199777)
  • Improved hyphenation when the last line in a paragraph only contains one syllable of a word (r199818)
  • Fixed a layout problem affecting CSS Grid items without a static inline position in RTL languages (r199098)
  • Fixed positioned items with gaps for CSS Grid (r199223)
  • Added support for CSS Grid grid-template-columns repeat(auto-fill, …) and repeat(auto-fit, …) (r199343)
  • Fixed positioned items with content alignment in CSS Grids (r199657)
  • Started using grid-template-areas to determine the explicit grid (r199661)
  • Corrected CSS Grid layout by using the margin box for non-auto minimum sizes (r199728)

Web APIs

  • Added support setting and retrieving Blob values in IndexedDB (r199120, r199230, r199499, r199524, r199708, r199730)
  • Corrected MessageEvent.source result once window has been fully created (r199087)
  • Improved stability when the first child of a shadow root is a comment node (r199097)
  • Made CSS be a proper constructor on the window object with static functions (r199112)
  • Exposed the Crypto constructor on the window object (r199159)
  • Added support for display: contents on <slot> elements (r199151)
  • Fixed FontFace so it will be properly reject the returned promise if Content Security Policy blocks all the URLs (r199611)
  • Made FontFaceSet handle null correctly (r199216)
  • Corrected DOMTokenList.contains() so it does not throw an exception (r199296)
  • Made Selection.deleteFromDocument not delete a character when the selection is a caret per the spec (r199585)
  • Improved the IndexedDB bindings to better match the spec (r199750, r199774)
  • Made AudioBufferSourceNode.buffer nullable (r199751)
  • Improved stability handling a wheel event that closes a frame (r199181)

Web Inspector

  • Made it possible to expand objects in the Instances heap snapshot view to see what it retains (r199379)
  • Improved performance dramatically in the Timelines tab when recording pages with a lot of rapid activity and for long recordings (r199747)
  • Improved JavaScript pretty printing performance by using Esprima and by no longer blocking the main thread (r199168, r199169)
  • Improved the profiler’s sampling rate to get closer to a 1ms sample frequency (r199092)
  • Improved filtering in Open Quickly dialog (r199143, r199226)
  • Made the Open Quickly dialog keep its resource list up to date (r199207)
  • Stopped trying to match color patterns in JavaScript source code to improve performance of large resources (r199095)
  • Changed take snapshot navigation button to a camera glyph (r199177)
  • Corrected source code location links in the JavaScript profile Call Trees view (r199201)
  • Made XHRs and Web Workers full-text searchable (r199263)
  • Improved the appearance of DOM nodes in object previews (r199322)
  • Improved the tab bar rendering when the tabs are small (r199325)
  • Corrected dock controls disappearing from the toolbar after leaving fullscreen (r199395)
  • Started remembering the zoom factor as a persistent setting across sessions (r199396)
  • Corrected sourceMappingURL not being used when sourceURL was also set (r199688)
  • Started localizing sizes and times by using Number.prototype.toLocaleString (r199635)
  • Made sourceMappingURL work more reliably across reloads (r199852)

Rendering

  • Improved the time to display for some pages — allowing a short page header to render immediately before other content populates later (r199155)
  • Fixed page tile layers disappearing when graphics acceleration is unavailable (r199130)
  • Made font-size: 0 render as 0 width when text-rendering: optimizeLegibility is used (r199150)
  • Corrected focus ring drawing at incorrect location on image map with a CSS transform (r199247)
  • Made negative letter-spacing affect the right edge of the content’s visual overflow (r199516)
  • Corrected compositing for WebGL based canvases after they changed size (r199536)
  • Started clearing the rendered icon on <input type=file> when an empty files list is set (r199540)
  • Improved performance of border-collapse: collapse on tables (r199552)
  • Improved rendering of select[multiple] to better match other browsers (r199553)
  • Fixed backdrop filter so it honors visibility: hidden (r199862)

Security

  • Made nested browsing context created for <object> or <embed> respect Content Security Policy’s object-src directive (r199527)
  • Started ignoring Content Security Policy meta tags if it is not a descendent of <head> per the spec (r199163)
  • Started ignoring report-only Content Security Policy directives delivered via meta tag per the spec (r199538)
  • Started ignoring paths in Content Security Policy URL matching after redirects per spec (r199612)
  • Removed support for X-Frame-Options in <meta> per the spec (r199696)

Networking

  • Stopped speculatively revalidating cached redirects (r199521)
  • Stopped cacheing responses with Content-Range headers to avoid serving incorrect results (r199090)
  • Fixed clearing the application cache when removing website data in Privacy preferences (r199204)

Accessibility

  • Changed the application role description to “web application” to avoid confusion with the top-level system application description (r199260)
  • Made presentation role be preferred over child <title> and <desc> elements in SVG content (r199588)

You can file bugs or feature requests at the WebKit bug tracker, or you can submit feedback or bugs to Apple on Apple’s bug reporting website. For other questions or feedback, feel free to reach us on Twitter at @webkit or Jonathan Davis at @jonathandavis.

By Timothy Hatcher at April 27, 2016 05:00 PM

April 26, 2016

Updating Our Prefixing Policy

WebKit Blog

When implementing new features for the Web, it’s important for us to be able to get them into the hands of developers early, so they can give new things a try. (Of course, this also helps us identify and fix bugs!) In the past, browsers did this by using vendor-prefixed names for features. This was intended to protect the Web from the churn of spec and implementation changes. Browsers would eventually implement the standard version with no prefix and drop support for the prefixed version.

Over time this strategy has turned out not to work so well. Many websites came to depend on prefixed properties. They often used every prefixed variant of a feature, which makes CSS less maintainable and JavaScript programs trickier to write. Sites frequently used just the prefixed version of a feature, which made it hard for browsers to drop support for the prefixed variant when adding support for the unprefixed, standard version. Ultimately, browsers felt pressured by compatibility concerns to implement each other’s prefixes.

The current consensus among browser implementors is that, on the whole, prefixed properties have hurt more than they’ve helped. So, WebKit’s new policy is to implement experimental features unprefixed, behind a runtime flag. Runtime flags allow us to continue to get experimental features into developers’ hands while avoiding the various problems vendor prefixes had. Runtime flags also make it easier for us to have different default settings between stable builds and preview builds such as Safari Technology Preview.

We’ll be applying our updated policy to new feature work going forward. Whether or not a runtime flag should be on or off on WebKit trunk (and thus in nightly builds) depends on the maturity of the feature, both in terms of its spec stability and implementation maturity.

What does this mean for Web developers?

Initially, developers shouldn’t notice anything different. In the longer term we hope this change will make it easier for you to try out upcoming features. As always, we encourage you to give in-progress features a try. Feedback and bug reports on experimental features are very welcome.

What about currently prefixed features?

We’ll be evaluating existing features on a case-by-case basis. We expect to significantly reduce the number of prefixed properties supported over time but Web compatibility will require us to keep around prefixed versions of some features.

We invite comments and feedback on the new policy from Web developers, educators, and our colleagues working on other browser engines. Feel free to reach out to me on Twitter (@hober), Jon Davis (@jonathandavis), @webkit, or email me directly at hober@apple.com.

By Edward O'Connor at April 26, 2016 02:00 PM

April 16, 2016

Frédéric Wang: OpenType MATH in HarfBuzz

Igalia WebKit

TL;DR:

  • Work is in progress to add OpenType MATH support in HarfBuzz and will be instrumental for many math rendering engines relying on that library, including browsers.

  • For stretchy operators, an efficient way to determine the required number of glyphs and their overlaps has been implemented and is described here.

In the context of Igalia browser team effort to implement MathML support using TeX rules and OpenType features, I have started implementation of OpenType MATH support in HarfBuzz. This table from the OpenType standard is made of three subtables:

  • The MathConstants table, which contains layout constants. For example, the thickness of the fraction bar of ab\frac{a}{b}.

  • The MathGlyphInfo table, which contains glyph properties. For instance, the italic correction indicating how slanted an integral is e.g. to properly place the subscript in ∫D\displaystyle\displaystyle\int_{D}.

  • The MathVariants table, which provides larger size variants for a base glyph or data to build a glyph assembly. For example, either a larger parenthesis or a assembly of U+239B, U+239C, U+239D to write something like:

      (abcdefgh\left(\frac{\frac{\frac{a}{b}}{\frac{c}{d}}}{\frac{\frac{e}{f}}{\frac{g}{h}}}\right.  

Code to parse this table was added to Gecko and WebKit two years ago. The existing code to build glyph assembly in these Web engines was adapted to use the MathVariants data instead of only private tables. However, as we will see below the MathVariants data to build glyph assembly is more general, with arbitrary number of glyphs or with additional constraints on glyph overlaps. Also there are various fallback mechanisms for old fonts and other bugs that I think we could get rid of when we move to OpenType MATH fonts only.

In order to add MathML support in Blink, it is very easy to import the OpenType MATH parsing code from WebKit. However, after discussions with some Google developers, it seems that the best option is to directly add support for this table in HarfBuzz. Since this library is used by Gecko, by WebKit (at least the GTK port) and by many other applications such as Servo, XeTeX or LibreOffice it make senses to share the implementation to improve math rendering everywhere.

The idea for HarfBuzz is to add an API to

  1. 1.

    Expose data from the MathConstants and MathGlyphInfo.

  2. 2.

    Shape stretchy operators to some target size with the help of the MathVariants.

It is then up to a higher-level math rendering engine (e.g. TeX or MathML rendering engines) to beautifully display mathematical formulas using this API. The design choice for exposing MathConstants and MathGlyphInfo is almost obvious from the reading of the MATH table specification. The choice for the shaping API is a bit more complex and discussions is still in progress. For example because we want to accept stretching after glyph-level mirroring (e.g. to draw RTL clockwise integrals) we should accept any glyph and not just an input Unicode strings as it is the case for other HarfBuzz shaping functions. This shaping also depends on a stretching direction (horizontal/vertical) or on a target size (and Gecko even currently has various ways to approximate that target size). Finally, we should also have a way to expose italic correction for a glyph assembly or to approximate preferred width for Web rendering engines.

As I mentioned at the beginning, the data and algorithm to build glyph assembly is the most complex part of the OpenType MATH and deserves a special interest. The idea is that you have a list of n≥1n\geq 1 glyphs available to build the assembly. For each 0≤i≤n-10\leq i\leq n-1, the glyph gig_{i} has advance aia_{i} in the stretch direction. Each gig_{i} has straight connector part at its start (of length sis_{i}) and at its end (of length eie_{i}) so that we can align the glyphs on the stretch axis and glue them together. Also, some of the glyphs are “extenders” which means that they can be repeated 0, 1 or more times to make the assembly as large as possible. Finally, the end/start connectors of consecutive glyphs must overlap by at least a fixed value omino_{\mathrm{min}} to avoid gaps at some resolutions but of course without exceeding the length of the corresponding connectors. This gives some flexibility to adjust the size of the assembly and get closer to the target size tt.

gig_{i}

sis_{i}

eie_{i}

aia_{i}

gi+1g_{i+1}

si+1s_{i+1}

ei+1e_{i+1}

ai+1a_{i+1}

oi,i+1o_{i,i+1}

Figure 0.1: Two adjacent glyphs in an assembly

To ensure that the width/height is distributed equally and the symmetry of the shape is preserved, the MATH table specification suggests the following iterative algorithm to determine the number of extenders and the connector overlaps to reach a minimal target size tt:

  1. 1.

    Assemble all parts by overlapping connectors by maximum amount, and removing all extenders. This gives the smallest possible result.

  2. 2.

    Determine how much extra width/height can be distributed into all connections between neighboring parts. If that is enough to achieve the size goal, extend each connection equally by changing overlaps of connectors to finish the job.

  3. 3.

    If all connections have been extended to minimum overlap and further growth is needed, add one of each extender, and repeat the process from the first step.

We note that at each step, each extender is repeated the same number of times r≥0r\geq 0. So if IExtI_{\mathrm{Ext}} (respectively INonExtI_{\mathrm{NonExt}}) is the set of indices 0≤i≤n-10\leq i\leq n-1 such that gig_{i} is an extender (respectively is not an extender) we have ri=rr_{i}=r (respectively ri=1r_{i}=1). The size we can reach at step rr is at most the one obtained with the minimal connector overlap omino_{\mathrm{min}} that is

∑i=0N-1(∑j=1riai-omin)+omin=(∑i∈INonExtai-omin)+(∑i∈IExtr⁢(ai-omin))+omin\sum_{i=0}^{N-1}\left(\sum_{j=1}^{r_{i}}{a_{i}-o_{\mathrm{min}}}\right)+o_{% \mathrm{min}}=\left(\sum_{i\in I_{\mathrm{NonExt}}}{a_{i}-o_{\mathrm{min}}}% \right)+\left(\sum_{i\in I_{\mathrm{Ext}}}r{(a_{i}-o_{\mathrm{min}})}\right)+o% _{\mathrm{min}}

We let NExt=|IExt|N_{\mathrm{Ext}}={|I_{\mathrm{Ext}}|} and NNonExt=|INonExt|N_{\mathrm{NonExt}}={|I_{\mathrm{NonExt}}|} be the number of extenders and non-extenders. We also let SExt=∑i∈IExtaiS_{\mathrm{Ext}}=\sum_{i\in I_{\mathrm{Ext}}}a_{i} and SNonExt=∑i∈INonExtaiS_{\mathrm{NonExt}}=\sum_{i\in I_{\mathrm{NonExt}}}a_{i} be the sum of advances for extenders and non-extenders. If we want the advance of the glyph assembly to reach the minimal size tt then

  SNonExt-omin⁢(NNonExt-1)+r⁢(SExt-omin⁢NExt)≥t{S_{\mathrm{NonExt}}-o_{\mathrm{min}}\left(N_{\mathrm{NonExt}}-1\right)}+{r% \left(S_{\mathrm{Ext}}-o_{\mathrm{min}}N_{\mathrm{Ext}}\right)}\geq t  

We can assume 0" class="ltx_Math" display="inline" id="p12.m1">SExt-omin⁢NExt>0S_{\mathrm{Ext}}-o_{\mathrm{min}}N_{\mathrm{Ext}}>0 or otherwise we would have the extreme case where the overlap takes at least the full advance of each extender. Then we obtain

  r≥rmin=max⁡(0,⌈t-SNonExt+omin⁢(NNonExt-1)SExt-omin⁢NExt⌉)r\geq r_{\mathrm{min}}=\max\left(0,\left\lceil\frac{t-{S_{\mathrm{NonExt}}+o_{% \mathrm{min}}\left(N_{\mathrm{NonExt}}-1\right)}}{S_{\mathrm{Ext}}-o_{\mathrm{% min}}N_{\mathrm{Ext}}}\right\rceil\right)  

This provides a first simplification of the algorithm sketched in the MATH table specification: Directly start iteration at step rminr_{\mathrm{min}}. Note that at each step we start at possibly different maximum overlaps and decrease all of them by a same value. It is not clear what to do when one of the overlap reaches omino_{\mathrm{min}} while others can still be decreased. However, the sketched algorithm says all the connectors should reach minimum overlap before the next increment of rr, which means the target size will indeed be reached at step rminr_{\mathrm{min}}.

One possible interpretation is to stop overlap decreasing for the adjacent connectors that reached minimum overlap and to continue uniform decreasing for the others until all the connectors reach minimum overlap. In that case we may lose equal distribution or symmetry. In practice, this should probably not matter much. So we propose instead the dual option which should behave more or less the same in most cases: Start with all overlaps set to omino_{\mathrm{min}} and increase them evenly to reach a same value oo. By the same reasoning as above we want the inequality

  SNonExt-o⁢(NNonExt-1)+rmin⁢(SExt-o⁢NExt)≥t{S_{\mathrm{NonExt}}-o\left(N_{\mathrm{NonExt}}-1\right)}+{r_{\mathrm{min}}% \left(S_{\mathrm{Ext}}-oN_{\mathrm{Ext}}\right)}\geq t  

which can be rewritten

  SNonExt+rmin⁢SExt-o⁢(NNonExt+rmin⁢NExt-1)≥tS_{\mathrm{NonExt}}+r_{\mathrm{min}}S_{\mathrm{Ext}}-{o\left(N_{\mathrm{NonExt% }}+{r_{\mathrm{min}}N_{\mathrm{Ext}}}-1\right)}\geq t  

We note that N=NNonExt+rmin⁢NExtN=N_{\mathrm{NonExt}}+{r_{\mathrm{min}}N_{\mathrm{Ext}}} is just the exact number of glyphs used in the assembly. If there is only a single glyph, then the overlap value is irrelevant so we can assume NNonExt+r⁢NExt-1=N-1≥1N_{\mathrm{NonExt}}+{rN_{\mathrm{Ext}}}-1=N-1\geq 1. This provides the greatest theorical value for the overlap oo:

  omin≤o≤omaxtheorical=SNonExt+rmin⁢SExt-tNNonExt+rmin⁢NExt-1o_{\mathrm{min}}\leq o\leq o_{\mathrm{max}}^{\mathrm{theorical}}=\frac{S_{% \mathrm{NonExt}}+r_{\mathrm{min}}S_{\mathrm{Ext}}-t}{N_{\mathrm{NonExt}}+{r_{% \mathrm{min}}N_{\mathrm{Ext}}}-1}  

Of course, we also have to take into account the limit imposed by the start and end connector lengths. So omaxo_{\mathrm{max}} must also be at most min⁡(ei,si+1)\min{(e_{i},s_{i+1})} for 0≤i≤n-20\leq i\leq n-2. But if rmin≥2r_{\mathrm{min}}\geq 2 then extender copies are connected and so omaxo_{\mathrm{max}} must also be at most min⁡(ei,si)\min{(e_{i},s_{i})} for i∈IExti\in I_{\mathrm{Ext}}. To summarize, omaxo_{\mathrm{max}} is the minimum of omaxtheoricalo_{\mathrm{max}}^{\mathrm{theorical}}, of eie_{i} for 0≤i≤n-20\leq i\leq n-2, of sis_{i} 1≤i≤n-11\leq i\leq n-1 and possibly of e0e_{0} (if 0∈IExt0\in I_{\mathrm{Ext}}) and of of sn-1s_{n-1} (if n-1∈IExt{n-1}\in I_{\mathrm{Ext}}).

With the algorithm described above NExtN_{\mathrm{Ext}}, NNonExtN_{\mathrm{NonExt}}, SExtS_{\mathrm{Ext}}, SNonExtS_{\mathrm{NonExt}} and rminr_{\mathrm{min}} and omaxo_{\mathrm{max}} can all be obtained using simple loops on the glyphs gig_{i} and so the complexity is O⁢(n)O(n). In practice nn is small: For existing fonts, assemblies are made of at most three non-extenders and two extenders that is n≤5n\leq 5 (incidentally, Gecko and WebKit do not currently support larger values of nn). This means that all the operations described above can be considered to have constant complexity. This is much better than a naive implementation of the iterative algorithm sketched in the OpenType MATH table specification which seems to require at worst

  ∑r=0rmin-1NNonExt+r⁢NExt=NNonExt⁢rmin+rmin⁢(rmin-1)2⁢NExt=O⁢(n×rmin2)\sum_{r=0}^{r_{\mathrm{min}}-1}{N_{\mathrm{NonExt}}+rN_{\mathrm{Ext}}}=N_{% \mathrm{NonExt}}r_{\mathrm{min}}+\frac{r_{\mathrm{min}}\left(r_{\mathrm{min}}-% 1\right)}{2}N_{\mathrm{Ext}}={O(n\times r_{\mathrm{min}}^{2})}  

and at least Ω⁢(rmin)\Omega(r_{\mathrm{min}}).

One of issue is that the number of extender repetitions rminr_{\mathrm{min}} and the number of glyphs in the assembly NN can become arbitrary large since the target size tt can take large values e.g. if one writes \underbrace{\hspace{65535em}} in LaTeX. The improvement proposed here does not solve that issue since setting the coordinates of each glyph in the assembly and painting them require Θ⁢(N)\Theta(N) operations as well as (in the case of HarfBuzz) a glyph buffer of size NN. However, such large stretchy operators do not happen in real-life mathematical formulas. Hence to avoid possible hangs in Web engines a solution is to impose a maximum limit NmaxN_{\mathrm{max}} for the number of glyph in the assembly so that the complexity is limited by the size of the DOM tree. Currently, the proposal for HarfBuzz is Nmax=128N_{\mathrm{max}}=128. This means that if each assembly glyph is 1em large you won’t be able to draw stretchy operators of size more than 128em, which sounds a quite reasonable bound. With the above proposal, rminr_{\mathrm{min}} and so NN can be determined very quickly and the cases N≥NmaxN\geq N_{\mathrm{max}} rejected, so that we avoid losing time with such edge cases…

Finally, because in our proposal we use the same overlap oo everywhere an alternative for HarfBuzz would be to set the output buffer size to nn (i.e. ignore r-1r-1 copies of each extender and only keep the first one). This will leave gaps that the client can fix by repeating extenders as long as oo is also provided. Then HarfBuzz math shaping can be done with a complexity in time and space of just O⁢(n)O(n) and it will be up to the client to optimize or limit the painting of extenders for large values of NN…

By fredw at April 16, 2016 09:23 PM

April 15, 2016

Frédéric Wang: OpenType MATH in HarfBuzz

Igalia WebKit

TL;DR:

  • Work is in progress to add OpenType MATH support in HarfBuzz and will be instrumental for many math rendering engines relying on that library, including browsers.

  • For stretchy operators, an efficient way to determine the required number of glyphs and their overlaps has been implemented and is described here.

In the context of Igalia browser team effort to implement MathML support using TeX rules and OpenType features, I have started implementation of OpenType MATH support in HarfBuzz. This table from the OpenType standard is made of three subtables:

  • The MathConstants table, which contains layout constants. For example, the thickness of the fraction bar of ab\frac{a}{b}.

  • The MathGlyphInfo table, which contains glyph properties. For instance, the italic correction indicating how slanted an integral is e.g. to properly place the subscript in ∫D\displaystyle\displaystyle\int_{D}.

  • The MathVariants table, which provides larger size variants for a base glyph or data to build a glyph assembly. For example, either a larger parenthesis or a assembly of U+239B, U+239C, U+239D to write something like:

    (abcdefgh\left(\frac{\frac{\frac{a}{b}}{\frac{c}{d}}}{\frac{\frac{e}{f}}{\frac{g}{h}}}\right.

Code to parse this table was added to Gecko and WebKit two years ago. The existing code to build glyph assembly in these Web engines was adapted to use the MathVariants data instead of only private tables. However, as we will see below the MathVariants data to build glyph assembly is more general, with arbitrary number of glyphs or with additional constraints on glyph overlaps. Also there are various fallback mechanisms for old fonts and other bugs that I think we could get rid of when we move to OpenType MATH fonts only.

In order to add MathML support in Blink, it is very easy to import the OpenType MATH parsing code from WebKit. However, after discussions with some Google developers, it seems that the best option is to directly add support for this table in HarfBuzz. Since this library is used by Gecko, by WebKit (at least the GTK port) and by many other applications such as Servo, XeTeX or LibreOffice it make senses to share the implementation to improve math rendering everywhere.

The idea for HarfBuzz is to add an API to

  1. 1.

    Expose data from the MathConstants and MathGlyphInfo.

  2. 2.

    Shape stretchy operators to some target size with the help of the MathVariants.

It is then up to a higher-level math rendering engine (e.g. TeX or MathML rendering engines) to beautifully display mathematical formulas using this API. The design choice for exposing MathConstants and MathGlyphInfo is almost obvious from the reading of the MATH table specification. The choice for the shaping API is a bit more complex and discussions is still in progress. For example because we want to accept stretching after glyph-level mirroring (e.g. to draw RTL clockwise integrals) we should accept any glyph and not just an input Unicode strings as it is the case for other HarfBuzz shaping functions. This shaping also depends on a stretching direction (horizontal/vertical) or on a target size (and Gecko even currently has various ways to approximate that target size). Finally, we should also have a way to expose italic correction for a glyph assembly or to approximate preferred width for Web rendering engines.

As I mentioned at the beginning, the data and algorithm to build glyph assembly is the most complex part of the OpenType MATH and deserves a special interest. The idea is that you have a list of n≥1n\geq 1 glyphs available to build the assembly. For each 0≤i≤n-10\leq i\leq n-1, the glyph gig_{i} has advance aia_{i} in the stretch direction. Each gig_{i} has straight connector part at its start (of length sis_{i}) and at its end (of length eie_{i}) so that we can align the glyphs on the stretch axis and glue them together. Also, some of the glyphs are “extenders” which means that they can be repeated 0, 1 or more times to make the assembly as large as possible. Finally, the end/start connectors of consecutive glyphs must overlap by at least a fixed value omino_{\mathrm{min}} to avoid gaps at some resolutions but of course without exceeding the length of the corresponding connectors. This gives some flexibility to adjust the size of the assembly and get closer to the target size tt.

gig_{i}

sis_{i}

eie_{i}

aia_{i}

gi+1g_{i+1}

si+1s_{i+1}

ei+1e_{i+1}

ai+1a_{i+1}

oi,i+1o_{i,i+1}

Figure 1: Two adjacent glyphs in an assembly

To ensure that the width/height is distributed equally and the symmetry of the shape is preserved, the MATH table specification suggests the following iterative algorithm to determine the number of extenders and the connector overlaps to reach a minimal target size tt:

  1. 1.

    Assemble all parts by overlapping connectors by maximum amount, and removing all extenders. This gives the smallest possible result.

  2. 2.

    Determine how much extra width/height can be distributed into all connections between neighboring parts. If that is enough to achieve the size goal, extend each connection equally by changing overlaps of connectors to finish the job.

  3. 3.

    If all connections have been extended to minimum overlap and further growth is needed, add one of each extender, and repeat the process from the first step.

We note that at each step, each extender is repeated the same number of times r≥0r\geq 0. So if IExtI_{\mathrm{Ext}} (respectively INonExtI_{\mathrm{NonExt}}) is the set of indices 0≤i≤n-10\leq i\leq n-1 such that gig_{i} is an extender (respectively is not an extender) we have ri=rr_{i}=r (respectively ri=1r_{i}=1). The size we can reach at step rr is at most the one obtained with the minimal connector overlap omino_{\mathrm{min}} that is

∑i=0N-1(∑j=1riai-omin)+omin=(∑i∈INonExtai-omin)+(∑i∈IExtr⁢(ai-omin))+omin\sum_{i=0}^{N-1}\left(\sum_{j=1}^{r_{i}}{a_{i}-o_{\mathrm{min}}}\right)+o_{ \mathrm{min}}=\left(\sum_{i\in I_{\mathrm{NonExt}}}{a_{i}-o_{\mathrm{min}}} \right)+\left(\sum_{i\in I_{\mathrm{Ext}}}r{(a_{i}-o_{\mathrm{min}})}\right)+o% _{\mathrm{min}}

We let NExt=|IExt|N_{\mathrm{Ext}}={|I_{\mathrm{Ext}}|} and NNonExt=|INonExt|N_{\mathrm{NonExt}}={|I_{\mathrm{NonExt}}|} be the number of extenders and non-extenders. We also let SExt=∑i∈IExtaiS_{\mathrm{Ext}}=\sum_{i\in I_{\mathrm{Ext}}}a_{i} and SNonExt=∑i∈INonExtaiS_{\mathrm{NonExt}}=\sum_{i\in I_{\mathrm{NonExt}}}a_{i} be the sum of advances for extenders and non-extenders. If we want the advance of the glyph assembly to reach the minimal size tt then

SNonExt-omin⁢(NNonExt-1)+r⁢(SExt-omin⁢NExt)≥t{S_{\mathrm{NonExt}}-o_{\mathrm{min}}\left(N_{\mathrm{NonExt}}-1\right)}+{r% \left(S_{\mathrm{Ext}}-o_{\mathrm{min}}N_{\mathrm{Ext}}\right)}\geq t

We can assume 0" display="inline">SExt-omin⁢NExt>0S_{\mathrm{Ext}}-o_{\mathrm{min}}N_{\mathrm{Ext}}>0 or otherwise we would have the extreme case where the overlap takes at least the full advance of each extender. Then we obtain

r≥rmin=max⁡(0,⌈t-SNonExt+omin⁢(NNonExt-1)SExt-omin⁢NExt⌉)r\geq r_{\mathrm{min}}=\max\left(0,\left\lceil\frac{t-{S_{\mathrm{NonExt}}+o_{ \mathrm{min}}\left(N_{\mathrm{NonExt}}-1\right)}}{S_{\mathrm{Ext}}-o_{\mathrm{ min}}N_{\mathrm{Ext}}}\right\rceil\right)

This provides a first simplification of the algorithm sketched in the MATH table specification: Directly start iteration at step rminr_{\mathrm{min}}. Note that at each step we start at possibly different maximum overlaps and decrease all of them by a same value. It is not clear what to do when one of the overlap reaches omino_{\mathrm{min}} while others can still be decreased. However, the sketched algorithm says all the connectors should reach minimum overlap before the next increment of rr, which means the target size will indeed be reached at step rminr_{\mathrm{min}}.

One possible interpretation is to stop overlap decreasing for the adjacent connectors that reached minimum overlap and to continue uniform decreasing for the others until all the connectors reach minimum overlap. In that case we may lose equal distribution or symmetry. In practice, this should probably not matter much. So we propose instead the dual option which should behave more or less the same in most cases: Start with all overlaps set to omino_{\mathrm{min}} and increase them evenly to reach a same value oo. By the same reasoning as above we want the inequality

SNonExt-o⁢(NNonExt-1)+rmin⁢(SExt-o⁢NExt)≥t{S_{\mathrm{NonExt}}-o\left(N_{\mathrm{NonExt}}-1\right)}+{r_{\mathrm{min}} \left(S_{\mathrm{Ext}}-oN_{\mathrm{Ext}}\right)}\geq t

which can be rewritten

SNonExt+rmin⁢SExt-o⁢(NNonExt+rmin⁢NExt-1)≥tS_{\mathrm{NonExt}}+r_{\mathrm{min}}S_{\mathrm{Ext}}-{o\left(N_{\mathrm{NonExt% }}+{r_{\mathrm{min}}N_{\mathrm{Ext}}}-1\right)}\geq t

We note that N=NNonExt+rmin⁢NExtN=N_{\mathrm{NonExt}}+{r_{\mathrm{min}}N_{\mathrm{Ext}}} is just the exact number of glyphs used in the assembly. If there is only a single glyph, then the overlap value is irrelevant so we can assume NNonExt+r⁢NExt-1=N-1≥1N_{\mathrm{NonExt}}+{rN_{\mathrm{Ext}}}-1=N-1\geq 1. This provides the greatest theorical value for the overlap oo:

omin≤o≤omaxtheorical=SNonExt+rmin⁢SExt-tNNonExt+rmin⁢NExt-1o_{\mathrm{min}}\leq o\leq o_{\mathrm{max}}^{\mathrm{theorical}}=\frac{S_{ \mathrm{NonExt}}+r_{\mathrm{min}}S_{\mathrm{Ext}}-t}{N_{\mathrm{NonExt}}+{r_{ \mathrm{min}}N_{\mathrm{Ext}}}-1}

Of course, we also have to take into account the limit imposed by the start and end connector lengths. So omaxo_{\mathrm{max}} must also be at most min⁡(ei,si+1)\min{(e_{i},s_{i+1})} for 0≤i≤n-20\leq i\leq n-2. But if rmin≥2r_{\mathrm{min}}\geq 2 then extender copies are connected and so omaxo_{\mathrm{max}} must also be at most min⁡(ei,si)\min{(e_{i},s_{i})} for i∈IExti\in I_{\mathrm{Ext}}. To summarize, omaxo_{\mathrm{max}} is the minimum of omaxtheoricalo_{\mathrm{max}}^{\mathrm{theorical}}, of eie_{i} for 0≤i≤n-20\leq i\leq n-2, of sis_{i} 1≤i≤n-11\leq i\leq n-1 and possibly of e0e_{0} (if 0∈IExt0\in I_{\mathrm{Ext}}) and of of sn-1s_{n-1} (if n-1∈IExt{n-1}\in I_{\mathrm{Ext}}).

With the algorithm described above NExtN_{\mathrm{Ext}}, NNonExtN_{\mathrm{NonExt}}, SExtS_{\mathrm{Ext}}, SNonExtS_{\mathrm{NonExt}} and rminr_{\mathrm{min}} and omaxo_{\mathrm{max}} can all be obtained using simple loops on the glyphs gig_{i} and so the complexity is O⁢(n)O(n). In practice nn is small: For existing fonts, assemblies are made of at most three non-extenders and two extenders that is n≤5n\leq 5 (incidentally, Gecko and WebKit do not currently support larger values of nn). This means that all the operations described above can be considered to have constant complexity. This is much better than a naive implementation of the iterative algorithm sketched in the OpenType MATH table specification which seems to require at worst

∑r=0rmin-1NNonExt+r⁢NExt=NNonExt⁢rmin+rmin⁢(rmin-1)2⁢NExt=O⁢(n×rmin2)\sum_{r=0}^{r_{\mathrm{min}}-1}{N_{\mathrm{NonExt}}+rN_{\mathrm{Ext}}}=N_{ \mathrm{NonExt}}r_{\mathrm{min}}+\frac{r_{\mathrm{min}}\left(r_{\mathrm{min}}-% 1\right)}{2}N_{\mathrm{Ext}}={O(n\times r_{\mathrm{min}}^{2})}

and at least Ω⁢(rmin)\Omega(r_{\mathrm{min}}).

One of issue is that the number of extender repetitions rminr_{\mathrm{min}} and the number of glyphs in the assembly NN can become arbitrary large since the target size tt can take large values e.g. if one writes \underbrace{\hspace{65535em}} in LaTeX. The improvement proposed here does not solve that issue since setting the coordinates of each glyph in the assembly and painting them require Θ⁢(N)\Theta(N) operations as well as (in the case of HarfBuzz) a glyph buffer of size NN. However, such large stretchy operators do not happen in real-life mathematical formulas. Hence to avoid possible hangs in Web engines a solution is to impose a maximum limit NmaxN_{\mathrm{max}} for the number of glyph in the assembly so that the complexity is limited by the size of the DOM tree. Currently, the proposal for HarfBuzz is Nmax=128N_{\mathrm{max}}=128. This means that if each assembly glyph is 1em large you won’t be able to draw stretchy operators of size more than 128em, which sounds a quite reasonable bound. With the above proposal, rminr_{\mathrm{min}} and so NN can be determined very quickly and the cases N≥NmaxN\geq N_{\mathrm{max}} rejected, so that we avoid losing time with such edge cases…

Finally, because in our proposal we use the same overlap oo everywhere an alternative for HarfBuzz would be to set the output buffer size to nn (i.e. ignore r-1r-1 copies of each extender and only keep the first one). This will leave gaps that the client can fix by repeating extenders as long as oo is also provided. Then HarfBuzz math shaping can be done with a complexity in time and space of just O⁢(n)O(n) and it will be up to the client to optimize or limit the painting of extenders for large values of NN…

April 15, 2016 10:00 PM

March 31, 2016

Michael Catanzaro: Positive progress on WebKitGTK+ security updates

Igalia WebKit

I previously reported that, although WebKitGTK+ releases regular upstream security updates, most Linux distributions are not taking the updates. At the time, only Arch Linux and Fedora were reliably releasing our security updates. So I’m quite pleased that openSUSE recently released a WebKitGTK+ security update, and then Mageia did too. Gentoo currently has an update in the works. It remains to be seen if these distros regularly follow up on updates (expect a follow-up post on this in a few months), but, optimistically, you now have several independent distros to choose from to get an updated version WebKitGTK+, plus any distros that regularly receive updates directly from these distros.

Unfortunately, not all is well yet. It’s still not safe to use WebKitGTK+ on the latest releases of Debian or Ubuntu, or on derivatives like Linux Mint, elementary OS, or Raspbian. (Raspbian is notable because it uses an ancient, insecure version of Epiphany as its default web browser, and Raspberry Pis are kind of popular.)

And of course, no distribution has been able to get rid of old, insecure WebKitGTK+ 2.4 compatibility packages, so many applications on distributions that do provide security updates for modern WebKitGTK+ will still be insecure. (Don’t be fooled by the recent WebKitGTK+ 2.4.10 update; it contains only a few security fixes that were easy to backport, and was spurred by the need to add GTK+ 3.20 compatibility. It is still not safe to use.) Nor have distributions managed to remove QtWebKit, which is also old and insecure. You still need to check individual applications to see if they are running safe versions of WebKit.

But at least there are now several distros providing WebKitGTK+ security updates. That’s good.

Special thanks to Apple and to my colleagues at Igalia for their work on the security advisories that motivate these updates.

By Michael Catanzaro at March 31, 2016 03:00 AM

Michael Catanzaro: Epiphany 3.20

Igalia WebKit

So, what’s new in Epiphany 3.20?

First off: overlay scrollbars. Because web sites have the ability to style their scrollbars (which you’ve probably noticed on Google sites), WebKit embedders cannot use a normal GtkScrolledWindow to display content; instead, WebKit has to paint the scrollbars itself. Hence, when overlay scrollbars appeared in GTK+ 3.16, WebKit applications were left out. Carlos García Campos spent some time to work on this, and the result speaks for itself (if you fullscreen this video to see it properly):

Overlay scrollbars did not actually require any changes in Epiphany itself — all applications using an up-to-date version of WebKit will immediately benefit — but I mention it here as it’s one of the most noticeable changes. Read about other WebKit improvements, like the new Faster Than Light FTL/B3 JavaScript compilation tier, on Carlos’s blog.

Next up, there is a new downloads manager, also by Carlos García Campos. This replaces the old downloads bar that used to appear at the bottom of the screen:

Screenshot of the new downloads manager in Epiphany 3.20.

I flipped the switch in Epiphany to enable WebGL:

If you watched that video in fullscreen, you might have noticed that page is marked as insecure, even though it doesn’t use HTTPS. Like most browsers, we used to have several confusing security states. Pages with mixed content received a security warning that all users ignored, but pages with no security at all received no such warning. That’s pretty dumb, which is why Firefox and Chrome have been talking about changing this for a year or so now. I went ahead and implemented it. We now have exactly two security states: secure and insecure. If your page loads any content not over HTTPS, it will be marked as insecure. The vast majority of pages will be displayed as insecure, but it’s no less than such sites deserve. I’m not concerned at all about “warning fatigue,” because users are not generally expected to take any action on seeing these warnings. In the future, we will take this further, and use the insecure indicator for sites that use SHA-1 certificates.

Moving on. By popular request, I exposed the previously-hidden setting to disable session restore in the preferences dialog, as “Remember previous tabs on startup:”

Screenshot of the preferences dialog, with the new

Meanwhile, Carlos worked in both WebKit and Epiphany to greatly improve session restoration. Previously, Epiphany would save the URLs of the pages loaded in each tab, and when started it would load each URL in a new tab, but you wouldn’t have any history for those tabs, for example, and the state of the tab would otherwise be lost. Carlos worked on serializing the WebKit session state and exposing it in the WebKitGTK+ API, allowing us to restore full back/forward history for each tab, plus details like your scroll position on each tab. Thanks to Carlos, we also now make use of this functionality when reopening closed tabs, so your reopened tab will have a full back/forward list of history, and also when opening new tabs, so the new tab will inherit the history of the tab it was opened from (a feature that we had in the past, but lost when we switched to WebKit2).

Interestingly, we found the session restoration was at first too good: it would restore the page really exactly as you last viewed it, without refreshing the content at all. This means that if, for example, you were viewing a page in Bugzilla, then when starting the browser, you would miss any new comments from the last time you loaded the page until you refresh the page manually. This is actually the current behavior in Safari; it’s desirable on iOS to make the browser launch instantly, but questionable for desktop Safari. Carlos decided to always refresh the page content when restoring the session for WebKitGTK+.

Last, and perhaps least, there’s a new empty state displayed for new users, developed by Lorenzo Tilve and polished up by me, so that we don’t greet new users with a completely empty overview (where your most-visited sites are normally displayed):

Empty State

That, plus a bundle of the usual bugfixes, significant code cleanups, and internal architectual improvements (e.g. I converted the communication between the UI process and the web process extension to use private D-Bus connections instead of the session bus). The best things have not changed: it still starts up about 5-20 times faster than Firefox in my unscientific testing; I expect you’ll find similar results.

Enjoy!

By Michael Catanzaro at March 31, 2016 01:54 AM

December 15, 2014

Web Engines Hackfest 2014

Gustavo Noronha

For the 6th year in a row, Igalia has organized a hackfest focused on web engines. The 5 years before this one were actually focused on the GTK+ port of WebKit, but the number of web engines that matter to us as Free Software developers and consultancies has grown, and so has the scope of the hackfest.

It was a very productive and exciting event. It has already been covered by Manuel RegoPhilippe Normand, Sebastian Dröge and Andy Wingo! I am sure more blog posts will pop up. We had Martin Robinson telling us about the new Servo engine that Mozilla has been developing as a proof of concept for both Rust as a language for building big, complex products and for doing layout in parallel. Andy gave us a very good summary of where JS engines are in terms of performance and features. We had talks about CSS grid layouts, TyGL – a GL-powered implementation of the 2D painting backend in WebKit, the new Wayland port, announced by Zan Dobersek, and a lot more.

With help from my colleague ChangSeok OH, I presented a description of how a team at Collabora led by Marco Barisione made the combination of WebKitGTK+ and GNOME’s web browser a pretty good experience for the Raspberry Pi. It took a not so small amount of both pragmatic limitations and hacks to get to a multi-tab browser that can play youtube videos and be quite responsive, but we were very happy with how well WebKitGTK+ worked as a base for that.

One of my main goals for the hackfest was to help drive features that were lingering in the bug tracker for WebKitGTK+. I picked up a patch that had gone through a number of iterations and rewrites: the HTML5 notifications support, and with help from Carlos Garcia, managed to finish it and land it at the last day of the hackfest! It provides new signals that can be used to authorize notifications, show and close them.

To make notifications work in the best case scenario, the only thing that the API user needs to do is handle the permission request, since we provide a default implementation for the show and close signals that uses libnotify if it is available when building WebKitGTK+. Originally our intention was to use GNotification for the default implementation of those signals in WebKitGTK+, but it turned out to be a pain to use for our purposes.

GNotification is tied to GApplication. This allows for some interesting features, like notifications being persistent and able to reactivate the application, but those make no sense in our current use case, although that may change once service workers become a thing. It can also be a bit problematic given we are a library and thus have no GApplication of our own. That was easily overcome by using the default GApplication of the process for notifications, though.

The show stopper for us using GNotification was the way GNOME Shell currently deals with notifications sent using this mechanism. It will look for a .desktop file named after the application ID used to initialize the GApplication instance and reject the notification if it cannot find that. Besides making this a pain to test – our test browser would need a .desktop file to be installed, that would not work for our main API user! The application ID used for all Web instances is org.gnome.Epiphany at the moment, and that is not the same as any of the desktop files used either by the main browser or by the web apps created with it.

For the future we will probably move Epiphany towards this new era, and all users of the WebKitGTK+ API as well, but the strictness of GNOME Shell would hurt the usefulness of our default implementation right now, so we decided to stick to libnotify for the time being.

Other than that, I managed to review a bunch of patches during the hackfest, and took part in many interesting discussions regarding the next steps for GNOME Web and the GTK+ and Wayland ports of WebKit, such as the potential introduction of a threaded compositor, which is pretty exciting. We also tried to have Bastien Nocera as a guest participant for one of our sessions, but it turns out that requires more than a notebook on top of a bench hooked up to   a TV to work well. We could think of something next time ;D.

I’d like to thank Igalia for organizing and sponsoring the event, Collabora for sponsoring and sending ChangSeok and myself over to Spain from far away Brazil and South Korea, and Adobe for also sponsoring the event! Hope to see you all next year!

Web Engines Hackfest 2014 sponsors: Adobe, Collabora and Igalia

Web Engines Hackfest 2014 sponsors: Adobe, Collabora and Igalia

By kov at December 15, 2014 11:20 PM

December 08, 2014

How to build TyGL

University of Szeged

This is a follow-up blog post of our announcement of TyGL - the 2D-accelerated GPU rendering port of WebKit.

We have been received lots of feedback about TyGL and we would like to thank you for all questions, suggestions and comments. As we promised lets get into some technical details.

read more

By szilard.ledan at December 08, 2014 12:47 PM

November 12, 2014

Announcing the TyGL-WebKit port to accelerate 2D web rendering with GPU

University of Szeged

We are proud to announce the TyGL port (link: http://github.com/szeged/TyGL) on the top of EFL-WebKit. TyGL (pronounced as tigel) is part of WebKit and provides 2D-accelerated GPU rendering on embedded systems. The engine is purely GPU based. It has been developed on and tested against ARM-Mali GPU, but it is designed to work on any GPU conforming to OpenGL ES 2.0 or higher.

The GPU involvement on future graphics is inevitable considering the pixel growth rate of displays, but harnessing the GPU power requires a different approach than CPU-based optimizations.

read more

By zoltan.herczeg at November 12, 2014 02:18 PM

October 22, 2014

Fuzzinator reloaded

University of Szeged

It's been a while since I last (and actually first) posted about Fuzzinator. Now I think that I have enough new experiences worth sharing.

More than a year ago, when I started fuzzing, I was mostly focusing on mutation-based fuzzer technologies since they were easy to build and pretty effective. Having a nice error-prone test suite (e.g. LayoutTests) was the warrant for fresh new bugs. At least for a while.

read more

By renata.hodovan at October 22, 2014 10:38 PM

September 25, 2014

Measuring ASM.JS performance

University of Szeged

What is ASM.JS?

Now that mobile computers and cloud services become part of our lives, more and more developers see the potential of the web and online applications. ASM.JS, a strict subset of JavaScript, is a technology that provides a way to achieve near native speed in browsers, without the need of any plugin or extension. It is also possible to cross-compile C/C++ programs to it and running them directly in your browser.

In this post we will compare the JavaScript and ASM.JS performance in different browsers, trying out various kinds of web applications and benchmarks.

read more

By matyas.mustoha at September 25, 2014 10:40 AM

August 28, 2014

CSS Shapes now available in Chrome 37 release

Adobe Web Platform

Support for CSS Shapes is now available in the latest Google Chrome 37 release.

chromelogo

What can I do with CSS Shapes?

CSS Shapes lets you think out of the box! It gives you the ability to wrap content outside any shape. Shapes can be defined by geometric shapes, images, and even gradients. Using Shapes as part of your website design takes a visitor’s visual and reading experience to the next level. If you want to start with some tutorials, please go visit Sarah Soueidan’s article about Shapes.

Demo

The following shapes use case is from the Good Looking Shapes Gallery blog post.

Without CSS Shapes
the_11_guesses_no_shapes
With CSS Shapes
the_11_guesses_shapes

In the first picture, we don’t use CSS Shapes. The text wraps around the rectangular image container, which leads to a lot of empty space between the text and the visible part of the image.

In the second picture, we use CSS Shapes. You can see the wrapping behavior around the image. In this case the white parts of the image are transparent, thus the browser can automatically wrap the content around the visible part, which leads to this nice and clean, visually more appealing wrapping behavior.

How do I get CSS Shapes?

Just update your Chrome browser to the latest version from the Chrome/About Google Chrome menu, or download the latest stable version from https://www.google.com/chrome/browser/.

I’d like to thank the collaboration of WebKit and Blink engineers, and everyone else in the community who has contributed to this feature. The fact that Shapes is shipping in two production browsers — Chrome 37 now and Safari 8 later this year — is the upshot of the open source collaboration between the people who believe in a better, more expressive web. Although Shapes will be available in these browsers, you’ll need another solution for the other browsers. The CSS Shapes Polyfill is one method of achieving consistent behavior across browsers.

Where should I start?

For more info about CSS Shapes, please check out the following links:

Let us know your thoughts or if you have nice demos, here or on Twitter: @AdobeWeb and @ZoltanWebKit.

By Zoltan Horvath at August 28, 2014 05:12 PM

May 13, 2014

Good-Looking Shapes Gallery

Adobe Web Platform

As a modern consumer of media, you rarely crack open a magazine or a pamphlet or anything that would be characterized as “printed”. Let me suggest that you take a walk on the wild side. The next time you are in a doctor’s office, or a supermarket checkout lane, or a library, thumb though a magazine. Most of the layouts you’ll find inside can also be found on the web, but not all of them. Layouts where content hugs the boundaries of illustrations are common in print and rare on the web. One of the reasons non-rectangular contour-hugging layouts are uncommon on the web is that they are difficult to produce.

They are not difficult to produce anymore.

The CSS Shapes specification is now in the final stages of standardization. This feature enables flowing content around geometric shapes (like circles and polygons), as well as around shapes defined by an image’s alpha channel. Shapes make it easy to produce the kinds of layouts you can find in print today, with all the added flexibility and power that modern online media affords. You can use CSS Shapes right now with the latest builds of WebKit and Blink based browsers, like Safari and Chrome.

Development of CSS Shapes has been underway for about two years, and we’ve been regularly heralding its progress here. Many of those reports have focused on the evolution of the spec and implementations, and they’ve included examples that emphasized basics over beauty. This article is an attempt to tilt the balance back towards good-looking. Listed below are simple shapes demos that we think look pretty good. Everyone on Adobe’s CSS Shapes engineering team contributed at least one.

There’s a live CodePen.io version of each demo in the gallery. Click on the demo screenshot or one of the handy links to take a look. You’ll want to view the demos with a browser that supports Shapes and you’ll need to enable CSS Shapes in that browser. For example you can use a nightly build of the Safari browser or you can enable shapes in Chrome or Chrome Canary like this:

  1. Copy and paste chrome://flags/#enable-experimental-web-platform-features into the address bar, then press enter.
  2. Click the ‘Enable’ link within that section.
  3. Click the ‘Relaunch Now’ button at the bottom of the browser window.

A few of the demos use the new Shapes Polyfill and will work in most browsers.

And now, without further ado, please have a look through our good-looking shapes gallery.


Ozma of Oz

ozma-demo-screenshot

This demo reproduces the layout style that opens many of the chapters of the L. Frank Baum books, including Ozma of Oz.  The first page is often dominated by an illustration on the left or right. The chapter’s text conforms to the illustration, but not too tightly. The books were published over 100 years ago and they still look good print.  With CSS Shapes they can still look good on the web.

Top Cap

topcap-demo-screenshot

The conventional “drop-cap” opens a paragraph by enlarging and highlighting the first letter, word or phrase. The drop-cap’s goal is to draw your attention to where you can start reading. This demo delivers the same effect by crowning the entire opening paragraph with a “top cap” that funnels your attention into the article. In both cases, what’s going on is a segue from a graphic element to the text.

Violator

monsters-demo-screenshot

A violator is small element that “violates” rectangular text layout by encroaching on a corner or a small part of an edge. This layout idiom is common in short-form magazines and product packaging. That “new and improved” banner which blazes through the corner of thousands of consumer products (whether or not they are new or improved) – it’s a violator.

Column Interest

columns-demo-screenshot

When a print magazine feels the need to incorporate some column layout melodrama, they often reach for this idiom. The shape spans a pair of columns, which creates visual interest in the middle of the page. Without it you’d be faced with a wall of attention sapping text and more than likely turn the page.

Caption

Screenshot of the wine jug caption demo.

The old-school approach for including a caption with an image is to put the caption text alongside or below the image. Putting a caption on top of an image requires a little more finesse, since you have to ensure that the text doesn’t obscure anything important and that the text is rendered in a way that preserves readability.  The result can be relatively attractive.

This photograph was taken by Zoltan Horvath who has pointed out that I’ve combined a quote about tea with a picture of a ceremonial wine jug.  I apologize for briefly breaching that beverage boundary. It’s just a demo.

Paging

Screenshot of the paging demo.

With a layout like this, one could simple let the content wrap and around the shape on the right and then expand into the usual rectangle.  In this demo the content is served up a paragraph at a time, in response to the left and right arrow keys.

Note also: yes in fact the mate gourd is perched on exactly the same windowsill as the previous demo. Zoltan and Pope Francis are among the many fans of yerba mate tea.

Ersatz shape-inside

Screenshot of the ersatz shape-inside demo.

Originally the CSS Shapes spec included shape-inside as well as shape-outside. Sadly, shape-inside was promoted to “Level 2″ of the spec and isn’t available in the current implementations. Fortunately for shape insiders everywhere, it’s still sometimes possible to mimic shape-inside with an adjacent pair of carefully designed shape-outside floats. This demo is a nice example of that, where the text appears inside a bowl of oatmeal.

Animation

animation-demo-screeenshot

This is an animated demo, so to appreciate it you’ll really need to take a look at the live version. It is an example of using an animated shape to draw the user’s attention to a particular message.  Of course one must use this approach with restraint, since an animated loop on a web page doesn’t just gently tug at the user’s attention. It drags at their attention like a tractor beam.

Performance

performance-demo-screenshot

Advertisements are intended to grab the user’s attention and a second or two of animation will do that. In this demo a series of transition motions have been strung together into a tiny performance that will temporarily get the reader’s attention. The highlight of the performance is – of course – the text snapping into the robot’s contour for the finale. Try and imagine a soundtrack that punctuates the action with some whirring and clanking noises, it’s even better that way.

By hmuller at May 13, 2014 05:38 PM

April 24, 2014

Adobe Web Platform Goes to the 2014 WebKit Contributors’ Meeting

Adobe Web Platform

Last week, Apple hosted the 2014 WebKit Contributors’ Meeting at their campus in Cupertino. As usual it was an unconference-style event, with session scheduling happening on the morning of the first day. While much of the session content was very specific to WebKit implementation, there were topics covered that are interesting to the wider web community. This post is a roundup of some of these topics from the sessions that Adobe Web Platform Team members attended.

CSS Custom Properties for Cascading Variables

Alan Stearns suggested a session on planning a new implementation of CSS Custom Properties for Cascading Variables. While implementations of this spec have been attempted in WebKit in the past, they never got past the experimental stage. Despite this, there is still much interest in implementing this feature. In addition, the current version of the spec has addressed many of the issues that WebKit contributors had previously expressed. We talked about a possible issue with using variables in custom property values, which Alan is investigating. More detail is available in the notes from the Custom Properties session.

CSS Regions

Andrei Bucur presented the current state of the CSS Regions implementation in WebKit. The presentation was well received and well attended. Notably, this was one of the few sessions with enough interest that it had a time slot all to itself.

While CSS Regions shipped last year in iOS 7 and Safari 6.1 and 7, the implementation in WebKit hasn’t been standing still. Andrei mentioned the following short list of changes in WebKit since the last Safari release:

  • correct painting of fragments and overflow
  • scrollable regions
  • accelerated content inside regions
  • position: fixed elements
  • the regionoversetchange event
  • better selection
  • better WebInspector integration
  • and more…

Andrei’s slides outlining the state of CSS Regions also contain a roadmap for the feature’s future in WebKit as well as a nice demo of the fix to fragment and overflow handling. If you are following the progress of CSS Regions in WebKit, the slides are definitely worth a look. (As of this writing, the Regions demo in the slides only works in Safari and WebKit Nightly.)

CSS Shapes

Zoltan Horvath, Bear Travis, and I covered the current state of CSS Shapes in WebKit. We are almost done implementing the functionality in Level 1 of the CSS Shapes Specification (which is itself a Candidate Recommendation, the last step before becoming an official W3C standard). The discussion in this session was very positive. We received good feedback on use cases for shape-outside and even talked a bit about the possibilities for when shape-inside is revisited as part of CSS Shapes Level 2. While I don’t have any slides or demos to share at the moment, we will soon be publishing a blog post to bring everyone up to date on the latest in CSS Shapes. So watch this space for more!

Subpixel Layout

This session was mostly about implementation. However, Zalan Bujtas drew an interesting distinction between subpixel layout and subpixel painting. Subpixel layout allows for better space utilization when laying out elements on the page, as boxes can be sized and positioned more precisely using fractional units. Subpixel painting allows for better utilization of high DPI displays by actually drawing elements on the screen using fractional CSS pixels (For example: on a 2x “Retina” display, half of a CSS pixel is one device pixel). Subpixel painting allows for much cleaner lines and smoother animations on high DPI displays when combined with subpixel layout. While subpixel layout is currently implemented in WebKit, subpixel painting is currently a work in progress.

Web Inspector

The Web Inspector is full of shiny new features. The front-end continues to shift to a new design, while the back-end gets cleaned up to remove cruft. The architecture for custom visual property editors is in place and will hopefully enable quick and intuitive editing of gradients, transforms, and animations in the future. Other goodies include new breakpoint actions (like value logging), a redesigned timeline, and IndexedDB debugging support. The Web Inspector still has room for new features, and you can always check out the #webkit-inspector channel on freenode IRC for the latest and greatest.

Web Components

The Web Components set of features continues to gather interest from the browser community. Web Components is made up of four different features: HTML Components, HTML Imports, Shadow DOM, and HTML Templates. The general gist of the talk was that the Web Components concepts are desirable, but there are concerns that the features’ complexity may make implementation difficult. The main concerns seemed to center around performance and encapsulation with Shadow DOM, and will hopefully be addressed with a prototype implementation of the feature (in the works). You can also take a look at the slides from the Web Components session.

CSS Grid Layout

The WebKit implementation of the CSS Grid Layout specification is relatively advanced. After learning in this session that the only way to test out Grid Layout in WebKit was to make a custom build with it enabled, session attendees concluded that it should be turned on by default in the WebKit Nightlies. So in the near future, experimenting with Grid Layout in WebKit should be as easy as installing a nightly build.

More?

As I mentioned earlier, this was just a high-level overview of a few of the topics at this year’s WebKit Contributors’ Meeting. Notes and slides for some of the topics not mentioned here are available on the 2014 WebKit Meeting page in the wiki. The WebKit project is always welcoming new contributors, so if you happen to see a topic on that wiki page that interests you, feel free to get in touch with the community and see how you can get involved.

Acknowledgements

This post would not have been possible without the notes and editing assistance of my colleagues on the Adobe Web Platform Team that attended the meeting along with me: Alan Stearns, Andrei Bucur, Bear Travis, and Zoltan Horvath.

By Bem Jones-Bey at April 24, 2014 05:23 PM

March 18, 2014

QtWebKit is no more, what now?

Gustavo Noronha

Driven by the technical choices of some of our early clients, QtWebKit was one of the first web engines Collabora worked on, building the initial support for NPAPI plugins and more. Since then we had kept in touch with the project from time to time when helping clients with specific issues, hardware or software integration, and particularly GStreamer-related work.

With Google forking Blink off WebKit, a decision had to be made by all vendors of browsers and platform APIs based on WebKit on whether to stay or follow Google instead. After quite a bit of consideration and prototyping, the Qt team decided to take the second option and build the QtWebEngine library to replace QtWebKit.

The main advantage of WebKit over Blink for engine vendors is the ability to implement custom platform support. That meant QtWebKit was able to use Qt graphics and networking APIs and other Qt technologies for all of the platform-integration needs. It also enjoyed the great flexibility of using GStreamer to implement HTML5 media. GStreamer brings hardware-acceleration capabilities, support for several media formats and the ability to expand that support without having to change the engine itself.

People who are using QtWebKit because of its being Gstreamer-powered will probably be better served by switching to one of the remaining GStreamer-based ports, such as WebKitGTK+. Those who don’t care about the underlying technologies but really need or want to use Qt APIs will be better served by porting to the new QtWebEngine.

It’s important to note though that QtWebEngine drops support for Android and iOS as well as several features that allowed tight integration with the Qt platform, such as DOM manipulation through the QWebElement APIs, making QObject instances available to web applications, and the ability to set the QNetworkAccessManager used for downloading resources, which allowed for fine-grained control of the requests and sharing of cookies and cache.

It might also make sense to go Chromium/Blink, either by using the Chrome Content API, or switching to one its siblings (QtWebEngine included) if the goal is to make a browser which needs no integration with existing toolkits or environments. You will be limited to the formats supported by Chrome and the hardware platforms targeted by Google. Blink does not allow multiple implementations of the platform support layer, so you are stuck with what upstream decides to ship, or with a fork to maintain.

It is a good alternative when Android itself is the main target. That is the technology used to build its main browser. The main advantage here is you get to follow Chrome’s fast-paced development and great support for the targeted hardware out of the box. If you need to support custom hardware or to be flexible on the kinds of media you would like to support, then WebKit still makes more sense in the long run, since that support can be maintained upstream.

At Collabora we’ve dealt with several WebKit ports over the years, and still actively maintain the custom WebKit Clutter port out of tree for clients. We have also done quite a bit of work on Chromium-powered projects. Some of the decisions you have to make are not easy and we believe we can help. Not sure what to do next? If you have that on your plate, get in touch!

By kov at March 18, 2014 07:44 PM

February 25, 2014

Improving your site’s visual details: CSS3 text-align-last

Adobe Web Platform

In this post, I want to give a status report regarding the text-align-last CSS3 property. If you are interested in taking control of the small visual details of your site with CSS, I encourage you to keep reading.

The problem

First, let’s talk about why we need this property. You’ve probably already seen many text blocks on pages that don’t quite seem visually correct, because the last line isn’t justified with the previous lines. Check out the example paragraph below:

Example of the CSS3 text-align-last property

In the first column, the last line isn’t justified. This is the expected behavior, when you apply the ‘text-align: justify’ CSS property on a container. On the other hand, in the second column, the content is entirely justified, including the last line.

The solution

This magic is the ‘text-align-last’ CSS3 property, which is set to justify on the second container. The text-align-last property is part of the CSS Text Module Level 3 specification, which is currently a working draft. The text-align-last property describes how the last line of a block or a line right before a forced line break is aligned when ‘text-align’ is ‘justify’, which means you gain full control over the alignment of the last line of a block. The property allows several more options, which you can read about on WebPlatform.org docs, or the CSS Text Module Level 3 W3C Specification.

A possible use case (Added April – 2014)

After looking at the previous example (which was rather focusing on the functionality of the property), let’s move on to a more realistic use case. The feature is perfect to make our multi-line captions look better. Check out the centered, and the justified image caption examples below.

centertext_align__simple_justify

And now, compare them with a justified, multi-line caption, where the last line has been centered by text-align-last: center.
text_align_last_center

I think the proper alignment of the last line gives a better overlook to the caption.

Browser Support

I recently added rendering support for the property in WebKit (Safari) based on the latest specification. Dongwoo Joshua Im from Samsung added rendering support in Blink (Chrome). If you like to try it out in WebKit, you’ll need to make a custom developer build and use the CSS3 text support build flag (--css3-text).

The property is already included in Blink’s developer nightlies by default, so after launching your latest Chrome Canary, you only need to enable ‘Enable experimental Web Platform features’ under chrome://flags, and enjoy the full control over your last lines.

Developer note

Please keep in mind that both the W3C specification and the implementations are under experimental status. I’ll keep blogging about the feature and let you know if anything changes, including when the feature ships for production use!

By Zoltan Horvath at February 25, 2014 04:58 PM

December 11, 2013

WebKitGTK+ hackfest 5.0 (2013)!

Gustavo Noronha

For the fifth year in a row the fearless WebKitGTK+ hackers have gathered in A Coruña to bring GNOME and the web closer. Igalia has organized and hosted it as usual, welcoming a record 30 people to its office. The GNOME foundation has sponsored my trip allowing me to fly the cool 18 seats propeller airplane from Lisbon to A Coruña, which is a nice adventure, and have pulpo a feira for dinner, which I simply love! That in addition to enjoying the company of so many great hackers.

Web with wider tabs and the new prefs dialog

Web with wider tabs and the new prefs dialog

The goals for the hackfest have been ambitious, as usual, but we made good headway on them. Web the browser (AKA Epiphany) has seen a ton of little improvements, with Carlos splitting the shell search provider to a separate binary, which allowed us to remove some hacks from the session management code from the browser. It also makes testing changes to Web more convenient again. Jon McCan has been pounding at Web’s UI making it more sleek, with tabs that expand to make better use of available horizontal space in the tab bar, new dialogs for preferences, cookies and password handling. I have made my tiny contribution by making it not keep tabs that were created just for what turned out to be a download around. For this last day of hackfest I plan to also fix an issue with text encoding detection and help track down a hang that happens upon page load.

Martin Robinson and Dan Winship hack

Martin Robinson and Dan Winship hack

Martin Robinson and myself have as usual dived into the more disgusting and wide-reaching maintainership tasks that we have lots of trouble pushing forward on our day-to-day lives. Porting our build system to CMake has been one of these long-term goals, not because we love CMake (we don’t) or because we hate autotools (we do), but because it should make people’s lives easier when adding new files to the build, and should also make our build less hacky and quicker – it is sad to see how slow our build can be when compared to something like Chromium, and we think a big part of the problem lies on how complex and dumb autotools and make can be. We have picked up a few of our old branches, brought them up-to-date and landed, which now lets us build the main WebKit2GTK+ library through cmake in trunk. This is an important first step, but there’s plenty to do.

Hackers take advantage of the icecream network for faster builds

Hackers take advantage of the icecream network for faster builds

Under the hood, Dan Winship has been pushing HTTP2 support for libsoup forward, with a dead-tree version of the spec by his side. He is refactoring libsoup internals to accomodate the new code paths. Still on the HTTP front, I have been updating soup’s MIME type sniffing support to match the newest living specification, which includes specification for several new types and a new security feature introduced by Internet Explorer and later adopted by other browsers. The huge task of preparing the ground for a one process per tab (or other kinds of process separation, this will still be topic for discussion for a while) has been pushed forward by several hackers, with Carlos Garcia and Andy Wingo leading the charge.

Jon and Guillaume battling code

Jon and Guillaume battling code

Other than that I have been putting in some more work on improving the integration of the new Web Inspector with WebKitGTK+. Carlos has reviewed the patch to allow attaching the inspector to the right side of the window, but we have decided to split it in two, one providing the functionality and one the API that will allow browsers to customize how that is done. There’s a lot of work to be done here, I plan to land at least this first patch durign the hackfest. I have also fought one more battle in the never-ending User-Agent sniffing war, in which we cannot win, it looks like.

Hackers chillin' at A Coruña

Hackers chillin’ at A Coruña

I am very happy to be here for the fifth year in a row, and I hope we will be meeting here for many more years to come! Thanks a lot to Igalia for sponsoring and hosting the hackfest, and to the GNOME foundation for making it possible for me to attend! See you in 2014!


By kov at December 11, 2013 09:47 AM

August 27, 2013

HTML Alchemy – Combining CSS Shapes with CSS Regions

Adobe Web Platform

Note: Support for shape-inside is only available until the following nightly builds: WebKit r166290 (2014-03-26); Chromium 260092 (2014-03-28).


I have been working on rendering for almost a year now. Since I landed the initial implementation of Shapes on Regions in both Blink and WebKit, I’m incredibly excited to talk a little bit about these features and how you can combine them together.

Mad_scientist

Don’t know what CSS Regions and Shapes are? Start here!

The first ingredient in my HTML alchemy kitchen is CSS Regions. With CSS Regions, you can flow content into multiple styled containers, which gives you enormous creative power to make magazine style layouts. The second ingredient is CSS Shapes, which gives you the ability to wrap content inside or outside any shape. In this post I’ll talk about the “shape-inside” CSS property, which allows us to wrap content inside an arbitrary shape.

Let’s grab a bowl and mix these two features together, CSS Regions and CSS Shapes to produce some really interesting layouts!

In the latest Chrome Canary and Safari WebKit Nightly, after enabling the required experimental features, you can flow content continuously through multiple kinds of shapes. This rocks! You can step out from the rectangular text flow world and break up text into multiple, non-rectangular shapes.

Demo

If you already have the latest Chrome Canary/Safari WebKit Nightly, you can just go ahead and try a simple example on codepen.io. If you are too lazy, or if you want to extend your mouse button life by saving a few button clicks, you can continue reading.


iBuyd3

In the picture above we see that the “Lorem ipsum” story flows through 4 different, colorful regions. There is a circle shape on each of the first two fixed size regions. Check out the code below to see how we apply the shape to the region. It’s pretty straightforward, right?
#region1, #region2 {
    -webkit-flow-from: flow;
    background-color: yellow;
    width: 200px;
    height: 200px;
    -webkit-shape-inside: circle(50%, 50%, 50%);
}
The content flows into the third (percentage sized) region, which represents a heart (drawn by me, all rights reserved). I defined the heart’s coordinates in percentages, so the heart will stretch as you resize the window.
#region3 {
    -webkit-flow-from: flow;
    width: 50%;
    height: 400px;
    background-color: #EE99bb;
    -webkit-shape-inside: polygon(11.17% 10.25%,2.50% 30.56%,3.92% 55.34%,12.33% 68.87%,26.67% 82.62%,49.33% 101.25%,73.50% 76.82%,85.17% 65.63%,91.63% 55.51%,97.10% 31.32%,85.79% 10.21%,72.47% 5.35%,55.53% 14.12%,48.58% 27.88%,41.79% 13.72%,27.50% 5.57%);
}

The content that doesn’t fit in the first three regions flows into the fourth region. The fourth region (see the retro-blue background color) has its CSS width and height set to auto, so it grows to fit the remaining content.

Real world examples

After trying the demo and checking out the links above, I’m sure you’ll see the opportunities for using shape-inside with regions in your next design. If you have some thoughts on this topic, don’t hesitate to comment. Please keep in mind that these features are under development, and you might run into bugs. If you do, you should report them on WebKit’s Bugzilla for Safari or Chromium’s issue tracker for Chrome. Thanks for reading!

By Zoltan Horvath at August 27, 2013 04:00 PM

August 06, 2013

WebGL, at last!

Brent Fulgham

It's been a long time since I've written an update -- but my lack of blog posting is not an indication of a lack of progress in WebKit or the WinCairo port. Since I left my former employer (who *still* hasn't gotten around to updating the build machine I set up there), we've:

  • Migrated from Visual Studio 2005 to Visual Studio 2010 (and soon, VS2012)
  • Enabled New-run-webkit-tests
  • Updated the WinCairo Support Libraries to support 64-bit builds
  • Integrated a ton of cURL improvements and extensions thanks to the TideSDK guys 
  • and ...
... thanks to the hard work of Alex Christensen, brought up WebGL on the WinCairo port.  This is a little exciting for me, because it marks the first time (I can recall) where the WinCairo port actually gained a feature that was not already part of the core Apple Windows port.



The changes needed to see these circa-1992 graphics in all their three-dimensional glory are already landed in the WebKit tree.  You just need to:

  1. Enable the libEGL, libGLESv2, translator_common, translator_glsl, and translator_hlsl for the WinCairo build (they are currently turned off).
  2. Make the following change to WTF/wtf/FeatureDefines.h: 

Brent Fulgham@WIN7-VM ~/WebKit/Source/WTF/wtf
$ svn diff
Index: FeatureDefines.h
===================================================================
--- FeatureDefines.h    (revision 153733)
+++ FeatureDefines.h    (working copy)
@@ -245,6 +245,13 @@
 #define ENABLE_VIEW_MODE_CSS_MEDIA 0
 #endif

+#define ENABLE_WEBGL 1
+#define WTF_USE_3D_GRAPHICS 1
+#define WTF_USE_OPENGL 1
+#define WTF_USE_OPENGL_ES_2 1
+#define WTF_USE_EGL 1
+#define WTF_USE_GRAPHICS_SURFACE 1
+
 #endif /* PLATFORM(WIN_CAIRO) */

 /* --------- EFL port (Unix) --------- */

Performance is a little ragged, but we hope to improve that in the near future.

We have plenty of more plans for the future, including full 64-bit support (soon), and hopefully some improvements to the WinLauncher application to make it a little more useful.

As always, if you would like to help out,

By Brent Fulgham (noreply@blogger.com) at August 06, 2013 05:53 AM

May 15, 2013

CSS Level 3 Text Decoration on WebKit and Blink – status

Bruno de Oliveira Abinader

It’s been a while since I wrote the last post about progress on implementing CSS Level 3 Text Decoration features in WebKit. I’ve been involved with other projects but now I can finally resume the work in cooperation with my colleague from basysKom, Lamarque Souza. So far we have implemented:

  • text-decoration-line (link)
  • text-decoration-style (link)
  • text-decoration-color (link)
  • text-underline-position (link)

These properties are currently available under -webkit- prefix on WebKit, and guarded by a feature flag - CSS3_TEXT – which is enabled by default on both EFL and GTK ports. On Blink, plans are to get these properties unprefixed and under a runtime flag, which can be activated by enabling the “Experimental WebKit Features” (updated to “Experimental Web Platform Features” in latest builds) flag – see chrome://flags inside Google Chrome/Chromium). There are still some Skia-related issues to fix on Blink to enable proper dashed and dotted text decoration styles to be displayed. In the near future, we shall also have the text-decoration shorthand as specified on CSS Level 3 specification.

See below a summary of things I plan to finish in the near future:

  • [webkit] Property text-decoration-line now accepts blink as valid value
  • [blink] Fix implementation of dashed and dotted styles on Skia
  • [blink] Fix an issue where previous Skia stroke styles were used when rendering paint decorations
  • [blink] Implement CSS3_TEXT as a runtime flag
  • [blink] Property text-decoration-line now accepts blink as valid value
  • [blink] Implement support for text-decoration shorthand
  • [webkit] Implement support for text-decoration shorthand

Note: Please do not confuse text-decoration‘s blink value with Blink project :)

Stay tuned for further updates!

By Bruno Abinader at May 15, 2013 04:52 PM

March 27, 2013

Freeing the Floats of the Future From the Tyranny of the Rectangle

Adobe Web Platform

With modern web layout you can have your content laid out in whatever shape you want as long as it’s a rectangle. Designers in other media have long been able to have text and other content lay out inside and around arbitrarily complex shapes. The CSS Exclusions, CSS Shapes Level 1, and CSS Shapes Level 2 specifications aim to bring this capability to the web.

While these features aren’t widely available yet, implementation is progressing and it’s already possible to try out some of the features yourself. Internet Explorer 10 has an implementation of the exclusions processing model, so you can try out exclusions in IE 10 today.

At Adobe we have been focusing on implementing the shapes specification. We began with an implementation of shape-inside and now have a working implementation of the shape-outside property on floats. We have been building our implementation in WebKit, so the easiest way to try it out yourself is to download a copy of Chrome Canary. Once you have Canary, enable Experimental Web Platform Features and go wild!

What is shape-outside?

“Now hold up there,” you may be thinking, “I don’t even know what a shape-outside is and you want me to read this crazy incomprehensible specification thing to know what it is!?!”

Well you’ll be happy to know that it really isn’t that complex, especially in the case of floats. When an element is floated, inline content avoids the floated element. Content flows around the margin box of the element as defined by the CSS box model. The shape-outside CSS property allows you to tell the browser to use a specified shape instead of the margin box when wrapping content around the floating element.

CSS Exclusions

The current implementation allows for rectangles, rounded rectangles, circles, ellipses, and polygons. While this gives a lot of flexibility, eventually you will be able to use a SVG path or the alpha channel of an image to make it easier to create complex shapes.

How do I use it?

First, you need to get a copy of Chrome Canary and then enable Experimental Web Platform features. Once you have that, load up this post in Chrome Canary so that you can click on the images below to see a live example of the code. Even better, the examples are on Codepen, so you can and should play with them yourself and see what interesting things you can come up with.

Note that in this post and the examples I use the unprefixed shape-outside property.
If you want to test these examples outside of my Codepen then you will need to use the prefixed -webkit-shape-outside property or use (which is a built in option in Codepen).

We’ll start with a HTML document with some content and a float. Currently shape-outside only works on floating elements, so those are the ones to concentrate on. For example: (click on the image to see the code)

HTML without shape-outside

You can now add the shape-outside property to the style for your floats.

.float {
  shape-outside: circle(50%, 50%, 50%);
}

A circle is much more interesting than a standard rectangle, don’t you think? This circle is centered in the middle of the float and has a radius that is half the width of the float. The effect on the layout is something like this:

shape-outside circle

While percentages were used for this circle, you can use any CSS unit you like to specify the shape. All of the relative units are relative to the dimensions of element where the shape-outside is specified.

Supported shapes

Circles are cool and all, but I promised you other shapes, and I will deliver. There are four types of shapes that are supported by the current shape-outside implementation: rectangle, circle, ellipse, and polygon.

rectangle

You have the ability to specify a shape-outside that is a fairly standard rectangle:

shape-outside: rectangle(x, y, width, height);

The x and y parameters specify the coordinates of the top-left corner of the rectangle. This coordinate is in relation to the top-left corner of the floating element’s content box. Because of the way this interacts with the rules of float positioning, setting these to anything other than 0 causes an effect that is similar to relatively positioning the float’s content. (Explaining this is beyond the scope of this post.)

The width and height parameters should be self-explanatory: they are the width and height of the resulting rectangle.

Where things get interesting is with the six-argument form of rectangle:

shape-outside: rectangle(x, y, width, height, rx, ry);

The first four arguments are the same as explained above, but the last two specify corner radii in the horizontal (rx) and vertical (ry) directions. This not only allows the creation of rounded rectangles, you can create circles and ellipses as well. (Just like with [border-radius][border-radius].)

Here’s an example of a rectangle, a rounded rectangle, a circle, and an ellipse using just rectangle syntax:

shape-outside rectangle

If you’re reading this in Chrome Canary with exclusions turned on, play around with this demo and see what other things you can do with the rectangles.

circle

I already showed you a simple circle demo and you’ll be happy to know that’s pretty much all there is to know about circles:

shape-outside: circle(cx, cy, radius);

The cx and cy parameters specify the coordinates of the center of the circle. In most situations you’ll want to put them at the center of your box. Just like with rectangles moving this around can be useful, but it behaves similarly to relatively positioning the float’s content with respect to the shape.

The radius parameter is the radius of the resulting circle.

In case you’d like to see it again, here’s what a circle looks like:

shape-outside circle

While it is possible to create circles with rounded rectangles as described above, having a dedicated circle shape is much more convenient.

ellipse

Sometimes, you need to squish your circles and that’s where the ellipse comes in handy.

shape-outside: ellipse(cx, cy, rx, ry);

Just like a circle, an ellipse has cx and cy to specify the coordinates of its center and you will likely want to have them at the center of your float. And just like all the previous shapes, changing these around will cause the float’s content to position relative to your shape.

The rx and ry parameters will look familiar from the rounded rectangle case and they are exactly what you would expect: the horizontal and vertical radii of the ellipse.

Ellipses can be used to create circles (rx = ry) and rounded rectangles can be used to create ellipses, but it’s best to use the shape that directly suits your purpose. It’s much easier to read and maintain that way.

Here’s an example of using an ellipse shape:

shape-outside ellipse

polygon

Now here’s where things get really interesting. The polygon `shape-outside` allows you to specify an arbitrary polygonal shape for your float:

shape-outside: polygon(x1 y1, x2 y2, ... , xn yn);

The parameters of the polygon are the x and y coordinates of each vertex of the shape. You can have as many vertices as you would like.

Here’s an example of a simple polygon:

shape-outside triangle

Feel free to play with this and see what happens if you create more interesting shapes!

Putting content in the float

The previous examples all had divs without any content just to make it easier to read and understand the code, but a big motivation for shape-outside is to wrap around other content. Interesting layouts often involve wrapping text around images as this final example shows:

shape-outside with images

As usual, you should take a look and play with the code for this example of text wrapping around floated images. This is just the beginning of the possibilities, as you can put a shape outside on any floating element with any content you want inside.

Next steps

We are still hard at work on fixing bugs in the current implementation and implementing the rest of the features in the CSS Shapes Level 1 specification. We welcome your feedback on what is already implemented and also on the spec itself. If you are interested in becoming part of the process, you can raise issues with the current WebKit implementation by filing bugs in the WebKit bugzilla. If you have issues with the spec, those are best raised on the www-style mailing list. And of course, you can leave your feedback as comments on this post.

I hope that you enjoy experimenting with shape-outside and the other features we are currently working on.

By Bem Jones-Bey at March 27, 2013 05:10 PM

August 01, 2012

WebKit CSS3 text-decoration properties (preview)

Bruno de Oliveira Abinader

WebKit currently supports CSS Text Level 2.1 version of text-decoration property (link). This version treats only about the decoration line types (underline, overline, line-through and blink – the latter is not supported on WebKit).

The draft version of CSS Text Level 3 upgrades text-decoration (link) property as a shorthand to 3 newly added properties, named text-decoration-line (link), text-decoration-style (link) and text-decoration-color (link), and also adds text-decoration-skip (link) property.

Among other WebKit stuff I’ve been doing lately, this feature implementation is one of the most cool ones I’m enjoying implementing. I’ve grabbed the task of implementing all of these CSS3 text-decoration* properties on WebKit, and results are great so far!

As you can see below, these are the new text decoration styles (solid, double, dotted, dashed and wavy – the latter still requires platform support) available:

Text decoration style layout test results on Qt platform

And also specific text decoration colors can be set:

Text decoration color layout test results on Qt platform

These features (with exception to text-decoration-skip property) are already implemented on Firefox, thus it gets easier to compare results with different web engines. It is important to notice since CSS3 specification is still in development, all these properties have a -webkit- prefix (ie. -webkit-text-decoration), so text-decoration still maintains CSS2.1 specification requirements. The patches are being reviewed and will soon land upstream, let’s hope it will be soon!

By Bruno Abinader at August 01, 2012 09:58 PM

April 11, 2012

A guide for Qt5/WebKit2 development setup for Nokia N9 on Ubuntu Linux

Bruno de Oliveira Abinader

As part of my daily activities at basysKom on QtWebKit maintenance and development for Nokia devices, it is interesting to keep a track on latest developments circa QtWebKit. There is currently a promising project of a Qt5/WebKit2-based browser called Snowshoe mainly developed by my fellow friends from INdT which is completely open-source. This browser requires latest Qt5 and QtWebKit binaries and thus requires us to have a functional build system environment. There is a guide available on WebKit’s wiki (link) which is very helpful but lacks some information about compilation issues found when following the setup steps. So I am basing this guide from that wiki page and I hope that it gets updated soon :)

On this guide it is assumed the following:

  • All commands are issued on a Linux console. I am not aware of how this guide would work on other systems.
  • All commands are supposed to be issued inside base directory, unless expressely said otherwise (ie. cd <QT5_DIR>).
  • You might want to check if you have git and rsync packages installed in your system.

1. Install Qt SDK

In order to build Qt5 and QtWebKit for Nokia N9, you need to set up a cross-compiler. Thankfully, Qt SDK already comes with a working setup. Please download the online installer from Qt Downloads section (link).

NOTE: The offline installer comes with an outdated version of the MADDE target, which can be updated by running the script below and chosing “Update components” when asked:

$ ~/QtSDK/SDKMaintenanceTool

2. Directory setup

It is suggested (and actually required by some build scripts) to have a base directory which holds Qt5, Qt Components and WebKit project sources. The suggested base directory can be created by running:

$ mkdir -p ~/swork

NOTE: You can actually choose another directory name, but so far it is required by some scripts to have at least a symbolic link pointing to <HOME_DIR>/swork.

3. Download convenience scripts

3.1. browser-scripts

$ git clone https://github.com/resworb/scripts.git browser-scripts

3.2. rsync-scripts

$ wget http://trac.webkit.org/attachment/wiki/SettingUpDevelopmentEnvironmentForN9/rsync-scripts.tar.gz?format=raw
$ tar xzf rsync-scripts.tar.gz

4. Download required sources

4.1. testfonts

$ git clone git://gitorious.org/qtwebkit/testfonts.git

4.2. Qt5, QtComponents and WebKit

The script below when successfully run will create ~/swork/qt5, ~/swork/qtcomponents and ~/swork/webkit directories:

$ browser-scripts/clone-sources.sh --no-ssh

NOTE: You can also manually download sources, but remember to stick with the directory names described above.

5. Pre-build hacks

5.1. Qt5 translations

Qt5 translations are not being properly handled by cross-platform toolchain. This happens mainly because lrelease application is called to generate Qt message files, but since it is an ARMEL binary your system is probably not capable of running it natively (unless you have a misc_runner kernel module properly set, then you can safely skip this step). In this case, you can use lrelease from your system’s Qt binaries without any worries.

If you have a Scratchbox environment set, it is suggested for you to stop its service first:

$ sudo service scratchbox-core stop

Now you can manually generate Qt message files by running this:

$ cd ~/swork/qt5/qttranslations/translations
$ for file in `ls *ts`; do lrelease $file -qm `echo "$file" | sed 's/ts$/qm/'`; done

5.2. Disable jsondb-client tool

QtJsonDB module from Qt5 contains a tool called jsondb-client, which depends on libedit (not available on MADDE target). It is safe to disable its compilation for now:

$ sed -i 's/jsondb-client//' ~/swork/qt5/qtjsondb/tools/tools.pro

5.3. Create missing symbolic links

Unfortunately Qt5 build system is not robust enough to support our cross-compilation environment, so some symbolic links are required on MADDE to avoid compilation errors (where <USER> is your system user name):

$ ln -s ~/swork/qt5/qtbase/include ~/QtSDK/Madde/sysroots/harmattan_sysroot_10.2011.34-1_slim/home/<USER>/swork/qt5/qtbase
$ ln -s ~/swork/qt5/qtbase/mkspecs ~/QtSDK/Madde/sysroots/harmattan_sysroot_10.2011.34-1_slim/home/<USER>/swork/qt5/mkspecs

6. Build sources

You can execute the script that will build all sources using cross-compilation setup:

$ browser-scripts/build-sources.sh --cross-compile

If everything went well, you now have the most up-to-date binaries for Qt5/WebKit2 development for Nokia N9. Please have a look at WebKit’s wiki for more information about how to update sources after a previous build and information on how to keep files in sync with device. The guide assumes PR1.1 firmware for N9 device, which is already outdated, so I might come up next with updated instructions on how to safely sync files to your PR1.2-enabled device.

That’s all for now, I appreciate your comments and feedback!

By Bruno Abinader at April 11, 2012 07:18 AM

March 10, 2012

WebKitGTK+ Debian packaging repository changes

Gustavo Noronha

For a while now the git repository used for packaging WebKitGTK+ has been broken. Broken as in nobody was able to clone it. In addition to that, the packaging workflow had been changing over time, from a track-upstream-git/patches applied one to a import-orig-only/patches-not-applied one.

After spending some more time trying to unbreak the repository for the third time I decided it might be a good time for a clean up. I created a new repository, imported all upstream versions for series 1.2.x (which is in squeeze), 1.6.x (unstable), and 1.7.x (experimental). I also imported packaging-related commis for those versions using git format-patch and black magic.

One of the good things about doing this move, and which should make hacking the WebKitGTK+ debian package more pleasant and accessible can be seen here:


kov@goiaba ~/s/debian-webkit> du -sh webkit/.git webkit.old/.git
27M webkit/.git
1.6G webkit.old/.git

If you care about the old repository, it’s on git.debian.org still, named old-webkit.git. Enjoy!

By kov at March 10, 2012 05:32 PM

December 07, 2011

WebKitGTK+ hackfest \o/

Gustavo Noronha

It’s been a couple days since I returned from this year’s WebKitGTK+ hackfest in A Coruña, Spain. The weather was very nice, not too cold and not too rainy, we had great food, great drinks and I got to meet new people, and hang out with old friends, which is always great!

Hackfest black board, photo by Mario

I think this was a very productive hackfest, and as usual a very well organized one! Thanks to the GNOME Foundation for the travel sponsorship, to our friends at Igalia for doing an awesome job at making it happen, and to Collabora for sponsoring it and granting me the time to go there! We got a lot done, and although, as usual, our goals list had many items not crossed, we did cross a few very important ones. I took part in discussions about the new WebKit2 APIs, got to know the new design for GNOME’s Web application, which looks great, discussed about Accelerated Compositing along with Joone, Alex, Nayan and Martin Robinson, hacked libsoup a bit to port the multipart/x-mixed-replace patch I wrote to the awesome gio-based infrastructure Dan Winship is building, and some random misc.

The biggest chunk of time, though, ended up being devoted to a very uninteresting (to outsiders, at least), but very important task: making it possible to more easily reproduce our test results. TL;DR? We made our bots’ and development builds use jhbuild to automatically install dependencies; if you’re using tarballs, don’t worry, your usual autogen/configure/make/make install have not been touched. Now to the more verbose version!

The need

Our three build slaves reporting a few failures

For a couple years now we have supported an increasingly complex and very demanding automated testing infrastructure. We have three buildbot slaves, one provided by Collabora (which I maintain), and two provided by Igalia (maintained by their WebKitGTK+ folks). Those bots build as many check ins as possible with 3 different configurations: 32 bits release, 64 bits release, and 64 bits debug.

In addition to those, we have another bot called the EWS, or Early Warning System. There are two of those at this moment: one VM provided by Collabora and my desktop, provided by myself. These bots build every patch uploaded to the bugzilla, and report build failures or passes (you can see the green bubbles). They are very important to our development process because if the patch causes a build failure for our port people can often know that before landing, and try fixes by uploading them to bugzilla instead of doing additional commits. And people are usually very receptive to waiting for EWS output and acting on it, except when they take way too long. You can have an idea of what the life of an EWS bot looks like by looking at the recent status for the WebKitGTK+ bots.

Maintaining all of those bots is at times a rather daunting task. The tests require a very specific set of packages, fonts, themes and icons to always report the same size for objects in a render. Upgrades, for instance, had to be synchronized, and usually involve generating new baselines for a large number of tests. You can see in these instructions, for instance, how strict the environment requirements are – yes, we need specific versions of fonts, because they often cause layouts to change in size! At one point we had tests fail after a compiler upgrade, which made rounding act a bit different!

So stability was a very important aspect of maintaining these bots. All of them have the same version of Debian, and most of the packages are pinned to the same version. On the other hand, and in direct contradition to the stability requirement, we often require bleeding edge versions of some libraries we rely on, such as libsoup. Since we started pushing WebKitGTK+ to be libsoup-only, its own progress has been pretty much driven by WebKitGTK+’s requirements, and Dan Winship has made it possible to make our soup backend much, much simpler and way more featureful. That meant, though, requiring very recent versions of soup.

To top it off, for anyone not running Debian testing and tracking the exact same versions of packages as the bots it was virtually impossible to get the tests to pass, which made it very difficult for even ourselves to make sure all patches were still passing before committing something. Wow, what a mess.

The explosion^Wsolution

So a few weeks back Martin Robinson came up with a proposed solution, which, as he says, is the “nuclear bomb” solution. We would have a jhbuild environment which would build and install all of the dependencies necessary for reproducing the test expectations the bots have. So over the first three days of the hackfest Martin and myself hacked away in building scripts, buildmaster integration, a jhbuild configuration, a jhbuild modules file, setting up tarballs, and wiring it all in a way that makes it convenient for the contributors to get along with. You’ll notice that our buildslaves now have a step just before compiling called “updated gtk dependencies” (gtk is the name we use for our port in the context of WebKit), which runs jhbuild to install any new dependencies or version bumps we added. You can also see that those instructions I mentioned above became a tad simpler.

It took us way more time than we thought for the dust to settle, but it eventually began to. The great thing of doing it during the hackfest was that we could find and fix issues with weird configurations on the spot! Oh, you build with AR_FLAGS=cruT and something doesn’t like it? OK, we fix it so that the jhbuild modules are not affected by that variable. Oh, turns out we missed a dependency, no problem, we add it to the modules file or install them on the bots, and then document the dependency. I set up a very clean chroot which we could use for trying out changes so as to not disrupt the tree too much for the other hackfest participants, and I think overall we did good.

The aftermath

By the time we were done our colleagues who ran other distributions such as Fedora were already being able to get a substantial improvements to the number of tests passing, and so did we! Also, the ability to seamlessly upgrade all the bots with a simple commit made it possible for us to very easily land a change that required a very recent (as in unreleased) version of soup which made our networking backend way simpler. All that red looks great, doesn’t it? And we aren’t done yet, we’ll certainly be making more tweaks to this infrastructure to make it more transparent and more helpful to the users (contributors and other people interested in running the tests).

If you’ve been hit by the instability we caused, sorry about that, poke mrobinson or myself in the #webkitgtk+ IRC channel on FreeNode, and we’ll help you out or fix any issues. If you haven’t, we hope you enjoy all the goodness that a reproducible testing suite has to offer! That’s it for now, folks, I’ll have more to report on follow-up work started at the hackfest soon enough, hopefully =).

By kov at December 07, 2011 11:34 PM

November 29, 2011

Accelerated Compositing in webkit-clutter

Gustavo Noronha

For a while now my fellow Collaboran Joone Hur has been working on implementing the Accelerated Compositing infrastructure available in WebKit in webkit-clutter, so that we can use Clutter’s powers for compositing separate layers and perform animations. This work is being done by Collabora and is sponsored by BOSCH, whom I’d like to thank! What does all this mean, you ask? Let me tell me a bit about it.

The way animations usually work in WebKit is by repainting parts of the page every few milliseconds. What that means in technical terms is that an area of the page gets invalidated, and since the whole page is one big image, all of the pieces that are in that part of the page have to be repainted: the background, any divs, images, text that are at that part of the page.

What the accelerated compositing code paths allow is the creation of separate pieces to represent some of the layers, allowing the composition to happen on the GPU, removing the need to perform lots of cairo paint operations per second in many cases. So if we have a semi-transparent video moving around the page, we can have that video be a separate texture that is layered on top of the page, made transparent and animated by the GPU. In webkit-clutter’s case this is done by having separate actors for each of the layers.

I have been looking at this code on and off, and recently joined Joone in the implementation of some of the pieces. The accelerated compositing infrastructure was originally built by Apple and is, for that reason, works in a way that is very similar to Core Animation. The code is still a bit over the place as we work on figuring out how to best translate the concepts into clutter concepts and there are several bugs, but some cool demos are already possible! Bellow you have one of the CSS3 demos that were made by Apple to demo this new functionality running on our MxLauncher test browser.

You can also see that the non-Accelerated version is unable to represent the 3D space correctly. Also, can you guess which of the two MxLauncher instances is spending less CPU? ;) In this second video I show the debug borders being painted around the actors that were created to represent layers.

The code, should you like to peek or test is available in the ac2 branch of our webkit-clutter repository: http://gitorious.org/webkit-clutter/webkit-clutter/commits/ac2

We still have plenty of work to do, so expect to hear more about it. During our annual hackfest in A Coruña we plan to discuss how this work could be integrated also in the WebKitGTK+ port, perhaps by taking advantage of clutter-gtk, which would benefit both ports, by sharing code and maintenance, and providing this great functionality to Epiphany users. Stay tuned!

By kov at November 29, 2011 05:55 PM

October 09, 2011

Tests Active

Brent Fulgham


Looking back over this blog, I see that it was around a year ago that I got the initial WinCairo buildbot running. I'm very pleased to announce that I have gotten ahold of a much more powerful machine, and am now able to run a full build and tests in slightly under an hour -- a huge improvement over the old hardware which took over two hours just to build the software!

This is a big step, because we can now track regressions and gauge correctness compared to the other platforms. Up to now, testing has largely consisted of periodic manual runs of the test suite, and a separate set of high-level tests run as part of a larger application. This was not ideal, because it was easy for low-level functions in WebKit that I rarely use to be broken and missed.

All is not perfect, of course. Although over 12,000 tests now run (successfully) with each build, that is effectively two thirds of the full test suite. Most of the tests I have disabled are due to small differences in the output layout. I'm trying to understand why these differences exist, but I suspect many of them simply reflect small differences in Cairo compared to the CoreGraphics rendering layer.

If any of you lurkers are interested in helping out, trying out some of the tests I have disabled and figuring out why they fail would be a huge help!

By Brent Fulgham (noreply@blogger.com) at October 09, 2011 02:43 AM

July 14, 2011

An Unseasonable Snowfall

Brent Fulgham

A year or two ago I ported the Cocoa "CallJS" application to MFC for use with WebKit. The only feedback I ever got on the topic was a complaint that it would not build under the Visual Studio Express software many people used.

After seeing another few requests on the webkit-help mailing list for information on calling JavaScript from C++ (and vice-versa), I decided to dust off the old program and convert it to pure WINAPI calls so that VS Express would work with it.

Since my beloved Layered Window patches finally landed in WebKit, I also incorporated a transparent WebKit view floating over the main application window. Because I suck at art, I stole appropriated the Let It Snow animation example to give the transparent layer something to do.

Want to see what it looks like?

By Brent Fulgham (noreply@blogger.com) at July 14, 2011 06:34 PM

July 10, 2011

Updated WebKit SDK (@r89864)

Brent Fulgham

I have updated the WebKitSDK to correspond to SVN revision r8984.

Major changes in this revision:
* JavaScript engine improvements.
* Rendering improvements.
* New 'Transparent Web View' support.
* General performance and memory use improvements.

This ZIP file also contains updated versions of Zlib, OpenSSL, cURL, and OpenCFLite.

Note that I have stopped statically linking Cairo; I'm starting to integrate some more recent Cairo updates (working towards some new rendering features), and wanted to be able to update it incrementally as changes are made.

This package contains the same Cairo library (in DLL form) as used in previous versions.

As usual, please let me know if you encounter any problems with this build.

[Update] I forgot to include zlib1.dll! Fixed in the revised zip file.

By Brent Fulgham (noreply@blogger.com) at July 10, 2011 04:24 AM

July 05, 2011

WinCairoRequirements Sources Archive

Brent Fulgham

I've posted the 80 MB source archive of the requirements needed to build the WinCairo port of WebKit.

Note that you do NOT need these sources unless you plan on building them yourself or wish to archive the source code for these modules. The binaries are always present in the WinCairoRequirements.zip file, which is downloaded and unzipped to the proper place when you execute the update-webkit --wincairo command.

By Brent Fulgham (noreply@blogger.com) at July 05, 2011 07:39 PM

June 28, 2011

Towards a Simpler WinCairo Build

Brent Fulgham


For the past couple of years, anyone interested in trying to build the WinCairo port of WebKit had to track down a number of support libraries, place them in their development environment's include (and link search) paths, and then cross their fingers and hope everything built.

To make things a little easier, I wrapped up the libraries and headers I use for building and posted them as a zip file on my .Mac account. This made things a little easier, but you still had to figure out where to drop the files and figure out if I had secretly updated my 'requirements.zip' file without telling anyone. Not ideal.

A couple of days ago, while trolling through the open review queue, I ran across a Bug filed by Carl Lobo, which automated the task of downloading the requirements file when running build-webkit --wincairo. This was a huge improvement!

Today, I hijacked Carl's changes and railroaded the patch through the review process (making a few modifications along the way):

  • I renamed my requirements file WinCairoRequirements.zip.

  • I added a timestamp file, so that build-webkit --wincairo can check to see if the file changed, and download it if necessary.

  • I propagated Carl's changes to update-webkit, so that now by adding the --wincairo argument it will update the WinCairoRequirements file.


I'm really excited about this update. If you've been wanting to try out the WinCairo port of WebKit, this would be a great time to try it out. I'd love to hear your experiences!

By Brent Fulgham (noreply@blogger.com) at June 28, 2011 04:42 AM

June 14, 2011

Benchmarking Javascript engines for EFL

Lucas De Marchi

The Enlightenment Foundation Libraries has several bindings for other languages in order to ease the creation of end-user applications, speeding up its development. Among them, there’s a binding for Javascript using the Spidermonkey engine. The questions are: is it fast enough? Does it slowdown your application? Is Spidermonkey the best JS engine to be used?

To answer these questions Gustavo Barbieri created some C, JS and Python benchmarks to compare the performance of EFL using each of these languages. The JS benchmarks were using Spidermonkey as the engine since elixir was already done for EFL. I then created new engines (with only the necessary functions) to also compare to other well-known JS engines: V8 from Google and JSC (or nitro) from WebKit.

Libraries setup

For all benchmarks EFL revision 58186 was used. Following the setup of each engine:

  • Spidermonkey: I’ve used version 1.8.1-rc1 with the already available bindings on EFL repository, elixir;
  • V8: version 3.2.5.1, using a simple binding I created for EFL. I named this binding ev8;
  • JSC: WebKit’s sources are needed to compile JSC. I’ve used revision 83063. Compiling with CMake, I chose the EFL port and enabled the option SHARED_CORE in order to have a separated library for Javascript;

Benchmarks

Startup time: This benchmark measures the startup time by executing a simple application that imports evas, ecore, ecore-evas and edje, bring in some symbols and then iterates the main loop once before exiting. I measured the startup time for both hot and cold cache cases. In the former the application is executed several times in sequence and the latter includes a call to drop all caches so we have to load the library again from disk

Runtime – Stress: This benchmark executes as many frames per second as possible of a render-intensive operation. The application is not so heavy, but it does some loops, math and interacts with EFL. Usually a common application would do far less operations every frame because many operations are done in EFL itself, in C, such as list scrolling that is done entirely in elm_genlist. This benchmark is made of 4 phases:

  • Phase 0 (P0): Un-scaled blend of the same image 16 times;
  • Phase 1 (P1): Same as P0, with additional 50% alpha;
  • Phase 2 (P2): Same as P0, with additional red coloring;
  • Phase 3 (P3): Same as P0, with additional 50% alpha and red coloring;

The C and Elixir’s versions are available at EFL repository.

Runtime – animation: usually an application doesn’t need “as many FPS as possible”, but instead it would like to limit to a certain amount of frames per second. E.g.: iphone’s browser tries to keep a constant of 60 FPS. This is the value I used on this benchmark. The same application as the previous benchmark is executed, but it tries to keep always the same frame-rate.

Results

The first computer I used to test these benchmarks on was my laptop. It’s a Dell Vostro 1320, Intel Core 2 Duo with 4 GB of RAM and a standard 5400 RPM disk. The results are below.

Benchmarks on Dell 1320 laptop

First thing to notice is there are no results for “Runtime – animation” benchmark. This is because all the engines kept a constant of 60fps and hence there were no interesting results to show. The first benchmark shows that V8’s startup time is the shortest one when considering we have to load the application and libraries from disk. JSC was the slowest and  Spidermonkey was in between.

With hot caches, however, we have another complete different scenario, with JSC being almost as fast as the native C application. Following, V8 with a delay a bit larger and Spidermonkey as the slowest one.

The runtime-stress benchmark shows that all the engines are performing well when there’s some considerable load in the application, i.e. removing P0 from from this scenario. JSC was always at the same speed of native code; Spidermonkey and V8 had an impact only when considering P0 alone.

 

Next computer to consider in order to execute these benchmarks was  a Pandaboard, so we can see how well the engines are performing in an embedded platform. Pandaboard has an ARM Cortex-A9 processor with 1GB of RAM and the partition containing the benchmarks is in an external flash storage drive. Following the results for each benchmark:

 

Benchmarks on Pandaboard

Once again, runtime-animation is not shown since it had the same results for all engines. For the startup tests, now Spidermonkey was much faster than the others, followed by V8 and JSC in both hot and cold caches. In runtime-stress benchmark, all the engines performed well, as in the first computer, but now JSC was the clear winner.

 

There are several points to be considered when choosing an engine to be use as a binding for a library such as EFL. The raw performance and startup time seems to be very near to the ones achieved with native code. Recently there were some discussions in EFL mailing list regarding which engine to choose, so I think it would be good to share these numbers above. It’s also important to notice that these bindings have a similar approach of elixir, mapping each function call in Javascript to the correspondent native function. I made this to be fair in the comparison among them, but depending on the use-case it’d  be good to have a JS binding similar to what python’s did, embedding the function call in real python objects.

By Lucas De Marchi at June 14, 2011 05:25 PM

April 29, 2011

Collection of WebKit ports

Holger Freyther

WebKit is a very successfull project. It is that in many ways. The code produced seems to very fast, the code is nice to work on, the people are great, the partys involved collaborate with each other in the interest of the project. The project is also very successfull in the mobile/smartphone space. All the major smartphone platforms but Windows7 are using WebKit. This all looks great, a big success but there is one thing that stands out.

From all the smartphone platforms no one has fully upstreamed their port. There might be many reasons for that and I think the most commonly heard reason is the time needed to get it upstreamed. It is specially difficult in a field that is moving as fast as the mobile industry. And then again there is absolutely no legal obligation to work upstream.

For most of today I collected the ports I am aware of, put them into one git repository, maybe find the point where they were branched, rebase their changes. The goal is to make it more easy to find interesting things and move them back to upstream. One can find the combined git tree with the tags here. I started with WebOS, moved to iOS, then to Bada and stopped at Android as I would have to pick the sourcecode for each android release for each phone from each vendor. I think I will just be happy with the Android git tree for now. At this point I would like to share some of my observations in the order I did the import.

Palm


Palm's release process is manual. In the last two releases they call the file .tgz but forgot to gzip it, in 2.0.0 the tarball name was in camel case. The thing that is very nice about Palm is that they provide their base and their changes (patch) separately. From looking at the 2.1.0 release it looks that for the Desktop version they want to implement Complex Font rendering. Earlier versions (maybe it is still the case) lack the support for animated GIF.

iOS


Apple's release process seems to be very structured. The source can be downloaded here. What I think is to note is that the release tarball contains some implementations of WebCore only as .o file and Apple has stopped releasing the WebKit sourcecode beginning with iOS 4.3.0.

Bada


This port is probably not known by many. The release process seems to be manual as well, the name of directories changed a lot between the releases, they come with a WML Script engine and they do ship something they should not ship.

I really hope that this combined tree is useful for porters that want to see the tricks used in the various ports and don't want to spend the time looking for each port separately.

By zecke (noreply@blogger.com) at April 29, 2011 07:20 PM

February 13, 2011

How to make the GNU Smalltalk Interpreter slower

Holger Freyther

This is another post about a modern Linux based performance measurement utility. It is called perf, it is included in the Linux kernel sources and it entered the kernel in v2.6.31-rc1. In many ways it is obsoleting OProfile, in fact for many architectures oprofile is just a wrapper around the perf support in the kernel. perf comes with a few nice application. perf top provides a statistics about which symbols in user and in kernel space are called, perf record to record an application or to start an application to record it and then perf report to browse this report with a very simple CLI utility. There are tools to bundle the record and the application in an archive, a diff utility.

For the last year I was playing a lot with GNU Smalltalk and someone posted the results of a very simplistic VM benchmark ran across many different Smalltalk implementations. In one of the benchmarks GNU Smalltalk is scoring last among the interpreters and I wanted to understand why it is slower. In many cases the JavaScriptCore interpreter is a lot like the GNU Smalltalk one, a simple direct-threaded bytecode interpreter, uses computed goto (even is compiled with -fno-gcse as indicated by the online help, not that it changed something for JSC), heavily inlined many functions.

There are also some differences, the GNU Smalltalk implementation is a lot older and in C. The first notable is that it is a Stack Machine and not register based, there are global pointers for the SP and the IP. Some magic to make sure that in the hot loop the IP/SP is 'local' in a register, depending on the available registers also keep the current argument in one, the interpreter definition is in a special file format but mostly similar to how Interepreter::privateExecute is looking like. The global state mostly comes from the fact that it needs to support switching processes and there might be some event during the run that requires access to the IP to store it to resume the old process. But in general the implementation is already optimized and there is little low hanging fruits and most experiments result in a slow down.

The two important things are again: Having a stable benchmark, having a tool to help to know where to look for things. In my case the important tools are perf stat, perf record, perf report and perf annotate. I have put a copy of the output to the end of this blog post. The stat utility provides one with number of instructions executed, branches, branch misses (e.g. badly predicted), L1/L2 cache hits and cache misses.

The stable benchmark helps me to judge if a change is good, bad or neutral for performance within the margin of error of the test. E.g. if I attempt to reduce the code size the instructions executed should decrease, if I start putting __builtin_expect.. into my code the number of branch misses should go down as well. The other useful utility is to the perf report that allows one to browse the recorded data, this can help to identify the methods one wants to start to optimize, it allows to annotate these functions inside the simple TUI interface, but does not support searching in it.

Because the codebase is already highly optimized any of my attempts should either decrease the code size (and the pressure on the i-cache), the data size (d-cache), remove stores or loads from memory (e.g. reorder instructions), fix branch predictions. The sad truth is that most of my changes were either slow downs or neutral to the performance and it is really important to undo these changes and not have false pride (unless it was also a code cleanup or such).

So after about 14 hours of toying with it the speed ups I have managed to make come from inlining a method to unwind a context (callframe), reordering some compares on the GC path and disabling the __builtin_expect branch hints as they were mostly wrong (something the kernel people found to be true in 2010 as well). I will just try harder, or try to work on the optimizer or attempt something more radical...



$ perf stat gst -f Bench.st
219037433 bytecodes/sec; 6025895 sends/sec

Performance counter stats for 'gst -f Bench.st':

17280.101683 task-clock-msecs # 0.969 CPUs
2076 context-switches # 0.000 M/sec
123 CPU-migrations # 0.000 M/sec
3925 page-faults # 0.000 M/sec
22215005506 cycles # 1285.583 M/sec (scaled from 70.02%)
40593277297 instructions # 1.827 IPC (scaled from 80.00%)
5063469832 branches # 293.023 M/sec (scaled from 79.98%)
70691940 branch-misses # 1.396 % (scaled from 79.98%)
27844326 cache-references # 1.611 M/sec (scaled from 20.02%)
134229 cache-misses # 0.008 M/sec (scaled from 20.03%)

17.838888599 seconds time elapsed


PS: The perf support probably works best on Intel based platforms and the biggest other problem is that perf annotate has some issues when the code is included from other c files.

By zecke (noreply@blogger.com) at February 13, 2011 08:56 PM

January 17, 2011

Using systemtap userspace tracing...

Holger Freyther

At the 27C3 we were running a GSM network and during the preparation I noticed a strange performance problem coming from the database library we are using running. I filled our database with some dummy data and created a file with the queries we normally run and executed time cat queries | sqlite3 file as a mini benchmark. I also hacked this code into our main routine and ran it with time as well. For some reason the code running through the database library was five times slower.

I was a bit puzzled and I decided to use systemtap to explore this to build a hypothesis and to also have the tools to answer the hypothesis. I wanted to find out if if it is slow because our database library is doing some heavy work in the implementation, or because we execute a lot more queries behind the back. I was creating the below probe:


probe process("/usr/lib/libsqlite3.so.0.8.6").function("sqlite3_get_table")
{
a = user_string($zSql);
printf("sqlite3_get_table called '%s'\n", a);
}


This probe will be executed whenever the sqlite3_get_table function of the mentioned library will be called. The $zSql is a variable passed to the sqlite3_get_table function and contains the query to be executed. I am converting the pointer to a local variable and then can print it. Using this simple probe helped me to see which queries were executed by the database library and helped me to do an easy optimisation.

In general it could be very useful to build a set of probes (I think one calls set a tapset) that check for API misusage, e.g. calling functions with certain parameters where something else might be better. E.g. in Glib use truncate instead of assigning "" to the GString, or check for calls to QString::fromUtf16 coming from Qt code itself. On second thought this might be better as a GCC plugin, or both.

By zecke (noreply@blogger.com) at January 17, 2011 12:41 PM

December 17, 2010

In the name of performance

Holger Freyther

I tend to see people doing weird things and then claim that the change is improving performance. This can be re-ordering instructions to help the compiler, attempting to use multiple cores of your system, writing a memfill in assembly. On the one hand people can be right and the change is making things faster, on the other hand they could use assembly to make things look very complicated, justify their pay, and you might feel awkward to question if it is making any sense.

In the last couple of weeks I have stumbled on some of those things. For some reason I found this bug report about GLIBC changing the memcpy routine for SSE and breaking the flash plugin (because it uses memcpy in the wrong way). The breakage is justified that the new memcpy was optimized and is faster. As Linus points out with his benchmark the performance improvement is mostly just wishful thinking.

Another case was someone providing MIPS optimized pixman code to speed-up all drawing which turned out to be wishful thinking as well...

The conclusion is. If someone claims that things are faster with his patch. Do not simply trust him, make sure he refers to his benchmark, is providing numbers of before and after and maybe even try to run it yourself. If he can not provide this, you should wonder how he measured the speed-up! There should be no place for wishful thinking in benchmarking. This is one of the areas where Apple's WebKit team is constantly impressing me.

By zecke (noreply@blogger.com) at December 17, 2010 01:48 PM

December 16, 2010

Benchmarking QtWebKit-V8 on Linux

University of Szeged

For some time it has been possible to build and run QtWebKit on Linux using Google's V8 JavaScript engine instead of the default JavaScriptCore. I thought it would be good to see some numbers comparing the runtime performance of the two engines in the same environment and also measuring the performance of the browser bindings.

read more

By andras.becsi at December 16, 2010 01:04 PM

October 23, 2010

Easily embedding WebKit into your EFL application

Lucas De Marchi

This is the first of a series of posts that I’m planning to do using basic examples in EFL, the Enlightenment Foundation Libraries. You may have heard that EFL is reaching its 1.0 release. Instead of starting from the very beginning with the basic functions of these libraries, I decided to go the opposite way, showing the fun stuff that is possible to do. Since I’m also an WebKit developer, let’s put the best of both softwares together and have a basic window rendering a webpage.

Before starting off, just some remarks:

  1. I’m using here the basic EFL + WebKit-EFL (sometimes called ewebkit). Developing an EFL application can be much simpler, particularly if you use an additional library with pre-made widgets like Elementary. However, it’s good to know how the underlying stuff works, so I’m providing this example.
  2. This could have been the last post in a series when talking about EFL since it uses at least 3 libraries. Don’t be afraid if you don’t understand what a certain function is for or if you can’t get all EFL and WebKit running right now. Use the comment section below and I’ll make my best to help you.

Getting EFL and WebKit

In order to able to compile the example here, you will need to compile two libraries from source: EFL and WebKit. For both libraries, you can either get the last version from svn or use the last snapshots provided.

  • EFL:

Grab a snapshot from the download page. How to checkout the latest version from svn is detailed here, as well as some instructions on how to compile

  • WebKit-EFL:

A very detailed explanation on how to get WebKit-EFL up and running is available on trac. Recently, though, WebKit-EFL started to be released too. It’s not detailed in the wiki yet, but you can grab a snapshot instead of checking out from svn.

hellobrowser!

In the spirit of “hello world” examples, our goal here is to make a window showing a webpage rendered by WebKit. For the sake of simplicity, we will use a default start page and put a WebKit-EFL “widget” to cover the entire window. See below a screenshot:

hellobrowser - WebKit + EFL

The code for this example is available here. Pay attention to a comment in the beginning of this file that explains how to compile it:

gcc -o hellobrowser hellobrowser.c \
     -DEWK_DATADIR="\"$(pkg-config --variable=datadir ewebkit)\"" \
     $(pkg-config --cflags --libs ecore ecore-evas evas ewebkit)

The things worth noting here are the dependencies and a variable. We directly depend on ecore and evas from EFL and on WebKit. We define a variable, EWK_DATADIR, using pkg-config so our browser can use the default theme for web widgets defined in WebKit. Ecore handles events like mouse and keyboard inputs, timers etc whilst evas is the library responsible for drawing. In a later post I’ll detail them a bit more. For now, you can read more about them on their official site.

The main function is really simple. Let’s divide it by pieces:

    // Init all EFL stuff we use
    evas_init();
    ecore_init();
    ecore_evas_init();
    ewk_init();

Before you use a library from EFL, remember to initialize it. All of them use their own namespace, so it’s easy to know which library you have to initialize: for example, if you call a function starting by “ecore_”, you know you first have to call “ecore_init()”. The last initialization function is WebKit’s, which uses the “ewk_” namespace.

    window = ecore_evas_new(NULL, 0, 0, 800, 600, NULL);
    if (!window) {
        fprintf(stderr, "something went wrong... :(\n");
        return 1;
    }

Ecore-evas then is used to create a new window with size 800×600. The other options are not relevant for an introduction to the libraries and you can find its complete documentation here.

    // Get the canvas off just-created window
    evas = ecore_evas_get(window);

From the Ecore_Evas object we just created, we grab a pointer to the evas, which is the space in which we can draw, adding Evas_Objects. Basically an Evas_Object is an object that you draw somewhere, i.e. in the evas. We want to add only one object to our window, that is where WebKit you render the webpages. Then, we have to ask WebKit to create this object:

    // Add a View object into this canvas. A View object is where WebKit will
    // render stuff.
    browser = ewk_view_single_add(evas);

Below I demonstrate a few Evas’ functions that you use to manipulate any Evas_Object. Here we are manipulating the just create WebKit object, moving to the desired position, resizing to 780x580px and then telling Evas to show this object. Finally, we tell Evas to show the window we created too. This way we have a window with an WebKit object inside with a little border.

    // Make a 10px border, resize and show
    evas_object_move(browser, 10, 10);
    evas_object_resize(browser, 780, 580);
    evas_object_show(browser);
    ecore_evas_show(window);

We need to setup a bit more things before having a working application. The first one is to give focus to the Evas_Object we are interested on in order to receive keyboard events when opened. Then we connect a function that will be called when the window is closed, so we can properly exit our application.

    // Focus it so it will receive pressed keys
    evas_object_focus_set(browser, 1);
 
    // Add a callback so clicks on "X" on top of window will call
    // main_signal_exit() function
    ecore_event_handler_add(ECORE_EVENT_SIGNAL_EXIT, main_signal_exit, window);

After this, we are ready to show our application, so we start the mainloop. This function will only return when the application is closed:

    ecore_main_loop_begin();

The function called when the application is close, just tell Ecore to exit the mainloop, so the function above returns and the application can shutdown. See its implementation below:

static Eina_Bool
main_signal_exit(void *data, int ev_type, void *ev)
{
    ecore_evas_free(data);
    ecore_main_loop_quit();
    return EINA_TRUE;
}

Before the application exits, we shutdown all the libraries that were initialized, in the opposite order:

    // Destroy all the stuff we have used
    ewk_shutdown();
    ecore_evas_shutdown();
    ecore_shutdown();
    evas_shutdown();

This is a basic working browser, with which you can navigate through pages, but you don’t have an entry to set the current URL, nor “go back” and “go forward” buttons etc. All you have to do is start adding more Evas_Objects to your Evas and connect them to the object we just created. For a still basic example, but with more stuff implemented, refer to the EWebLauncher that we ship with the WebKit source code. You can see it in the “WebKitTools/EWebLauncher/” folder or online at webkit’s trac. Eve is another browser with a lot more features that uses Elementary in addition to EFL, WebKit. See a blog post about it with some nice pictures.

Now, let’s do something funny with our browser. With a bit more lines of code you can turn your browser upside down. Not really useful, but it’s funny. All you have to do is to rotate the Evas_Object WebKit is rendering on. This is implemented by the following function:

// Rotate an evas object by 180 degrees
static void
_rotate_obj(Evas_Object *obj)
{
    Evas_Map *map = evas_map_new(4);
 
    evas_map_util_points_populate_from_object(map, obj);
    evas_map_util_rotate(map, 180.0, 400, 300);
    evas_map_alpha_set(map, 0);
    evas_map_smooth_set(map, 1);
    evas_object_map_set(obj, map);
    evas_object_map_enable_set(obj, 1);
 
    evas_map_free(map);
}

See this screenshot below and  get the complete source code.

EFL + WebKit doing Politreco upside down

By Lucas De Marchi at October 23, 2010 09:53 PM

October 02, 2010

Deploying WebKit, common issues

Holger Freyther

From my exposure to people deploying QtWebKit or WebKit/GTK+ there are some things that re-appear and I would like to discuss these here.

  • Weird compile error in JavaScript?
  • It is failing in JavaScriptCore as it is the first that is built. It is most likely that the person that provided you with the toolchain has placed a config.h into it. There are some resolutions to it. One would be to remove the config.h from the toolchain (many things will break), or use -isystem instead of -I for system includes.
    The best way to find out if you suffer from this problem is to use -E instead of -c to only pre-process the code and see where the various includes are coming from. It is a strategy that is known to work very well.

  • No pages are loaded.
  • Most likely you do not have a DNS Server set, or no networking, or the system your board is connected to is not forwarding the data. Make sure you can ping a website that is supposed to work, e.g. ping www.yahoo.com, the next thing would be to use nc to execute a simple HTTP 1.1 get on the site and see if it is working. In most cases you simply lack networking connectivity.

  • HTTPS does not work
  • It might be either an issue with Qt or an issue with your system time. SSL Certificates at least have two dates (Expiration and Creation) and if your system time is after the Expiration or before the Creation you will have issues. The easiest thing is to add ntpd to your root filesystem to make sure to have the right time.

    The possible issue with Qt is a bit more complex. You can build Qt without OpenSSL support, you can make it link to OpenSSL or you can make it to dlopen OpenSSL at runtime. If SSL does not work it is most likely that you have either build it without SSL support, or with runtime support but have failed to install the OpenSSL library.

    Depending on your skills it might be best to go back to ./configure and make Qt link to OpenSSL to avoid the runtime issue. strings is a very good tool to find out if your libQtNetwork.so contains SSL support, together with using objdump -x and search for _NEEDED you will find out which config you have.

  • Local pages are not loaded
  • This is a pretty common issue for WebKit/GTK+. In WebKit/GTK+ we are using GIO for local files and to determine the filetype it is using the freedesktop.org shared-mime-info. Make sure you have that installed.

  • The page only displays blank
  • This is another issue that comes back from time to time. It only appears on WebKit/GTK+ with the DirectFB backend but sadly people never report back if and how they have solved it. You could make a difference and contribute back to the WebKit project.


    In general most of these issues can be avoided by using a pre-packaged Embedded Linux Distribution like Ångström (or even Debian). The biggest benefit of that approach is that someone else made sure that when you install WebKit, all dependencies will be installed as well and it will just work for your ARM/MIPS/PPC system. It will save you a lot of time.

    By zecke (noreply@blogger.com) at October 02, 2010 06:12 AM

    August 28, 2010

    WebKit

    Lucas De Marchi

    After some time working with the EFL port of WebKit, I’ve been nominated as an official webkit developer. Now I have super powers in the official repository :-), but I swear I intend to use it with caution and responsibility. I’ll not forget Uncle Ben’s advice: ”with great power comes great responsibility”.

    I’m preparing a post to talk about WebKit, EFL, eve (a new web browser based on WebKit + EFL) and how to easily embed a browser in your application. Stay tuned.

    By Lucas De Marchi at August 28, 2010 03:15 AM

    August 10, 2010

    Coscup2010/GNOME.Asia with strong web focus

    Holger Freyther

    On the following weekend the Coscup 2010/GNOME.Asia is taking place in Taipei. The organizers have decided to have a strong focus on the Web as can be seen in the program.

    On saturday there are is a keynote and various talks about HTML5, node.js. The Sunday will see three talks touching WebKit/GTK+. There is one about building a tablet OS with WebKit/GTK+, one by Xan Lopez on how to build hybrid applications (a topic I have devoted moiji-mobile.com to) and a talk by me using gdb to explain how WebKit/GTK+ is working and how the porting layer interacts with the rest of the code.

    I hope the audience will enjoy the presentations and I am looking forward to attend the conference, there is also a strong presence of the ex-Openmoko Taiwan Engineering team. See you on Saturday/Sunday and drop me an email if you want to talk about WebKit or GSM...

    By zecke (noreply@blogger.com) at August 10, 2010 04:32 PM

    July 16, 2010

    Cross-compiling QtWebKit for Windows on Linux using MinGW

    University of Szeged

    In this post I'll show you how to configure and compile a MinGW toolchain for cross-compilation on Linux, then how to build Qt using this toolchain and finally compile the Qt port of WebKit from trunk.

    read more

    By andras.becsi at July 16, 2010 09:31 AM

    September 06, 2008

    Skia graphics library in Chrome: First impressions

    Alp Toker

    With the release of the WebKit-based Chrome browser, Google also introduced a handful of new backends for the browser engine including a new HTTP stack and the Skia graphics library. Google’s Android WebKit code drops have previously featured Skia for rendering, though this is the first time the sources have been made freely available. The code is apparently derived from Google’s 2005 acquisition of North Carolina-based software firm Skia and is now provided under the Open Source Apache License 2.0.

    Weighing in at some 80,000 lines of code (to Cairo’s 90,000 as a ballpark reference) and written in C++, some of the differentiating features include:

    • Optimised software-based rasteriser (module sgl/)
    • Optional GL-based acceleration of certain graphics operations including shader support and textures (module gl/)
    • Animation capabilities (module animator/)
    • Some built-in SVG support (module (svg/)
    • Built-in image decoders: PNG, JPEG, GIF, BMP, WBMP, ICO (modules images/)
    • Text capabilities (no built-in support for complex scripts)
    • Some awareness of higher-level UI toolkit constructs (platform windows, platform events): Mac, Unix (sic. X11, incomplete), Windows, wxwidgets
    • Performace features
      • Copy-on-write for images and certain other data types
      • Extensive use of the stack, both internally and for API consumers to avoid needless allocations and memory fragmentation
      • Thread-safety to enable parallelisation

    The library is portable and has (optional) platform-specific backends:

    • Fonts: Android / Ascender, FreeType, Windows (GDI)
    • Threading: pthread, Windows
    • XML: expat, tinyxml
    • Android shared memory (ashmem) for inter-process image data references

    Skia Hello World

    In this simple example we draw a few rectangles to a memory-based image buffer. This also demonstrates how one might integrate with the platform graphics system to get something on screen, though in this case we’re using Cairo to save the resulting image to disk:

    #include "SkBitmap.h"
    #include "SkDevice.h"
    #include "SkPaint.h"
    #include "SkRect.h"
    #include <cairo.h>
     
    int main()
    {
      SkBitmap bitmap;
      bitmap.setConfig(SkBitmap::kARGB_8888_Config, 100, 100);
      bitmap.allocPixels();
      SkDevice device(bitmap);
      SkCanvas canvas(&device);
      SkPaint paint;
      SkRect r;
     
      paint.setARGB(255, 255, 255, 255);
      r.set(10, 10, 20, 20);
      canvas.drawRect(r, paint);
     
      paint.setARGB(255, 255, 0, 0);
      r.offset(5, 5);
      canvas.drawRect(r, paint);
     
      paint.setARGB(255, 0, 0, 255);
      r.offset(5, 5);
      canvas.drawRect(r, paint);
     
      {
        SkAutoLockPixels image_lock(bitmap);
        cairo_surface_t* surface = cairo_image_surface_create_for_data(
            (unsigned char*)bitmap.getPixels(), CAIRO_FORMAT_ARGB32,
            bitmap.width(), bitmap.height(), bitmap.rowBytes());
        cairo_surface_write_to_png(surface, "snapshot.png");
        cairo_surface_destroy(surface);
      }
     
      return 0;
    }

    You can build this example for yourself linking statically to the libskia.a object file generated during the Chrome build process on Linux.

    Not just for Google Chrome

    The Skia backend in WebKit, the first parts of which are already hitting SVN (r35852, r36074) isn’t limited to use in the Chrome/Windows configuration and some work has already been done to get it up and running on Linux/GTK+ as part of the ongoing porting effort.

    The post Skia graphics library in Chrome: First impressions appeared first on Alp Toker.

    By alp at September 06, 2008 12:11 AM

    June 12, 2008

    WebKit Meta: A new standard for in-game web content

    Alp Toker

    Over the last few months, our browser team at Nuanti Ltd. has been developing Meta, a brand new WebKit port suited to embedding in OpenGL and 3D applications. The work is being driven by Linden Lab, who are eagerly investigating WebKit for use in Second Life.

    While producing Meta we’ve paid great attention to resolving the technical and practical limitations encountered with other web content engines.


    uBrowser running with the WebKit Meta engine

    High performance, low resource usage

    Meta is built around WebKit, the same engine used in web browsers like Safari and Epiphany, and features some of the fastest content rendering around as well as nippy JavaScript execution with the state of the art SquirrelFish VM. The JavaScript SDK is available independently of the web renderer for sandboxed client-side game scripting and automation.

    It’s also highly scalable. Some applications may need only a single browser context but virtual worlds often need to support hundreds of web views or more, each with active content. To optimize for this use case, we’ve cut down resource usage to an absolute minimum and tuned performance across the board.

    Stable, easy to use cross-platform SDK

    Meta features a single, rock-solid API that works identically on all supported platforms including Windows, OS X and Linux. The SDK is tailored specifically to embedding and allows tight integration (shared main loop or operation in a separate rendering thread, for example) and hooks to permit seamless visual integration and extension. There is no global setup or initialization and the number of views can be adjusted dynamically to meet resource constraints.

    Minimal dependencies

    Meta doesn’t need to use a conventional UI toolkit and doesn’t need any access to the underlying windowing system or the user’s filesystem to do its job, so we’ve done away with these concepts almost entirely. It adds only a few megabytes to the overall redistributable application’s installed footprint and won’t interfere with any pre-installed web browsers on the user’s machine.

    Nuanti will be offering commercial and community support and is anticipating involvement from the gaming industry and homebrew programmers.

    In the mid term, we aim to submit components of Meta to the WebKit Open Source project, where our developers are already actively involved in maintaining various subsystems.

    Find out more

    Today we’re launching meta.nuanti.com and two mailing lists to get developers talking. We’re looking to make this site a focal point for embedders, choc-full of technical details, code samples and other resources.

    The post WebKit Meta: A new standard for in-game web content appeared first on Alp Toker.

    By alp at June 12, 2008 09:35 AM

    April 21, 2008

    Acid3 final touches

    Alp Toker

    Recently we’ve been working to finish off and land the last couple of fixes to get a perfect pixel-for-pixel match against the reference Acid3 rendering in WebKit/GTK+. I believe we’re the first project to achieve this on Linux — congratulations to everyone on the team!


    Epiphany using WebKit r32284

    We also recently announced our plans to align more closely with the GNOME desktop and mobile platform. To this end we’re making a few technology and organisational changes that I hope to discuss in an upcoming post.

    The post Acid3 final touches appeared first on Alp Toker.

    By alp at April 21, 2008 02:38 AM

    April 06, 2008

    WebKit Summer of Code Projects

    Alp Toker

    With the revised deadline for Google Summer of Code ’08 student applications looming, we’ve been getting a lot of interest in browser-related student projects. I’ve put together a list of some of my favourite ideas.

    If in doubt, now’s the time to submit proposals. Already-listed ideas are the most likely to get mentored but students are free to propose their own ideas as well. Proposals for incremental improvements will tend to be favoured over ideas for completely new applications, but a proof of concept and/or roadmap can help when submitting plans for larger projects.

    Update: There’s no need to keep asking about the status of an application on IRC/private mail etc. It’s a busy time for the upstream developers but they’ll get back in touch as soon as possible.

    The post WebKit Summer of Code Projects appeared first on Alp Toker.

    By alp at April 06, 2008 08:40 PM

    March 27, 2008

    WebKit gets 100% on Acid3

    Alp Toker

    Today we reached a milestone with WebKit/GTK+ as it became the first browser engine on Linux/X11 to get a full score on Acid3, shortly after the Acid3 pass by WebKit for Safari/Mac.

    Acid3
    Epiphany using WebKit r31371

    There is actually still a little work to be done before we can claim a flawless Acid3 pass. Two of the most visible remaining issues in the GTK+ port are :visited (causing the “LINKTEST FAILED” notice in the screenshot) and the lack of CSS text shadow support in the Cairo/text backend which is needed to match the reference rendering.

    It’s amazing to see how far we’ve come in the last few months, and great to see the WebKit GTK+ team now playing an active role in the direction of WebCore as WebKit continues to build momentum amongst developers.

    Update: We now also match the reference rendering.

    The post WebKit gets 100% on Acid3 appeared first on Alp Toker.

    By alp at March 27, 2008 09:06 PM

    March 15, 2008

    Bossa Conf ’08

    Alp Toker

    Am here in the LHR lounge. In a couple of hours, we take off for the INdT Bossa Conference, Pernambuco, Brazil via Lisbon. Bumped in to Pippin who will be presenting Clutter. Also looking forward to Lennart‘s PulseAudio talk amongst others.

    If you happen to be going, drop by on my WebKit Mobile presentation, 14:00 Room 01 this Monday. We have a small surprise waiting for Maemo developers.

    WebKit Mobile

    The post Bossa Conf ’08 appeared first on Alp Toker.

    By alp at March 15, 2008 03:29 AM