I'm looking forward to doing careful benchmarking, as this renderer absolutely l...

I'm looking forward to doing careful benchmarking, as this renderer absolutely looks like it will be competitive. It turns out that is really hard to do, if you want meaningful results.

My initial take is that performance will be pretty dependent on hardware, in particular support for pixel local storage[1]. From what I've seen so far, Apple Silicon is the sweet spot, as there is hardware support for binning and sorting to tiles, and then asking for fragment shader execution to be serialized within a tile works well. On other hardware, I expect the cost of serializing those invocations to be much higher.

One reason we haven't done deep benchmarking on the Vello side is that our performance story is far from done. We know one current issue is the use of device atomics for aggregating bounding boxes. We have a prototype implementation [2] that uses monoids for segmented reduction, and that looks like a nice improvement. Additionally, we plan to do f16 math (which should be a major win especially on mobile), as well as using subgroups for various prefix sum steps (subgroups are in the process of landing in WebGPU[3]).

Overall, I'm thrilled to see this released as open source, and that there's so much activity in fast GPU vector graphics rendering. I'd love to see a future in which CPU path rendering is seen as being behind the times, and this moves us closer to that future.

[1]: https://dawn.googlesource.com/dawn/+/refs/heads/main/docs/da...

[2]: https://github.com/linebender/vello/issues/259

[3]: https://github.com/gpuweb/gpuweb/issues/4306