A Regex Barometer

A Regex Barometer

by burntsushi

dsemi

Would be interested to see how cl-ppcre compares as I've often heard people praise its performance.

burntsushi

Me too! I ran out steam adding engines. If anyone wants to take a crack at this, instructions are here: https://github.com/BurntSushi/rebar/blob/master/CONTRIBUTING...

Basically, you "just" need to build a Lisp program that accepts a simple format on stdin, and then repeatedly executes the benchmark given. Then it just needs to print out the time and result of each sample collected. Finally, you need to tell rebar how to build and run the program.

Joker_vD

Not to be confused with Rebar3 [0] which is a de-facto package manager and build tool for Erlang.

[0] https://github.com/erlang/rebar3

Aachen

I don't understand, what are the columns? Is this ratio thing a factor of how much slower one library is than some other unspecified thing? And why aren't all benchmarks run on all engines if the results are averaged anyway; does the benchmark count column mean something different?

burntsushi

Does this paragraph above the summary table answer your question?

> The summary statistic used is the geometric mean of the speed ratios for each regex engine across all benchmarks that include it. The ratios within each benchmark are computed from the median of all timing samples taken, and dividing it by the best median of the regex engines that participated in the benchmark. For example, given two regex engines A and B with results 35 ns and 25 ns on a single benchmark, A has a speed ratio of 1.4 and B has a speed ratio of 1.0. The geometric mean reported here is then the "average" speed ratio for that regex engine across all benchmarks.

So 1.0 represents the fastest possible ratio, where the engine ranked first in all benchmarks.

> And why aren't all benchmarks run on all engines

Because they can't be. Some benchmarks, for example, test the speed that a regex engine reports capture group offsets. But not all regex engines (like Hyperscan) support that functionality. In other cases, it's because of various limits that are hit, e.g., "regex is too big."

If you look at the individual benchmarks, more information about each can be seen by expanding the details. And if there are engines missing, it should tell you why. Look at the noseyparker benchmark for example[1]. It's absolutely brutal and most engines just time out.

[1]: https://github.com/BurntSushi/rebar#noseyparker

kzrdude

That was interesting. And the quadratic benchmark listed after that highlights how hyperscan wins in an unfair way - it just implements different semantics, so it's not comparing the same feature. (I see now, after posting, you've mentioned it in another comment.)

Modified3019

Yeah, definitely needs at least a “higher/lower is better” comment for people that don’t want to sink time and effort into a random github project trying to figure out what these values are saying.

Comment was deleted :(

pionar

Are there any surprises here? For me, it's how well the .NET regex lib does in compiled mode.

burntsushi

I agree. I did not expect .NET to do so well.

Another surprise is perhaps the quadratic benchmark[1]. Even for automata oriented engines, it provokes a case that is unavoidably O(m * n^2). Basically, the automata oriented engines (rust/regex, go/regexp and re2) all have worst case O(m * n) time complexity for an individual search. But when you move to "iterate over all matches in a haystack," that worst case gets bumped up to O(m * n^2) for certain classes of regexes.

All regex engines in the barometer, except for Hyperscan, suffer from this problem. That's not because of any special secret in Hyperscan, but rather, that it implements different match semantics than every other engine.

I'm otherwise finding it hard to say what is "surprising" haha. Perhaps others from the outside can comment.

[1]: https://github.com/BurntSushi/rebar#quadratic

ygra

There has been quite a bit of effort in improving regex performance in .NET in the last few years. Stephen Toub has a few interesting (and long) articles:

• https://devblogs.microsoft.com/dotnet/regex-performance-impr...

• https://devblogs.microsoft.com/dotnet/regular-expression-imp...

From what I remember, there are a bunch of distinct optimizations:

• transforming the regex into a functionally equivalent one with atomic groups to prevent backtracking when it's statically known that backtracking cannot change the result (but the runtime)

• diverting most parts of the actual search to string.IndexOf, IndexOfAny, etc. methods that have been heavily vectorized

• trying to find clever ways in using said methods in finding good starting our continuation points for the match and matching either backwards or forwards as needed to retain most of the vectorized searching benefits with as little unnecessary backtracking as possible (e.g. /a+b/ will likely generate code to search for b and then match backwards instead of trying to find every a and continuing forwards from there)

I guess most of those ideas are neither new nor uncommon (especially in faster engines where I think they take inspiration from), but compared to a rather similar language .NET has quite an edge by now.

Dowwie

Could the barometer be run against the very first release of regex?

burntsushi

It can. I did it here: https://github.com/BurntSushi/rebar/tree/ag/regexinit/record...

And the PR: https://github.com/BurntSushi/rebar/pull/3

Although I probably won't merge it. Overall results:

    $ rebar rank record/curated/2023-07-05/*.csv -M compile --intersection
    Engine          Version  Geometric mean of speed ratios  Benchmark count
    ------          -------  ------------------------------  ---------------
    rust/regex      1.9.0    1.03                            34
    rust/regexold   1.7.3    3.12                            34
    rust/regexinit  1.0.0    4.35                            34

You can explore individual benchmarks in either the README I linked above, or by checkout out that PR and running this:

    rebar cmp record/curated/2023-07-05/*.csv -M compile --intersection

(You'll need to build rebar first of course though.)

Dowwie

More than 2x improvement :). It started off fast and just kept getting faster. Congrats.

No value for PHP?

burntsushi

PHP uses PCRE2, which is included in this barometer. I mentioned PHP (among others) here: https://github.com/BurntSushi/rebar/blob/master/WANTED.md

If someone has a compelling argument for why PHP specifically (and probably others) should be included, then please open an issue and make your case. :-)

KerrAvon

notes on other WANTED stuff: NSRegularExpression uses ICU regex underneath. However, I think Swift's new native regex support is a from-scratch new engine; it pretty much has to be to work well with Swift strings and the nifty extended RegEx literal builder mode.

Crafted by Rajat

Source Code

hckrnws

A Regex Barometer