hckrnws
Would be interested to see how cl-ppcre compares as I've often heard people praise its performance.
Me too! I ran out steam adding engines. If anyone wants to take a crack at this, instructions are here: https://github.com/BurntSushi/rebar/blob/master/CONTRIBUTING...
Basically, you "just" need to build a Lisp program that accepts a simple format on stdin, and then repeatedly executes the benchmark given. Then it just needs to print out the time and result of each sample collected. Finally, you need to tell rebar how to build and run the program.
Not to be confused with Rebar3 [0] which is a de-facto package manager and build tool for Erlang.
I don't understand, what are the columns? Is this ratio thing a factor of how much slower one library is than some other unspecified thing? And why aren't all benchmarks run on all engines if the results are averaged anyway; does the benchmark count column mean something different?
Does this paragraph above the summary table answer your question?
> The summary statistic used is the geometric mean of the speed ratios for each regex engine across all benchmarks that include it. The ratios within each benchmark are computed from the median of all timing samples taken, and dividing it by the best median of the regex engines that participated in the benchmark. For example, given two regex engines A and B with results 35 ns and 25 ns on a single benchmark, A has a speed ratio of 1.4 and B has a speed ratio of 1.0. The geometric mean reported here is then the "average" speed ratio for that regex engine across all benchmarks.
So 1.0 represents the fastest possible ratio, where the engine ranked first in all benchmarks.
> And why aren't all benchmarks run on all engines
Because they can't be. Some benchmarks, for example, test the speed that a regex engine reports capture group offsets. But not all regex engines (like Hyperscan) support that functionality. In other cases, it's because of various limits that are hit, e.g., "regex is too big."
If you look at the individual benchmarks, more information about each can be seen by expanding the details. And if there are engines missing, it should tell you why. Look at the noseyparker benchmark for example[1]. It's absolutely brutal and most engines just time out.
That was interesting. And the quadratic benchmark listed after that highlights how hyperscan wins in an unfair way - it just implements different semantics, so it's not comparing the same feature. (I see now, after posting, you've mentioned it in another comment.)
Yeah, definitely needs at least a “higher/lower is better” comment for people that don’t want to sink time and effort into a random github project trying to figure out what these values are saying.
Comment was deleted :(
Are there any surprises here? For me, it's how well the .NET regex lib does in compiled mode.
I agree. I did not expect .NET to do so well.
Another surprise is perhaps the quadratic benchmark[1]. Even for automata oriented engines, it provokes a case that is unavoidably O(m * n^2). Basically, the automata oriented engines (rust/regex, go/regexp and re2) all have worst case O(m * n) time complexity for an individual search. But when you move to "iterate over all matches in a haystack," that worst case gets bumped up to O(m * n^2) for certain classes of regexes.
All regex engines in the barometer, except for Hyperscan, suffer from this problem. That's not because of any special secret in Hyperscan, but rather, that it implements different match semantics than every other engine.
I'm otherwise finding it hard to say what is "surprising" haha. Perhaps others from the outside can comment.
There has been quite a bit of effort in improving regex performance in .NET in the last few years. Stephen Toub has a few interesting (and long) articles:
• https://devblogs.microsoft.com/dotnet/regex-performance-impr...
• https://devblogs.microsoft.com/dotnet/regular-expression-imp...
From what I remember, there are a bunch of distinct optimizations:
• transforming the regex into a functionally equivalent one with atomic groups to prevent backtracking when it's statically known that backtracking cannot change the result (but the runtime)
• diverting most parts of the actual search to string.IndexOf, IndexOfAny, etc. methods that have been heavily vectorized
• trying to find clever ways in using said methods in finding good starting our continuation points for the match and matching either backwards or forwards as needed to retain most of the vectorized searching benefits with as little unnecessary backtracking as possible (e.g. /a+b/ will likely generate code to search for b and then match backwards instead of trying to find every a and continuing forwards from there)
I guess most of those ideas are neither new nor uncommon (especially in faster engines where I think they take inspiration from), but compared to a rather similar language .NET has quite an edge by now.
Could the barometer be run against the very first release of regex?
It can. I did it here: https://github.com/BurntSushi/rebar/tree/ag/regexinit/record...
And the PR: https://github.com/BurntSushi/rebar/pull/3
Although I probably won't merge it. Overall results:
$ rebar rank record/curated/2023-07-05/*.csv -M compile --intersection
Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
rust/regex 1.9.0 1.03 34
rust/regexold 1.7.3 3.12 34
rust/regexinit 1.0.0 4.35 34
You can explore individual benchmarks in either the README I linked above, or by checkout out that PR and running this: rebar cmp record/curated/2023-07-05/*.csv -M compile --intersection
(You'll need to build rebar first of course though.)More than 2x improvement :). It started off fast and just kept getting faster. Congrats.
No value for PHP?
PHP uses PCRE2, which is included in this barometer. I mentioned PHP (among others) here: https://github.com/BurntSushi/rebar/blob/master/WANTED.md
If someone has a compelling argument for why PHP specifically (and probably others) should be included, then please open an issue and make your case. :-)
notes on other WANTED stuff: NSRegularExpression uses ICU regex underneath. However, I think Swift's new native regex support is a from-scratch new engine; it pretty much has to be to work well with Swift strings and the nifty extended RegEx literal builder mode.
Crafted by Rajat
Source Code