Theory vs. Practice
Diagnosis is not the end, but the beginning of practice.
Critical wrk and wrk2 bugs: all wrk/wrk2 benchmarks since 2012 are bogus
Nowadays, benchmarking is not a walk in a park. As yet another coincidence, wrk and wrk2 have been created in 2012 to complement weighttp (2006) and IBM Apache Benchmark (1996). The world wide web, totally flat, was much simpler before G-WAN and its first 2010-2013 benchmarks bringing new unknown heights in this otherwise boring, infinitely self-complacent industry.
In 2023-2024, I first used wrk, which takes forever to complete benchmarks with a fast server because wrk attempts to count all the server replies (if the server takes 10 seconds to complete the test, and wrk is 500 times slower than the server, then wrk will need 500 * 10 seconds = 5000 seconds = 1 hour 23 minutes to complete the test). Late 2024, an engineer suggested wrk2 because it is "slower but more reliable"... and it stops at the specified time (instead of taking forever).
In April 2025, I published new [1k-40k users] benchmarks (G-WAN reaching 242m RPS at 10k users). But a few months later, I discovered that installing wrk2 on new machines was crashing at... 10k users.
This was odd because 10k users is the concurrency where G-WAN (at 242m RPS) is vaporizing NGINX and others (which top with less than 1m RPS at 1k users). But I did not have time to fix wrk2, and I was thinking that writting a G-WAN-based benchmark would be a much better value-proposition than fixing the slow, obscure and large code of wrk2 (5,316 lines of code).
Near September 2025, I noticed that an OS update had slowed-down G-WAN from 242m RPS to 8m RPS (so I wrote the G-WAN cache to bypass a suddenly 'faulty' Linux kernel syscall – restoring G-WAN performance to 281m RPS at 10k users).
I though I was safe from this point. But in April 2026 I have been told that creating many threads could take so much time that wrk2 could leave no time to the actual benchmark. This patch was provided, where stop_at is created after start (wrk2 was creating stop_at before start and the creation of threads, ignoring completely the threads calibration time!):
--- a/src/wrk.c
+++ b/src/wrk.c
@@ -122,7 +122,8 @@
uint64_t connections = cfg.connections / cfg.threads;
double throughput = (double)cfg.rate / cfg.threads;
- uint64_t stop_at = time_us() + (cfg.duration * 1000000);
+ uint64_t start = time_us();
+ uint64_t stop_at = start + (cfg.duration * 1000000);
for (uint64_t i = 0; i < cfg.threads; i++) {
thread *t = &threads[i];
@@ -163,7 +164,6 @@
printf(" %"PRIu64" threads and %"PRIu64" connections\n",
cfg.threads, cfg.connections);
- uint64_t start = time_us();
uint64_t complete = 0;
uint64_t bytes = 0;
errors errors = { 0 };
wrk2, despite being well-promoted and widely praised, is not exactly what I would call a champion:
At 40-50k users the kernel OOM kill-switch "Terminates" wrk2 for using 190+ GB on my 192 GB RAM machine while G-WAN, which is doing many more things, is consuming around 700 MB of RAM (this fact alone is revealing about how much expertise and care is dedicated to benchmark tools by the best-funded "scalability and benchmark experts").
That's why I felt the need to make my own benchmark, which will be integrated and published with G-WAN. With it, it will be possible to benchmark high-concurrencies on miniPCs with 4 GB of RAM. A welcome change for the unfunded crowds.
Nevertheless, I have promised to investigate further, and have discovered that the situation was much worse than presented, as the proposed patch would not fix the main issue:
(1) wrk2's thread calibration takes as much time as the benchmark itself (default for both: 10 seconds – the benchmark duration can be specified on the command-line... but the calibration duration is silently extended: calibrate_delay = 10_seconds + (thread->connections * 5), a total nonsense for high concurrencies, carefully hidden with the use of MACROS!).
(2) wrk2's main() setups a stop_at time before creating the threads and a start time after creating and calibrating the threads, so the benchmark_effective_duration = benchmark_specified_duration - calibration_duration
(what could possibly go wrong in wonderland, right?).
(3) wrk2's main() does the RPS calculation req_per_s = complete / runtime_s which turns the division into a multiplication (leading to bogus values) when the actual benchmark time (which defaults to 10 seconds) is reduced by the calibration time to less than 1 second (the first parallelization bug).
This deadly issue happens most of the time because the calibration time and actual benchmarking time are nearly identical!
The obvious fix was to do this in wrk.c, not in main() but rather in the threads' function:
thread->start = time_us(); thread->stop_at = thread->start + (cfg.duration * 1000000); // <= THE FIX aeMain(loop); // => the actual benchmark starts here, after thread calibration was done
With this single line, we guaranty that every single thread will execute for (at least) the user-specified time. wrk2 benchmarks will last longer than before because the thread calibration time will not be substracted from the thread benchmarking execution time (they will be cumulated). And, most probably, like in real life, not all client threads will start and end at the same time, making benchmarks last even longer (than the default duration, or the one specified on the command-line).
But since the starting time and execution time are different for each thread, we can't calculate the RPS in main() like Gil Tene is doing it in wrk2 since 2012: by taking the start of the first thread and the end of the last one (the second parallelization bug).
Doing so is necessarily false (due to the OS task and thread scheduling, background processes, etc.) – that's basic parallelism synchronization, a discipline publicly normalized with the 1995 POSIX threads publication. In 2026, 30 years later, there is no excuse for doing it wrong by-design to such an extent... in a tool supposedly benchmarking high-performance multi-threaded servers!
Instead, the RPS must be accounted for in each thread – which in turn will contribute to report the final server performance (in RPS) better since all the thread execution durations are more exactly matching the specified benchmark time.
wrk2 has been first published in 2012 by Gil Tene. In 2026, these 2 major by-design flaws are 14-year old – for something presented as "A constant throughput, correct latency recording variant of wrk". wrk, created by Will Glozer, also miscalculates the RPS and has a less severe calibration/benchmark time flaw only because it allocates much less time to threads calibration.
It would be very interesting to hear about why Gil Tene felt the need to extend the calibration time so much in wrk2, to the point where it completely defeats the purpose of benchmarking... while claiming that wrk2 is "more exact" than wrk!
I am saying this because, after examining the wrk2 source code, there are very (very) strange things like many things implemented and not used, redundant slow function calls, and... purposedly misleading messages like "Initialised %d threads in %.3f ms" while the timing was for event-loops creation (thread creation was not timed and was not reported – it smells like if the great "Art of Deception" was at work, again).
For a so-called "high-performance multi-threaded benchmark tool" the source code of wrk2 stinks, quite a lot, and would deserve a complete rewrite (if it was not badly designed in the first place). Its only purpose seems to be as slow and inefficient as possible.
Using event-queues for high-latency networks, low concurrencies and mostly-idle clients works, but this model will quickly show its limits on localhost (or fast networks), generate VERY HIGH latencies ("ready" queued connections are starved while only one is processed at a time) and hit the small wrk2 timeouts:
#define SOCKET_TIMEOUT_MS 60000 // GPG: 1 minute, was 2000 ( 2 seconds) #define CALIBRATE_DELAY_MS 1000 // GPG: 1 second, was 10000 (10 seconds) #define TIMEOUT_INTERVAL_MS 60000 // GPG: 1 minute, was 2000 ( 2 seconds)
In the same spirit, calibrate_delay = 10_seconds + (thread->connections * 5);, is an absolute nonsense, especially with high concurrencies (and has disastrous consequences when substracted from the actual benchmark time like wrk2 is doing it). Either this "widely praised scalability expert" is not familiar with the concept of arithmetic overflow, or he knew what he was doing. In both cases, his source code not trustworthy – and the fact that nobody felt the need to correct it is telling how serious is the whole cohort.
So I have quickly corrected a few things here and there, added some useful messages, added pretty thousands for the readability of RPS and timing, etc. but I don't see the point of wasting more time on the outrageously amateurish wrk2 codebase. Stating "amateurish" is much nicer than "criminal" because there are many hints that all this mediocrity and bad design choices was a plan rather than merely due to utter incompetence.
SO, SINCE 2012 MOST WRK AND WRK2 BENCHMARKS ARE BOGUS – AND NOBODY HAS EVER NOTICED... IN 14 YEARS!
After fixing wrk2's latest available source code and recompiling it, I quickly tested it and... it crashed at 10k users. Wow, nobody seemed to have addressed the bug I have experienced 12 months ago.
I re-downloaded wrk2 from several sources to compare it to the version I downloaded in October 2024. In the 2024 source code, the RPS flaws were already there... but at least this 2024 version of wrk2 (published before the April 2025 G-WAN benchmarks) had no problem to test up to 40k users without crashing.
In the newest versions of wrk2 available on Github, Ubuntu repositories, etc., the Makefile has also been heavily rewritten (so they have time for this, but not to make better tools) and the resulting executable file is now 10 times smaller than before (by not embedding the libraries it relies on, so the executable will fail if copied to another machine, due to GNU incompatibilities and shared library versioning) – and all these new versions are crashing at... 10k users!
If someone wanted to sabotage the tool that allows G-WAN to shine (and that reveals the deffects of NGINX and all other servers), the exact same thing would have been required to be done.
I am sure that there will be people claiming that all this is "accidental" – but I hardly see why and how wrk2 crashing at 10k+ users is a necessary feature for a multicore benchmark tool widely considered and celebrated as the "best of its class".
If there's no outright fraud here, I can't understand why it si so difficult to find a reasonably designed, performing and reliable benchmark tool for servers: all others, including the recent Go and Rust ones, are even slower and less capable than wrk2... so ever-degraded quality and spiraling budgets are presented as "the unescapable march of progress"!
I have called wrk3 this redesigned version of wrk2 (without the bogus RPS by-design flaws) that doesn't crash at 10k users... and which is much easier to compile since (1) it comes with all its dependencies and (2) has a Makefile using them.
As a bonus, wrk3 is much, much faster than wrk2: G-WAN now tops at 469m RPS at 10k users on the same machine where the same (relatively old) version of G-WAN topped at 281m RPS.
The latest G-WAN is now much, much faster, but that will be for another blog post.
I share wrk3 with the world, both to let people test their own works and G-WAN (HTTP(S) server, Web applications, and caching reverse proxy)... because we all need and deserve better tools than the ones provided by the best-funded "experts" of the BigTech industry.