ApacheBench & HTTPerf

People's beliefs and convictions are almost always gotten at second hand, and without examination. Mark Twain

HTTP benchmark tools: Apache's AB, Lighttpd's Weighttp, HTTPerf, and Nginx's "wrk"

If you believe in the merits of making your own opinion and want to test a server or a Web application, then this page may help you: we wished we could find such a resource when we were facing a blank page in year 2009, but very few benchmarks bother to document what they do – and even fewer explain why they do things a way or another.

This information is for Linux. We have left the Windows world (after 30 years of monoculture) after we have discovered in 2009 how much better G-WAN performs on Linux.

What to avoid – and why

The web is used by more and more people. Many webmasters "copy and paste" scripts from blogs without understanding what they do. Things get worse when that's a startup or a consultant without the time and resources to do the necessary homework. The example below is real:

Let's start with the most obvious:

Here, 2 high-end 2014 high-performance CPUs serve less than 15k requests/second... while G-WAN is 10x faster on one single 2011 laptop CPU (an Intel Core i3)... with a 10x larger static file!

The test on the left is done anonymously on a blog. But the test showing 200x higher performances is done by an academic expert working in the "distributed systems" laboratory of a prestigious university.

How can a test be wrong by more than a factor 200 (2x10x10)? This is what we explain on this page.

Know what you test

The HTTP Server

Most Web servers downloads come with a tiny "it works" static HTML page. It's not the case with G-WAN because G-WAN is an application server, it supports 17 languages and demonstrates them all with a rich API in the archive you have downloaded.

As a result, G-WAN loads the Java virtual Machine, the C# Virtual Machine, the PH7 Virtual Machine, the C runtime, the C++ runtime, the Objective-C runtime, and so on. All of them create a memory footprint of tens of megabytes – to which are added by default all G-WAN example scripts (init script, maintenance script, servlets, connection handlers, content-type handlers, and protocol handlers) to let people quickly try G-WAN's features.

So, before you make a benchmark, or attempt check G-WAN's memory footprint, make sure you are disabling everything what you don't want to use in your test. Enable or disable log files – for all the servers you compare. And before you conclude that the tests done on this site are irrelevant, drop us a line so we can help you to check that are are doing it right.

This is what we have done with all the tests published on this site: we have sent our test procedures, our results, and asked the authors of each server to make suggestions. Sometimes, like with ORACLE Glassfish, we have allowed them to multiply the performances of their server by more than a factor two by just making suggestions to slightly tune their code.

And, please, before all that, read this web page. You will most probably learn useful things. The kind of technical insights that made it possible for us to make G-WAN.

The Network

Testing web servers via the Internet is a common mistake. People believe that "it's a real-life test" but this is wrong because the network is the bottleneck and then you are not testing the web server. Instead, your are testing:

the network latency
the network bandwidth
the network quality of service.

An ADSL link (usually ranging in the 5-100 Mpbs) will not saturate a web server like G-WAN which can fully use a 5,000 Mbps fiber link with a single multicore CPU.

OK, but with 50 ADSL links @ 100 Mbps then a test would really saturate G-WAN, right?
Wrong.

Latency (the delay between consecutive IP packets) is the bottleneck:

Even a 1 Gigabit LAN would make G-WAN look almost as slow as Nginx. You need a 40/100 Gigabit LAN (with tuned OS kernels, optimized drivers, hight-end switches, etc.) to really saturate a web server like G-WAN.

For those who do not have access to such a test bed, there's a cheap, "optimal" solution available on every single computer. It offers the best available bandwidth and latency your web server, OS and CPU can provide (if properly configured). It's called localhost.

Some will (too) quickly conclude that then there's no advantage at using a fast web server if the limit is the network. This is not true because not all web servers can scale – especially when generating dynamic contents. Here, the difference between a bad and a good server will be life and death.

The Payload

Keep in mind that Web servers do NOT receive or send data. The OS kernel is doing it.

So, when you are serving a large file (a file that requires many TCP packets, each packet being 1,500 bytes in size) then you are testing the OS kernel rather than the Web server.

Nginx speed tests rely on an empty HTML file like 0.html. As Nginx caches file metadata (file size, date/time, etc.) such a test gives it an advantage as compared to other servers. Note that, because of a better userland program architecture and implementation, G-WAN is faster than Nginx, even with 0.html – and the G-WAN cache does not store empty files so it doesn't matter if it's enabled or not (by default caching is disabled in G-WAN v4.10+, see gwan/init.c to enable it).

To be more relevant, benchmarks aimed at measuring the efficiency of a Web server should use an embedded resource (like the Web 'bacon' of Nginx or G-WAN: a 43-byte transparent GIF pixel) to let each server show how good it is at parsing client requests and building a reply (otherwise you are testing the file system and the speed of your disk – or the Nginx cache).

To make the G-WAN /nop.gif URI work with Nginx add this to nginx.conf: location = /nop.gif { empty_gif; }

But the kernel is also caching files in memory, right? That's true, but as compared to G-WAN's caching the kernel is more than twice slower at the task (for small files), hence the value of benchmarking resources embedded in the HTTP server.

Is it right to test such a cornersome case? After all, web servers also have to deal with large static files.

First, as an application server, G-WAN generates dynamic contents in-memory. That's far from being a corner case: this is the primary purpose of of G-WAN – and many served contents are personalized, even for mere Web servers.

Second, the G-WAN cache will turn disk-based resources as embedded resources. This too, can be done for many small files. The G-WAN Key/Value store can also be used as a dynamic cache to accelerate database applications, or other computer intensive calculations like loan amortissement.

Third, G-WAN is also faster than Nginx when serving large files. Here, we are just explaining how much faster G-WAN will be for small files and embedded resources because, heck, the purpose of these benchmarks is to compare the web servers (rather than different OS kernels, File Systems, types of Disks, Network Interface Cards, Switches, etc. which, by careful selection, may be used to either serve as a bottleneck or to demonstrate the superiority of a given web server).

HTTP options

For the same reason, HTTP Keep-Alives should be used to test Web servers: establishing new TCP connections is very slow – and this is done (again) by the OS kernel rather than by the Web server. When you create many new connections per second, you test the OS kernel, not the user-mode server application.

Further, modern Web applications heavily rely on HTTP Keep-Alives, making them more than relevant on today's Web 2.0 with AJAX, Comet, online Games, video streaming, Big Data, HPC, HF Trading, etc.

HTTP Protocol version

HTTP/1.1 brings HTTP Keep-Alives to avoid having to establish a new TCP connection for each HTTP request. This is a useful evolution of HTTP/0.9 which lacks many other features.

Comparing servers that use different versions is increasingly irrelevant because HTTP/2 offers even more performance-oriented features like HTTP header compression, the support of bundled resources, multiplexing, request prioritization, native (un-encoded) binary contents, and streaming.

Know what you compare. Make sure you are using the same protocol version in all the servers.

OS kernel limits

In our G-WAN-based private Clouds, we are using a recompiled kernel which critical parts have been re-written to deliver much higher performance (it is most likely that the largest datacenters have done similar things to cut their hardware, electricity, floor space, and cooling costs).

With G-WAN, this custom kernel is 30x-50x times faster than the official kernel. But other HTTP servers don't get such a boost because their usermode code is the bottleneck rather than the OS kernel.

Make sure you are using the same OS version and configuration for all the HTTP servers you compare.

Hardware limits

Finally, even with a small static 100-byte file and HTTP keep-Alives, most of the time is consumed by the CPU address bus saturation due to broadcast snoops. That, recognized Intel R&D, is the next bottleneck to address.

Future multi-Core CPUs will only make things better for G-WAN and worse for all others because G-WAN has been designed to scale vertically before scaling horizontally.

Multi-Core CPUs

In year 2000, Intel shipped the last single-Core (mainstream) CPU, the Pentium 4. All its successors have been multi-Core CPUs, making single-Core CPUs obsolete (in servers, desktops, tablets, smartphones, routers, etc.).

In the past, CPUs were faster by using a faster clock frequency. But at 4GHz vendors told us that heat dissipation problems became unmanageable at reasonable costs. To continue delivering more power, CPUs started to embed several small CPUs (the CPU 'Cores') printed at a lower scale.

Programs that do not exploit the new CPU Cores will not run much faster on new CPUs. Established software vendors face a serious challenge because their product lines were designed at a time parallelism was not a concern on mass-market PCs and servers.

Around 2020, Moore's law will collapse as transistors reach the size of an atom, making it impossible to stack more Cores in CPUs. Then, making more powerful CPUs will require to break the laws of today's known physic. In the meantime, writing more efficient software (and enlarging CPU caches) are the only way to make computers run faster.

To test SMP (Symmetric Multi-Processing, called "hyperthreading" for Intel) or CMP (Chip Multi-Processing, called "multicore") server software, that is, software that takes advantage of logical processors able to run threads in parallel, you need to use CPUs with many Cores (and use as many workers on the client and server sides).

On Intel platforms, SMP can lead to a 30% performance gain (in the relatively rare cases where memory latencies can be avoided when interleaving the two threads' instructions in the pipeline) while CMP theorically delivers 100% of the performance (this is moderated by factors like cache coherency and access to system memory, which is much slower than CPU caches).

SMP/CMP Web servers can either use several processes (like Nginx) or one process and several threads (like G-WAN).

Our tests show that using one single process and several threads saves both CPU and RAM – and delivers better performance – because this architecture spares the need for redundant plumbing protected by locks and duplicated resources.

Not all 3.0 GHz CPUs are Equal

All our single-socket 6-Core tests are made with this Mac Pro CPU (identified as follows in the gwan.log file):

 Intel(R) Xeon(R) CPU W3680 @ 3.33GHz (6 Cores/CPU, 2 threads/Core)

We use two of such machines to make the LAN-based tests.

But (extensive) third-party CPU tests show that many same-frequency CPUs are not as fast (in year 2012, some same-frequency CPUs are 5 times slower and a few others are 1.5 times faster). You can identify your CPU here.

For example, this 2011 8-Core AMD CPU @ 4.2GHz is slower than our 2010 6-Core Xeon W3680 @ 3.33GHz.
You need this 2011 16-Core AMD CPU @ 2.1GHz to be as powerful as our 2010 6-Core Xeon W3680 @ 3.33GHz.
But even this 2012 4-Core Intel CPU @ 1.8GHz is 3x slower than our 2010 6-Core Xeon W3680 @ 3.33GHz.

Moral of the story, our "3.33 GHz Xeon CPU" is 20 times faster than a "3.06 GHz Xeon CPU". The number of CPU Cores or its frequency does not always reflect the power of the CPU. The exact CPU reference (stored by G-WAN in the gwan.log file) is needed to really identify the test platform.

And the same kind of approximations (or omissions) in the testbed environment (OS type and configuration, hardware, drivers, network devices) lead to similar inaccuracies, making it impossible to valid or understand undocumented results.

Linux Distribution and Release

Use a 64-bit Linux distribution. Even when G-WAN runs as a 32-bit process, a 64-bit kernel works twice faster than a 32-bit kernel. This is what our tests have shown those last two years, and this is probably the easiest way for you to save on hardware and energy consumption – if you really need that much performance.

Note that Ubuntu 10.10 LTS 64-bit is more faster and scalable than Ubuntu 12.04 LTS 64-bit (which uses twice as much kernel CPU time for serving less requests, as easily seen with our ab.c benchmarking tool described below).

It means that newer kernel and LIBC releases are not necessarily better in terms of performance. There's no substitute to tests for insights that you can rely on.

Also note that, on linux 64-bit, G-WAN 32-bit is slightly faster than G-WAN 64-bit, proof that the advice above, while valid for the Linux OS kernel, is not an absolute rule valid for all software. You may want to use G-WAN 32-bit if you do not need to use more than 2 GiB of RAM (the theoric limit borders 2.5 GiB, as the kernel allocates some virtual addressing space for its own needs).

Firewalls, Packet Statistic or Packet Filtering (IP TABLES, PCAP, etc.)

IP TABLES are very expensive in terms of CPU because they apply many rules to each packet traveling on the network. Even statefull firewalls have to keep track of established connections – and this takes time – delaying network traffic.

As a typical IP TABLES configuration will divide G-WAN performances by a factor two, performance tests cannot seriously be done without either disclosing the fact that IP TABLES are used (and how), or even better, without disabling this bottleneck for the duration of the tests – and for all the compared Web servers.

System configuration

If you use a default OS installation then not all the resources of your hardware can be used: the default settings may be designed for client needs, or to save resources, but they set a limit to what can be done. Therefore, you must do some tuning to let applications and the OS kernel fully use the hardware.

The first issue is the lack of file descriptors (the default is only 1,024 files by process, resulting in very poor performance).

The second issue is the lack of TCP port numbers. As it takes time to fully close connections lingering in the TIME_WAIT state, the number of available ports will quickly decrease and establishing new connections will not be possible until they are released by the system. The default client port range is [1,024 - 5,000] and must be extended to the whole [1,024 - 65,535] ephemeral port range.

Not doing so, you will quickly hit the TIME_WAIT state and the AB, HTTPerf or Weighttp tools will produce errors like:

    "error: connect() failed: Cannot assign requested address (99)"

To avoid these issues and improve general performance, you have to change the following system options:

 ulimit -aH                                        (this gives your limit)
 sudo sh -c ulimit -HSn 200000                     (this setups your limit)

To make the following options permanent (available after a reboot) you must edit a couple of system configuration files:

Edit the file /etc/security/limits.conf:

 sudo gedit /etc/security/limits.conf

And add the values below:

 * soft nofile 200000
 * hard nofile 200000

Edit the file /etc/sysctl.conf:

 sudo gedit /etc/sysctl.conf

And add the values below:

 # "Performance Scalability of a Multi-Core Web Server", Nov 2007
 # Bryan Veal and Annie Foong, Intel Corporation, Page 4/10
 fs.file-max = 5000000
 net.core.netdev_max_backlog = 400000
 net.core.optmem_max = 10000000
 net.core.rmem_default = 10000000
 net.core.rmem_max = 10000000
 net.core.somaxconn = 100000
 net.core.wmem_default = 10000000
 net.core.wmem_max = 10000000
 net.ipv4.conf.all.rp_filter = 1
 net.ipv4.conf.default.rp_filter = 1
 net.ipv4.ip_local_port_range = 1024 65535
 net.ipv4.tcp_congestion_control = bic
 net.ipv4.tcp_ecn = 0
 net.ipv4.tcp_max_syn_backlog = 12000
 net.ipv4.tcp_max_tw_buckets = 2000000
 net.ipv4.tcp_mem = 30000000 30000000 30000000
 net.ipv4.tcp_rmem = 30000000 30000000 30000000
 net.ipv4.tcp_sack = 1
 net.ipv4.tcp_syncookies = 0
 net.ipv4.tcp_timestamps = 1
 net.ipv4.tcp_wmem = 30000000 30000000 30000000    

 # optionally, avoid TIME_WAIT states on localhost no-HTTP Keep-Alive tests:
 #    "error: connect() failed: Cannot assign requested address (99)"
 # On Linux, the 2MSL time is hardcoded to 60 seconds in /include/net/tcp.h:
 # #define TCP_TIMEWAIT_LEN (60*HZ)
 # The option below is safe to use:
 net.ipv4.tcp_tw_reuse = 1

 # The option below lets you reduce TIME_WAITs further
 # but this option is for benchmarks, NOT for production (NAT issues)
 net.ipv4.tcp_tw_recycle = 1

Then save the file and then make the system reload it:

 sudo sysctl -p /etc/sysctl.conf

The options above are important because values that are too low just block benchmarks. You will find other options in the ab.c wrapper described below.

If enabled, SELinux may prevent G-WAN from raising the number of file descriptors. If this is the case, apply the following SELinux module:

 /usr/sbin/semodule -DB
 service auditd restart
 service gwan restart
 grep gwan /var/log/audit/audit.log | audit2allow -M gwan_maxfds
 semodule -i gwan_maxfds.pp
 service gwan start
 Starting gwan:                                         [  OK  ]
 /usr/sbin/semodule -B

The number of file descriptors used by G-WAN can be found in /proc:

 cat /proc/`ps ax | grep gwan | grep -v grep | awk -F " " '{print $1}'`
 /limits | grep "Max open files"
 Max open files            2048                 2048                 files

This is good for a one-time check, but don't use the above command to constantly monitor G-WAN, use the more efficient ab.c program described below.

Virtualization (Hypervisors)

Virtualization is another hardware abstraction layer on the top of the OS kernel (which, to avoid more bugs, additional critical security holes and further loss of performance, is the only abstraction layer that we should be running on any given machine).

And it is not only slower – it also has a completely different performance profile because everything is encapsulated with new code (for example, memory allocation is notoriously atrociously damaged by virtualization, even further than all other tasks).

One area where hypervisors are a notorious nuisance is in the (altered) detection of the CPU topology (number of CPUs, Cores and Threads per Core).

For a mysterious reason, hypervisors feel the need to corrupt the return values of the CPU CPUID instruction and the Linux kernel /proc/cpuinfo structure. Both were designed for the sole purpose of letting multi-threaded applications like G-WAN scale on multicore systems.

Some Linux distributions have recently innovated in this matter by also altering what the CPU CPUID instruction returns.

Of course, the broken CPU topology detection won't affect single-threaded servers like Nginx or Apache but this will make G-WAN underperform, by a factor two or more. Even more sneakily, the ab.c test tool described here will also be affected – but only for multi-threaded servers like G-WAN.

So, instead of having the OS kernel as the bottleneck (like on a normal machine), then you have a (much) slower 'virtual machine' as the new bottleneck (see "Multi-Core scaling in a virtualized environment").

Unsurprisingly, if the speed is limited to 30km/h, then even a sports car will not 'run faster' than a bicycle.

Beware what you are testing.

The TCP/IP stack needs to be warmed-up

If you run the same AB (Apache Benchmark) test twice then you don't get the same results. Why?

Because the TCP/IP stack works in a conservative manner: it starts slowly and gradually augments its speed as the traffic grows. When a new TCP connection is created, there is no way to know if the server is capable to read as fast as the client can send.

So the client has to send some data, and then wait the server confirmation that all was received. Then it can slowly augment its speed until the receiving side does not cope (either the server or the network is overwhelmed or the traffic is rerouted to a slower/longer PATH). As a result, one small shot does not give enough time for the stack to reach its optimal state – and such a test will hide how fast a server is and how well it can cope with a growing load.

This is why small isolated AB tests are so variable and so much lower in relevance than longer tests. And since a test needs to last longer to be relevant, why not use this time to check how the tested web server (or web application) is behaving as the number of concurrent users is growing? Why not check how much RAM and CPU resources the application and the system are consuming?

Here you will see how to do all this – with one single command.

Caching vs. non-Caching

The very existence of G-WAN has created a debate because its in-memory caches (for static and dynamic contents), when enabled, more than double its performance (by removing the disk I/O bottleneck that... the OS kernel caching is supposed to wave).

With G-WAN's goal being scalability, if an OS feature does not deliver the promised benefits, then G-WAN feels a duty to fill the gap. The question to ask is why mainstream servers did not do it before.

Other servers stated that it was not fair to compare (G-WAN + cache) to Nginx or Apache2 (despite the fact that they also employ diverse caching strategies – albeit much slower, like a memcached server or locally cached file blocks and metadata (opened file descriptors, file size, etc.).

Why hidden caching strategies would be more legitimate than publicly disclosed ones (unlike G-WAN tests, many Web server benchmarks do not document their system tweaking and configuration files) is another interesting question.

Further, if dedicated cache servers like Apache Traffic Server (ATS) or Varnish are widely compared to Web servers like Nginx or Apache2 then rejecting the same test for G-WAN is... just not fair.

Since version 4.7 G-WAN has made it possible to disable its caches, and since v4.10 it disables caching by default (see the gwan/init.c file). This way, users can check that G-WAN is faster than others, whether caching is used or not.

Not all Benchmark Tools are created equal

As we have seen, one recurring issue with performance tests is the very relevance of the test. But one of the most irrelevant ways to distort a test is to use a benchmarking client which is slower than the tested server.

A server cannot send answers faster than it receives queries.

A surprisingly high number of recent "performance tests" insist to use completely irrelevant benchmark tools, either because they are obsolete or because their architecture and implementation cannot match the performance of the tested servers.

Using a slow (single-thread, and/or Java, Python, Ruby, etc.) benchmark tool makes it possible to "demonstrate" that all the tested servers have similar performances because what is tested here is no longer the server but rather the client tool.

Below, we will discuss the most widely known benchmarking tools and their relevance.

IBM - ApacheBench (AB)

To install ApacheBench:

 sudo apt-get -y install apache2-utils

Basic usage (ab -h for more options):

 ab -n 100000 -c 100 -t 1 -k "http://127.0.0.1:8080/100.html"

n ........ number of HTTP requests
c ........ number of concurrent connections
k ........ enable HTTP keep-alives
t ......... number of seconds of the test

AB is reliable, simple to understand and easy to use. It's only defect is a relatively high CPU usage, and its inability to put under pressure SMP (Symmetric Multi-Processing, called "hyperthreading" for Intel) or CMP (Chip Multi-Processing, called "multicore") servers which use several worker threads.

This is because AB (like many others including Siege) uses one single thread, and an outdated event polling method. AB was made at a time CPU Cores did not exist (before 2001) – and this now makes mostly AB irrelevant to test the load of a modern multi-threaded server.

Knowing this, single-threaded servers usually use AB to compare themselves to multi-threaded servers. This is because they are much slower with multi-threaded clients like Weighttp presented below. See our comparison of AB and weighttp benchmarks.

Lighttpd - Weighttp (WG)

Like IBM AB, Weighttp has been written by Web server authors – probably because they felt the (real) gap for a serious HTTP stress tool able to test modern multi-Core CPUs. To install Weighttp:

 wget http://github.com/lighttpd/weighttp/zipball/master
 unzip lighttpd-weighttp-v0.2-6-g1bdbe40.zip
 cd lighttpd-weighttp-v0.2-6-g1bdbe40
 sudo apt-get install libev
 gcc -g2 -O2 -DVERSION='"123"' src/*.c -o weighttp -lev -lpthread
 sudo cp ./weighttp /usr/local/bin

Basic usage (weighttp -h for more options):

 weighttp -n 100000 -c 100 -t 4 -k "http://127.0.0.1:8080/100.html"

n ........ number of HTTP requests
c ........ number of concurrent connections (default: 1)
k ........ enable HTTP keep-alives (default: none)
t ......... number of threads of the test (default: 1, use one thread per CPU Core)

Based on epoll on Linux, Weighttp is much faster than AB – even with one single thread. But its real value is when you are using as many threads/processes as you have CPU Cores on the server you target because THIS IS THE ONLY WAY TO REALLY TEST A SMP/CMP SERVER (that is, a server using several worker threads attached to logical processors – by default G-WAN uses one thread per CPU Core).

With Weighttp being so fast, you will almost certainly hit the TIME_WAIT state wall (see the TIME_WAIT fix above in the "System Configuration" paragraph).

Weighttp is by far the best stress tool we know today: it uses the clean AB interface and works reasonably well. It could be made even faster by using leaner code, but there are not many serious coders investing their time to write decent client tools, it seems.

Hewlett Packard - HTTPerf

Basic usage (httperf -h for more options):

 httperf --server=127.0.0.1 --port=8080 --rate=100 --num-conns=100
 --num-calls=100000 --timeout=5 --hog --uri=/100.html

Yes, HTTPerf is more complex than AB. This is visible at first glance in its syntax.

And HTTPerf does not let you specify the concurrency rate, nor the duration of the test:

num-call ........... number of HTTP requests per connection (> 1 for keep-alives)
num-conn ........ total number of connections to create
rate ................... number of connections to start per second

If we want 100,000 HTTP requests, we have to calculate how many '--num-conn' and '--num-call' we will have to to specify to get a given '--rate':

nbr_req = rate * num-call

'num-conn' makes it last longer, but to get any given 'rate' 'num-conn' must always be >= to 'rate'.

HTTPerf takes great care to create new connections progressively and it only collects statistics after 5 seconds. This was probably done to 'warm-up' servers that have problems with 'cold' starts and memory allocation.

Removing this useful information from benchmark tests makes them NOT reflect reality (where clients send requests on short but intense bursts).

Also, HTTPerf's pointlessly long shots for each test make the TIME_WAIT state become a problem (see the TIME_WAIT fix above in the "System Configuration" paragraph).

Finally, HTTPerf cannot test client concurrency accurately: if rate=1 but num-conn=2 and num-call=100000 then you are more than likely to end with concurrent connections (despite the rate=1) because not all HTTP requests will be processed when the second connection is launched.

And if you use a smaller num-call value then you are testing the TCP/IP stack (creating TCP/IP connections is slow and this is done by the kernel, not by the user-mode HTTP server or Web application that you want to test).

As a result, HTTPerf can only be reliably used without HTTP Keep-Alives (with num-call=1). And even in this case, we have found ApacheBench (and even more weighttp) to be a far better proposition.

The (long) story of Nginx's "wrk"

We have discovered wrk very late (5 years after G-WAN's first release), after it was mentioned in a stackoverflow question about G-WAN. wrk was written in late 2012 according to GifHub, and it is presented on various Web sites as "the benchmark tool written for Nginx". The fact that its HTTP parser is borrowed from Nginx's source code confirms this statement.

 Usage: wrk <options> <url>
 Options:
  -c, --connections <N>  Connections to keep open
  -d, --duration    <T> of test
  -t, --threads     <N> of threads to use

In contrast with AB or weighttp, wrk creates all TCP connections first, and, then only, it sends HTTP requests.

There's a "benchmark" where Nginx is presented as processing 500,000 requests/second and later even 1 million RPS, with a version of wrk modified to support pipelining (oddly, a feature removed from wrk after this test).

This test, using a server with two 6-Core CPUs, did not take into account the ability of Nginx to accept or close TCP many connections quickly: they all were pre-established.

In contrast, with half the CPUs and CPU Cores, G-WAN achives 850,000 requests per second... TCP handshakes included!

Under common circumstances wrk is biasing HTTP benchmarks because:

real Web users send an HTTP request immediately after the TCP handshake;
wrk masks the ability (or the inability) of a server to quickly accept and close connections;
G-WAN's DoS shield cuts established TCP connections that wait too long before providing HTTP requests.

If you use wrk with a duration (-d) exceeding G-WAN's timeout, or with so many connections (-c) that establishing them exceeds G-WAN's timeout, then G-WAN will cut most of those idle connections because, missing an HTTP request, they look like a Denial of Service (DoS) attack. As wrk reports those connection closes as "I/O errors", some users incorrectly concluded that G-WAN could not cope with the same load that Nginx handles (...if configured with the proper options).

But if you stay below G-WAN I/O timeouts then G-WAN passes the wrk test with better results than Nginx (nginx.conf):

 wrk -d 5 -t6 -c N "http://127.0.0.1:8080/nop.gif"

          ┌───────────────────────┬───────────────────────┐
          │                       │                       │
          │         G-WAN         │          NGINX        │
          │                       │                       │
 ┌────────┼────────────┬──────────┼────────────┬──────────┤
 │        │            │          │            │          │
 │    N   │    RPS     │  IO Rate │     RPS    │  IO Rate │
 │        │            │          │            │          │
 ├────────┼────────────┼──────────┼────────────┼──────────┤
 │        │            │          │            │          │
 │    100 │ 668,884.20 │ 188.18MB │ 101,627.86 │  22.78MB │
 │        │            │          │            │          │
 ├────────┼────────────┼──────────┼────────────┼──────────┤
 │        │            │          │            │          │
 │     1k │ 648,054.56 │ 182.32MB │ 314,922.46 │  70.58MB │
 │        │            │          │            │          │
 ├────────┼────────────┼──────────┼────────────┼──────────┤
 │        │            │          │            │          │
 │    10k │ 456,831.62 │ 128.52MB │ 301,124.29 │  67.49MB │
 │        │            │          │            │          │
 ├────────┼────────────┼──────────┼────────────┼──────────┤
 │        │            │          │            │          │
 │   100k │     141.91 │  40.88KB │       4.04 │   0.93KB │
 │        │            │          │            │          │
 └────────┴────────────┴──────────┴────────────┴──────────┘

Here G-WAN starts with a 2.2 MB footprint while nginx (which needs settings for heavy loads to pass this test) starts with a 454.73MB memory footprint (see the next table for more details).

Nginx pre-allocates buffers before receiving the first connection, so you have to make room for many connections in the nginx.conf file.

As opposed to fixed config file options, G-WAN adaptive values let the resources to be allocated dynamically, on a per-need basis.

The last case above, at 100,000 connections, is purposely pathologic to see how wrk copes with an impossible mission: at best there are 65,535 - 1,024 = 64,511 ports available on localhost while the client and the server are supposed to use 100,000 ports each in order to establish 100,000 connections (as each TCP connection needs two ports, binding the client and the server on different IP addresses would help).

Now, let's do something more difficult, with connections series that grow by 100k clients, up to 1 million of concurrent clients:

 ./abc [0-1m:100k+100kx1] "127.0.0.1:8080/nop.gif"

 G-WAN                   CPU             RAM
 -------- -------- ---------------  -------------  --------
 Clients     RPS      user  kernel  SRV MB SYS MB    Time
 -------- -------- ------- -------  ------ ------  --------
       1,   74348,     92,    325,   2.21,   0.0,  10:29:28
  100000,   31161,    104,    484,  29.55,   0.6,  10:29:31
  200000,  103890,    201,    735,  29.68,   0.1,  10:29:32
  300000,   29715,     93,    321,  38.07,   0.7,  10:29:35
  400000,  105434,    212,    739,  38.07,   0.2,  10:29:36
  500000,   25985,    105,    314,  51.91,   0.7,  10:29:40
  600000,   97000,    190,    689,  51.91,   0.2,  10:29:41
  700000,   25271,     89,    336,  53.07,   0.8,  10:29:45
  800000,   31639,    269,    670,  53.33,   0.0,  10:29:49
  900000,   25621,    138,    410,  64.34,   0.8,  10:29:52
 1000000,   84426,    279,    699,  64.34,   0.4,  10:29:54
 ----------------------------------------------------------
 Total RPS:634,490 Time:29 second(s) [00:00:29]
 ----------------------------------------------------------
 
 NGINX                   CPU             RAM
 -------- -------- ---------------  -------------  --------
 Clients     RPS      user  kernel  SRV MB SYS MB    Time
 -------- -------- ------- -------  ------ ------  --------
       1,   27275,     29,    106, 567.67,   0.0,  10:43:16
  100000,   29917,    135,    344, 596.06,   0.8,  10:43:19
  200000,   62279,     81,    199, 587.53,   0.1,  10:43:21
  300000,   30057,     95,    285, 606.52,   0.9,  10:43:24
  400000,   68045,     74,    174, 591.85,   0.2,  10:43:25
  500000,   28827,     99,    203, 596.98,   0.9,  10:43:29
  600000,   62716,     75,    196, 595.05,   0.3,  10:43:31
  700000,   28161,     96,    250, 606.65,   1.1,  10:43:34
  800000,   55241,     58,    164, 586.46,   0.4,  10:43:36
  900000,   27468,    117,    273, 604.63,   1.1,  10:43:40
 1000000,   53484,     84,    198, 595.52,   0.5,  10:43:42
 ----------------------------------------------------------
 Total RPS:473,470 Time:31 second(s) [00:00:31]
 ----------------------------------------------------------

This time, we have used weighttp and our abc wrapper to play on a concurrency range: from 0 to 1m, by increasing steps of 100k clients running 100k requests done one time.

How comes weighttp manages to go up to 1 millions of concurrent clients when wrk was dying at a mere 100,000 connections?

weighttp establishes (and closes) connections on-the-fly. This lets it achieve much higher concurrencies on localhost, albeit, at the cost of performance as we can see here.

RPS results are lower and 'shaky' because the faster a server, the faster free ports will be consumed and lack to establish new connections.

This example demonstrates how fragile the relevance of a tool or test procedure can be.

Vendors may seek to present big numbers to attract users, but numbers make sense only when people understand the nature of the test, and how much tweaking a server requires to merely pass it.

But there's one area where we have found that the particularities of wrk have a good use: that's when playing with concurrencies in the 10,000-50,000 range. Below that range, wrk is too slow, but above its pre-allocation tactic helps to scale where weighttp suffers on localhost.

To use less CPU (and be even faster) G-WAN could cache file blocks and metadata (open file descriptor, file blocks, exist/size/time/type attributes) like Nginx. A future version of G-WAN may do it. But the tests above show that, for now, without any kind of caching, G-WAN is faster than Nginx because G-WAN's architecture and implementation are better.

This matters because on modern multicore systems, server applications are CPU-bound (CPUs are typically underused, pointlessly waiting for things to be done because of the poor multicore scalability of their code). Being able to saturate a greater bandwidth by better using your CPUs lets you do more work with less machines.

At this point you may question why we do not use a LAN instead of localhost. A 1 GigaBit NIC is limited to ~100 MegaBytes per second. In the wrk test above, Nginx would not be limited: it transfered a maximum of 70 MB per second. But G-WAN would be slowed-down by a factor two(!) as it processed 188 MB per second.

Using a 1 Gigabit link in performance tests would make G-WAN look almost as slow as Nginx. That is, unless you have a 10/40/100 Gigabit LAN (with tuned OS kernels, optimized drivers, hight-end switches, etc.) available on your desk to do tests.

Let's recapitulate: weighttp will work very well until 10,000 clients and wrk, which is slower until that point, works better for further concurrencies until 50,000 (that seems to be its limit on localhost). The next 10-second test comforts our prior findings:

 wrk -d 10 -c 50000 -t 6 http://localhost:8080/nop.gif
 G-WAN
 Running 10s test @ http://localhost:8080/nop.gif
   6 threads and 50000 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency    52.22ms  161.94ms 882.82ms   95.60%
     Req/Sec    71.61k     6.51k   92.32k    77.27%
   4156389 requests in 10.02s, 1.14GB read
   Socket errors: connect 0, read 0, write 0, timeout 139871
 Requests/sec: 414,625.76
 Transfer/sec:    116.65MB
  
 NGINX
 wrk -d 10 -c 50000 -t 6 http://localhost:8080/nop.gif
 Running 10s test @ http://localhost:8080/nop.gif
   6 threads and 50000 connections
   Thread Stats   Avg      Stdev     Max   +/- Stdev
     Latency   200.72ms  267.68ms 815.53ms   79.69%
     Req/Sec    39.63k    10.04k   51.73k    63.51%
   2213810 requests in 10.00s, 496.14MB read
   Socket errors: connect 0, read 0, write 0, timeout 15336
 Requests/sec: 221,364.51
 Transfer/sec:     49.61MB

At a concurrency of 50,000 clients wrk shows that G-WAN, despite having a much lower latency, is twice faster than Nginx (which uses less verbose HTTP headers than G-WAN).

There are more timeouts in proportion, but this is expected when the kernel is pushed to its limits (many more requests and a larger payload are processed by G-WAN).

Also, in all fairness, while G-WAN could be slowed-down to avoid these timeouts, Nginx is its own bottleneck, and can't do any better.

The tests above show that G-WAN is faster whatever the concurrency and the test tool (this point is important to reach a consensus: G-WAN has more ground on this area because we use the benchmark tools made by others).

As we have seen, selecting a benchmark tool is not easy as the proper choice depends on the environment (server hardware, network), the load, the type of requests (embedded resources, static file sizes, dynamic contents), and the concurrency.

Lighttpd's weighttp and Nginx's wrk have both legitimate uses and this is probably what makes benchmarking a so difficult matter: it requires a lot of time to understand how things realy work and enough honesty to recognize one's errors. After all, like many, we started to use AB, until we discovered a "better tool" with weighttp.

Comparing ApacheBench and Weighttp

We said that Weighttp is better than AB. But how far is this true? It depends on the number of CPU Cores: the more you have Cores, the faster Weighttp will be as compared to AB. AB was designed at a time when multicore did not exist:

// -------------------------------------------------------------------------------------
// ab test: 7.1 seconds and 140,680 req/s
// -------------------------------------------------------------------------------------
ab -k -n 1000000 -c 300 http://127.0.0.1:8080/100.html

Server Software:        G-WAN
Server Hostname:        127.0.0.1
Server Port:            8080

Document Path:          /100.html
Document Length:        100 bytes

Concurrency Level:      300
Time taken for tests:   7.108 seconds
Complete requests:      1000000
Failed requests:        0
Write errors:           0
Keep-Alive requests:    1000000
Total transferred:      377000000 bytes
HTML transferred:       100000000 bytes
Requests per second:    140680.34 [#/sec] (mean)
Time per request:       2.132 [ms] (mean)
Time per request:       0.007 [ms] (mean, across all concurrent requests)
Transfer rate:          51793.44 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0      15
Processing:     0    2   0.1      2       6
Waiting:        0    2   0.1      2       6
Total:          0    2   0.1      2      17

Percentage of the requests served within a certain time (ms)
  50%      2
  66%      2
  75%      2
  80%      2
  90%      2
  95%      2
  98%      2
  99%      2
 100%     17 (longest request)

// -------------------------------------------------------------------------------------
// weighttp test: 1.7 seconds and 595,305 req/s
// -------------------------------------------------------------------------------------
weighttp -k -n 1000000 -c 300 -t 4 http://127.0.0.1:8080/100.html

finished in 1 sec, 679 millisec and 808 microsec, 595305 req/s, 205217 kbyte/s
requests: 1000000 total, 1000000 started, 1000000 done, 1000000 succeeded, 0 failed, 
0 errored
status codes: 1000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 353000000 bytes total, 253000000 bytes http, 100000000 bytes data

Clearly, ApacheBench (AB) is unable to benchmark G-WAN: AB is the bottleneck with 140k requests/sec while weighttp delivers 595k requests/sec. Most of the difference comes from the 4 CPU Cores used by weighttp (4 x 140 = 560), and the remaining comes from the fact that weighttp is using a faster event-based Linux mechanism called "epoll".

Someone sent me the following bash script, claiming that AB can be made relevant on multicore with the taskset tool. The only way to see if this is the case is to test it:

1
2
3

 for i in `seq 0 3`;do
   taskset -c $i ab -k -c 50 -n 1000000 'http://127.0.0.1/100.html' | grep '(mean)' &> $i.log &
 done

ab -k -c 50 -n 1000000 http://127.0.0.1/100.html
...
Time taken for tests:   7.094 seconds
Requests per second:    140965.17 [#/sec] (mean)
Time per request:       0.355 [ms] (mean)

./bench.sh
...
Requests per second:    72731.63 [#/sec] (mean)
Time per request:       0.687 [ms] (mean)

Requests per second:    73923.53 [#/sec] (mean)
Time per request:       0.676 [ms] (mean)

Requests per second:    71168.06 [#/sec] (mean)
Time per request:       0.703 [ms] (mean)

Requests per second:    74232.14 [#/sec] (mean)
Time per request:       0.674 [ms] (mean)
-----------------------------------------------
Total:                292,055.36 req/sec in 18 seconds

Even with taskset, the AB client does not scale on multicore: each instance of AB executed in parallel delivers half the results of one single instance of AB. As a benchmark tool, AB is the bottleneck. It just cannot be used to test servers that scale vertically.

The `ab.c` wrapper for Weighttp

ab.c initially supported ApacheBench and HTTPerf, but this was making the code pointlessly unreadable so they were ditched as anyway their irrelevance is easily demonstrated. The name 'ab.c' keeps a reference to IBM 'AB', as the next step in performance measuring imposed by the 2001 multicore standard.

ab.c runs tests on the [1 - 1,000] concurrency range. It makes sense (especially if you are using >= 3 rounds for each concurrency test to have a minimum, average and maximum value at each step).

With such a long (and continuous) string of tests, you get more relevant results. A general trend can be extracted from the whole test, and each server's results curve's slope is as useful as its variability to interpret the behavior of a program:

Apache Traffic Server (Yahoo!) vs G-WAN vs Lighttpd vs Nginx vs Varnish (Facebook)

Running weighttp 1,000 times (or more) for each server, in a continuous way, and each time with different parameters, is a tedious task (best left to computers).

The ab.c program does just that: it lets you define the URLs to test, the range, and it collects the results in a CSV file suitable for charting with LibreOffice or gnuplot (apt-get install gnuplot).

ab.c can also measure the CPU and RAM resources consumed by the web server and the system (some servers delegate a significant part of their job to the system, potentially escaping measures made for their sole process(es)).

The ab.c file can either be run by G-WAN or be compiled by GCC (see the comments at the top of the ab.c file).

If you run it with G-WAN, copy ab.c in the directory where the gwan executable is stored and run G-WAN twice (one time as a server and a second time to run the ab.c test program):

 sudo ./gwan                                     (run gwan server)
 ./gwan -r ab.c gwan 127.0.0.1/100.html          (run ab.c [server_name] <URL>)

If you compile ab.c with gcc -O2 ab.c -o abc -lpthread then do this:

 sudo ./gwan                                     (run gwan server)
 ./abc gwan 127.0.0.1/100.html                   (run ab.c [server_name] <URL>)

This will display: (saved in "result_gwan_100.html.txt")

===============================================================================
G-WAN ApacheBench / Weighttp / HTTPerf wrapper       http://gwan.ch/source/ab.c
-------------------------------------------------------------------------------
Now: Thu Sep 28 18:52:03 2013
CPU: 1 x 6-Core CPU(s) Intel(R) Xeon(R) CPU W3680 @ 3.33GHz
RAM: 6.88/7.80 (Free/Total, in GB)
OS : Linux x86_64 v#50-Ubuntu SMP Fri Mar 18 18:42:20 UTC 2011 2.6.35-28-generic
     Ubuntu 10.10 \n \l

> Server 'gwan' process topology:
---------------------------------------------
  6] pid:5622 Thread
  5] pid:5621 Thread
  4] pid:5620 Thread
  3] pid:5619 Thread
  2] pid:5618 Thread
  1] pid:5617 Thread
  0] pid:5506 Process RAM: 2.29 MB
---------------------------------------------
Total 'gwan' server footprint: 2.29 MB
/home/pierre/gwan/gwan
G-WAN 4.9.28 64-bit (Sep 28 2013 15:07:23)

weighttp -n 1000000 -c [0-1000 step:10 rounds:3] -t 6 -k "http://127.0.0.1/100.html"

=> HTTP/1.1 200 OK
   Server: G-WAN
   Date: Thu, 28 Sep 2013 20:52:03 GMT
   Last-Modified: Thu, 13 Oct 2011 13:15:12 GMT
   ETag: "1f538066-4e96e460-64"
   Vary: Accept-Encoding
   Accept-Ranges: bytes
   Content-Type: text/html; charset=UTF-8
   Content-Length: 100
   Connection: close
=> XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...

Number         Requests per second            CPU
  of      ----------------------------  ----------------
Clients     min       ave       max       user    kernel   MB RAM     Time
--------  --------  --------  --------  -------  -------  -------   --------
      1,   152678,   220251,   279337,     392,    1484,    2.32,   08:07:40
     10,   283362,   301733,   318171,     474,    1568,    2.32,   08:07:50
     20,   591956,   618668,   662597,     369,    1440,    2.32,   08:07:55
     30,   718260,   739384,   777479,     425,    1286,    2.32,   08:07:59
     40,   658536,   736343,   797682,     422,    1241,    2.32,   08:08:03
     50,   775151,   788168,   800177,     378,    1274,    2.32,   08:08:07
     60,   615532,   667071,   764336,     357,    1331,    2.32,   08:08:11
     70,   792621,   802455,   816932,     334,    1303,    2.32,   08:08:15
     80,   807515,   823596,   839406,     401,    1258,    2.32,   08:08:19
     90,   756659,   797750,   824328,     343,    1300,    2.32,   08:08:22
    100,   823602,   831829,   836091,     313,    1302,    2.32,   08:08:26
    110,   726856,   797436,   837584,     392,    1349,    2.32,   08:08:30
    120,   822744,   834193,   840570,     327,    1289,    2.32,   08:08:33
    130,   828726,   840265,   846978,     418,    1220,    2.32,   08:08:37
    140,   797041,   820530,   857351,     366,    1242,    2.32,   08:08:41
    150,   778589,   818561,   847754,     366,    1311,    2.32,   08:08:44
    160,   840447,   846376,   855455,     352,    1260,    2.32,   08:08:48
    170,   840446,   844842,   848324,     366,    1243,    2.32,   08:08:51
    180,   818502,   834756,   854787,     364,    1243,    2.32,   08:08:55
    190,   818874,   830426,   844438,     318,    1307,    2.32,   08:08:59
    200,   792267,   827353,   846006,     343,    1286,    2.32,   08:09:02
    210,   719377,   800988,   846982,     980,    3549,    2.32,   08:09:06
    220,   841211,   846224,   850133,     413,    1213,    2.32,   08:09:10
    230,   843732,   847659,   850862,     369,    1248,    2.32,   08:09:13
    240,   837850,   842696,   847033,     373,    1252,    2.32,   08:09:17
    250,   840531,   846152,   849944,     364,    1260,    2.32,   08:09:20
    260,   844300,   848033,   850777,     292,    1321,    2.32,   08:09:24
    270,   829653,   838570,   856243,     355,    1246,    2.32,   08:09:27
    280,   777657,   823413,   849086,     383,    1211,    2.32,   08:09:31
    290,   840282,   843783,   849591,     320,    1297,    2.32,   08:09:35
    300,   836449,   842222,   850771,     369,    1258,    2.32,   08:09:38
    310,   815157,   835302,   850166,     331,    1298,    2.32,   08:09:42
    320,   838998,   841898,   843357,     338,    1300,    2.32,   08:09:45
    330,   829140,   835129,   841611,     331,    1304,    2.32,   08:09:49
    340,   779995,   805174,   838465,     334,    1356,    2.32,   08:09:53
    350,   843775,   844373,   845439,     362,    1274,    2.32,   08:09:56
    360,   841599,   845898,   848848,     343,    1288,    2.32,   08:10:00
    370,   793889,   824918,   843308,     369,    1297,    2.32,   08:10:03
    380,   772776,   821580,   847750,     341,    1295,    2.32,   08:10:07
    390,   837953,   845142,   850654,     299,    1327,    2.32,   08:10:11
    400,   840662,   845963,   853726,     369,    1253,    2.32,   08:10:14
    410,   803891,   831454,   845386,     331,    1302,    2.32,   08:10:18
    420,   827370,   838941,   848850,     357,    1272,    2.32,   08:10:21
    430,   842613,   845519,   847311,     387,    1239,    2.32,   08:10:25
    440,   843442,   846206,   851032,     355,    1267,    2.32,   08:10:28
    450,   785780,   823147,   845937,     385,    1252,    2.32,   08:10:32
    460,   834546,   839099,   841910,     401,    1227,    2.32,   08:10:36
    470,   830758,   835113,   840067,     357,    1276,    2.32,   08:10:39
    480,   788556,   822908,   848298,     366,    1277,    2.32,   08:10:43
    490,   833311,   841979,   848084,     371,    1253,    2.32,   08:10:46
    500,   831943,   838932,   849478,     341,    1279,    2.32,   08:10:50
    510,   836036,   842524,   849206,     355,    1274,    2.32,   08:10:54
    520,   844894,   848526,   852083,     334,    1295,    2.32,   08:10:57
    530,   840880,   847663,   853107,     299,    1330,    2.32,   08:11:01
    540,   788728,   824870,   845167,     396,    1300,    2.32,   08:11:04
    550,   834439,   842538,   851526,     404,    1230,    2.32,   08:11:08
    560,   833872,   838739,   844909,     394,    1227,    2.32,   08:11:12
    570,   831906,   839622,   846857,     371,    1253,    2.32,   08:11:15
    580,   798701,   826954,   843199,     355,    1269,    2.32,   08:11:19
    590,   801124,   827933,   846073,     338,    1304,    2.32,   08:11:22
    600,   826464,   837761,   850053,     336,    1283,    2.32,   08:11:26
    610,   817589,   830982,   838716,     344,    1325,    2.32,   08:11:30
    620,   797157,   824149,   844845,     324,    1325,    2.32,   08:11:33
    630,   835954,   839543,   843219,     355,    1286,    2.32,   08:11:37
    640,   805020,   813326,   827024,     355,    1281,    2.32,   08:11:40
    650,   727524,   798570,   836528,     401,    1246,    2.32,   08:11:44
    660,   806741,   829463,   841187,     390,    1251,    2.32,   08:11:48
    670,   838255,   844630,   850748,     357,    1274,    2.32,   08:11:51
    680,   830290,   840271,   847011,     406,    1237,    2.32,   08:11:55
    690,   843669,   845673,   846854,     338,    1290,    2.32,   08:11:59
    700,   839085,   844716,   848786,     331,    1300,    2.32,   08:12:02
    710,   825393,   835650,   847304,     366,    1279,    2.32,   08:12:06
    720,   822331,   836362,   848265,     313,    1335,    2.32,   08:12:09
    730,   841236,   844834,   848198,     322,    1318,    2.32,   08:12:13
    740,   837295,   839035,   841725,     348,    1279,    2.32,   08:12:16
    750,   842605,   844794,   847091,     357,    1274,    2.32,   08:12:20
    760,   838908,   843304,   845683,     313,    1328,    2.32,   08:12:24
    770,   833358,   836914,   838761,     348,    1269,    2.32,   08:12:27
    780,   839677,   843890,   848215,     376,    1265,    2.32,   08:12:31
    790,   810000,   832713,   844356,     364,    1267,    2.32,   08:12:34
    800,   796423,   825282,   845784,     376,    1291,    2.32,   08:12:38
    810,   828858,   837415,   845210,     380,    1248,    2.32,   08:12:42
    820,   826608,   836079,   843948,     350,    1288,    2.32,   08:12:45
    830,   797722,   824175,   838571,     362,    1232,    2.32,   08:12:49
    840,   827289,   834943,   840495,     334,    1314,    2.32,   08:12:52
    850,   829754,   837062,   840883,     348,    1295,    2.32,   08:12:56
    860,   811395,   827793,   838070,     390,    1262,    2.32,   08:13:00
    870,   823325,   828099,   837477,     365,    1281,    2.32,   08:13:03
    880,   833904,   838703,   844646,     378,    1258,    2.32,   08:13:07
    890,   823148,   832111,   839696,     362,    1295,    2.32,   08:13:10
    900,   831513,   835048,   837028,     366,    1283,    2.32,   08:13:14
    910,   833647,   835858,   837625,     331,    1309,    2.32,   08:13:18
    920,   804111,   824672,   836163,     341,    1324,    2.32,   08:13:21
    930,   821785,   827819,   833732,     348,    1307,    2.32,   08:13:25
    940,   821629,   829049,   839868,     350,    1288,    2.32,   08:13:29
    950,   797319,   808653,   826050,     364,    1279,    2.32,   08:13:32
    960,   828641,   833717,   836952,     348,    1295,    2.32,   08:13:36
    970,   817147,   825612,   831693,     338,    1293,    2.32,   08:13:40
    980,   822798,   831615,   837336,     313,    1323,    2.32,   08:13:43
    990,   699485,   785803,   831896,     315,    1212,    2.32,   08:13:47
   1000,   715512,   774247,   823320,     352,    1179,    2.32,   08:13:51
-------------------------------------------------------------------------------
min:80528704   avg:82290105   max:83730021 Time:388 second(s) [00:06:28]
-------------------------------------------------------------------------------
CPU jiffies:   user:323931   kernel:1170785   total:1494716

Why such a test? Servers can hardly be compared without all this information:

Performance (requests per second, total elapsed time)
Scalability (the [1-1,000] concurrency range)
Efficiency (CPU and RAM resources)

The min, ave and max columns show the requests per second for the number of rounds (here "10") at the specified concurrency (which differs on each line).

The totals at the bottom of the min, ave and max columns show the SUMS of all the values above. Theses values are useful to quickly compare the final scores of several tests (but this is not the average per second: (a) these SUMS are amounts which are not related to the TOTAL elapsed time and (b) these SUMS do not cover the whole interval of time because we have a concurrency step and because the test itself takes time to execute, each server not being able to accept connections and process requests at the same pace).

The CPU and RAM values are amounts of Jiffies for the CPU (a unit defined by the Linux kernel, unrelated to a % of the available CPU resources) and MB for the RAM used by G-WAN.

To get a taste of what ab.c does, here are comparisons made with this test processing 1 billion of HTTP requests:
(weighttp requests per concurrency:1 million, range:[1-1,000], concurrency step:10, concurrency rounds:10)

 100.html (100-byte static file, keep-alives)
 G-WAN server   min:80528704   avg:82290105   max:83730021   Time: 1292 seconds [00:21:32]
 Lighty server  min:21218587   avg:21648673   max:21956268   Time: 4740 seconds [01:19:00]
 Nginx server   min:15072297   avg:15927773   max:16797720   Time: 6823 seconds [01:53:43] 
 Varnish cache  min: 8817943   avg: 9612933   max:10399610   Time:10817 seconds [03:00:17]
 
 hello world
 G-WAN + C      min:69718806   avg:76476595   max:80158520   Time: 1551 seconds [00:25:51]
 G-WAN + Java   min:67829055   avg:72811610   max:75972646   Time: 1648 seconds [00:27:28]
 G-WAN + Scala  min:67646477   avg:72637502   max:75776744   Time: 1660 seconds [00:27:40]
 G-WAN + JS     min:68204491   avg:73775202   max:76865982   Time: 1696 seconds [00:28:16]
 G-WAN + Go     min:69063401   avg:75203358   max:78411364   Time: 1892 seconds [00:31:32]
 G-WAN + Lua    min:67727190   avg:72511554   max:75758859   Time: 1920 seconds [00:32:00]
 G-WAN + Perl   min:69802019   avg:75089420   max:78208829   Time: 1977 seconds [00:32:57]
 G-WAN + Ruby   min:69274839   avg:74538113   max:77808764   Time: 2054 seconds [00:34:14]
 G-WAN + Python min:69158531   avg:74223281   max:77418044   Time: 2110 seconds [00:35:10]
 G-WAN + PHP    min:56242039   avg:59764709   max:61338987   Time: 2212 seconds [00:36:52]
 Tomcat         min: 5715150   avg: 6709361   max: 7655606   Time:20312 seconds [05:38:32]
 Node.js        min: 1239556   avg: 1336105   max: 1420920   Time:80102 seconds [22:15:02]
 Google Go      min: 1148172   avg: 1208407   max: 1280151   Time:84811 seconds [23:33:31]

Those results are sorted by order of performance (shortest elapsed time for the whole test). Divide the main/avg/max numbers by 100 (they are the SUM of 100 values) to have an estimation of the average RPS on the whole [1-1,000] concurrency range.

Note that, depending on the server being tested, a test can take several weeks. This is usually correlated to the RPS, but other factors enter into account like the ability for server processes to accept connections on-the-fly (some are filling a backlog).

How to Spot "Creative Accounting" (aka: Fake Tests, F.U.D., etc.)

That's easier than anticipated – for three reasons: (a) there's no real work nor skills behind FUD campaigns, (b) to stay anonymous their authors often use new accounts, unknown nicknames or young guys desperate to find a job rather than seasoned experts (understandably unwilling to be ashamed), and (c) "influencing the minds" only needs to focus on volume (you have to "occupy the space") rather than care about correctness:

"It is easier to believe a lie that one has heard a thousand times than to believe a fact that one has never heard before."
– Robert Lynd

Here is a quick check-list:

the environment is not correctly documented (system, hardware, server configuration, etc. see above)
on their 2013 6/8-Core CPU(s) G-WAN is slower than (or just as fast as) on a 2006 Core2 Duo (very common)
the benchmark tool has been chosen for its irrelevance (it's the bottleneck, it's pre-establishing connections, etc.)
"Server X" is much faster than G-WAN... but "Server X" will die in pain with the loan.c test (despite G-WAN using a script).
the published figures target a very narrow concurrency range rather than G-WAN's [0-1,000] or even [0-1,000,000] tests...
and/or they just carefully avoid any resemblance to reality (sadly, outright plain lies are often the most widely published).

Again, an easy-to-check criteria is the extraordinary volume of copies of some given "neutral tests" or hate blog posts: if the authors are not part of a well-funded FUD campaign, how is it possible for their prose to be duplicated on hundreds of Web sites and to be "validated" by a chain of comments using anonymous accounts and fake identities?

Logic dictates that, the more you see FUD against someone, the more the target is a threat to the business of the FUD sponsors.

Conclusion

The fact that benchmarking tools do not tell you how to make successful tests should raise some questions. Like the fact that Web/Proxy server vendors rarely make extensive comparative benchmarks – and forget to document them when they bother to publish their tests.

When facing a choice in server technologies, do your homework! At least now you know how to proceed.