Theory vs. Practice
Diagnosis is not the end, but the beginning of practice.
› What makes something "scalable"?
Almost all Web servers and Database servers distributed today are "highly-scalable". Or so say the vendors because that's a keyword that end-users, media and search engines value as the promise for big savings.
And scalability indeed matters: that's the ability to perform well while concurrency grows. For a server, it means that satisfying one or 10,000 users must be done with a low latency (this is the application responsivity, also called 'user experience').
Let's consider G-WAN and the ORACLE noSQL Database.
This Demo will be been presented in the noSQL and Big Data sessions of the ORACLE Open World (OOW) 45,000-person event held in San Francisco on Sept. 30 – Oct. 4, 2012:
On the left, there is a photo of the session pitch taken by Alex.
The G-WAN presentation (the G-WAN-based PaaS) can be found here.
The Initial Demo
In early 2012, Alex from EON (the G-WAN-based PaaS in San Mateo) told me "We have to use nSQL from ORACLE". That was the beginning of it. First, Java support was needed for G-WAN so this was done. Second, we were looking for a relevant use case.
On Sept. 8th Alex sent me a 6.4 MB Tomcat JSP demo adding "we need to accelerate it for the OOW". I was not given any specification, just this JSP application containing 3,455 files. The README.txt file stated:
Populate the database with 10,000 avatars records. Among them 1,000 avatars' position range is { 0 < x < 600, 0 < y < 600}, which is related with current game board size 600 * 600. Others' x, y coordinates are all larger than 600.
The rest was left as an exercise for the reader (the code is the documentation). I had 20 days to "accerelate" that demo with G-WAN.
Here is a screenshot made on Sept. 12th after Alex gave me access to a running demo so I had a chance to know what it looked like:
This grid is the 600 x 600 game board where the moving 1,000 bots are constrained.
Note how the same user names are duplicated. This is because names are collected when people log in and stored in a database to decorate the 1,000 moving avatars (or new human player joining the game).
The other 9,000 avatars were kept (still) outside of this area, which makes it much easier to find who is in the viciny of an human player and way faster to update a limited set of DB records.
Nearby bots are not within the 'radar' yellow circle, but I believe that this was corrected at a later time in the Tomcat JSP application.
In this demo, players were moving the yellow dot, not the game board background.
As a result, once you moved outside of the game board the yellow dot was invisible and the grid was empty. At least that was consistent with the decision to limit the bots moves to the game board. But this was not exactely my view of illustrating a scalable architecture in a scalability demo. A few things clearly deserved to be changed for the sake of the exercise – and to show the true value of the technologies involved: a Java Application server and ORACLE's Java noSQL.
The impatient will jump to the bottom of this page to have a quick look at the 'enhanced' version.
The New Demo
Reading the code, I found that Tomcat received 3 requests per second and noSQL 3x 1k records DB search + 3k updates per second.
Reading this, I scratched my head, wondering how G-WAN could "visibly accelerate" such a demo.
Other than showing a lower latency than Tomcat, it was not clear to me how G-WAN could shine in such an environment: G-WAN can't serve requests faster than the Internet Browser sends them – and that was 3 times per second – not really a challenge.
As raising this refresh rate would break the demo on the Internet (where the typical end-user network latency ranges between 100 and 300ms), I decided to rather break the (unspecified) rules, keeping the HTML5 canvas demo and the 300ms refresh delay, but with a different load:
demo version | number of bots | moves /second | moves area | visitable area | records searched /client/second | record updates /second |
---|---|---|---|---|---|---|
Original | 10,000 | 3,000 | game board | game board | 3,000 | 3,000 |
New | 100,000,000 | 300,000,000 | whole arena | whole arena | 300,000,000 | 300,000,000 |
Now, would Tomcat attempt to use the new (specified) rules, there would be some "visible" difference between the G-WAN demo and Tomcat's (the latter would immediately die). G-WAN was now in a position to demonstrate its ability to act as a caching reverse proxy (for the noSQL DB backend) as well as a Web application server in this modern AJAX + HTML5 demo.
Plus searching and processing + updating 100 millions of records stored in a persistent DB – within 300ms – is a decent challenge.
Here are the latency values for a 3.33GHz 6-Core Xeon W3680 CPU (with plenty of room for concurrency):
number of bots | 1,000,000 | 10,000,000 | 100,000,000 |
---|---|---|---|
db latency | 0.6 ms | 7.1 ms | 79.5 ms |
Remember that those figures are for 1 single client making 3 requests per second. For 2 clients, we have twice as much records searched per second because each client has a different position and because – in my version of the demo – all the bots are constantly moving in the whole arena.
As surfaces grow exponentially, to keep a constant density of bots per square meter (this was needed for the demo to run with 10k, 100k, 1m, 10m and 100m of bots) I settled with the following formula:
// Firefox 15.0.1 does it but Chrome 20.0.1132.47 does not support the // Mathematical Markup Language (MathML) <math><msqrt> tags // // so here is the MathML version followed by plain old C source code: // arena_length = x 75 // not all browsers print a '√' around nbr_bots arena_length = ceil(sqrt(nbr_bots)) * 75; // the same formula in C
The game board size remains the same, this is just a window showing a (now moving) part of the whole arena where all bots move in real time (some human players may stay still for a while – that's their choice).
In such an massive dynamic environment, cache servers like Varnish make no sense: no two server replies will be the same because all bots (and human players) move constantly.
With these new rules, there was no alternative to efficiency – the point that we precisely wanted to illustrate with G-WAN.
New Challenges
What is trivial in the thousands becomes non-trivial in the millions. Scalability is the very difference between client (1 user) and server (many users) applications.
That issue obviously required a particular attention, like involving threading (to parallelise synchronous tasks), optimized processing (the routines to move bots randomly) and proper data structures (to store avatars, their bot or human status, and their constantly varying position, orientation, direction and speed).
I also faced the fact that JSP containers are not portable: their implementation relies on calls that are not part of Java but rather proprietary APIs of a given Java app. server. As a result, while G-WAN supports Java, to support JSP it would have to rely on a third-party JSP container and this would most likely violate some third-party IP rights, including patents.
To avoid this issue, it was decided to rewrite the demo from scratch and to avoid using JSP scripts.
But some other mind-boggling challenges were unexpected:
Take Firefox. The demo worked for a minute or so and then stopped with a cryptic error. Chrome, tested later, worked as expected. It took me a while to find that while Firefox did not fail if the machine was connected to the Internet. Chrome worked in both cases. Firefox also failed with a Javascript function accepting a variable number of parameters: only one version of the function was actually implemented (the latest version of both browsers was used). Great ways to waste one's time figuring what can cause an error first attributed to one's code rather than to the (inconsistent) development environment.
Once rewritten, this demo made some usability points work differently:
- human players (in blue) can travel the whole arena – where all the bots (in black) are moving. Dot are replaced by centered labels with a more readable opaque background. To see which bots are roaming they do not use (duplicated) human names: they are identified by their unique id number.
- the player stays in the middle of the game board. Like in old 2D arcade games this is the game board background which is now scrolling to visit the whole arena. And while new human players are all occupying the same position after they log in, they can visit the whole arena instead of being limited to the game board.
- rather than a blank grid, the demo backgroud displays a map of San Francisco (that's screenshots, live Google Maps could have been used if I knew for sure that the demo machine at the OOW is connected to the Internet).
Also, to make the engine's performance "visible" (an obvious requirement for a scalability demo) a real-time dashboard is exposing the latencies (Network, Server and Database) to make it clear where the time is spent.
The Internet browser knows only the Network time so I have added a new G-WAN performance counter for the elapsed requests execution time which, like the DB time, is passed to the client by the G-WAN server-side script).
Here is a screenshot of what I finally came up with:
This new demo looks better and it offers more insight: only G-WAN can make ORACLE noSQL scale with such a low latency.
In fact, I do not know any other application server able to merely run this demo with 100 millions of bots – even the super-star of all in-memory DB engines VoltDB boasts about a mere 675k TPS/sec on... 12 servers.
Here G-WAN does 1,777 times better than VoldDB – and on 1 single server:
100 millions of record updates * (1,000 ms / 84 ms) = 1.2 billion TPS
That was a decent Parallel Programming challenge illustrating G-WAN's value for online Communities, online Games, Social Networks, Auction sites, High-Frequency Trading, SaaS, PaaS, and Big Data.
Occupying my time for the past weeks, writing this demo has delayed the new G-WAN release (with C# scripts, an elastic reverse proxy and load balancer, etc.), but at the same time it made me find and fix a Javascript comment minifying bug and add a new get_env(argv, REQUEST_TIME) value to let G-WAN scripts find the time used by G-WAN to process a request (validate and parse it + optionally execute handlers + and reply generation if REQUEST_TIME is invoked at the end of a servlet). All in all, that was a positive experience that will certainly benefit to the server ecosystem.