Tuesday, April 26, 2011

The More Things Change...

Jeff Atwood's latest post reminding folks of the obvious lessons of the recent minor EC2 outage at Amazon reminded me again about how everything moves in painful cycles of forgetting and reinvention in the computer industry. In this case the specific technique he refers to Netflix using, their "Chaos Monkey", is basically the exact same thing that Tandem Computers used to regularly use in their Guardian OS 30 or so years before Netflix.

It's hard to overstate just how many great innovations in computing were pioneered by Tandem; their main problem commercially was really being so far ahead of their time in so many ways. Many of the architectural innovations they didn't just invent but turn into commercial success are being directly copied in the modern web era; building distributed web systems with EC2 virtual machines and message queuing systems (or their equivalents in the Google or Azure ecosystems) are just scaled-out versions of what happened inside a Tandem mainframe's chassis.

In the case of the Chaos Monkey, it's important to understand that Tandem's fault-tolerant machines weren't built with triple-modular redundancy, as a lot of people always seemed to assume. Rather, the fault-tolerance of the system was due to comprehensive hardware error detection (obviously you have to detect faults to recover from them) combined with having no single point of failure. This did mean that there were duplicates of things inside the chassis: the disks, for instance, had two independent paths to them from different CPU nodes, the CPU nodes didn't share memory but rather had two separate internal interconnects (in essence, two mini-LANs) connecting them but all of this actually got used, it wasn't there "just in case".

The brilliance of the Guardian OS architecture (and remember, this was set up in the 1970's) was that it was a microkernel-based distributed system, composed out of small servers (small by necessity, as the architecture was essentially 16-bit, so a service process had 64k 16-bit words of data storage) which communicated by fault-tolerant explicit message queues. Each service process, as it worked on messages in its queue, would checkpoint its memory to a separate warm standby copy of itself, which the executive would always ensure ran on a separate CPU node inside the chassis - collectively, the nodes formed a single effective computer in what we now would call clustering.

If anything untoward happened to a node, be it a CPU fault, an ECC memory failure, or just about anything, rather than trying to work out exactly what the consequences of the failure were and stitching up an ad-hoc response to try and keep running - essentially, the kind of hopeless approach encouraged, to no good end, by exception handling in programming languages like Java and C++. Rather, the node was simply shut down and the in-process work abandoned, and the other nodes began a clever recovery process.

In this recovery, the warm standby copies of each service process (including its message queue) were found, and new standby copies were made to freshly selected nodes, and then the standby copies were started to pick up at the last saved checkpoint, resuming operation relatively seamlessly and without the need for the clients to do much.

This combined with transactional management of writes to disk - as these systems were built to do online transaction processing, so their main workload was database management - meant that the microkernel provided software fault tolerance to the system as a whole.

The relevance to the Chaos Monkey is how Tandem used to continually test their recovery system; one of the standard services that could be run was a process which did nothing but wait some amount of time, and then issue a command to reset the node it was currently running on, forcing the recovery of that node. Since all faults lead to node shutdown, this was an effective simulation technique.

And amusingly, with the suicidal shutdown service being one of these recoverable services, the recovery process would instantiate a new warm standby on another node, so that it too would in turn be reset. Essentially, then, once this service was started you'd see random nodes inside the chassis be reset and recover continually, during which time you could run a normal test workload and observe that it continued with temporary degraded performance but with no other ill effects (or if there was, that typically meant a bug in one of the services which meant it wasn't cooperating correctly with the checkpoint system).

It's also worth noting that the SQL running on the Tandem systems I used - CLX/R's during a time I spent at Tandem learning immensely about Tandem's architecture, and making this classic tome one of my all-time desert island books right alongside The Structure And Interpretation of Computer Programs.

Almost everything about making high-availability scalable web services that is being painfully relearned now can be understood by just making a good study of those Tandem systems; for instance, their database system used key-range partitioning to distribute queries in parallel across the cluster (and could do so quite transparently), much as sharding is used today.

Of course, this is how it ever was; lessons learned in mainframes being forgotten and re-learned by the upstart minicomputer folks, and then in turn forgotten and re-learned in the age of the microprocessor, and now once again in the age of web development. But it's remarkable to think that Tandem were basically the only firm pursuing this particular line, and they did it better than most people do now, only starting some 35 years ago before Ethernet, before SCSI, before UNIX.