Reminiscences from a Life of Bytes: 2011

Friday, October 14, 2011

Steam Unmetered

I've long been a fan of Valve's Steam game service, although for this part of the world it's often been infeasible to use it to its full extent. Anyhow, a couple of days ago my ISP, who I was already pretty happy with, made my day by formally announcing their unmetered Steam content server.

Given how many Steam games I've bought over the years but not been able to download - having to carefully look at my usage at the end of each billing cycle and plan what I could get - this was a real boon, on top of TelstreClear already having bumped my data cap from 25Gb to 40Gb for the same money just a month earlier. Several other New Zealand ISPs have also recently started running Steam content servers, and I expect that like the Telstra one they'll be locked to their customers as is common in both Australia and New Zealand.

However, Steam doesn't play nice with ISPs who do this; it always downloads from multiple content servers, and since only one of those is unmetered, the results are that at best you still get about 1/3 of the data from the regular Steam servers. Better than nothing, but still a nasty chunk out of your data cap at best - and if things go badly and the content hasn't yet been locally replicated instead of waiting until it has been, Steam will get the whole thing from their metered server and potentially leave you with a hefty bill. Not cool.

The third-party add-ons for trying to prevent this are awful, and the only one which claimed to work at all on Windows XP evidently has its homepage vanish just before the launch. So, I wrote my own which should work better than the other techniques; it defaults to directing Steam at Telstra's server, but it can be pointed at other ones if you're on another ISP which provides an unmetered server.

As it happens I've written various Detours-like code before, along with fun things like a Windows DLL loader for DJGPP under MS-DOS and things of that ilk, but of course I lost all that and had to leave it behind at Symantec. So, it was kinda fun to take a fresh crack at that; I didn't go to too much effort as this was a hack I put together in a few hours (mainly while waiting for tests to run on other code) so it's not quite perfect, but I rather like the approach.

This does DLL interception in a way I think is reasonably clean; it opens the target process, allocates a page of memory, and then builds an argument frame and then JITs some shim code in after it. One of the mostly-handy but sometimes-irritating things about the x86 is that it tends to use a lot of relative addressing, but this method of building the injection shim creates a nice, clean, address-independent result without much work.

The shim takes care of loading the target DLL (using a path string passed through the argument frame), obtaining the desired entry point via the exported function name (ditto), and then calling the function passing an argument string (again, in the frame) before releasing the reference count on the loaded DLL and returning. This injected code can be called from another process using CreateRemoteThread, and then the shim can be deallocated - if the DLL containing the called function wants to persist past that it can just add another reference to itself.

What makes this worth the effort are a couple of things, the most important being that it's really, really, really bad to do non-trivial things in your DllMain function. So, going to the effort of doing DLL injection this way means that you can call arbitrary code in a target process without having to rely on any of that. Another nice thing - and in practice the most important - is that unloading is clean, since your DLL can just export an unload call which the shim can use, which uses FreeLibrary to decrement the reference count inside itself before the temporary code in the outer shim calls FreeLibrary for the final time to actually get the unload done.

While this particular code isn't that special (it was after all just a quick hack to scratch an itch I had) it's worth noting that a lot of this kind of code injection is going on these days. Lots of add-ons for games, in particular, like to use it. There are graphics and script extenders for Bethesda's Morrowind and Oblivion and Fallout games, for instance, all of which use similar techniques to get inside the original code and make it do new things. And of course, the shim engine including in Windows itself for Application Compatibility purposes can do some quite nifty things too.

One of the nicest parts of this is actually how little (indeed, none at all for this particular app) reverse-engineering you need to do, because the API componentry in Windows itself is engineered so well for debugging.

Sunday, July 10, 2011

The Byte Shop

In 1981, the first retail computer store in New Zealand opened in downtown Auckland, in Fort Street just off Queen Street; called the Byte Shop, it was run by a wonderfully enthusiastic man, the late Andrew Tearle, whose death I only just learned of which preparing to write this post.

When I was 14, I often took the bus from Glenfield College to Auckland after high school on a Friday; most of the time, my main aim was to visit the then brand-new video arcades and play a little Pinball or Galaxian. Once the Byte Shop opened though, it became a mecca for a number of young nerds like myself to gawk at the computers and the programming books.

The first couple of undergraduate programming texts I bought came from there; Fundamentals of Interactive Computer Graphics (by Newman and Sproull) came from the Byte Shop, as did Sorting and Sort Systems, by Harold Lorin, from which I learned the wonders of things like the Polyphase Merge Sort (extremely important on machines of those day, where 16Kb was still considered a lot).

Andy himself was a genuinely friendly guy, and very tolerant of we computer-obsessed youngsters using his store as a hangout to discuss programming, and I hope it worked for him having us around to help answer some of the curlier technical questions customers had about the machines.

Lots of great meetings happened there; one of the most important for me was a young man named Justin Farrelly, and while chatting we found out that he was an Air Force avionics engineer based at 5 Squadron in Hobsonville, meaning he was only a couple of kilometers away from where I lived with my father in Greenhithe just on the other side of the harbour across a bridge.

Justin and I became firm friends and worked together for many years; it was a wonderful partnership, with Justin doing electronics design and me writing software, and around the same time I finished high school Justin left the Air Force and went into business for himself with me coding for him. We did a wonderfully diverse set of embedded projects, mostly based on the Intel 8051; later on Justin took pilot training and learned to fly helicopters, and went down to Antarctica to winter over with Greenpeace to help them get access to the Antarctic treaty negotiations: in this image on the Greenpeace site, Justin is the chap on the left smiling at the camera in a blue coat.

After that Justin went all over the world and we didn't work together again, but those years from 1981 through the late 80's were incredibly productive and fun. And without Andrew Tearle at the Byte Shop, that wouldn't have happened.

Saturday, July 9, 2011

The Turbo C Overlay Manager

During the late 80's I was doing most of my work - writing point-of-sale software for the retail oil industry - in a mixture of assembly language and Turbo C, rewriting my earlier point-of-sale system (written in Turbo Pascal on the Sanyo MBC 550 not-PC-compatible) which used a cooperative multitasking system in favour of one with a proper preemptive task switcher.

Even more fun, this little doodad took advantage of the technique of coercing DOS into supporting multitasking by using the INDOS flag and the very under-documented fact that the MS-DOS kernel thoughtfully stored all the key state for the current program in a contiguous block of memory. With the address and size of this block, as long as the MS-DOS kernel wasn't busy doing I/O at the time, task switches could be done by not just saving the CPU state and switching stacks as per normal, but also saving and restoring the MS-DOS state in the "swappable data area".

With two monitors, the resulting point-of-sale system would run as a TSR, typically driving a monochrome adapter running in the background, running several serial ports (with its own keyboard) and printers while the rest of the machine was available to run a back-office program on the colour display adaptor. The next elaboration was to use the EGA adapter, but to split each colour plane out and use it to drive a separate green screen, since the task system would happily run several instances of the main point-of-sale task so we could run up to three point-of-sale registers from the same PC, which could still run the accounting software on the other display adapter.

While this worked well, as features were added to the point of sale the memory pressure in real mode started to grow, so I looked into one of the fancy features of Turbo C, what Borland called the VROOOM overlay manager.

Overlays were a crude technique in the early days; typically, functions in a Pascal program would be designated as "overlay" routines, in which case they were split out from the main executable into a separate piece managed by library code in the main routine. Before calling an overlay routine, the caller would ensure that the right overlay was loaded in first so that the code for the routine was resident in memory.

All well and good, but with one drawback; since the memory used to load each of the routines in the "overlay" sections was overwritten as different code was loaded in, if an overlaid routine called a non-overlaid routine, which then called a different overlay, things didn't work out so well.

VROOM was more clever than that, though. What some clever folk at Borland had done was a neat trick to allow routines in overlays to work safely. I believe the same fundamental idea was also picked up by real-mode Windows to manage code sections, since real-mode Windows was under really extreme memory pressure, pretty much all application code needed to be swappable which resulted in similar problems.

With some persistence and reverse-engineering, it turned out that the overlay support was mediated by the linker which glued together the final program; the linker then emitted a small thunk for each entry point in any object module which had been tagged on the linker command line for overlaying, and this thunk took care of the transparent code loading of the right overlay segment on function entry.

However, it cooperated with any code higher up in the call stack cleverly; since Borland's compiler used a consistent frame format, the overlay library walked up through the stack looking at code segment addresses which matched the overlay area. If it found one of these call records on the stack it would rewrite it, extracting the original return address and replacing it. Finally, it would then task that stack frame as one it had modified, by adjusting the saved frame pointer to set the low bit (which for normal frames would always be 0, as the 8086 stack is always 16-bit aligned).

This meant that when the flow of control eventually tried to return to some code in an evicted overlay, the rewritten return address would be called instead, which went into a thunk in the overlay manager which would reload the correct overlay segment and fix the stack back up so it was as it was originally before continuing execution.

This made the overlay process pretty transparent, and reasonably reliable, although since it worked at a linker level it would sometimes take some care ensuring that the right routines were bundled together so the call hotspots didn't try thrashing overlay loads.

One remaining problem with it though, was that it still needed a lot of space for the overlay region itself (as much as the biggest segment it might load, which in practice for a program big enough to need overlaying, typically meant 64kb). That was a particular problem for our point-of-sale TSR.

So, having reverse-engineered this I decided to get creative and write my own overlay manager, which took things to the next level by putting the overlay in an unusual place: the Expanded Memory System page frame. My replacement overlay manager preloaded all the overlays into Expanded Memory areas, and then handled the overlay loads by remapping the EMS page frame to the right one for both zippy performance and no additional burden on MS-DOS real memory.

And since I had a preemptive multitasker which was already managing the DOS state switching, I then expanded that to track the EMS page frame state as well, so that my application which was running as a TSR could run concurrently with the main program even if the main program was using EMS as well. Hey presto, we could then run our point-of-sale in the background with something like 96kb of real memory load; 32k of core C library and overlay manager plus a 64kb data segment for the point-of-sale task instance.

And since one of the things I wanted to do around that time was write my own UNIX kernel for a 68k board to replace all this DOS malarkey, when I was working on that I could use this technique to help get GCC running in real-mode MS-DOS, although even with the overlay manager running out of the page frame, GCC still wouldn't fit. Fortunately, that too turned out to be a solvable problem....

Friday, July 8, 2011

Reverse Engineering

One of the things about learning to program, is that like just about anything else it takes effort to learn how to do it well. Reading a book is all very well, but one of the things about reading is that to read well takes effort. One of the things that distinguishes most of the Great Works of programming literature is that they are not easy, and one of the ways that even those works not intended to be used as college texts is the presence of exercises.

The measure of what a programmer says he has read is interesting, but the thing to really do is probe whether they have done the exercises. The Art of Computer Programming is a great set of books but of the all too few people who have read it, even fewer have even glanced at let alone done any of the exercises, which is a shame both because of the lost learning but also because in the great texts the exercises are usually so carefully constructed to take the reader beyond the main text.

[ Of course, back when I was a lad starting out, I didn't have the benefit of ready access to the Great Works (many of which hadn't been written yet, of course). Getting access to a machine was quite enough, at least circa 1980 in New Zealand, hence having to still use card decks to share a "personal" computer. ]

One of the more interesting learning exercises I went through in my early days came via the medium of computer chess, thanks to the classic early chess program Sargon II, by Dan and Kathe Spracklen, immortalized in the classic horror film The Thing as the program Kurt Russell is playing against. My father had taught me the basics of the game, but lacking many other people to play (and having discovered the distraction of computing) I didn't play enough to become any good; when a copy of Sargon II turned up I naturally gave it a whirl, and got a nasty surprise.

If you're feeling keen to get the retro vibe, you can try Sargon in an emulator, although I've linked to Sargon III instead of II as it's rather less forbidding to get started with - one small trick if you use the emulator is that the original Apple ][ had no lower-case. You have to enter moves in algebraic notation (using the board positions) as was a common convention for written chess games, and ESC toggles between the graphical board display and the move list.

Now, there's nothing particularly earth-shaking about Sargon's play, but although when I could get enough time on an Apple ][ to play it on (which I did; I hate to think how annoying and persistent I must have been) I could beat it handily at lower levels, by difficulty level 3 it was consistently beating me - enough that I found it quite frustrating. Now, at that point I could have just played the program to get better at chess, but I didn't - instead I decided that I needed to figure out how it worked, and derive insight into its underlying playstyle that would be a better (and less painful) learning method than being repeatedly beaten.

The other advantage of studying how the program worked was that unlike playing against it, I didn't actually need time with access to the computer to study the program.

Of course, I've since discovered that the original Sargon was published as a book consisting of a commented Z80 source code listing - a common way of distributing programs in the U.S. back in those days, as exemplified by Compute! magazine - but that book never made its way to our shores. However, there were disassemblers for the 6502; indeed the Apple ][ system ROM contained one, and an even better one by Glen Bredon called Sourceror was a companion to the assembly tool Big Mac which I'd started to play with writing ever more complex 6502 assembly language of my own.

So, I loaded the Sargon II binary into memory, pointed Sourceror at it, and got a rough assembly dump, and then with some persuasion got it printed out into a big stack, which I then carried with me everywhere for a month or so at high school, spending every free moment poring over it trying to trace the program's execution and divine its logic and internal data structures (including any parts of the assembly language dump that were actually data instead of code).

While that printout is long gone, I can at least cherish the memory memory of unwinding it all and discovering how it worked; in essence, the key element - and the part of the program I spent most time trying to figure out, since that meant making suppositions about what the data structures it was using were and then tracing through the code to see if those guesses made sense - was a position evaluator that "scored" a given board layout by scanning it to see which moves were available and thus checking which pieces were threatened with capture. The sum of the weights of the remaining pieces, with weight values chosen to reflect their tactical power, adjusted by the sum of the weights of pieces threatened with capture, gave a simple score for the "desirability" of that position.

[ There was an interesting adjustment to the weights of pieces for pawns; their point weighting changed depending on how many rows across the board they had advanced, to reflect the potential value of the pawn if it could reach the end row and be queened. ]

The difficulty level of the program then controlled how deep into the tree of possible positions from the current one the program would recursively search looking for board layouts, using simple Alpha-Beta pruning to try and constrain its search; the one it found which it scored as the best future outcome at the requested depth would result in the program's choice of move. Although my printout with hundreds of scribbled annotations as I unwound the program's structure is long lost, you can get a sense of it by reading the Sargon I source, along with its original commentary.

I definitely own the Spracklens a debt of gratitude; the fact their chess program was better than I am helped make me a developer by making me mad enough to learn how to pull programs apart from the raw bytes. I might not have put enough effort in to learn how otherwise, but I'm certainly glad I did.

Otherwise, for instance, I might not have gone on to study disassemblies of the Apple ][ BASIC floating-point code (helped by "What's Where in the Apple", which I don't think I have kept although I do have some other old bits of Apple][ trivia) and learned floating-point, and come to a fuller understanding of the Taylor and Maclaurin series expansions - around the same time as this I was teaching myself calculus, but recognising the series expansion by figuring out what the tables of constants represented was an epiphany to treasure.

As I said at the start, all good learning takes effort; and although I never did pursue chess seriously, the effort of reverse-engineering that chess program took quite a bit of time and effort that really paid off. Even though it seems like a frivolous exercise and at the time I certainly had no ambition beyond the exercise itself, it had real practical value. Reverse engineering is actually a powerful skill which I put to good use in the next few years (and indeed, serves me well to this day, albeit with better tools to help the process), first writing my own development tools for embedded systems - the SC/MP and Intel 8051, notably - and then using them to write lots of embedded code and system-level stuff.

Thursday, May 19, 2011

CPU Emulation for Improved Debugging

I'd been meaning to describe this for a while, but not got around to it; since it's somewhat related to Fabrice Bellard's great Javascript hack, that's finally got me motivated to describe this technique.

After Symantec shut down development of Ghost Solution Suite in early 2009 (closing our offices and laying off all but myself and another senior developer), my job changed rather rapidly. At the time I had been developing a Javascript interpreter for embedding into the Ghost product release due later that year, something I had finally gotten management approval to do in mid-2008 some 10 years after deciding that I wanted to build the system to support scripting in 1998.

Now, despite not having an actual script parser, the system was very much set up to support dynamic languages, as I'd been a Lisp and Smalltalk fan for many years. Although I'd read about smalltalk occasionally since the early 80's, I actually first used it seriously via the excellent Actor programming language in the Windows 3.1 era - Actor had a more Pascal-inspired surface syntax, but its semantics were pure Smalltalk, and it had a lot of influence; the excellent GUI class library it had was licensed by Borland and its design along with some of the Smalltalk concepts such as reflection were very successfully incorporated into Delphi - designed by Anders Hejlsberg, who later created the even more Smalltalk-like .NET environment. Oddly, Actor ended up owned by Symantec - I did spend some time trying to locate the source while I was there, but as with the QEMM and Desqview/X source which I went looking for after the QuarterDesk acquisition it was basically impossible to find.

So, when I set out in 1998 to make the Ghost Enterprise (as it was called then) management components, I decided a few simple things; it would be in C++, be garbage-collected, network transport would work by binary marshalling of objects and collections, and the core library of management objects should use a Smalltalk-like class library with heterogeneous collections so that C++ routines working with specific types should use dynamic type inquiry written in the style of the Eiffel assignment-attempt operator using C++ operator overloading.

All fairly simple and straightforward, but it's fair to say that the resulting mix of styles wasn't exactly something that the other folks on the team found easy to work with, and another problem was that the debugging experience in Visual C++ and later Visual Studio wasn't entirely stellar. I was entirely used to this kind of thing, so it didn't bother me - aside from my years of assembly language and embedded work, when working on Tandem CLX/R mainframes, the only C++ compiler was a port of Cfront 2.0 done with Lattice C, which meant that debugging was done on the Cfront-translated source, so I probably didn't appreciate that it should have been easier to debug.

Anyway, in 2009 since development had been cancelled midway through the project to deliver the next version, my job became one of stabilising what we had done so that if necessary it could be released and taking over more maintenance (previously I was maintaining most of the management framework, but with only two of us left behind to look after the entire suite of 1.5M LOC that expanded quite a bit). A few months later, once a small team in India was put together, we then started training them up on how things worked, at which point the rather esoteric style of coding became more of a problem, especially because of the poor debugging experience in Visual Studio.

Now, what was particularly awkward about the C++ source debugging in Visual Studio (and this applied for every version from 2002 through to 2008) was that although it mostly worked well in templated code where the full type of an object was statically available, it would struggle mightily with inspecting an object in a context where the static type was of a base class (or COM interface). On occasion it would manage to tell you what the real type of the pointed-to object was, but in general it couldn't manage to - and the Ghost management platform was written in a coding style which meant almost everything pointed to a generic base, so you'd be left with a soup of raw pointers that you'd have to manually cast to something to make sense of.

This resulted in a plea from the newly-formed maintenance team to do something to help them manage. But what could I do? There was, in the Visual Studio debugger, an extensible system for writing custom data inspectors, but it was entirely driven by matching things by static type - the exact piece of information that Visual Studio wasn't correctly working out for itself. And while it's possible to extend the .NET parts of Visual Studio in interesting ways including writing custom visualizers and expression evaluators, the C++ native code debugging didn't appear to have those extension points; the only solution seemed to be a plug-in system created for Visual C++ 6.0 which later releases had kept as a legacy system.

So, I had an extension interface which I could use to generate a string, and all the objects in my run-time new how to print themselves (either in JSON or an earlier more Lisp-like native text syntax), but the extension system had only one way to look at the debug target: a callback function which will read bytes of memory from the target.

At this point, I had the inspiration. Well, if my program's objects could print themselves, and the ONLY thing I could do was read bytes of memory, then the only possible way I could go forward was to write an x86 (or x64, since all the management platform code and JS interpreter also ran 64-bit) software VM and put that in as the debug plug-in; the plug-in could then set up in the emulated stack a frame which would call the print routine in the target object to print to a supplied buffer, and then extract it from the VM and pass it back to Visual Studio.

How hard could it be?

As it turned out, not hard at all, although there were a few things that did surprise me while putting it all together over the course of about three weeks. I went with a simple, straightforward emulator rather than a translator - although I would have liked to have a crack at dynamic translation (and still would, when I have the time) this particular problem didn't really warrant it.

Writing an x86 and/or x64 (mine actually supported both) emulator really isn't that hard - aside from knowing the architectures reasonably well, being able to decode x86 code has enough practical uses that I'd actually done it some years before, to write an API interception library similar in spirit to the Microsoft Research Detours project. In my interceptor, it would disassemble the interception target to relocate the target code to a side thunk which branched back into the original location after making room to insert a branch to the thunk. This method allowed the hook thunk to work safely even in Windows 95 where the kernel code was system-global, as the thunk I'd patch in would compare the current process/thread ID so it'd only activate in the right process context.

Since I was emulating user-mode code, I probably didn't need to emulate the x86 paging unit, but I did a rough emulation of it anyway since that also seemed to be the sanest way to use the memory-read callback which I had to use to get the debug target state. As instructions requested memory, I would consult the emulated MMU and if the page wasn't already present I'd populate it using the memory read callback, in effect "faulting in" the context from the debug target as I needed it.

What did surprise me is how much of the x86 instruction set I needed to support, including not just regular 8087 floating-point (not too hard to emulate using plain C++ floating-point math, except for quirks like rounding modes) but also a number of SSE2 instructions which turned out to be conditionally used in the Visual C++ run-time library as optimizations. Although the debug environment potentially could have avoided using those because they were guarded by CPU feature tests, the debug target process had typically already gone through that and set up global state so that it knew those instructions existed, and I had to emulate them and this meant including a suitable register file organization in the emulator for the xmm register set.

In addition, as the printing code I was calling in the target was pretty general-purpose, it would rely on things like iostreams, which meant memory allocation, and entering Win32 APIs to handle critical sections and the like, so although I didn't need to do any hardware emulation I did need to provide a fairly complete emulation of the Win32 process environment including manufacturing a Thread Environment Block and Process Environment Block structure as part of creating the initial VM for the emulated call to the print routine.

Possibly the most entertaining part of the whole thing was devising a scheme for representing the condition code computations; fortunately I ran across this rather clever and reasonably clear article which demonstrates some neat boolean hackery to recover the carry-out and overflow status values for an operation even though they aren't directly available with C++ arithmetic; while lazily computing the condition codes is conceptually simple enough, I didn't actually bother to start with until I stumbled across the above article, after which it was too tempting not to make the CC emulation lazy.

Finally, after making the above work with a direct call to the print routine, I couldn't resist one final improvement even though I knew it would probably not ship; the root of my dynamic class library was an IUnknown, so rather than directly call the print routine the plug-in DLL instead first performed an emulated virtual function call on QueryInterface for a debug-helper interface IID, and if that was supported then emulated those COM calls to get the print representations on things portably, so the plug-in system could be adapted to work with all kinds of stuff and not just my personal GC'd class library.

This is what made the whole thing worthwhile to me, in that this turned my little DLL into a general extension system that almost any C++ program could use to create more useful printed representations of themselves under debugging.

The end result worked astonishingly well, especially since one final quirk of Visual Studio was that the plug-in DLL containing the VM code got loaded and unloaded on every single attempt to resolve an expression; if you hit a breakpoint and the debug window wanted to show the values of, say, 16 local variables which had dynamic object type, the VM system would be loaded, run, and then completely torn down 16 times. Upon discovering that it was tempting to add an extra shim DLL to unload the VM only on a timer, but in actual fact things ran fast enough I decided not to bother.

Despite being a relatively naive direct interpreter, this turned out to perform more than adequately in practice; the emulator DLL was fully statically linked and occupied under 150kb, and even though it was probably about 50 times slower than native code it was fast enough at printing items and even quite complex nested hash tables and such that tapping F8 to single-step through code under debug was still quite snappy.

What was a particular bonus with this approach compared to other debug systems was that it also worked with post-mortem dumps supplied by customers, so that we could be sent a minidump of any code which failed in the field and this debug assistant would be able to untangle the representation of all the objects by just running it.

In contrast, compare that to the co-operative debugging model used in .NET, where processes running under a debugger have additional debugger support code injected into them which the normal debugger will then RPC to in order to obtain rich debug state and get objects marshalled out. That kind of cooperative debugging does create a great experience, but it's not so great for port-mortem work. Using an emulator to partially run the debug target is a great way to enhance the post-mortem debug experience, and indeed with a bit of additional elbow grease the VM could probably even support running the cooperative-debugging extensions into the emulated process.

Having done all this, I had hoped that wouldn't be the end of it - I had been hoping to expand my Javascript with a JIT-to-x86 once I got it fully working as a direct interpreter, and since the emulator was so tiny - about 50kb of compiled code - I could include it with the run-time so it could assist with developing and debugging the JIT by running the built code in a little VM. Unfortunately, as with all the code I wrote at Symantec I lost access to this when I was laid off, and until I get motivated to write another language run-time (or I decide to go back to my 1980's roots writing embedded software development tools) I probably won't have a good excuse to repeat the exercise anytime soon. However, seeing Fabrice Bellard's awesome Javascript hack in action is a good reminder that this kind of CPU emulation is really handy, and it's actually not as difficult to do as you probably think.

In fact, it's really quite an entertaining kind of project to set for yourself, and I recommend it as an exercise - you may not end up running Linux in a browser, but it's not only fun, in my case at least it was quite practical.

Tuesday, April 26, 2011

The More Things Change...

Jeff Atwood's latest post reminding folks of the obvious lessons of the recent minor EC2 outage at Amazon reminded me again about how everything moves in painful cycles of forgetting and reinvention in the computer industry. In this case the specific technique he refers to Netflix using, their "Chaos Monkey", is basically the exact same thing that Tandem Computers used to regularly use in their Guardian OS 30 or so years before Netflix.

It's hard to overstate just how many great innovations in computing were pioneered by Tandem; their main problem commercially was really being so far ahead of their time in so many ways. Many of the architectural innovations they didn't just invent but turn into commercial success are being directly copied in the modern web era; building distributed web systems with EC2 virtual machines and message queuing systems (or their equivalents in the Google or Azure ecosystems) are just scaled-out versions of what happened inside a Tandem mainframe's chassis.

In the case of the Chaos Monkey, it's important to understand that Tandem's fault-tolerant machines weren't built with triple-modular redundancy, as a lot of people always seemed to assume. Rather, the fault-tolerance of the system was due to comprehensive hardware error detection (obviously you have to detect faults to recover from them) combined with having no single point of failure. This did mean that there were duplicates of things inside the chassis: the disks, for instance, had two independent paths to them from different CPU nodes, the CPU nodes didn't share memory but rather had two separate internal interconnects (in essence, two mini-LANs) connecting them but all of this actually got used, it wasn't there "just in case".

The brilliance of the Guardian OS architecture (and remember, this was set up in the 1970's) was that it was a microkernel-based distributed system, composed out of small servers (small by necessity, as the architecture was essentially 16-bit, so a service process had 64k 16-bit words of data storage) which communicated by fault-tolerant explicit message queues. Each service process, as it worked on messages in its queue, would checkpoint its memory to a separate warm standby copy of itself, which the executive would always ensure ran on a separate CPU node inside the chassis - collectively, the nodes formed a single effective computer in what we now would call clustering.

If anything untoward happened to a node, be it a CPU fault, an ECC memory failure, or just about anything, rather than trying to work out exactly what the consequences of the failure were and stitching up an ad-hoc response to try and keep running - essentially, the kind of hopeless approach encouraged, to no good end, by exception handling in programming languages like Java and C++. Rather, the node was simply shut down and the in-process work abandoned, and the other nodes began a clever recovery process.

In this recovery, the warm standby copies of each service process (including its message queue) were found, and new standby copies were made to freshly selected nodes, and then the standby copies were started to pick up at the last saved checkpoint, resuming operation relatively seamlessly and without the need for the clients to do much.

This combined with transactional management of writes to disk - as these systems were built to do online transaction processing, so their main workload was database management - meant that the microkernel provided software fault tolerance to the system as a whole.

The relevance to the Chaos Monkey is how Tandem used to continually test their recovery system; one of the standard services that could be run was a process which did nothing but wait some amount of time, and then issue a command to reset the node it was currently running on, forcing the recovery of that node. Since all faults lead to node shutdown, this was an effective simulation technique.

And amusingly, with the suicidal shutdown service being one of these recoverable services, the recovery process would instantiate a new warm standby on another node, so that it too would in turn be reset. Essentially, then, once this service was started you'd see random nodes inside the chassis be reset and recover continually, during which time you could run a normal test workload and observe that it continued with temporary degraded performance but with no other ill effects (or if there was, that typically meant a bug in one of the services which meant it wasn't cooperating correctly with the checkpoint system).

It's also worth noting that the SQL running on the Tandem systems I used - CLX/R's during a time I spent at Tandem learning immensely about Tandem's architecture, and making this classic tome one of my all-time desert island books right alongside The Structure And Interpretation of Computer Programs.

Almost everything about making high-availability scalable web services that is being painfully relearned now can be understood by just making a good study of those Tandem systems; for instance, their database system used key-range partitioning to distribute queries in parallel across the cluster (and could do so quite transparently), much as sharding is used today.

Of course, this is how it ever was; lessons learned in mainframes being forgotten and re-learned by the upstart minicomputer folks, and then in turn forgotten and re-learned in the age of the microprocessor, and now once again in the age of web development. But it's remarkable to think that Tandem were basically the only firm pursuing this particular line, and they did it better than most people do now, only starting some 35 years ago before Ethernet, before SCSI, before UNIX.