Friday, October 14, 2011

Steam Unmetered

I've long been a fan of Valve's Steam game service, although for this part of the world it's often been infeasible to use it to its full extent. Anyhow, a couple of days ago my ISP, who I was already pretty happy with, made my day by formally announcing their unmetered Steam content server.

Given how many Steam games I've bought over the years but not been able to download - having to carefully look at my usage at the end of each billing cycle and plan what I could get - this was a real boon, on top of TelstreClear already having bumped my data cap from 25Gb to 40Gb for the same money just a month earlier. Several other New Zealand ISPs have also recently started running Steam content servers, and I expect that like the Telstra one they'll be locked to their customers as is common in both Australia and New Zealand.

However, Steam doesn't play nice with ISPs who do this; it always downloads from multiple content servers, and since only one of those is unmetered, the results are that at best you still get about 1/3 of the data from the regular Steam servers. Better than nothing, but still a nasty chunk out of your data cap at best - and if things go badly and the content hasn't yet been locally replicated instead of waiting until it has been, Steam will get the whole thing from their metered server and potentially leave you with a hefty bill. Not cool.

The third-party add-ons for trying to prevent this are awful, and the only one which claimed to work at all on Windows XP evidently has its homepage vanish just before the launch. So, I wrote my own which should work better than the other techniques; it defaults to directing Steam at Telstra's server, but it can be pointed at other ones if you're on another ISP which provides an unmetered server.

As it happens I've written various Detours-like code before, along with fun things like a Windows DLL loader for DJGPP under MS-DOS and things of that ilk, but of course I lost all that and had to leave it behind at Symantec. So, it was kinda fun to take a fresh crack at that; I didn't go to too much effort as this was a hack I put together in a few hours (mainly while waiting for tests to run on other code) so it's not quite perfect, but I rather like the approach.

This does DLL interception in a way I think is reasonably clean; it opens the target process, allocates a page of memory, and then builds an argument frame and then JITs some shim code in after it. One of the mostly-handy but sometimes-irritating things about the x86 is that it tends to use a lot of relative addressing, but this method of building the injection shim creates a nice, clean, address-independent result without much work.

The shim takes care of loading the target DLL (using a path string passed through the argument frame), obtaining the desired entry point via the exported function name (ditto), and then calling the function passing an argument string (again, in the frame) before releasing the reference count on the loaded DLL and returning. This injected code can be called from another process using CreateRemoteThread, and then the shim can be deallocated - if the DLL containing the called function wants to persist past that it can just add another reference to itself.

What makes this worth the effort are a couple of things, the most important being that it's really, really, really bad to do non-trivial things in your DllMain function. So, going to the effort of doing DLL injection this way means that you can call arbitrary code in a target process without having to rely on any of that. Another nice thing - and in practice the most important - is that unloading is clean, since your DLL can just export an unload call which the shim can use, which uses FreeLibrary to decrement the reference count inside itself before the temporary code in the outer shim calls FreeLibrary for the final time to actually get the unload done.

While this particular code isn't that special (it was after all just a quick hack to scratch an itch I had) it's worth noting that a lot of this kind of code injection is going on these days. Lots of add-ons for games, in particular, like to use it. There are graphics and script extenders for Bethesda's Morrowind and Oblivion and Fallout games, for instance, all of which use similar techniques to get inside the original code and make it do new things. And of course, the shim engine including in Windows itself for Application Compatibility purposes can do some quite nifty things too.

One of the nicest parts of this is actually how little (indeed, none at all for this particular app) reverse-engineering you need to do, because the API componentry in Windows itself is engineered so well for debugging.

Sunday, July 10, 2011

The Byte Shop

In 1981, the first retail computer store in New Zealand opened in downtown Auckland, in Fort Street just off Queen Street; called the Byte Shop, it was run by a wonderfully enthusiastic man, the late Andrew Tearle, whose death I only just learned of which preparing to write this post.

When I was 14, I often took the bus from Glenfield College to Auckland after high school on a Friday; most of the time, my main aim was to visit the then brand-new video arcades and play a little Pinball or Galaxian. Once the Byte Shop opened though, it became a mecca for a number of young nerds like myself to gawk at the computers and the programming books.

The first couple of undergraduate programming texts I bought came from there; Fundamentals of Interactive Computer Graphics (by Newman and Sproull) came from the Byte Shop, as did Sorting and Sort Systems, by Harold Lorin, from which I learned the wonders of things like the Polyphase Merge Sort (extremely important on machines of those day, where 16Kb was still considered a lot).

Andy himself was a genuinely friendly guy, and very tolerant of we computer-obsessed youngsters using his store as a hangout to discuss programming, and I hope it worked for him having us around to help answer some of the curlier technical questions customers had about the machines.

Lots of great meetings happened there; one of the most important for me was a young man named Justin Farrelly, and while chatting we found out that he was an Air Force avionics engineer based at 5 Squadron in Hobsonville, meaning he was only a couple of kilometers away from where I lived with my father in Greenhithe just on the other side of the harbour across a bridge.

Justin and I became firm friends and worked together for many years; it was a wonderful partnership, with Justin doing electronics design and me writing software, and around the same time I finished high school Justin left the Air Force and went into business for himself with me coding for him. We did a wonderfully diverse set of embedded projects, mostly based on the Intel 8051; later on Justin took pilot training and learned to fly helicopters, and went down to Antarctica to winter over with Greenpeace to help them get access to the Antarctic treaty negotiations: in this image on the Greenpeace site, Justin is the chap on the left smiling at the camera in a blue coat.

After that Justin went all over the world and we didn't work together again, but those years from 1981 through the late 80's were incredibly productive and fun. And without Andrew Tearle at the Byte Shop, that wouldn't have happened.

Saturday, July 9, 2011

The Turbo C Overlay Manager

During the late 80's I was doing most of my work - writing point-of-sale software for the retail oil industry - in a mixture of assembly language and Turbo C, rewriting my earlier point-of-sale system (written in Turbo Pascal on the Sanyo MBC 550 not-PC-compatible) which used a cooperative multitasking system in favour of one with a proper preemptive task switcher.

Even more fun, this little doodad took advantage of the technique of coercing DOS into supporting multitasking by using the INDOS flag and the very under-documented fact that the MS-DOS kernel thoughtfully stored all the key state for the current program in a contiguous block of memory. With the address and size of this block, as long as the MS-DOS kernel wasn't busy doing I/O at the time, task switches could be done by not just saving the CPU state and switching stacks as per normal, but also saving and restoring the MS-DOS state in the "swappable data area".

With two monitors, the resulting point-of-sale system would run as a TSR, typically driving a monochrome adapter running in the background, running several serial ports (with its own keyboard) and printers while the rest of the machine was available to run a back-office program on the colour display adaptor. The next elaboration was to use the EGA adapter, but to split each colour plane out and use it to drive a separate green screen, since the task system would happily run several instances of the main point-of-sale task so we could run up to three point-of-sale registers from the same PC, which could still run the accounting software on the other display adapter.

While this worked well, as features were added to the point of sale the memory pressure in real mode started to grow, so I looked into one of the fancy features of Turbo C, what Borland called the VROOOM overlay manager.

Overlays were a crude technique in the early days; typically, functions in a Pascal program would be designated as "overlay" routines, in which case they were split out from the main executable into a separate piece managed by library code in the main routine. Before calling an overlay routine, the caller would ensure that the right overlay was loaded in first so that the code for the routine was resident in memory.

All well and good, but with one drawback; since the memory used to load each of the routines in the "overlay" sections was overwritten as different code was loaded in, if an overlaid routine called a non-overlaid routine, which then called a different overlay, things didn't work out so well.

VROOM was more clever than that, though. What some clever folk at Borland had done was a neat trick to allow routines in overlays to work safely. I believe the same fundamental idea was also picked up by real-mode Windows to manage code sections, since real-mode Windows was under really extreme memory pressure, pretty much all application code needed to be swappable which resulted in similar problems.

With some persistence and reverse-engineering, it turned out that the overlay support was mediated by the linker which glued together the final program; the linker then emitted a small thunk for each entry point in any object module which had been tagged on the linker command line for overlaying, and this thunk took care of the transparent code loading of the right overlay segment on function entry.

However, it cooperated with any code higher up in the call stack cleverly; since Borland's compiler used a consistent frame format, the overlay library walked up through the stack looking at code segment addresses which matched the overlay area. If it found one of these call records on the stack it would rewrite it, extracting the original return address and replacing it. Finally, it would then task that stack frame as one it had modified, by adjusting the saved frame pointer to set the low bit (which for normal frames would always be 0, as the 8086 stack is always 16-bit aligned).

This meant that when the flow of control eventually tried to return to some code in an evicted overlay, the rewritten return address would be called instead, which went into a thunk in the overlay manager which would reload the correct overlay segment and fix the stack back up so it was as it was originally before continuing execution.

This made the overlay process pretty transparent, and reasonably reliable, although since it worked at a linker level it would sometimes take some care ensuring that the right routines were bundled together so the call hotspots didn't try thrashing overlay loads.

One remaining problem with it though, was that it still needed a lot of space for the overlay region itself (as much as the biggest segment it might load, which in practice for a program big enough to need overlaying, typically meant 64kb). That was a particular problem for our point-of-sale TSR.

So, having reverse-engineered this I decided to get creative and write my own overlay manager, which took things to the next level by putting the overlay in an unusual place: the Expanded Memory System page frame. My replacement overlay manager preloaded all the overlays into Expanded Memory areas, and then handled the overlay loads by remapping the EMS page frame to the right one for both zippy performance and no additional burden on MS-DOS real memory.

And since I had a preemptive multitasker which was already managing the DOS state switching, I then expanded that to track the EMS page frame state as well, so that my application which was running as a TSR could run concurrently with the main program even if the main program was using EMS as well. Hey presto, we could then run our point-of-sale in the background with something like 96kb of real memory load; 32k of core C library and overlay manager plus a 64kb data segment for the point-of-sale task instance.

And since one of the things I wanted to do around that time was write my own UNIX kernel for a 68k board to replace all this DOS malarkey, when I was working on that I could use this technique to help get GCC running in real-mode MS-DOS, although even with the overlay manager running out of the page frame, GCC still wouldn't fit. Fortunately, that too turned out to be a solvable problem....

Friday, July 8, 2011

Reverse Engineering

One of the things about learning to program, is that like just about anything else it takes effort to learn how to do it well. Reading a book is all very well, but one of the things about reading is that to read well takes effort. One of the things that distinguishes most of the Great Works of programming literature is that they are not easy, and one of the ways that even those works not intended to be used as college texts is the presence of exercises.

The measure of what a programmer says he has read is interesting, but the thing to really do is probe whether they have done the exercises. The Art of Computer Programming is a great set of books but of the all too few people who have read it, even fewer have even glanced at let alone done any of the exercises, which is a shame both because of the lost learning but also because in the great texts the exercises are usually so carefully constructed to take the reader beyond the main text.

[ Of course, back when I was a lad starting out, I didn't have the benefit of ready access to the Great Works (many of which hadn't been written yet, of course). Getting access to a machine was quite enough, at least circa 1980 in New Zealand, hence having to still use card decks to share a "personal" computer. ]

One of the more interesting learning exercises I went through in my early days came via the medium of computer chess, thanks to the classic early chess program Sargon II, by Dan and Kathe Spracklen, immortalized in the classic horror film The Thing as the program Kurt Russell is playing against. My father had taught me the basics of the game, but lacking many other people to play (and having discovered the distraction of computing) I didn't play enough to become any good; when a copy of Sargon II turned up I naturally gave it a whirl, and got a nasty surprise.

If you're feeling keen to get the retro vibe, you can try Sargon in an emulator, although I've linked to Sargon III instead of II as it's rather less forbidding to get started with - one small trick if you use the emulator is that the original Apple ][ had no lower-case. You have to enter moves in algebraic notation (using the board positions) as was a common convention for written chess games, and ESC toggles between the graphical board display and the move list.

Now,  there's nothing particularly earth-shaking about Sargon's play, but although when I could get enough time on an Apple ][ to play it on (which I did; I hate to think how annoying and persistent I must have been) I could beat it handily at lower levels, by difficulty level 3 it was consistently beating me - enough that I found it quite frustrating. Now, at that point I could have just played the program to get better at chess, but I didn't - instead I decided that I needed to figure out how it worked, and derive insight into its underlying playstyle that would be a better (and less painful) learning method than being repeatedly beaten.

The other advantage of studying how the program worked was that unlike playing against it, I didn't actually need time with access to the computer to study the program.

Of course, I've since discovered that the original Sargon was published as a book consisting of a commented Z80 source code listing - a common way of distributing programs in the U.S. back in those days, as exemplified by Compute! magazine - but that book never made its way to our shores. However, there were disassemblers for the 6502; indeed the Apple ][ system ROM contained one, and an even better one by Glen Bredon called Sourceror was a companion to the assembly tool Big Mac which I'd started to play with writing ever more complex 6502 assembly language of my own.

So, I loaded the Sargon II binary into memory, pointed Sourceror at it, and got a rough assembly dump, and then with some persuasion got it printed out into a big stack, which I then carried with me everywhere for a month or so at high school, spending every free moment poring over it trying to trace the program's execution and divine its logic and internal data structures (including any parts of the assembly language dump that were actually data instead of code).

While that printout is long gone, I can at least cherish the memory memory of unwinding it all and discovering how it worked; in essence, the key element - and the part of the program I spent most time trying to figure out, since that meant making suppositions about what the data structures it was using were and then tracing through the code to see if those guesses made sense - was a position evaluator that "scored" a given board layout by scanning it to see which moves were available and thus checking which pieces were threatened with capture. The sum of the weights of the remaining pieces, with weight values chosen to reflect their tactical power, adjusted by the sum of the weights of pieces threatened with capture, gave a simple score for the "desirability" of that position.

[ There was an interesting adjustment to the weights of pieces for pawns; their point weighting changed depending on how many rows across the board they had advanced, to reflect the potential value of the pawn if it could reach the end row and be queened. ]

The difficulty level of the program then controlled how deep into the tree of possible positions from the current one the program would recursively search looking for board layouts, using simple Alpha-Beta pruning to try and constrain its search; the one it found which it scored as the best future outcome at the requested depth would result in the program's choice of move. Although my printout with hundreds of scribbled annotations as I unwound the program's structure is long lost, you can get a sense of it by reading the Sargon I source, along with its original commentary.

I definitely own the Spracklens a debt of gratitude; the fact their chess program was better than I am helped make me a developer by making me mad enough to learn how to pull programs apart from the raw bytes. I might not have put enough effort in to learn how otherwise, but I'm certainly glad I did.

Otherwise, for instance, I might not have gone on to study disassemblies of the Apple ][ BASIC floating-point code (helped by "What's Where in the Apple", which I don't think I have kept although I do have some other old bits of Apple][ trivia) and learned floating-point, and come to a fuller understanding of the Taylor and Maclaurin series expansions - around the same time as this I was teaching myself calculus, but recognising the series expansion by figuring out what the tables of constants represented was an epiphany to treasure.


As I said at the start, all good learning takes effort; and although I never did pursue chess seriously, the effort of reverse-engineering that chess program took quite a bit of time and effort that really paid off. Even though it seems like a frivolous exercise and at the time I certainly had no ambition beyond the exercise itself, it had real practical value. Reverse engineering is actually a powerful skill which I put to good use in the next few years (and indeed, serves me well to this day, albeit with better tools to help the process), first writing my own development tools for embedded systems - the SC/MP and Intel 8051, notably - and then using them to write lots of embedded code and system-level stuff.

Thursday, May 19, 2011

CPU Emulation for Improved Debugging

I'd been meaning to describe this for a while, but not got around to it; since it's somewhat related to Fabrice Bellard's great Javascript hack, that's finally got me motivated to describe this technique.

After Symantec shut down development of Ghost Solution Suite in early 2009 (closing our offices and laying off all but myself and another senior developer), my job changed rather rapidly. At the time I had been developing a Javascript interpreter for embedding into the Ghost product release due later that year, something I had finally gotten management approval to do in mid-2008 some 10 years after deciding that I wanted to build the system to support scripting in 1998.

Now, despite not having an actual script parser, the system was very much set up to support dynamic languages, as I'd been a Lisp and Smalltalk fan for many years. Although I'd read about smalltalk occasionally since the early 80's, I actually first used it seriously via the excellent Actor programming language in the Windows 3.1 era - Actor had a more Pascal-inspired surface syntax, but its semantics were pure Smalltalk, and it had a lot of influence; the excellent GUI class library it had was licensed by Borland and its design along with some of the Smalltalk concepts such as reflection were very successfully incorporated into Delphi - designed by Anders Hejlsberg, who later created the even more Smalltalk-like .NET environment. Oddly, Actor ended up owned by Symantec - I did spend some time trying to locate the source while I was there, but as with the QEMM and Desqview/X source which I went looking for after the QuarterDesk acquisition it was basically impossible to find.

So, when I set out in 1998 to make the Ghost Enterprise (as it was called then) management components, I decided a few simple things; it would be in C++, be garbage-collected, network transport would work by binary marshalling of objects and collections, and the core library of management objects should use a Smalltalk-like class library with heterogeneous collections so that C++ routines working with specific types should use dynamic type inquiry written in the style of the Eiffel assignment-attempt operator using C++ operator overloading.

All fairly simple and straightforward, but it's fair to say that the resulting mix of styles wasn't exactly something that the other folks on the team found easy to work with, and another problem was that the debugging experience in Visual C++ and later Visual Studio wasn't entirely stellar. I was entirely used to this kind of thing, so it didn't bother me - aside from my years of assembly language and embedded work, when working on Tandem CLX/R mainframes, the only C++ compiler was a port of Cfront 2.0 done with Lattice C, which meant that debugging was done on the Cfront-translated source, so I probably didn't appreciate that it should have been easier to debug.

Anyway, in 2009 since development had been cancelled midway through the project to deliver the next version, my job became one of stabilising what we had done so that if necessary it could be released and taking over more maintenance (previously I was maintaining most of the management framework, but with only two of us left behind to look after the entire suite of 1.5M LOC that expanded quite a bit). A few months later, once a small team in India was put together, we then started training them up on how things worked, at which point the rather esoteric style of coding became more of a problem, especially because of the poor debugging experience in Visual Studio.

Now, what was particularly awkward about the C++ source debugging in Visual Studio (and this applied for every version from 2002 through to 2008) was that although it mostly worked well in templated code where the full type of an object was statically available, it would struggle mightily with inspecting an object in a context where the static type was of a base class (or COM interface). On occasion it would manage to tell you what the real type of the pointed-to object was, but in general it couldn't manage to - and the Ghost management platform was written in a coding style which meant almost everything pointed to a generic base, so you'd be left with a soup of raw pointers that you'd have to manually cast to something to make sense of.

This resulted in a plea from the newly-formed maintenance team to do something to help them manage. But what could I do? There was, in the Visual Studio debugger, an extensible system for writing custom data inspectors, but it was entirely driven by matching things by static type - the exact piece of information that Visual Studio wasn't correctly working out for itself. And while it's possible to extend the .NET parts of Visual Studio in interesting ways including writing custom visualizers and expression evaluators, the C++ native code debugging didn't appear to have those extension points; the only solution seemed to be a plug-in system created for Visual C++ 6.0 which later releases had kept as a legacy system.

So, I had an extension interface which I could use to generate a string, and all the objects in my run-time new how to print themselves (either in JSON or an earlier more Lisp-like native text syntax), but the extension system had only one way to look at the debug target: a callback function which will read bytes of memory from the target.

At this point, I had the inspiration. Well, if my program's objects could print themselves, and the ONLY thing I could do was read bytes of memory, then the only possible way I could go forward was to write an x86 (or x64, since all the management platform code and JS interpreter also ran 64-bit) software VM and put that in as the debug plug-in; the plug-in could then set up in the emulated stack a frame which would call the print routine in the target object to print to a supplied buffer, and then extract it from the VM and pass it back to Visual Studio.

How hard could it be?

As it turned out, not hard at all, although there were a few things that did surprise me while putting it all together over the course of about three weeks. I went with a simple, straightforward emulator rather than a translator - although I would have liked to have a crack at dynamic translation (and still would, when I have the time) this particular problem didn't really warrant it.

Writing an x86 and/or x64 (mine actually supported both) emulator really isn't that hard - aside from knowing the architectures reasonably well, being able to decode x86 code has enough practical uses that I'd actually done it some years before, to write an API interception library similar in spirit to the Microsoft Research Detours project. In my interceptor, it would disassemble the interception target to relocate the target code to a side thunk which branched back into the original location after making room to insert a branch to the thunk. This method allowed the hook thunk to work safely even in Windows 95 where the kernel code was system-global, as the thunk I'd patch in would compare the current process/thread ID so it'd only activate in the right process context.

Since I was emulating user-mode code, I probably didn't need to emulate the x86 paging unit, but I did a rough emulation of it anyway since that also seemed to be the sanest way to use the memory-read callback which I had to use to get the debug target state. As instructions requested memory, I would consult the emulated MMU and if the page wasn't already present I'd populate it using the memory read callback, in effect "faulting in" the context from the debug target as I needed it.

What did surprise me is how much of the x86 instruction set I needed to support, including not just regular 8087 floating-point (not too hard to emulate using plain C++ floating-point math, except for quirks like rounding modes) but also a number of SSE2 instructions which turned out to be conditionally used in the Visual C++ run-time library as optimizations. Although the debug environment potentially could have avoided using those because they were guarded by CPU feature tests, the debug target process had typically already gone through that and set up global state so that it knew those instructions existed, and I had to emulate them and this meant including a suitable register file organization in the emulator for the xmm register set.

In addition, as the printing code I was calling in the target was pretty general-purpose, it would rely on things like iostreams, which meant memory allocation, and entering Win32 APIs to handle critical sections and the like, so although I didn't need to do any hardware emulation I did need to provide a fairly complete emulation of the Win32 process environment including manufacturing a Thread Environment Block and Process Environment Block structure as part of creating the initial VM for the emulated call to the print routine.

Possibly the most entertaining part of the whole thing was devising a scheme for representing the condition code computations; fortunately I ran across this rather clever and reasonably clear article which demonstrates some neat boolean hackery to recover the carry-out and overflow status values for an operation even though they aren't directly available with C++ arithmetic; while lazily computing the condition codes is conceptually simple enough, I didn't actually bother to start with until I stumbled across the above article, after which it was too tempting not to make the CC emulation lazy.

Finally, after making the above work with a direct call to the print routine, I couldn't resist one final improvement even though I knew it would probably not ship; the root of my dynamic class library was an IUnknown, so rather than directly call the print routine the plug-in DLL instead first performed an emulated virtual function call on QueryInterface for a debug-helper interface IID, and if that was supported then emulated those COM calls to get the print representations on things portably, so the plug-in system could be adapted to work with all kinds of stuff and not just my personal GC'd class library.

This is what made the whole thing worthwhile to me, in that this turned my little DLL into a general extension system that almost any C++ program could use to create more useful printed representations of themselves under debugging.

The end result worked astonishingly well, especially since one final quirk of Visual Studio was that the plug-in DLL containing the VM code got loaded and unloaded on every single attempt to resolve an expression; if you hit a breakpoint and the debug window wanted to show the values of, say, 16 local variables which had dynamic object type, the VM system would be loaded, run, and then completely torn down 16 times. Upon discovering that it was tempting to add an extra shim DLL to unload the VM only on a timer, but in actual fact things ran fast enough I decided not to bother.

Despite being a relatively naive direct interpreter, this turned out to perform more than adequately in practice; the emulator DLL was fully statically linked and occupied under 150kb, and even though it was probably about 50 times slower than native code it was fast enough at printing items and even quite complex nested hash tables and such that tapping F8 to single-step through code under debug was still quite snappy.

What was a particular bonus with this approach compared to other debug systems was that it also worked with post-mortem dumps supplied by customers, so that we could be sent a minidump of any code which failed in the field and this debug assistant would be able to untangle the representation of all the objects by just running it.

In contrast, compare that to the co-operative debugging model used in .NET, where processes running under a debugger have additional debugger support code injected into them which the normal debugger will then RPC to in order to obtain rich debug state and get objects marshalled out. That kind of cooperative debugging does create a great experience, but it's not so great for port-mortem work. Using an emulator to partially run the debug target is a great way to enhance the post-mortem debug experience, and indeed with a bit of additional elbow grease the VM could probably even support running the cooperative-debugging extensions into the emulated process.

Having done all this, I had hoped that wouldn't be the end of it - I had been hoping to expand my Javascript with a JIT-to-x86 once I got it fully working as a direct interpreter, and since the emulator was so tiny - about 50kb of compiled code - I could include it with the run-time so it could assist with developing and debugging the JIT by running the built code in a little VM. Unfortunately, as with all the code I wrote at Symantec I lost access to this when I was laid off, and until I get motivated to write another language run-time (or I decide to go back to my 1980's roots writing embedded software development tools) I probably won't have a good excuse to repeat the exercise anytime soon. However, seeing Fabrice Bellard's awesome Javascript hack in action is a good reminder that this kind of CPU emulation is really handy, and it's actually not as difficult to do as you probably think.

In fact, it's really quite an entertaining kind of project to set for yourself, and I recommend it as an exercise - you may not end up running Linux in a browser, but it's not only fun, in my case at least it was quite practical.

Tuesday, April 26, 2011

The More Things Change...

Jeff Atwood's latest post reminding folks of the obvious lessons of the recent minor EC2 outage at Amazon reminded me again about how everything moves in painful cycles of forgetting and reinvention in the computer industry. In this case the specific technique he refers to Netflix using, their "Chaos Monkey", is basically the exact same thing that Tandem Computers used to regularly use in their Guardian OS 30 or so years before Netflix.

It's hard to overstate just how many great innovations in computing were pioneered by Tandem; their main problem commercially was really being so far ahead of their time in so many ways. Many of the architectural innovations they didn't just invent but turn into commercial success are being directly copied in the modern web era; building distributed web systems with EC2 virtual machines and message queuing systems (or their equivalents in the Google or Azure ecosystems) are just scaled-out versions of what happened inside a Tandem mainframe's chassis.

In the case of the Chaos Monkey, it's important to understand that Tandem's fault-tolerant machines weren't built with triple-modular redundancy, as a lot of people always seemed to assume. Rather, the fault-tolerance of the system was due to comprehensive hardware error detection (obviously you have to detect faults to recover from them) combined with having no single point of failure. This did mean that there were duplicates of things inside the chassis: the disks, for instance, had two independent paths to them from different CPU nodes, the CPU nodes didn't share memory but rather had two separate internal interconnects (in essence, two mini-LANs) connecting them but all of this actually got used, it wasn't there "just in case".

The brilliance of the Guardian OS architecture (and remember, this was set up in the 1970's) was that it was a microkernel-based distributed system, composed out of small servers (small by necessity, as the architecture was essentially 16-bit, so a service process had 64k 16-bit words of data storage) which communicated by fault-tolerant explicit message queues. Each service process, as it worked on messages in its queue, would checkpoint its memory to a separate warm standby copy of itself, which the executive would always ensure ran on a separate CPU node inside the chassis - collectively, the nodes formed a single effective computer in what we now would call clustering.

If anything untoward happened to a node, be it a CPU fault, an ECC memory failure, or just about anything, rather than trying to work out exactly what the consequences of the failure were and stitching up an ad-hoc response to try and keep running - essentially, the kind of hopeless approach encouraged, to no good end, by exception handling in programming languages like Java and C++. Rather, the node was simply shut down and the in-process work abandoned, and the other nodes began a clever recovery process.

In this recovery, the warm standby copies of each service process (including its message queue) were found, and new standby copies were made to freshly selected nodes, and then the standby copies were started to pick up at the last saved checkpoint, resuming operation relatively seamlessly and without the need for the clients to do much.

This combined with transactional management of writes to disk - as these systems were built to do online transaction processing, so their main workload was database management - meant that the microkernel provided software fault tolerance to the system as a whole.

The relevance to the Chaos Monkey is how Tandem used to continually test their recovery system; one of the standard services that could be run was a process which did nothing but wait some amount of time, and then issue a command to reset the node it was currently running on, forcing the recovery of that node. Since all faults lead to node shutdown, this was an effective simulation technique.

And amusingly, with the suicidal shutdown service being one of these recoverable services, the recovery process would instantiate a new warm standby on another node, so that it too would in turn be reset. Essentially, then, once this service was started you'd see random nodes inside the chassis be reset and recover continually, during which time you could run a normal test workload and observe that it continued with temporary degraded performance but with no other ill effects (or if there was, that typically meant a bug in one of the services which meant it wasn't cooperating correctly with the checkpoint system).

It's also worth noting that the SQL running on the Tandem systems I used - CLX/R's during a time I spent at Tandem learning immensely about Tandem's architecture, and making this classic tome one of my all-time desert island books right alongside The Structure And Interpretation of Computer Programs.

Almost everything about making high-availability scalable web services that is being painfully relearned now can be understood by just making a good study of those Tandem systems; for instance, their database system used key-range partitioning to distribute queries in parallel across the cluster (and could do so quite transparently), much as sharding is used today.

Of course, this is how it ever was; lessons learned in mainframes being forgotten and re-learned by the upstart minicomputer folks, and then in turn forgotten and re-learned in the age of the microprocessor, and now once again in the age of web development. But it's remarkable to think that Tandem were basically the only firm pursuing this particular line, and they did it better than most people do now, only starting some 35 years ago before Ethernet, before SCSI, before UNIX.

Tuesday, September 14, 2010

Old Dogs, New(ish) Tricks

One of the many little things I've wanted to have a crack at over recent years but couldn't really do was tackle writing a Regular Expression library for the GC'd runtime I designed and wrote to use inside Ghost Solution Suite. When Russ Cox went and wrote a series of awesome articles on that after I'd written a big chunk of a Javascript interpreter for that runtime, I wanted to do it even more.

Alas, one of the things that I lost when Symantec cut me loose was access to all that code which I'd created and looked after over more than a decade. It's not of any use to them, but that's the way it goes and so I need to start my personal collection of Useful Stuff from scratch again, only this time I'll take a lot more care to avoid losing ownership of it. And to be fair, starting again means this time I don't have to worry quite so much about compatibility with broken C++ compilers, as I had to back in 1995 when even 1990-level template support and return type covariance were considered supremely exotic.

Anyway, while reading Russ's articles again and contemplating where I'd fit such a thing into my long-term roadmap, a noticed an offhand reference to this gem of an idea. It's not a total game-changer, in that most of the times I've hit this kind of problem no sparse representation was feasible, but now I do find myself contemplating ways of trying this out.

As with so many great ideas, as Russ points out it's not really new at all; there are hints of it being known back at least as far as 1974. Some really neat ideas like this languish in obscurity for a long time; some of my favourite CS papers took a huge amount of time to become well-known like this classic paper by Richard Brent which is a simple and practical method for high-occupancy hash tables very few people ever knew about.

It's really fun trying these things out and demonstrating just how well they work - often, so well that it's truly baffling why they aren't better known. So, extra kudos to Russ not just on the Regexp front but for popularising a classic algorithm like this.

On the more practical front, just a small addendum to yesterday's note on MSBuild and Visual Studio 2010; something that had been puzzling me; for years I've been used to setting up dependent projects as I'd always done so an application which depends on a DLL gets the .LIB for the DLL automatically linked with it. However, in VS2010 it wasn't working, which was perplexing me.

It turns out that this is a consequence of a peculiar change in the build system; the old project dependency system still exists (and it's easy to encounter in the VS UI since it's in the same context menu it's always been in) but the old dependency system only deals with build order. Instead, the old-style dependencies are now handled exclusively through the "Framework and References" part of project settings, which is especially bizarre since this mainly affects C++ projects but the relevant UI is very .NET-oriented. So, there's no way this would be easy to discover, but there you have it - treat your DLL projects like .NET assembly references and you're back at the classic behaviour that's been around since at least Visual C++ 6.0 circa 1998.

Monday, September 13, 2010

A Handy Versioning Trick

As I've been setting myself up so that I can develop code at home again - something I couldn't do before because Symantec automatically owned everything I did making personal projects impossible even if I had the emotional energy left after the kinds of high pressure we were basically always under. One of the things that has involved has been setting up all the normal process stuff you really want to have in place these days for everything but the simplest work.

Fortunately there are lots of good tools around to start using now; for source control there's Mercurial which I've been happily using for a while along with BitBucket as a hosted repository which allows private storage (and has reasonable prices) and comes with the usual accoutrements like a wiki and bug-tracker. Joel Spolsky's Kiln+FogBugz is also excellent but the pricing bump from the free tier to paid is a little steep by comparison (which isn't a complaint, it's worth it) and I may use that for side stuff too - the main drawback there is that they don't appear to have OpenID yet.

Building code is where things get more interesting. Thanks to some sterling work by Anuradha Dissanayake who set up a fantastic CM system for Ghost, I've long been a fan of continuous build/continuous integration systems and there's some good automation out there. It's amusing how almost all the main CI tools you can get are Java-based, but fortunately here since most of what they are doing is coordination work - polling the version-control system for work and launching scripts - this is not the problem it would normally be. In fact, they are one of the few kinds of thing which really benefit from Java's famously over-sold "run anywhere" because they don't have classic UIs, aren't innately very complex, but can derive a lot of benefit from dynamic class loading for extensions.

On top of that, some of the CI systems are actually pretty good. Hudson and Teamcity both seem up to the job - I'm going with Hudson to start because TeamCity's supposed great feature is IDE integration which is actually pretty useless to me, and if I ever need to get into writing any Hudson extensions for myself I can. Although really, the only thing it doesn't have right out of the box is support for Windows slave instances on Amazon EC2, only UNIX (well, OpenSolaris or Linux) slave instances, but that's hardly a deal-breaker.

However, one interesting thing I've been looking at today is the irritation of managing build numbers; most of the CI systems have something-or-other to try and do for this, and there are all kinds of obscure work-arounds to install .NET extensions to get build numbers into MSBuild, but it's certainly not exactly seamless and it's nice if developer builds in Visual Studio (even plain old Express) can do this too.

Turns out it's actually pretty easy nowadays, thanks to some long overdue improvements in Visual Studio 2010. The old project and solution build system for C++ code was, frankly, hideous, and so part of the solution has been that Visual Studio now uses the same unified build tool for C++ as has been used in the .NET world, MSBuild. And fortunately, to go with that MSBuild has been helped a lot with the .NET 4.0 release that came with VS2010 in terms of filling in a lot of the basic missing features.

Probably the single neatest thing in MSBuild 4.0 is property functions, which allow build scripts to call into arbitrary parts of the .NET framework and get results as strings which can be used to parameterize the build actions, without having to go to the tedium of writing and building .NET assemblies to extend the build system. With these plus the general improvements to the built-in MSBuild "Tasks" elements, a regular vcxproj file (typically an otherwise empty project of "Utility" type) in Visual Studio can contain something like this inside the outermost <Project> element:

<Target Name="Build">
  <PropertyGroup>
    <Name>$(OutDir)\ver_def.h</Name>
    <Major>1</Major>
    <Minor>0</Minor>
    <Now>$([System.DateTime]::UtcNow)</Now>
    <Year>$([System.DateTime]::Parse($(Now)).ToString("yy"))</Year>
    <Day>$([System.DateTime]::Parse($(Now)).DayOfYear.ToString())</Day>
    <Time>$([System.DateTime]::Parse($(Now)).ToString("HHmm"))</Time>
  </PropertyGroup>
  <MakeDir Directories="$(OutDir)" />
  <WriteLinesToFile File="$(Name)" Lines="#define VER_MAJOR $(Major)" Overwrite="True" />
  <WriteLinesToFile File="$(Name)" Lines="#define VER_MINOR $(Minor)" Overwrite="False" />
  <WriteLinesToFile File="$(Name)" Lines="#define VER_BUILD $(Year)$(Day)" Overwrite="False" />
  <WriteLinesToFile File="$(Name)" Lines="#define VER_REV ;  $(Time)" Overwrite="False" />
</Target>


This is just the simplest kind of example, using hours and minutes as the last number part; one of the tricks to watch for is that each component of the total version number is 16-bit and so limited to 65535. That's why I used the day-of-year after the year (so this won't overflow until 2065) and why another popular build number format is to use half the number of seconds elapsed in the day instead of hours or minutes (unfortunately there are slightly too many seconds per day to use it raw, so we need to divide it down a little):

    <Seconds>$([System.DateTime]::Parse($(Now)).TimeOfDay.TotalSeconds)</Seconds>
    <Half>$([System.Convert]::ToInt32($([MSBuild]::Divide($(Seconds),2))))</Half>


In practice the hours-and-minutes time is more than good enough, since if you're doing your builds through an automated system they aren't going to be scheduled more than one per minute anyway, and the result is slightly easier to interpret. You can also choose to use "Now" instead of "UtcNow" in the example above for simplicity if you aren't particularly worried about the effects of timezones.

If you do that, then there are some nice benefits: Hudson has a plug-in which will post build results to a Google Calendar, so it's simplicity itself to correlate any given binary with the build that produced it just by looking it up in a browser. Of course that's not particularly hard with UTC or seconds-per-day either, but if you're a one-man-band it's your choice to make it easier on yourself.

Note that by overriding the "Build" target in the XML inserted into the vcxproj file, we don't get the normal inherited behaviour for the "Build" target (which comes from some of the imported boilerplate) of doing minimal builds, so this action always runs which is what we want. In a minimal vcxproj file you will still want to have some of the regular IDE crud present just to stop the IDE property editors going haywire, and one advantage of that is that you will get properties like $(OutDir) and friends set the right way if you use property sheets to ensure those are consistent across all your projects (as you definitely should).

Another thing to consider is that this is just one way of using the above data; you can also pick it up in other MSBuild projects and tasks in your overall build for e.g. publishing the build data to a version-dependent location and all kinds of other steps, or write it out in other formats for other tools (the above example is just the first thing I did because I happen to work mostly in C++, the language which has only just joined the MSBuild family in Visual Studio).

And now with this in place I can start to get on with some simple coding.

Friday, September 10, 2010

The Magic of Differential Pricing

It's really odd to watch the Amazon store at work for Kindle books, even free ones. As I'd previously noted, there basically aren't any free Kindle books. Except, there are, but just not for me.

An odd quirk of the Amazon storefront is that the first time I load it up on a machine after a couple of days, I get to see Kindle books based on what I last viewed or in various other sections of the page, all with alluring prices - and in one notable case, it did this for an item I remembered the price of - USD$6.95 (now USD$5.95 when I checked it again for this article) for the otherwise freely downloadable Romance of the Three Kingdoms.

Except, this time it was showing me the price as USD$3.95 ... so I clicked on the item, and hey presto! the price suddenly jumps by USD$2.00 - and if I refresh the Amazon landing page, suddenly the prices of half the Kindle books (most notably the "free" ones) change to reflect this.

It's very repeatable, and very frustrating, since it seems to only happen for Kindle books and not for anything else I've ever looked at on the Amazon store; Kindle books are "special" in several ways to the Amazon storefront, partly because of the bizarre regional availability (so the store tries very hard to pretend books I can't buy don't exist), but also because they seem to be the only thing to which Amazon are applying regionally differentiated pricing.

Basically, there seems to be a bug in the storefront which isn't picking up on my geographical location at first visit, so I'm given a tantalizing glimpse of the pricing US customers see, just as I sometimes get a glimpse of Kindle editions I can't buy. However, then it adds the "regional surcharge" to the browser session if I click on an item and thereafter the storefront only shows me the regionally adjusted prices.

The really interesting thing is that when this happens I can use "open in a new tab" to check all kinds of things, and it appears thus far that this hidden regional surcharge doesn't exist for full-priced Kindle e-texts. Presumably, for e-texts over a certain publisher price Amazon are happy just getting their cut; it's only for ones that fall under some mysterious threshold price to which a crude surcharge is applied.

Ultimately, I could live with a small regional surcharge, because I'm aware of what Amazon have sunk into their incredible infrastructure where small regional price differentials do exist. But what really bites is the scale of the discrepancy; it's 4 cents per Gigabyte difference in S3 pricing for AsiaPac, but being charged two dollars for a megabyte is just wrong, given that even the dastardly local telcos don't charge that for 3G data (and even the NZ$1/Mb rate is just to soak the unwary, it's easy to avoid).

So, I've almost finished Accelerando and the Kindle has really proven itself (barring some odd formatting in the book, which may be an artifact of the preparation - 1023 was rendered without superscripts as 1023, the kind of thing that makes a difference in a book using Big Numbers). It's really a great device, and the notion that in 5-10 years something like it could be almost ubiquitous is pretty exciting, although I imagine the textbook publishers will have a much harder time adapting their business models to e-text than mass-market fiction/non-fiction publishers are managing now (where although it's a bumpy ride, it's at least slightly less disruptive).

Thursday, September 9, 2010

Through the Looking Glass

Or possibly the most bizarre road trip ever: Jeffrey Goldberg goes to Cuba and has a series of astonishing conversations with Fidel Castro and (Che Guevara's daughter makes a cameo appearance).

To pad out the post, The Economist points at OECD data on graduates in unskilled jobs, which as one commenter notes is interesting when compared to this OECD press release on expanding tertiary education. Meanwhile on the Schumpeter column, worries about the U.S. University system (although I'm not position to critically evaluate his argument, the decline of science and engineering in western-country universities generally has concerned me for a while, as young people demonstrate by their choices how the status of scientists and engineers has fallen).

Also via The Economist's Free Exchange blog, a thoughtful article from Martin Wolf of the Financial Times on the role of the state in Democratic societies which has an enormous amount of interesting comment arising from it.