Thursday, May 19, 2011

CPU Emulation for Improved Debugging

I'd been meaning to describe this for a while, but not got around to it; since it's somewhat related to Fabrice Bellard's great Javascript hack, that's finally got me motivated to describe this technique.

After Symantec shut down development of Ghost Solution Suite in early 2009 (closing our offices and laying off all but myself and another senior developer), my job changed rather rapidly. At the time I had been developing a Javascript interpreter for embedding into the Ghost product release due later that year, something I had finally gotten management approval to do in mid-2008 some 10 years after deciding that I wanted to build the system to support scripting in 1998.

Now, despite not having an actual script parser, the system was very much set up to support dynamic languages, as I'd been a Lisp and Smalltalk fan for many years. Although I'd read about smalltalk occasionally since the early 80's, I actually first used it seriously via the excellent Actor programming language in the Windows 3.1 era - Actor had a more Pascal-inspired surface syntax, but its semantics were pure Smalltalk, and it had a lot of influence; the excellent GUI class library it had was licensed by Borland and its design along with some of the Smalltalk concepts such as reflection were very successfully incorporated into Delphi - designed by Anders Hejlsberg, who later created the even more Smalltalk-like .NET environment. Oddly, Actor ended up owned by Symantec - I did spend some time trying to locate the source while I was there, but as with the QEMM and Desqview/X source which I went looking for after the QuarterDesk acquisition it was basically impossible to find.

So, when I set out in 1998 to make the Ghost Enterprise (as it was called then) management components, I decided a few simple things; it would be in C++, be garbage-collected, network transport would work by binary marshalling of objects and collections, and the core library of management objects should use a Smalltalk-like class library with heterogeneous collections so that C++ routines working with specific types should use dynamic type inquiry written in the style of the Eiffel assignment-attempt operator using C++ operator overloading.

All fairly simple and straightforward, but it's fair to say that the resulting mix of styles wasn't exactly something that the other folks on the team found easy to work with, and another problem was that the debugging experience in Visual C++ and later Visual Studio wasn't entirely stellar. I was entirely used to this kind of thing, so it didn't bother me - aside from my years of assembly language and embedded work, when working on Tandem CLX/R mainframes, the only C++ compiler was a port of Cfront 2.0 done with Lattice C, which meant that debugging was done on the Cfront-translated source, so I probably didn't appreciate that it should have been easier to debug.

Anyway, in 2009 since development had been cancelled midway through the project to deliver the next version, my job became one of stabilising what we had done so that if necessary it could be released and taking over more maintenance (previously I was maintaining most of the management framework, but with only two of us left behind to look after the entire suite of 1.5M LOC that expanded quite a bit). A few months later, once a small team in India was put together, we then started training them up on how things worked, at which point the rather esoteric style of coding became more of a problem, especially because of the poor debugging experience in Visual Studio.

Now, what was particularly awkward about the C++ source debugging in Visual Studio (and this applied for every version from 2002 through to 2008) was that although it mostly worked well in templated code where the full type of an object was statically available, it would struggle mightily with inspecting an object in a context where the static type was of a base class (or COM interface). On occasion it would manage to tell you what the real type of the pointed-to object was, but in general it couldn't manage to - and the Ghost management platform was written in a coding style which meant almost everything pointed to a generic base, so you'd be left with a soup of raw pointers that you'd have to manually cast to something to make sense of.

This resulted in a plea from the newly-formed maintenance team to do something to help them manage. But what could I do? There was, in the Visual Studio debugger, an extensible system for writing custom data inspectors, but it was entirely driven by matching things by static type - the exact piece of information that Visual Studio wasn't correctly working out for itself. And while it's possible to extend the .NET parts of Visual Studio in interesting ways including writing custom visualizers and expression evaluators, the C++ native code debugging didn't appear to have those extension points; the only solution seemed to be a plug-in system created for Visual C++ 6.0 which later releases had kept as a legacy system.

So, I had an extension interface which I could use to generate a string, and all the objects in my run-time new how to print themselves (either in JSON or an earlier more Lisp-like native text syntax), but the extension system had only one way to look at the debug target: a callback function which will read bytes of memory from the target.

At this point, I had the inspiration. Well, if my program's objects could print themselves, and the ONLY thing I could do was read bytes of memory, then the only possible way I could go forward was to write an x86 (or x64, since all the management platform code and JS interpreter also ran 64-bit) software VM and put that in as the debug plug-in; the plug-in could then set up in the emulated stack a frame which would call the print routine in the target object to print to a supplied buffer, and then extract it from the VM and pass it back to Visual Studio.

How hard could it be?

As it turned out, not hard at all, although there were a few things that did surprise me while putting it all together over the course of about three weeks. I went with a simple, straightforward emulator rather than a translator - although I would have liked to have a crack at dynamic translation (and still would, when I have the time) this particular problem didn't really warrant it.

Writing an x86 and/or x64 (mine actually supported both) emulator really isn't that hard - aside from knowing the architectures reasonably well, being able to decode x86 code has enough practical uses that I'd actually done it some years before, to write an API interception library similar in spirit to the Microsoft Research Detours project. In my interceptor, it would disassemble the interception target to relocate the target code to a side thunk which branched back into the original location after making room to insert a branch to the thunk. This method allowed the hook thunk to work safely even in Windows 95 where the kernel code was system-global, as the thunk I'd patch in would compare the current process/thread ID so it'd only activate in the right process context.

Since I was emulating user-mode code, I probably didn't need to emulate the x86 paging unit, but I did a rough emulation of it anyway since that also seemed to be the sanest way to use the memory-read callback which I had to use to get the debug target state. As instructions requested memory, I would consult the emulated MMU and if the page wasn't already present I'd populate it using the memory read callback, in effect "faulting in" the context from the debug target as I needed it.

What did surprise me is how much of the x86 instruction set I needed to support, including not just regular 8087 floating-point (not too hard to emulate using plain C++ floating-point math, except for quirks like rounding modes) but also a number of SSE2 instructions which turned out to be conditionally used in the Visual C++ run-time library as optimizations. Although the debug environment potentially could have avoided using those because they were guarded by CPU feature tests, the debug target process had typically already gone through that and set up global state so that it knew those instructions existed, and I had to emulate them and this meant including a suitable register file organization in the emulator for the xmm register set.

In addition, as the printing code I was calling in the target was pretty general-purpose, it would rely on things like iostreams, which meant memory allocation, and entering Win32 APIs to handle critical sections and the like, so although I didn't need to do any hardware emulation I did need to provide a fairly complete emulation of the Win32 process environment including manufacturing a Thread Environment Block and Process Environment Block structure as part of creating the initial VM for the emulated call to the print routine.

Possibly the most entertaining part of the whole thing was devising a scheme for representing the condition code computations; fortunately I ran across this rather clever and reasonably clear article which demonstrates some neat boolean hackery to recover the carry-out and overflow status values for an operation even though they aren't directly available with C++ arithmetic; while lazily computing the condition codes is conceptually simple enough, I didn't actually bother to start with until I stumbled across the above article, after which it was too tempting not to make the CC emulation lazy.

Finally, after making the above work with a direct call to the print routine, I couldn't resist one final improvement even though I knew it would probably not ship; the root of my dynamic class library was an IUnknown, so rather than directly call the print routine the plug-in DLL instead first performed an emulated virtual function call on QueryInterface for a debug-helper interface IID, and if that was supported then emulated those COM calls to get the print representations on things portably, so the plug-in system could be adapted to work with all kinds of stuff and not just my personal GC'd class library.

This is what made the whole thing worthwhile to me, in that this turned my little DLL into a general extension system that almost any C++ program could use to create more useful printed representations of themselves under debugging.

The end result worked astonishingly well, especially since one final quirk of Visual Studio was that the plug-in DLL containing the VM code got loaded and unloaded on every single attempt to resolve an expression; if you hit a breakpoint and the debug window wanted to show the values of, say, 16 local variables which had dynamic object type, the VM system would be loaded, run, and then completely torn down 16 times. Upon discovering that it was tempting to add an extra shim DLL to unload the VM only on a timer, but in actual fact things ran fast enough I decided not to bother.

Despite being a relatively naive direct interpreter, this turned out to perform more than adequately in practice; the emulator DLL was fully statically linked and occupied under 150kb, and even though it was probably about 50 times slower than native code it was fast enough at printing items and even quite complex nested hash tables and such that tapping F8 to single-step through code under debug was still quite snappy.

What was a particular bonus with this approach compared to other debug systems was that it also worked with post-mortem dumps supplied by customers, so that we could be sent a minidump of any code which failed in the field and this debug assistant would be able to untangle the representation of all the objects by just running it.

In contrast, compare that to the co-operative debugging model used in .NET, where processes running under a debugger have additional debugger support code injected into them which the normal debugger will then RPC to in order to obtain rich debug state and get objects marshalled out. That kind of cooperative debugging does create a great experience, but it's not so great for port-mortem work. Using an emulator to partially run the debug target is a great way to enhance the post-mortem debug experience, and indeed with a bit of additional elbow grease the VM could probably even support running the cooperative-debugging extensions into the emulated process.

Having done all this, I had hoped that wouldn't be the end of it - I had been hoping to expand my Javascript with a JIT-to-x86 once I got it fully working as a direct interpreter, and since the emulator was so tiny - about 50kb of compiled code - I could include it with the run-time so it could assist with developing and debugging the JIT by running the built code in a little VM. Unfortunately, as with all the code I wrote at Symantec I lost access to this when I was laid off, and until I get motivated to write another language run-time (or I decide to go back to my 1980's roots writing embedded software development tools) I probably won't have a good excuse to repeat the exercise anytime soon. However, seeing Fabrice Bellard's awesome Javascript hack in action is a good reminder that this kind of CPU emulation is really handy, and it's actually not as difficult to do as you probably think.

In fact, it's really quite an entertaining kind of project to set for yourself, and I recommend it as an exercise - you may not end up running Linux in a browser, but it's not only fun, in my case at least it was quite practical.