Core Concepts
Interpreter versus Compiler
CPU emulation is generally implemented in one of two ways. The simplest and most straightforward approach is to write what is known as a CPU interpreter. An interpreter executes emulated CPU code by analyzing each instruction one at a time, and then performing the equivalent action. Interpreters are written to mirror the way a real CPU works. The logic within a CPU chip will fetch each instruction in turn, determine what the operation is, and then handle the request. Modern CPUs use many tricks to speed up this process, but fundamentally, it is the same for all CPUs.
The primary problem with CPU interpreters is that they are slow. The process of fetching and decoding an instruction can take quite a number of real CPU cycles to complete, on top of the cost of actually performing the operation that is encoded. For running emulated CPUs in the 1-20 MHz range, where the execution of each instruction takes multiple emulated clock cycles, this overhead is generally acceptable. Fortunately, this covers most "classic" CISC CPUs, from the Z80 to the 68000 to the 8086, and so it makes sense to keep the emulation of these CPUs as interpreters.
With the advent of high speed DSPs and RISC processors, however, the overhead of the interpreter becomes too much for even modern processors to handle. The problem is twofold. First, these processors tend to run at higher speeds, ranging from 20 MHz up into the hundreds of MHz. And making the problem worse, each emulated instruction is often pipelined such that it only takes a single emulated cycle per instruction. So you move from needing to execute several hundred thousand instructions per second to needing the execute several tens or hundreds of millions of instructions per second. At this point, the overhead of decoding each instruction prior to execution dominates the time spent emulating the CPU.
In order to address this, a second, more aggressive approach is commonly used: the dynamic recompiler (often abbreviated as dynarec or drc, and synonymous with just-in-time or JIT compilation). A dynamic recompiler at its heart translates opcodes from the emulated CPU architecture to the native CPU architecture on the fly. In general, a good dynamic recompiler takes advantage of two approaches that largely eliminate the overhead of decoding instructions. First, it performs a sort of bulk analysis of code; that is, instead of examining and decoding one instruction at a time, it "looks ahead" and decodes blocks of instructions all at once. And second, once it has decoded the instructions and translated them to the native CPU architecture, it saves the results of its work in a code cache for later reuse. This means that when the same block of code is executed, or if the code is part of a loop, there is no need at all to do any decoding; all the work has already been done, apart from simply executing the instructions.
While this all probably sounds great in theory, there are a few things working against a dynamic recompiler. The first thing is the complexity: dynamic recompilers are inherently more complex than interpreters and more difficult to follow. This is simply because an interpreter is a fairly logical and straightforward piece of code that simulates actual CPU behavior, whereas a dynamic recompiler is performing an on-the-fly translation of code.
A second big problem with dynamic recompilers is self-modifying code. Because the translated code is cached and reused, if some other code comes along and modifies the original code, the dynamic recompiler needs to detect that modification so that it can throw away its cached translation. Otherwise, when the modified code is executed in the future, the dynamic recompiler will not be wise to the changes and will execute its originally cached version instead!
The third major problem with dynamic recompilers is that they are platform-specific. An interpreter is most commonly written in a high level language like C or C++, and can be designed without much effort into being very portable across many computer architectures. In contrast, the entire purpose of the dynamic recompiler is to translate from the emulated CPU architecture to the native CPU architecture, meaning that the dynamic recompiler is by definition tied to that native CPU architecture. This implies a multiplicative effect on the amount of code that needs to be written, since each emulated CPU architecture needs one dynamic recompiler for each target CPU architecture.
The Universal Machine Language
It is exactly this final problem that the Universal Machine Language (abbreviated UML) was designed to solve. The UML defines an abstract CPU architecture that can be targeted by a dynamic recompiler. This abstract computer architecture is similar to a typical modern CPU, in that it defines a set of registers, flags, opcodes, and system behaviors. But of course, since there is no such thing as a UML CPU, generating code that targets this architecture is only half the battle. Once a dynamic recompiler has generated code for the UML, a back-end module then translates the UML code into equivalent instructions in the native CPU architecture.
Thus, the UML effectively decouples the writing of a dynamic recompiler from the writing of a back-end. This removes the multiplier involved in developing multiple dynamic recompilers for multiple target CPU architectures. Once a collection of back-ends exists, one only has to write a single dynamic recompiler targeting the UML, and it will work on all architectures that have back-ends. On the flip side, given a collection of UML-targeted dynamic recompilers, one only has to develop a single back-end translator to support all of them on a new target architecture.
In addition to this benefit, validation becomes easier on all sides. A "canonical" back-end can be developed, written in a high level language, which implements the UML in a way that is easy to verify. This canonical back-end can then be used to compare behaviors against other back-ends; if the other back-ends are written correctly, identical results should be produced for all dynamic recompilers. Similarly, bugs in the system can be easily isolated to the dynamic recompiler or to the back-end by simply switching back-ends and looking for identical behaviors.
The UML, which will be described in detail in future articles, was designed with a few core concepts in mind. First, the opcodes and behaviors are designed with a slight bias toward the writer of the dynamic recompiler, at the expense of the back-end author. This was done because it is expected that there will be fewer back-ends than dynamic recompilers, when all is said and done. Second, the UML describes a primarily 32-bit architecture, with native support for 64-bit operations. This was done because it most closely represents the current state of computer architectures, while allowing for optimization on native 64-bit platforms. And third, the UML opcodes are more CISC-like than RISC-like in their support of parameters. This was a conscious decision that not only makes writing the dynamic recompiler simpler, but also enables reasonable speed in a high-level canonical back-end.