1 Introduction

The first step in developing any machine code manipulation program is to understand the source object file format - what type of information is stored, how it is stored, and how it can be accessed. Normally we are not concerned with any of these issues since the operating system (OS) automatically handles all file access operations and provides users with an interface to these functions. When a program needs to be executed, the OS first extracts all relevant file information (usually from the file header), and carries out any necessary actions before putting it into memory; in other words the OS simply decodes the object file into an understandable form in memory. This behaviour is often described as loading of a program.

This thesis is divided into 7 chapters. In this chapter, we introduce the notation and concept for developing a Retargetable Loader (RL) and the motivation for developing one. Chapter 2 reviews some of the more recent works in the areas of binary decompilation and translation. Chapter 3 discusses the design choices for the RL; the three possible way to implement an RL. Structures and examples for various binary file formats (BFF) are discussed in chapter 4. The BFF properties and the grammar used in the Simple Retargetable Loader (SRL) are the main topic of chapter 5. Chapter 6 describes how we tested the SRL and the results accomplished. Lastly, a summary concludes this thesis.


1.1 Notation used

A binary object file is either an executable file that runs on a particular machine or a file containing object code that needs to be linked. The object code (or executable code) is generated by a compiler or a by an assembler. Fig. 1 shows the object code generation process.



An object file program has the following environment characteristics (attributes):

  1. the machine architecture that it runs on,
  2. its operating system,
  3. and the Binary File Format (BFF) - its structure.

Throughout this thesis, when referring to a binary object file, its attributes are specified with the triplet (Machine, OS, BFF); for example (x86, DOS, EXE) describes the environment where a binary program runs on the Intel x86 machine under the DOS operating system and has the EXE file format.



1.2 Tools that deal with machine code instructions

Apart from file accesses used by the OS, the loader plays an important role in some of the more complex machine code related tools like disassemblers, decompilers, debuggers, binary translators and tracers/profiles. The general object decoding abstraction of these tools in showed in Fig. 2.

At the top of Fig. 2, a binary object source (M1,OS1,BFF1) is fed into a loader where file related information is extracted. The object file is then decoded accordingly depending on the target code manipulation tools - binary translator, disassembler, debugger, decompiler or tracer/profiler (it is always decoded; what changes is the intermediate representation used based on the tool to be built). For example, if the tool is a disassembler - the binary object source is decoded to produce an assembler program. For a tracer/profiler, the source object is modified but its environment attributes is unchanged.




Perhaps the most interesting output in Fig. 2 is the binary translator; in this case a new binary object with a completely different environment (M2,OS2,BFF2) is produced. In the most general case, the source program undergoes a BFF change BFF1 BFF2, an operating system change OS1 OS2 and a machine platform change M1 M2. Most binary translators will support all three changes.


1.3 Binary translation and code migration

Binary translation, as the name suggests, translates binary files from one form to another. Advances in technology leads to newer architectural design and hence gives better performance. The availability of programs to run on the newer systems will always be scarce. Hence it would be desirable to have all (almost all) existing programs running on the new system (well, at least those that are frequently used). The ability to migrate any existing programs not only depends on the difference of the two system architectures, but the availability of program source, source code for all libraries included by the program, file structure and operating system services are also major factors that must not be overlooked.

The way to migrate code from one environment (M1,OS1,BFF1) to another (M2,OS2,BFF2) are indicated in [1]; from fastest to slowest:

  1. Re-compilation of source using native compilers.
  2. Binary translation of object binaries.
  3. Micro-coded emulation of old machine's instructions.
  4. Software emulation or interpretation.

Option 1 is ideal, but the availability of source code is a big problem. Large programs are not self contained, ie. they use routines from libraries which are in object format. Also, operating system service routines are another problem when migrating programs. Option 3 has the drawback that the target machine needs to have a micro-coded hardware layer and RISC machines often does not have this. Option 4 is the easiest to implement but is often too slow. Do we want programs to run slower in a new state of art machine? The answer is no obviously and thus option 2 would be the most sensible way to do code migration.



1.4 The loader

As discussed earlier, the loader can be used in many different machine code manipulation tools including binary translators. In any machine code tool, the loader is the first stage for understanding the BFF structure. As newer architecture solutions introduces newer environments, and often newer BFF, more machine code tools are needed to analyse and manipulate these BFFs.

The rest of this Thesis will present the motivation for the design of a retargetable loader (RL); a generic loader which can be targeted for different platforms based on an environment description. Unlike traditional loaders which are embedded in the OS kernel and hence can only understand only one particular type of BFF. The retargetable loader is designed to be intelligent, it will understand most types of BFFs regardless of OS or machine architecture it is on; in other words it is a generic loader.


1.5 Motivation for a Retargetable loader (RL)

To develop any machine code manipulation tool, understanding the BFFs/BFFt (source BFF and target BFF if developing a binary translator) is a key factor of the overall development. The loader plays an important role in this as it is the very first light bulb in the development circuit to enlighten the BFF structure. The loader can be quite simple in a way that it only tells the programmer where file information is located, but it is this fundamental element that describes the various sections of the BFF (similar to the Table of contents in a book) and hence provides the basis for decoding of machine instructions.

Traditionally, when developing a machine code manipulation tool, we need to write a decoder for every BFF we want to manipulate. For example, if we want to write a disassembler for an Intel x86 machine running DOS and using the EXE binary file format. We will write a loader for (x86, DOS, EXE) and most probability write it with the disassembler as a single program. If we then decide to write another disassembler for the Windows New Executable (NE) BFF, we will need to write another loader for (x86, Windows, NE) and another disassembler as the interface to information on the BFF will be different. So, if we have n different (M, OS, BFF) tuples, we will need to write n different loaders. This model is show in Fig. 3.




Hence, for X number of machines architectures, Y number of Operating Systems and Z number of BFF, we will need to write a total of X*Y*Z number of different loaders. That is if we want to test on all those different platforms.

The process carried out by all loaders are similar although the (M, OS, BFF) tuples are different. The ideal view of the above model is to unite all the n different loaders together and form a single generic one - a retargetable loader or RL. The new model is showed on Fig. 4.



The input to the RL is the BFF Description. The BFF description is a combination of the (M,OS,BFF) metrics into a form that can be understood by the RL.