The first step in developing any machine code manipulation program is to
understand the source object file format - what type of information is stored,
how it is stored, and how it can be accessed. Normally we are not concerned
with any of these issues since the operating system (OS) automatically handles
all file access operations and provides users with an interface to these
functions. When a program needs to be executed, the OS first extracts all
relevant file information (usually from the file header), and carries out
any necessary actions before putting it into memory; in other words the OS
simply decodes the object file into an understandable form in memory. This
behaviour is often described as loading of a program.
This thesis is divided into 7 chapters. In this chapter, we introduce the
notation and concept for developing a Retargetable Loader (RL) and the motivation
for developing one. Chapter 2 reviews some of the more recent works in the
areas of binary decompilation and translation. Chapter 3 discusses the design
choices for the RL; the three possible way to implement an RL. Structures
and examples for various binary file formats (BFF) are discussed in chapter
4. The BFF properties and the grammar used in the Simple Retargetable Loader
(SRL) are the main topic of chapter 5. Chapter 6 describes how we tested
the SRL and the results accomplished. Lastly, a summary concludes this
thesis.
1.1 Notation used
A binary object file is either an executable file that runs on a particular
machine or a file containing object code that needs to be linked. The object
code (or executable code) is generated by a compiler or a by an assembler.
Fig. 1 shows the object code generation process.
An object file program has the following environment characteristics (attributes):
Throughout this thesis, when referring to a binary object file, its attributes
are specified with the triplet (Machine, OS, BFF); for example (x86, DOS,
EXE) describes the environment where a binary program runs on the Intel x86
machine under the DOS operating system and has the EXE file format.
1.2 Tools that deal with machine code instructions
Apart from file accesses used by the OS, the loader plays an important role
in some of the more complex machine code related tools like disassemblers,
decompilers, debuggers, binary translators and tracers/profiles. The general
object decoding abstraction of these tools in showed in Fig. 2.
At the top of Fig. 2, a binary object source
(M1,OS1,BFF1) is fed into a loader where
file related information is extracted. The object file is then decoded
accordingly depending on the target code manipulation tools - binary translator,
disassembler, debugger, decompiler or tracer/profiler (it is always decoded;
what changes is the intermediate representation used based on the tool to
be built). For example, if the tool is a disassembler - the binary object
source is decoded to produce an assembler program. For a tracer/profiler,
the source object is modified but its environment attributes is unchanged.
Perhaps the most interesting output in Fig. 2 is the binary translator; in
this case a new binary object with a completely different environment
(M2,OS2,BFF2) is produced. In the most general
case, the source program undergoes a BFF change BFF1
BFF2,
an operating system change OS1
OS2 and
a machine platform change M1
M2. Most
binary translators will support all three changes.
1.3 Binary translation and code migration
Binary translation, as the name suggests, translates binary files from one
form to another. Advances in technology leads to newer architectural design
and hence gives better performance. The availability of programs to run on
the newer systems will always be scarce. Hence it would be desirable to have
all (almost all) existing programs running on the new system (well, at least
those that are frequently used). The ability to migrate any existing programs
not only depends on the difference of the two system architectures, but the
availability of program source, source code for all libraries included by
the program, file structure and operating system services are also major
factors that must not be overlooked.
The way to migrate code from one environment (M1,OS1,BFF1) to another (M2,OS2,BFF2) are indicated in [1]; from fastest to slowest:
Option 1 is ideal, but the availability of source code is a big problem.
Large programs are not self contained, ie. they use routines from libraries
which are in object format. Also, operating system service routines are another
problem when migrating programs. Option 3 has the drawback that the target
machine needs to have a micro-coded hardware layer and RISC machines often
does not have this. Option 4 is the easiest to implement but is often too
slow. Do we want programs to run slower in a new state of art machine? The
answer is no obviously and thus option 2 would be the most sensible way to
do code migration.
1.4 The loader
As discussed earlier, the loader can be used in many different machine code
manipulation tools including binary translators. In any machine code tool,
the loader is the first stage for understanding the BFF structure. As newer
architecture solutions introduces newer environments, and often newer BFF,
more machine code tools are needed to analyse and manipulate these BFFs.
The rest of this Thesis will present the motivation for the design of a
retargetable loader (RL); a generic loader which can be targeted for different
platforms based on an environment description. Unlike traditional loaders
which are embedded in the OS kernel and hence can only understand only one
particular type of BFF. The retargetable loader is designed to be intelligent,
it will understand most types of BFFs regardless of OS or machine architecture
it is on; in other words it is a generic loader.
1.5 Motivation for a Retargetable loader (RL)
To develop any machine code manipulation tool, understanding the
BFFs/BFFt (source BFF and target BFF if developing
a binary translator) is a key factor of the overall development. The loader
plays an important role in this as it is the very first light bulb in the
development circuit to enlighten the BFF structure. The loader can be quite
simple in a way that it only tells the programmer where file information
is located, but it is this fundamental element that describes the various
sections of the BFF (similar to the Table of contents in a book) and hence
provides the basis for decoding of machine instructions.
Traditionally, when developing a machine code manipulation tool, we need
to write a decoder for every BFF we want to manipulate. For example, if we
want to write a disassembler for an Intel x86 machine running DOS and using
the EXE binary file format. We will write a loader for (x86, DOS, EXE) and
most probability write it with the disassembler as a single program. If we
then decide to write another disassembler for the Windows New Executable
(NE) BFF, we will need to write another loader for (x86, Windows, NE) and
another disassembler as the interface to information on the BFF will be
different. So, if we have n different (M, OS, BFF) tuples, we will need to
write n different loaders. This model is show in Fig. 3.
Hence, for X number of machines architectures, Y number of Operating Systems
and Z number of BFF, we will need to write a total of X*Y*Z number of different
loaders. That is if we want to test on all those different platforms.
The process carried out by all loaders are similar although the (M, OS, BFF)
tuples are different. The ideal view of the above model is to unite all the
n different loaders together and form a single generic one - a retargetable
loader or RL. The new model is showed on Fig. 4.
The input to the RL is the BFF Description. The BFF description is a combination of the (M,OS,BFF) metrics into a form that can be understood by the RL.