3 Designing a Retargetable Loader (RL)

3.1 Design Choices

Being retargetable also mean being portable. Not only that loader will need to understand different Binary File Formats (BFFs), we also want it to run on different machine platforms. For portability reasons, the RL needs to be implemented using a suitable programming language. Assembly languages are often machine dependent, complex and hard to debug, thus it is not suitable for the task. Functional and logic languages are also unsuitable due to its restricted availability in different environments. High level languages like C, C++ and Java are quite common amongst various machine platforms and are ideal for implementing the RL.

To provide loading of a binary source, 3 basic approach can be used:

  1. hand craft the code,
  2. use library routines to assist writing the code, or
  3. use specifications for generation of code.

The first approach is the easiest and quickest although sometimes time-consuming to test. This simplicity advantage is only good for creating loaders that are limited to the knowledge of one BFF. As was described in Fig.2, you will need to hand craft n different loaders for n different (M, OS, BFF) tuples.

Approaches 2 and 3 can be very complex to create at first (but not to use), and can provide mechanisms to support the building of an RL. Both approaches are difficult to produce and time-consuming, however, once this time investment has been made, the production of other loaders are quick and simple. Even when the RL is finished, the user has the overhead of learning how to use the library or the specification language. Although slow at first, the advantage of a RL is remarkable and could save time and effort in the long run (the approach via specification is at least as good as the hand crafted approach even at an intermediate level).

3.2 Developing an RL via library routines



Fig. 5 highlights the fundamental components of an RL created via library routines. The (M, OS, BFF) triplet is described in the format defined by the library; internal binary representation. Any program needing access to the internal binary representation must access it through the interface between library and program. A set of interface routines are used by the interface to retrieve information about the BFF. For each new environment to be described, the user must present this environment in the structure that is specified by the library and provide the set of interface access routines.

The interface between library and program is often fixed and provide a standard for accessing the internal representation. A typical interface will be a set of prototype routine definitions (like the ones in the .h header file of a C program) offered to the program for BFF access. Although the interface is the same for all BFFs and has one single uniform internal representation, the Interface routines can be difficult to code and takes a considerable amount of effort to implement. Another problem that must be faced during design of the library is how to represent all BFFs. Obviously, the structure to store the BFF needs to be as general as possible for it to be able to capture all BFFs. To do this, one would need to examine almost all the BFFs that exists and derive a generic format for them. As a result, the internal binary representation structure in the library will be very large and tedious to construct. Limitation on the interface routines can cause other problems. For example, if an environment happens to have some wired section that is not defined in the internal binary representation, then the program has not way of accessing this section. To overcome these problems, the interface either needs to be very general or the library itself can allow additional structures and interface routines to be added onto of the existing ones.

Due to its complexity, one would not attempt to write from scratch, thus we will look at an existing package that will offer such abilities.


3.2.1 Binary File Descriptor Library (BFD)

GNU's Binary File Descriptor Library (BFD) [19] is a package containing common routines that applications can use regardless of its underlaying file format. The BFD Library is divided into two separate pieces for each BFF: the front end, and the back end. The front end interfaces between the user and BFD (Interface between library and program in fig. 4), the back end provides a set of calls which the BFD front end can use to process and manage the object file (Interface routines). To support a new BFF, the programmer creates the new BFD back end and adds it to the library. The internal binary representation used in BFD is called the BFD canonical object file format.

When an object file (M, OS, BFF) is opened, the BFD routines (from the front end) automatically determines the format of the input object file. A descriptor is build in memory with information about which routines are to be used to access elements of the object file's data structure. When the program wants information about the object file, the BFD reads from different sections of the file and process them. Each BFD back end will have routines to convert sections representations of the object file to BFD's internal canonical object file format.

For example, when a program wants to access an object file's symbol table information, there will be a routine in the BFD back end for that BFF object file that will convert the concrete representation of the object file to the canonical format. So when the program asks for symbol information, this BFD back end routine is invoked and any operations will then be carried on the canonical representation. When the program is finished and wants to write out the new information, another BFD back end routine is called to convert the new canonical representation to the chosen BFF output format.

A BFD canonical format has the following information about an object file:

Information stored

files Machine architecture.
Implementation format types.
Demand pageable bit.
Write protected bit.
sections Name.
Address in the object file.
Size.
Alignment information.
Various flags.
Pointers into BFD data structure.
symbols Definition in object file.
Name.
Value.
Various flags.
relocation level Pointer to the symbol to relocate to.
Offset of the data to relocate.
Section of where the data belongs.
Pointer to the relocation type descriptor.
line numbers Information for debugging purposes.

One difficulty in using the BFD library is its complexity. The library itself is very large, the number of functions offered in the front end are exceptionally many. The BFD front end was designed in mind to allow the programmer to be able to retrieve all type of information about any BFF, at least the existing ones. The BFD library can be integrated with disassemblers, decompilers, debuggers, etc. Due to this generality and hence its bulkiness, it is difficult to use it without spending a great deal of time learning how to use it. Perhaps because it is too general, it often contains information that is not needed for some applications.


3.3 Developing an RL via specifications

The idea of using specification is not a new one. Parser and compiler generators have been around for a while and have proven to be very useful. The input to these generators is often a specification, usually in the form of a context free grammar found commonly in specifying the syntax of programming languages. The idea can also be apply to binary objects where each of the BFFs can be specified in a grammar that can be understood by the RL. The resulting object file specification contains information about the structure of the object file and how various sections can be accessed. Each specification acts as a model frame for all object files in the group, ie. a template for all objects belonging to the environment (M, OS, BFF). In programming language terms, the BFF template is the variable type while each of the object files are the instances of this type.



In Fig. 6, an object file (M1, OS1, BFF1) has a specification template according to the syntax of the BFF grammar. The RL uses the template information as the basis for processing. The combination of this template along with an object file belonging to this group is then decoded by the RL.

The nature of the BFF grammar is slightly different to most of the language grammars. The main difference in the BFF grammar is its need to reference previously defined information. Information that is parsed can be very important to the rest of the contents. A good example of this is the address location of a section in the file. The object file can hold the file address of a data section in a field at the beginning of the file header. When having parsed over the header, this information must be kept for later use to locate the beginning of the data section. If another field indicates the size of the section, then this field is store as well so the parser/loader is able to identify where the section ends and possibly the start of the next section. Another example is the way some object files stored strings in the strings table. A string has two parts, a number and the actual string; usually not null terminated. The number is used to specify the length of the string that follows.

Because of the vast number of different BFFs, it is difficult to develop a suitable BFF grammar that can provide a complete frame model. Nevertheless, our goal is to develop such a grammar. There has been an attempt to create a grammar for the DWG (used for Autocad drawing) [22] file format and because of this, the grammar is biased towards that format. The next section describes this grammar.


3.3.1 DWG-based BFF grammar

This BFF grammar was developed and tested for the DWG format [22]. As the DWG is used as the basis for the requirement of the grammar, the result is biased. In this version of the BFF grammar, terminal symbols consists of a number of bytes as the fundamental set of base types found in most programming languages; char, int, long, float and double. A binary object file is view as a stream of bytes. A special note about this grammar is the ability to handle different byte ordering for integers, and different formats for floating point number for various machines. For example, the definition of a word representation in a little-endian machine is given by:

type word :=
	byte : first,
	byte : second
	return ((word)first | ((word)second << 8).

In a little-endian machine, the lower order byte comes before the higher order byte. In extended BNF form, this grammar has the following basic types [3]:

data_types :=
"char" | "byte" | "short" | "word" | "long" | "longword" |
"float" | "double".

The grammar rule for defining types is:

type_def_rule :=
	"typedef" data_type
	basic_type_name ":="
	("byte" ":" byte_name) LIST
	"return" expr ".".

Using the above rule, the definition of the DWG longp type is as follows:

typedef longword longp :=
	byte : b1, byte : b2, byte : b3, byte : b4
	return (longword)b1 | ((longword)b2 << 8)
	       | ((longword)b3 << 16) | ((longword)b4 << 24). 

A complete specification for the DWG (release 12) can be found in [22].