3.1 Design Choices
Being retargetable also mean being portable. Not only that loader will need
to understand different Binary File Formats (BFFs), we also want it to run
on different machine platforms. For portability reasons, the RL needs to
be implemented using a suitable programming language. Assembly languages
are often machine dependent, complex and hard to debug, thus it is not suitable
for the task. Functional and logic languages are also unsuitable due to its
restricted availability in different environments. High level languages like
C, C++ and Java are quite common amongst various machine platforms and are
ideal for implementing the RL.
To provide loading of a binary source, 3 basic approach can be used:
The first approach is the easiest and quickest although sometimes time-consuming
to test. This simplicity advantage is only good for creating loaders that
are limited to the knowledge of one BFF. As was described in Fig.2, you will
need to hand craft n different loaders for n different (M,
OS, BFF) tuples.
Approaches 2 and 3 can be very complex to create at first (but not to use),
and can provide mechanisms to support the building of an RL. Both approaches
are difficult to produce and time-consuming, however, once this time investment
has been made, the production of other loaders are quick and simple. Even
when the RL is finished, the user has the overhead of learning how to use
the library or the specification language. Although slow at first, the advantage
of a RL is remarkable and could save time and effort in the long run (the
approach via specification is at least as good as the hand crafted approach
even at an intermediate level).
3.2 Developing an RL via library routines
Fig. 5 highlights the fundamental components of an RL created via library
routines. The (M, OS, BFF) triplet is described in the format defined by
the library; internal binary representation. Any program needing access to
the internal binary representation must access it through the interface between
library and program. A set of interface routines are used by the interface
to retrieve information about the BFF. For each new environment to be described,
the user must present this environment in the structure that is specified
by the library and provide the set of interface access routines.
The interface between library and program is often fixed and provide a standard
for accessing the internal representation. A typical interface will be a
set of prototype routine definitions (like the ones in the .h header file
of a C program) offered to the program for BFF access. Although the interface
is the same for all BFFs and has one single uniform internal representation,
the Interface routines can be difficult to code and takes a considerable
amount of effort to implement. Another problem that must be faced during
design of the library is how to represent all BFFs. Obviously, the structure
to store the BFF needs to be as general as possible for it to be able to
capture all BFFs. To do this, one would need to examine almost all the BFFs
that exists and derive a generic format for them. As a result, the internal
binary representation structure in the library will be very large and tedious
to construct. Limitation on the interface routines can cause other problems.
For example, if an environment happens to have some wired section that is
not defined in the internal binary representation, then the program has not
way of accessing this section. To overcome these problems, the interface
either needs to be very general or the library itself can allow additional
structures and interface routines to be added onto of the existing ones.
Due to its complexity, one would not attempt to write from scratch, thus
we will look at an existing package that will offer such abilities.
3.2.1 Binary File Descriptor Library (BFD)
GNU's Binary File Descriptor Library (BFD) [19] is a package containing common
routines that applications can use regardless of its underlaying file format.
The BFD Library is divided into two separate pieces for each BFF: the front
end, and the back end. The front end interfaces between the user and BFD
(Interface between library and program in fig. 4), the back end provides
a set of calls which the BFD front end can use to process and manage the
object file (Interface routines). To support a new BFF, the programmer
creates the new BFD back end and adds it to the library. The internal
binary representation used in BFD is called the BFD canonical object
file format.
When an object file (M, OS, BFF) is opened, the BFD routines (from the front
end) automatically determines the format of the input object file. A descriptor
is build in memory with information about which routines are to be used to
access elements of the object file's data structure. When the program wants
information about the object file, the BFD reads from different sections
of the file and process them. Each BFD back end will have routines to convert
sections representations of the object file to BFD's internal canonical object
file format.
For example, when a program wants to access an object file's symbol table
information, there will be a routine in the BFD back end for that BFF object
file that will convert the concrete representation of the object file to
the canonical format. So when the program asks for symbol information, this
BFD back end routine is invoked and any operations will then be carried on
the canonical representation. When the program is finished and wants to write
out the new information, another BFD back end routine is called to convert
the new canonical representation to the chosen BFF output format.
A BFD canonical format has the following information about an object file:
|
Information stored |
|
|---|---|
| files |
Machine architecture. Implementation format types. Demand pageable bit. Write protected bit. |
| sections |
Name. Address in the object file. Size. Alignment information. Various flags. Pointers into BFD data structure. |
| symbols |
Definition in object file. Name. Value. Various flags. |
| relocation level |
Pointer to the symbol to relocate to. Offset of the data to relocate. Section of where the data belongs. Pointer to the relocation type descriptor. |
| line numbers | Information for debugging purposes. |
One difficulty in using the BFD library is its complexity. The library itself
is very large, the number of functions offered in the front end are exceptionally
many. The BFD front end was designed in mind to allow the programmer to be
able to retrieve all type of information about any BFF, at least the existing
ones. The BFD library can be integrated with disassemblers, decompilers,
debuggers, etc. Due to this generality and hence its bulkiness, it is difficult
to use it without spending a great deal of time learning how to use it. Perhaps
because it is too general, it often contains information that is not needed
for some applications.
3.3 Developing an RL via specifications
The idea of using specification is not a new one. Parser and compiler generators
have been around for a while and have proven to be very useful. The input
to these generators is often a specification, usually in the form of a context
free grammar found commonly in specifying the syntax of programming languages.
The idea can also be apply to binary objects where each of the BFFs can be
specified in a grammar that can be understood by the RL. The resulting object
file specification contains information about the structure of the object
file and how various sections can be accessed. Each specification acts as
a model frame for all object files in the group, ie. a template for all objects
belonging to the environment (M, OS, BFF). In programming language terms,
the BFF template is the variable type while each of the object files are
the instances of this type.
In Fig. 6, an object file (M1, OS1, BFF1)
has a specification template according to the syntax of the BFF grammar.
The RL uses the template information as the basis for processing. The combination
of this template along with an object file belonging to this group is then
decoded by the RL.
The nature of the BFF grammar is slightly different to most of the language
grammars. The main difference in the BFF grammar is its need to reference
previously defined information. Information that is parsed can be very important
to the rest of the contents. A good example of this is the address location
of a section in the file. The object file can hold the file address of a
data section in a field at the beginning of the file header. When having
parsed over the header, this information must be kept for later use to locate
the beginning of the data section. If another field indicates the size of
the section, then this field is store as well so the parser/loader is able
to identify where the section ends and possibly the start of the next section.
Another example is the way some object files stored strings in the strings
table. A string has two parts, a number and the actual string; usually not
null terminated. The number is used to specify the length of the string that
follows.
Because of the vast number of different BFFs, it is difficult to develop
a suitable BFF grammar that can provide a complete frame model. Nevertheless,
our goal is to develop such a grammar. There has been an attempt to create
a grammar for the DWG (used for Autocad drawing) [22] file format and because
of this, the grammar is biased towards that format. The next section describes
this grammar.
3.3.1 DWG-based BFF grammar
This BFF grammar was developed and tested for the DWG format [22]. As the
DWG is used as the basis for the requirement of the grammar, the result is
biased. In this version of the BFF grammar, terminal symbols consists of
a number of bytes as the fundamental set of base types found in most programming
languages; char, int, long, float and double. A binary object file is view
as a stream of bytes. A special note about this grammar is the ability to
handle different byte ordering for integers, and different formats for floating
point number for various machines. For example, the definition of a word
representation in a little-endian machine is given by:
type word := byte : first, byte : second return ((word)first | ((word)second << 8).
In a little-endian machine, the lower order byte comes before the higher
order byte. In extended BNF form, this grammar has the following basic types
[3]:
data_types :=
"char" | "byte" | "short" | "word" | "long" | "longword" |
"float" | "double".
The grammar rule for defining types is:
type_def_rule :=
"typedef" data_type
basic_type_name ":="
("byte" ":" byte_name) LIST
"return" expr ".".
Using the above rule, the definition of the DWG longp type is as follows:
typedef longword longp := byte : b1, byte : b2, byte : b3, byte : b4 return (longword)b1 | ((longword)b2 << 8) | ((longword)b3 << 16) | ((longword)b4 << 24).
A complete specification for the DWG (release 12) can be found in [22].