Due to the increasing number of binary file formats (BFFs), we need to capture their structure in some consistent and appropriate way. To build applications that manipulates machine code instructions, the BFF will be a prime consideration when building such tools. As the Retargetable Loader is like a front end to all machine-code manipulation tools, its ability to understand the BFF is of most importance. As discussed in previous chapters, the ideal method to capture a BFF is via specifications. The rest of this chapter describes properties of the specification language BFF derived by the author to create a Simple Retargetable Loader (SRL); a scaled down version of the RL. It also provides explanation on some of the grammar rules, patterns and some examples.
Some definitions before we start:
In programming terminology, a specification is like the program written in
the grammar language.
5.1 BFF properties
In a BFF, some parts within the file are interrelated. Although the structure of the binary file never change, its file size, location of regions and contents can varies dramatically. Because of this behaviour, the ability of the grammar (and thus the RL) to reference previously defined information are very important. The resulting grammar not only needs to be general, it must also be flexible to assist the RL in re-referencing previously parsed information. Information that needs to be re-referenced are usually found in the file header of the binary object file. As the specification for a particular BFF is parsed, any reference to previously read information needs to be handled appropriately. This idea of re-referencing is not found in most programming languages' BNF grammars.
The ability for the language to re-use defined information later in the program is very limited. User defined types can be referenced later when declaring instances of that type, but does not have a value. Macros in languages can be used (referenced) throughout the rest of the program after its definition, but their values are fixed and cannot be changed. Each binary file of the same BFF has its own set of records which identifies itself. The BFF specification captures the structure of these record, but not their information (type versus variables). The general structure of the BFF is known through the specification, however each instance of this BFF can only be understood by the RL during its parsing process. The specification will need to define the items, but their values at run time give meaning to other definitions in the rest of the specification. This is similar to the idea of dynamic typing languages.
The following example will clarify the idea of re-referencing in a BFF specifications:
Let us assume we have a "Hello World" program stored in a Windows NE BFF. The segment table for the Windows NE BFF consists a fixed number of segment table entries. The exact number of entries is listed as one of the field in the program's file header - NumSegEnt. To create a copy of the segment table for the "Hello World" program in memory, the RL must allocate the number of entries according to NumSegEnt in the "Hello World" program's file header. In the Windows NE specification, the definition of the file header and segment table could be:
FileHeader : STRUCTURE {
..
..
NumSegEnt : int;
..
}
SegmentTable : ARRAY NumSegEnt OF SegTableEnt;
The value of NumSegEnt is used to specify the size of the array. In traditional languages, the array size must be fixed at compile time. The size of the segment table is only know at run time.
Most re-referenced information are located in the file header, but sometimes
it is not necessary the case. For example, to locate the segment table, the
address where it can located must be defined:
| SegmentTable : |
ARRAY NumSegEnt OF SegTableEnt; ADDRESS NewHoff + SegToff; |
The address is the
SegToff found within the
new header, while the address of this new header is
NewHoff located in the old
header.
5.2 Simple Retargetable Loader (SRL)
The prime development domain of the SRL is based on the DOS EXE, Windows
NE and Sparc ELF formats. The DOS EXE is very simple and limited in structure.
The Sparc ELF is most complicated while the Windows NE is somewhere in between.
Nevertheless, the grammar is designed to be a generic BFF grammar and can
be easily extended if latter found insufficient. At this stage, our focus
has been mainly on the above three BFFs. The differences in complexity between
these BFF (DOS EXE - simple, Sparc ELF - very general and Windows NE - moderate)
gives good indication of how well the grammar works. Example of the DOS EXE,
Windows NE and Sparc ELF in SRL's BFF grammar can be found in appendix 2a,
2b and 2c.
5.3 BFF grammar for Simple Retargetable Loader (SRL)
This section describes the abstract syntax of the BFF grammar for the Simple Retargetable Loader (SRL). The grammar syntax is in extended BNF (EBNF). EBNF has the following language symbols:
In the grammar,
non-terminals appear in italics,
terminals appear as normal,
"literal strings" appear
with double quotes, and
examples appear
in bold. The start symbol for this grammar is
BFFspec.
BFFspec
{spec}.
spec
format-def defin {defin}
loading-info.
format-def
"DEFINITION" "FORMAT" ident {ident}
"END" "FORMAT".
defin
"DEFINITION" ident "ADDRESS" expression
scope-def
"END" ident.
Loading-info
"FILEHEADER" ident
"IMAGESIZE" expression
"IMAGEADDRESS" expression.
scope-def
ident type-exp
{ident type-exp}.
type-exp
"SIZE" expression |
"ARRAY" expression
scope-def
"END" ident.
expression
"(" ident operator expression ")" |
ident operator expression | .
operator
"+" | "-" | "*" | "/" | "^" |
"%".
Ident
"a".."z" | "A".."Z" {"a".."z" | "A".."Z"
| "_"}
5.4 Grammar explanations
The body for any BFF specification is of the form:
spec
format-def defin
{defin} loading-info.
5.4.1 format-def
format-def
"DEFINITION" "FORMAT" ident
{ident} "END" "FORMAT".
The format-def specifies the overall structure of the BFF. Everything (all idents) defined as part of format-def must be defined later in the specification although the format need not include all parts of a BFF. An example of this is when a particular object file contains areas that are not used and never referenced, typically acts as space fillers that separates different parts of file. An example definition of format-def for a simple BFF format is :
DEFINITION FORMAT
file_header
section
END FORMAT
Here, the file_header and section will need to be defined later in the grammar. If this specification is for a DOS EXE file, then obviously, the relocation table would go between the file_header and section. But if we are not concern with it, it can be omitted from the definition. The organisation of identifiers are not forced, it merely indicates the relative ordering of divisions. In the above definition, it does not suggest that section starts at the end of file_header, in fact section could be placed before file_header. The syntax of format-def does not put any ordering restrictions on its. But for clarity and easy understanding, the user should arrange the definitions in a well-formed manner so it reflects the actual file structure.
5.4.2 defin
Each ident that were in format-def are defined using this rule defin:
defin
"DEFINITION" ident "ADDRESS" expression
scope-def
"END" ident.
The ident after the key word DEFINITION must be previously declared in the grammar. The start location of this new structure (relative to the start of the file ) is specified by the expression after the keyword ADDRESS. For example, the definition of the file_header might be :
DEFINITION file_header ADDRESS 0
h_sigLo SIZE 8
h_sigHi SIZE 8
h_lastPageSize SIZE 16
..
..
..
END file_header
The above definition indicated that the
file_header start
at the beginning of the file. All declarations that follows belongs to this
definition; in the above case the
file_header.
h_sigLo, h_sigHi and
h_lastPageSize all
belongs to the same scope level and has a parent named
file_header. This
concept is equivalent to the definition of a structural type in most programming
languages.
5.4.3 Loading-info
Loading-info
"FILEHEADER" ident
"IMAGESIZE" expression
"IMAGEADDRESS" expression.
hold the fundamental information about the object file for loading to occur.
It is crucial for any BFF specification to provide their loading information.
There are no order on the occurrence of the three constructs, as long as
all three exists in the specification. The
FILEHEADER construct identifies
the first region of the object that must be loaded in memory. This region
is often the file header as it contains critical information about the locations
of other regions and some house keeping information. The
IMAGESIZE specifies the load
image size. The size is often calculated based on the information obtained
in the file header. The
IMAGEADDRESS specifies the
start address (relative to the beginning of file) where the image should
be loaded.
5.4.4 type-exp
type-exp
"SIZE" expression |
"ARRAY" expression
scope-def
"END" ident.
defines the type for the identifier. An identifier can be either a single element of a particular size or a group that is specified by the ARRAY construct. The expression after the key word ARRAY identifies the number of elements in the ARRAY. Declarations within the ARRAY definition are bounded to the same scope with array identifier being their parent. For example, the definition of the segment table in the Windows NE format :
DEFINITION seg_table ADDRESS (sh_segToff
+ sho_off)
seg_table_ent ARRAY sh_segTent
ste_logSectoff SIZE 16
ste_size SIZE 16
ste_flag SIZE 16
ste_minsize SIZE 16
END seg_table_ent
END seg_table
In the Windows NE file, the segment table is defined to be an array of structures named seg_table_ent. The number array elements is sh_segTent, obviously sh_segTent must has been parsed earlier in the specification. ste_logSectoff, ste_size, ste_flag and ste_minsize are the elements within seg_table_ent.
5.5 Limitation of the SRL grammar
The SRL grammar was designed to be as simple as possible. It merely provides a most basic frame grammar model for creation of elementary loading routines. Most of the SRL functions are dealing with information about the file header and apply its definitions to the rest of the object file.
There are a number of areas that the SRL grammar does not include: