5 BFF grammar and the SRL

Due to the increasing number of binary file formats (BFFs), we need to capture their structure in some consistent and appropriate way. To build applications that manipulates machine code instructions, the BFF will be a prime consideration when building such tools. As the Retargetable Loader is like a front end to all machine-code manipulation tools, its ability to understand the BFF is of most importance. As discussed in previous chapters, the ideal method to capture a BFF is via specifications. The rest of this chapter describes properties of the specification language BFF derived by the author to create a Simple Retargetable Loader (SRL); a scaled down version of the RL. It also provides explanation on some of the grammar rules, patterns and some examples.

Some definitions before we start:

In programming terminology, a specification is like the program written in the grammar language.

5.1 BFF properties

In a BFF, some parts within the file are interrelated. Although the structure of the binary file never change, its file size, location of regions and contents can varies dramatically. Because of this behaviour, the ability of the grammar (and thus the RL) to reference previously defined information are very important. The resulting grammar not only needs to be general, it must also be flexible to assist the RL in re-referencing previously parsed information. Information that needs to be re-referenced are usually found in the file header of the binary object file. As the specification for a particular BFF is parsed, any reference to previously read information needs to be handled appropriately. This idea of re-referencing is not found in most programming languages' BNF grammars.

The ability for the language to re-use defined information later in the program is very limited. User defined types can be referenced later when declaring instances of that type, but does not have a value. Macros in languages can be used (referenced) throughout the rest of the program after its definition, but their values are fixed and cannot be changed. Each binary file of the same BFF has its own set of records which identifies itself. The BFF specification captures the structure of these record, but not their information (type versus variables). The general structure of the BFF is known through the specification, however each instance of this BFF can only be understood by the RL during its parsing process. The specification will need to define the items, but their values at run time give meaning to other definitions in the rest of the specification. This is similar to the idea of dynamic typing languages.

The following example will clarify the idea of re-referencing in a BFF specifications:

Let us assume we have a "Hello World" program stored in a Windows NE BFF. The segment table for the Windows NE BFF consists a fixed number of segment table entries. The exact number of entries is listed as one of the field in the program's file header - NumSegEnt. To create a copy of the segment table for the "Hello World" program in memory, the RL must allocate the number of entries according to NumSegEnt in the "Hello World" program's file header. In the Windows NE specification, the definition of the file header and segment table could be:

FileHeader : STRUCTURE {
  ..
  ..
  NumSegEnt : int;
  ..
}
SegmentTable : ARRAY NumSegEnt OF SegTableEnt
;

The value of NumSegEnt is used to specify the size of the array. In traditional languages, the array size must be fixed at compile time. The size of the segment table is only know at run time.

Most re-referenced information are located in the file header, but sometimes it is not necessary the case. For example, to locate the segment table, the address where it can located must be defined:
SegmentTable : ARRAY NumSegEnt OF SegTableEnt;
ADDRESS NewHoff + SegToff;

The address is the SegToff found within the new header, while the address of this new header is NewHoff located in the old header.

5.2 Simple Retargetable Loader (SRL)

The prime development domain of the SRL is based on the DOS EXE, Windows NE and Sparc ELF formats. The DOS EXE is very simple and limited in structure. The Sparc ELF is most complicated while the Windows NE is somewhere in between. Nevertheless, the grammar is designed to be a generic BFF grammar and can be easily extended if latter found insufficient. At this stage, our focus has been mainly on the above three BFFs. The differences in complexity between these BFF (DOS EXE - simple, Sparc ELF - very general and Windows NE - moderate) gives good indication of how well the grammar works. Example of the DOS EXE, Windows NE and Sparc ELF in SRL's BFF grammar can be found in appendix 2a, 2b and 2c.

5.3 BFF grammar for Simple Retargetable Loader (SRL)

This section describes the abstract syntax of the BFF grammar for the Simple Retargetable Loader (SRL). The grammar syntax is in extended BNF (EBNF). EBNF has the following language symbols:

In the grammar, non-terminals appear in italics, terminals appear as normal, "literal strings" appear with double quotes, and examples appear in bold. The start symbol for this grammar is BFFspec.

BFFspec {spec}.

spec format-def defin {defin} loading-info.

format-def "DEFINITION" "FORMAT" ident {ident} "END" "FORMAT".

defin
  "DEFINITION" ident "ADDRESS" expression
  scope-def
 
"END" ident.

Loading-info  
  "FILEHEADER" ident
 
"IMAGESIZE" expression
 
"IMAGEADDRESS" expression.

scope-def ident type-exp {ident type-exp}.

type-exp  
  "SIZE" expression |
  "ARRAY" expression
  scope-def
 
"END" ident.

expression
  "(" ident operator expression ")" |
  ident operator expression | .

operator "+" | "-" | "*" | "/" | "^" | "%".

Ident "a".."z" | "A".."Z" {"a".."z" | "A".."Z" | "_"}


5.4 Grammar explanations

The body for any BFF specification is of the form:

spec format-def defin {defin} loading-info.

5.4.1 format-def

format-def "DEFINITION" "FORMAT" ident {ident} "END" "FORMAT".

The format-def specifies the overall structure of the BFF. Everything (all idents) defined as part of format-def must be defined later in the specification although the format need not include all parts of a BFF. An example of this is when a particular object file contains areas that are not used and never referenced, typically acts as space fillers that separates different parts of file. An example definition of format-def for a simple BFF format is :

DEFINITION FORMAT
file_header
section
END FORMAT

Here, the file_header and section will need to be defined later in the grammar. If this specification is for a DOS EXE file, then obviously, the relocation table would go between the file_header and section. But if we are not concern with it, it can be omitted from the definition. The organisation of identifiers are not forced, it merely indicates the relative ordering of divisions. In the above definition, it does not suggest that section starts at the end of file_header, in fact section could be placed before file_header. The syntax of format-def does not put any ordering restrictions on its. But for clarity and easy understanding, the user should arrange the definitions in a well-formed manner so it reflects the actual file structure.

5.4.2 defin

Each ident that were in format-def are defined using this rule defin:

defin  
  "DEFINITION" ident "ADDRESS" expression
  scope-def
 
"END" ident.

The ident after the key word DEFINITION must be previously declared in the grammar. The start location of this new structure (relative to the start of the file ) is specified by the expression after the keyword ADDRESS. For example, the definition of the file_header might be :

DEFINITION file_header ADDRESS 0
  h_sigLo SIZE 8
  h_sigHi SIZE 8
  h_lastPageSize SIZE 16
  ..
  ..
  ..
END file_header

The above definition indicated that the file_header start at the beginning of the file. All declarations that follows belongs to this definition; in the above case the file_header. h_sigLo, h_sigHi and h_lastPageSize all belongs to the same scope level and has a parent named file_header. This concept is equivalent to the definition of a structural type in most programming languages.

5.4.3 Loading-info

Loading-info  
  "FILEHEADER" ident
 
"IMAGESIZE" expression
 
"IMAGEADDRESS" expression.

hold the fundamental information about the object file for loading to occur. It is crucial for any BFF specification to provide their loading information. There are no order on the occurrence of the three constructs, as long as all three exists in the specification. The FILEHEADER construct identifies the first region of the object that must be loaded in memory. This region is often the file header as it contains critical information about the locations of other regions and some house keeping information. The IMAGESIZE specifies the load image size. The size is often calculated based on the information obtained in the file header. The IMAGEADDRESS specifies the start address (relative to the beginning of file) where the image should be loaded.

5.4.4 type-exp

type-exp  
  "SIZE" expression |
  "ARRAY" expression
  scope-def
 
"END" ident.

defines the type for the identifier. An identifier can be either a single element of a particular size or a group that is specified by the ARRAY construct. The expression after the key word ARRAY identifies the number of elements in the ARRAY. Declarations within the ARRAY definition are bounded to the same scope with array identifier being their parent. For example, the definition of the segment table in the Windows NE format :

DEFINITION seg_table ADDRESS (sh_segToff + sho_off)
  seg_table_ent ARRAY sh_segTent
    ste_logSectoff SIZE 16
    ste_size SIZE 16
    ste_flag SIZE 16
    ste_minsize SIZE 16
  END seg_table_ent
END seg_table

In the Windows NE file, the segment table is defined to be an array of structures named seg_table_ent. The number array elements is sh_segTent, obviously sh_segTent must has been parsed earlier in the specification. ste_logSectoff, ste_size, ste_flag and ste_minsize are the elements within seg_table_ent.

5.5 Limitation of the SRL grammar

The SRL grammar was designed to be as simple as possible. It merely provides a most basic frame grammar model for creation of elementary loading routines. Most of the SRL functions are dealing with information about the file header and apply its definitions to the rest of the object file.

There are a number of areas that the SRL grammar does not include: