Permission Requested. IEEE Transactions of Software Engineering.
Figure courtesy of Bill Caudle. |
This page is dedicated to the memory of Professor Maurice Halstead, the
father of decompilation.
The page contains information of decompilation projects directed by Professor
Halstead during the 1960-1976 period. Thanks to
Bill Caudle for all the
information and anecdotes shared. Bill worked on decompilation
projects directed by Professor Halstead between 1965 and 1976.
Throughout this page, unless otherwise noted, all text in quotes are Bill's words.
Professor Halstead is best known for his work in Software Science, an experimental and theoretical discipline concerned with measuring properties of computer programs. In the 1960s he wrote one of the first books on compiler implementation.
|
|---|
"Maury's background in physics, meteorology and the physical sciences
provided him with insights and interests that ultimately led to
the development of software metrics which he initially called
'Software Thermodynamics.'
"As a boss, Maury could be the epitome of a father figure: he was
supportive, interested, and always made a person feel like the
ideas that he fed to them were their own (at least this was the
case for me). He could also be an aggressive politician and
created enemies as well as friends as a player in the development
of computing capabilities of the U.S. Navy."
"Maury was a true experimental scientist, pioneer and visionary in the
field of computer software. At a time when software developers
were still arguing about the viability of higher level languages,
Maury committed part of the resources of a Navy development project
to create the first self-compiling compiler (Neliac)
which was then used to complete the project on time and within
budget.
At another time, with a team of 30 people working on the Global
Weather conversion project apparently in schedule trouble,
he took the gamble to cut the project team staff in half (rather
than increasing the staff). The project was completed on time
due to releasing the productive members' time to productive
work rather than assisting junior members with their problems.
The following sections describe three decompilation projects for a variety of machines. The aim of these decompiler projects was to aid in the migration of software to newer/successor machines, hence eliminating part of the time consuming task of rewriting code for the new machines. "A primary requirement was to be able to decompile from binary (or more specifically using binary card images as input)."
In 1960, Maury directed a project on decompilation mainly to show the usefulness of the Neliac language as a Universal Computer Oriented Language, as well as a problem-oriented language. Joel Donnelly and Herman Englander implemented the D-Neliac decompiler for the Univac M-460 Countess Computer while on Maury's staff. "Each time Joel would bring an 'impossibility' to Maury, they would sit down and figure a way around it." As Herm noted, "the difficult we do immediately -- the impossible takes a little longer." The D-Neliac decompiler was an operational decompiler that decompiled Univac M-460 binary code into Neliac source code. There was also a CDC 1604 version of D-Neliac.
In the summer of 1962, Bill joined NEL and worked for Maury developing an earth atmosphere simulation program. After returning to NMSU for one more year of graduate studies, Bill returned to NEL in 1963.
"The first version of the IBM 7094 to Neliac Decompiler was from IBM 7094 binary card images to IBM 7094 Neliac. Then, after the Univac 1107-8 compiler was bootstrapped from the IBM 7094 version, the decompiler was modified to translate to Univac 1108 Neliac. Since the resulting programs were running on a different system and instruction set, we did not have the luxury of inserting crutch code (i.e., machine code), even though Neliac allowed it. What we did do, however, was to modify the Neliac compiler, which we then called the Neliac "recompiler", to compile in ways to assist the correct operation of the decompiled code. One of these was the order of evaluation of arithmetic. That was one of the first tasks I accomplished after enhancing the Neliac compiler to include compilation of algebraic expressions and exponentiation. The original compiler allowed only very simple arithmetic. The recompiler used this simpler, and predictable, order of evaluation. This allowed such idioms as the conversion of integer to floating point to work as they had on the IBM 7094 (where the conversion was performed by a load of the integer, ORing the result with a floating point exponent, followed by a floating point add of zero). Of course, maintaining the order of arithmetic is sometimes important, and that was the primary emphasis of the recompiler."
Original team members of the advanced software development group working for Maury on the Neliac compiler and decompiler projects included: Bill Caudle, Lambert Lui and Gordon Pelton on the decompiler; Bob Stelloh and Alan Howard on the compiler. Al Fuhrman developed the Neliac compressor/reformatter. Others were involved in the later stages of the group.
The Lockheed decompiler was used in several demonstrations that included Phillips Petroleum, Ontario Hydro and the US Air Force. "One of the more humorous decompilations I ever did was the decompilation of a Monte Carlo application that included a random number generator. The converted random number generator was required to produce the same sequence of random numbers, and the Monte Carlo application was required to generate exactly the same results on the same inputs."
The Lockheed decompiler was used on a sizable project with Lockheed as the subcontractor to Univac. This contract was won after Bill put on a demonstration to the Air Force in Washington DC. Lambert Lui supported Bill in this effort, and Maury attended the actual demo (to assure our success, ashtrays were placed in the computer room for the Air Force Colonel. The team was warned not to ask the Colonel to extinguish his cigar.) The other vendors in the competition were IBM (using strictly conversion) and CDC (using code simulation).
Starting in the fall of 1967 and completed in about May 1968, the US Air Force Global Weather Center's weather forecasting programs (125 programs written in assembler and FORTRAN) were converted to run on the Univac 1108 at the Offutt AFB in Omaha. Bill recalls having encountered a bug in one of the decompiled programs -- the largest assembler program in the set. This program was 24K lines of code and 6K lines of data (corresponding approximately to a 180K byte program). Using a set of input data picked by the Air Force as part of an acceptance test, the program generated by the decompiler would crash. After exhaustive manual checking of the decompiled code, a parallel run of the original program on the IBM 7094 was requested. The result? The parallel run showed that the original program running against the same test data would crash too -- the bug was in the original program, and was faithfully reproduced in the decompiled program.
"We were very successful in decompiling programs that had resulted from FORTRAN programs. These would run with little manual change. The IBM 7094 to Neliac Decompiler recognized FORTRAN calling sequences and replaced them with a more compact version. For pure assembler programs, the decompiler was estimated to remove 80-90% of the effort over manual conversion. At that time, we estimated the cost of conversion to be less than U$1.00/instruction decompiled. This was substantially less than the estimate at that time for manual conversion of assembler programs."
When the primary investors in TAI failed to produce sustaining capital for the TAI, it was abandoned. In 1970, the company staff and university associates established another company called Combinatorics (Also in Lafayette, IN). In 1973, Combinatorics was bought by Mathematica, Inc., of Princeton, NJ.
Throughout this period, Maury supervised a PhD thesis on decompilation at Purdue University, that of Barry Housel (awarded 1973). At the same time, this was the period in which Maury began his formal investigation of the complexity of programs, resulting in what are now known as the "Halstead Metrics".
The Sperry*Univac Inverse Compiler was made up of the following modules: input processing, basic block generation, local data usage analysis, control and data flow analysis, data mapping, reporting, intermediate form preparation, level transformations, and target language production. The model for decompilation is distinguished from prior models by the extensive use of analysis techniques in an attempt to derive data structures. The latter was in recognition of the fact that data conversion is of primary importance. Additional details on the analysis techniques are given in the previously unpublished paper On the Inverse of Compiling by Bill Caudle, 1980.
The minimum input to the inverse compiler was the binary code in the absolute or relocatable load modules. Additional input in the form of the associated assembly and map listings for the program (produced by the ASM or SPURT assemblers) were optional; these listings assisted in the conversion process. The reporting module flagged missing libraries and parts of the program that would be hard to decompile. Using the input processor, the user could provide additional information to completely define the instruction blocks, data blocks, and subroutine definitions. Any information that the decompiler obtained by analysis could be modified or enhanced by the conversion specialist. However, no changes were allowed to be made to the binary programs, which remained unmodified. The enhanced input files were then input to the decompiler which produced COBOL code. The optimizer was optionally used to improve the readability of the generated COBOL code. The 1100 series sales memo from 1976 announces the 494 to 1100 inverse compiler.
The following figures were reported in internal documents:
Subsequent to the 2 year contract, Bill went to work with Sperry*Univac in Roseville, MN, and continued on the decompilation project for another 2 years. About 10 people worked on the decompilation project in Roseville, with as many at times in Salt Lake City, UT. Work on decompilation in Roseville was in the area of loop analysis, interval analysis, global data analysis, and dictionary techniques. The Salt Lake City group provided input processing, conversion to internal forms, level transformations, and conversion to target language. Effectively, this separated the project into that related to decompiler techniques and that more directly related to compiler techniques.
In 1975, there was a version of the decompiler that generated PLUS (a PL/1-like language for systems programming) code. Univac did a benchmark test on decompilation in Zurich using the PLUS version. This was the first conversion done outside their own environment. Bill led the team in Zurich that translated 8 programs of 5600 lines of assembly code. These programs were picked from the most difficult type of programs to be converted and resulted in many improvements in the process. Part of the results of this field test were the recommendation to use COBOL as the target language.
As a result of this field effort, in late 1975/76 a version of the inverse compiler was developed that generated COBOL code. Both PLUS and COBOL target languages were generated from the same internal representation used in the decompiler.