You are on page 1of 16

Relogix:

Converting assembler
to high quality C
MicroAPL Ltd





Simon Marsden, MicroAPL Ltd

22nd September 2010

Introduction
MicroAPLs Relogix translator is designed to convert programs written in IBM mainframe
assembly language to C.
Although there have been other attempts at doing this, some of them are fairly simplistic.
What distinguishes Relogix is that it produces high quality C code. Our aim is to produce C
code of a standard thats close to what a human programmer would write - readable, easy
to understand and easy to maintain.
To achieve this its necessary to think carefully about how the original program should be
represented in C. If you model the behaviour of the processor too closely you will get a
translation thats slow and unreadable. On the other hand, the author of the assembler
code may have used some quite subtle coding tricks, so a very detailed analysis of the
program is necessary.
This document explores some of the techniques used by Relogix to achieve high-quality
code. Along the way well look at how not to convert assembler code to C, and how choosing
the right level of abstraction brings benefits in both readability and performance.

Representation of Registers and Conditions
The IBM370 mainframe architecture has 16 user-programmable General Purpose Registers,
R0-R15. The instruction set is very rich, with both register- and memory-based operands,
and most instructions set the condition codes.
For a simple example, consider the following register-to-register add instruction:
AR R1,R2

This instruction adds register R2 to R1, but also updates the condition codes register as
follows:
Resulting Condition Code:
0 Result zero; no overflow
1 Result less than zero; no overflow
2 Result greater than zero; no overflow
3 Overflow

How not to do it!
First lets look at how not to translate this instruction to C. Some simplistic translators model
the general purpose registers in a C global variable similar to the following:
struct {
long r0;
long r1;
long r2;
...
long r15
} registers;

The translation of the AR instruction is then something like:
long temp;

temp = registers.r1;
registers.r1 += registers.r2;

if ( (temp > 0 && registers.r2 > 0 && registers.r1 <= 0) ||
(temp < 0 && registers.r2 < 0 && registers.r1 >= 0))
condition = 3;
else if (registers.r1 > 0)
condition = 2;
else if (registers.r1 < 0)
condition = 1;
else
condition = 0;

Even if you use a C pre-processor macro like AR(r1,r2) to hide the implementation details,
this code is ugly and inefficient, and does not lead to a translation thats easy to understand
or maintain.
Representation of registers in Relogix-translated code
By contrast, Relogix maps the general purpose registers onto normal C variables. These are
not global, but rather are ordinary local variables within a subroutine (or subroutine
arguments if the register is passed into a routine as a parameter).
In addition, even within a subroutine a particular register like R1 is not always modelled by
the same local variable. Instead Relogix performs variable lifetime analysis: it looks at how
R1 is used within the subroutine. In the following example there are two different lives, the
second load instruction completely overwriting the results of the first with new data:
L R1,MYPTR
STC R3,0(R1)
L R1,OFFSET << This use of R1 is unconnected to the last
AR R1,R2

One important consequence is that Relogix can allocate proper types to the local variables
representing registers, as shown in the following example (ignore for a moment the clumsy
names given to the local variables; theyre shown this way to make the example clear, but
Relogix normally chooses more meaningful names):
char *r1_1; << First version of R1 has type: char *
char r3;
long r1_2; << Second version of R1 has type: long
long r2;
...

r1_1 = myptr;
*r1_1 = r3;
r1_2 = offset;
r1_2 += r2;

Variable types are chosen by Relogix after a deep inspection of the code, examining all the
ways in which a variable is used. Does it seem to be a signed quantity or an unsigned
quantity? A pointer? What sort of pointer? Relogix is able to choose sensible variable types,
even discovering structures and unions.
A second consequence of modelling registers as local variables is that Relogix can perform
intermediate variable elimination, so that the example above becomes:
char r3;
long r1_2;
long r2;
...

*myptr = r3;
r1_2 = offset + r2;

Representation of conditions in Relogix-translated code
We saw how the simplistic translation of the AR instruction faithfully reproduced its effect
on the Conditions register. The translation is accurate but inefficient. In effect this approach
is close to writing an IBM370 instruction-set simulator without the overhead of interpreting
the opcodes at runtime.
Relogix takes a different approach. It analyses what comes after the AR instruction. Does the
program actually check the conditions? If so, which ones?
For example, the AR instruction might occur in a sequence like this:
AR R1,R2
A R1,8(R4)

In the example above the programmer doesnt care about how the AR instruction sets the
condition code, since this is immediately overwritten by the instruction following. In this
case the translation of AR can just be:
r1 += r2;

Or maybe the programmer tests whether the result of the instruction is negative:
AR R1,R2
BM MINUS
LA R3,1
B DONE
MINUS LA R3,-1
DONE DS 0H

In this case, the Relogix translation might be:
r1 += r2;
if (r1 >= 0)
r3 = 1;
else
r3 = -1;

If the final value in R1 is not needed again, the translation might even be:
if (r1 + r2 >= 0)
r3 = 1;
else
r3 = -1;

The key points here are:
The code matches what a human programmer might write, apart from the unusual
choice of variable names. In fact, Relogix can choose better names, a fact which we
will explore in a subsequent section.
The code is efficient: it doesnt reproduce unwanted side-effects from the original
IBM 370 instruction.
In order to determine the best translation of an instruction it is necessary to look at
it in context, analysing what comes before and what comes after. Relogix makes
heavy use of recursive analysis techniques to perform a detailed investigation of the
code.
Unravelling Spaghetti: Go-to statement considered harmful
Edsger Dijkstra famously wrote a paper in 1968 called Go-to statement considered harmful,
in which he argues that the existence of goto is an invitation to make a mess of ones
program.
As assembly-language programmers were used to dealing with this - unconditional and
conditional branches, subroutine calls and returns are pretty much all we have. However,
no good C programmer would make excessive use of the goto statement.
Relogix is able to analyse the assembler program and recover a higher-level structure. In
particular it can detect the following C flow-control constructs
if...else if...else statements
do...while loops
while loops
for loops
switch statements
subroutine calls and returns

This includes handling more complex spaghetti, like code which jumps into the middle of a
loop, code which jumps out of a loop to more than one exit point, and code which
manipulates a return address so that a subroutine doesnt return cleanly to its caller.
As an example of flow recovery, consider the following piece of assembler code which uses a
typical jump-table idiom found in 370 assembler:
B *+4(R3)
B L1
B L2
B L3
L1 LA R1,=C'ANIMAL' TYPE IS ANIMAL
B DONE
L2 LA R1,=C'VEGETABLE' TYPE IS VEGETABLE
B DONE
L3 LA R1,=C'MINERAL' TYPE IS MINERAL
DONE DS 0H

A bad translation of this might be as follows (Relogix never produces this):
if (r3 == 0)
goto L1;
if (r3 == 4)
goto L2;
if (r3 == 8)
goto L3;
_rlx_flow_error_trap (); /* (Does not return) */

L1: r1 = "ANIMAL";
goto DONE;
L2: r1 = " VEGETABLE ";
goto DONE;
L3: r1 = " MINERAL ";
DONE:

As a first step, Relogix will convert this sequence into the following (Notice also that the
comments in the assembler code are carried over into the C code):
if (r3 == 0)
r1 = "ANIMAL"; /* type is animal */
else if (r3 == 4)
r1 = "VEGETABLE"; /* type is vegetable */
else if (r3 == 8)
r1 = "MINERAL"; /* type is mineral */
else
_rlx_flow_error_trap (); /* (Does not return) */

However it there are enough cases to make it worthwhile, Relogix will convert this into a
switch statement :
switch (r3) {
case 0:
r1 = "ANIMAL"; /* type is animal */
break;
case 4:
r1 = "VEGETABLE"; /* type is vegetable */
break;
case 8:
r1 = "MINERAL"; /* type is mineral */
break;
default:
_rlx_flow_error_trap (); /* (Does not return) */
break;
}

Note that Relogix adds a call to a routine named _rlx_flow_error_trap which is called
in the event that an unexpected value is passed in R3. This is just a safety feature which
helps in catching programming errors. The implementation of _rlx_flow_error_trap
typically prints an error message and aborts the program.

Better Variable Names
In the examples so far we have used variable names like r3 to make it clear which registers
the variables represent. To get closer to the goal of producing C code that a human
programmer might have written, Relogix needs to choose better variable names.
Relogix includes a module known as the name manager which takes care of this. The name
manager uses a number of techniques to choose a suitable name for each variable:
The name manager inspects the way that a variable is used. If r1 is used in an
example like this:
*r1 = 10;
...then it is some kind of pointer. A generic name like ptr might be suitable. Similarly
a variable used as a counter in a for loop might be called i
If a pointer variable is initialised to point to a global data item, we can improve on
the name. For example, instead of
ptr = &date;
...we could use the name date_ptr
Similarly if a variable is loaded from a named structure field, we can get a useful
name:
elapsed_time = performance.elapsed_time;

And if its a pointer to a structure of type time, Relogix might choose a name like
time_ptr
Another useful source of variable names is the comments. For example if we see the
assembler statement:
XR R1,R1 CLEAR THE TOTAL
...then a good name for the variable which represents r1 might be total
The names of variables can also be specified explicitly by the Relogix user.

To make it easy to relate the translated C code back to the original assembler code, the
original location of each variable is included in a comment when the variable is declared, e.g.
unsigned long total; /* [Originally in R1] */

Data Types
Relogix performs a detailed analysis of the way that variables are used in order to determine
their types. For example in the following code R1 is tested to check whether its negative.
This indicates that its a signed value, and so by inference PROFIT is signed too:

L R1,PROFIT
LTR R1,R1
BM MADE_A_LOSS

The translation might be something like:
long profit;

if (profit < 0)
...

In the example below, R2 is used in a logical shift right operation. Its probably an unsigned
value, and R3 is a pointer to an unsigned value:
L R2,0(R3)
SRL R2,8

In this case the translation might be:
unsigned long r2;
unsigned long *r3;

r2 = *r3 >> 8;

Type analysis also allows Relogix to recover C structures and unions from the code, as in this
example:

ADDSALES CSECT
REGIONS EQU 5

USING SALES,R3
XR R1,R1
LA R2,REGIONS
LOOP CLI TYPE,'A'
BNZ SKIP
A R1,VALUE
SKIP LA R3,SIZE(R3)
BCT R2,LOOP
ST R1,TOTAL
BR R14

TOTAL DC A(0)


SALES DSECT
VALUE DS A
TYPE DS CL1
SIZE EQU *-SALES

In this example the DSECT declaration converts to a C structure of the following type:
struct sales {
long value;
char type;
};

The converted assembler code is shown below (notice that its close to what a human
programmer might write):
#define REGIONS 5

/* Private file-scope variables */
static long total = 0;


/*
***************************************************************
* addsales *
***************************************************************
*
* Parameters:
*
* struct sales *sales_ptr [Originally in r3; In]
*/
void addsales (struct sales *sales_ptr)
{
long i; /* [Originally in r2] */
long v; /* [Originally in r1] */

v = 0;
for (i = 0; i < REGIONS; i++) {
if (sales_ptr->type == 'A')
v += sales_ptr->value;
sales_ptr++;
}
total = v;
}

All the examples shown above are produced by Relogix automatically, without any human
intervention. However in all cases the type solving system can also be guided by
supplementary information provided by the user.
Self-Modifying Code
The technique of modifying instruction opcodes at runtime is very often used in IBM
mainframe assembler applications. In fact it seems to be far more frequent than for other
processors which weve seen.
The following is a typical example:
LTR R1,R1
BZ LABEL
MVI LABEL+1,C'Y'
LABEL CLI 0(R2),C'X'

Viewed in isolation the CLI instruction seems unambiguous - its just comparing the
memory location pointed to by R2 with the immediate value 'X'. A simple translator might
mistakenly convert the instruction to something like this:
if (*r2 == 'X')
...

However, its necessary to look at the whole code. The MVI instruction is modifying the
opcode of the CLI instruction. Its patching the immediate value used in the right argument,
overwriting it with a 'Y'. Relogix is able to detect this case and it produces the following
translation:
static unsigned char immediate = 'X';

if (r1 != 0)
immediate = 'Y';

if (*r2 == immediate)
...

Other examples of self-modifying code include:
Branch conditions modified at runtime so that a branch is either taken or not taken:
LABEL BC 0,TARGET
OI LABEL+1,X'F0'

String lengths modified at runtime:
STC R1,*+5
MVC WORKB(0),0(R3)

Operand offsets modified at runtime (Displacement DISP is modified in the
following example):
SR R3,R4 Calculate displacement
A R3,=X'00009000' OR in the R9 field
STH R3,LABEL+4 Poke the instruction
LABEL MVC RESULT,DISP(R9)

Many simple cases of this kind are handled automatically. By analysing the whole of the
program, Relogix chooses a translation which reproduces the original behaviour.
Sometimes a program will actually generate whole sequences of instructions at runtime. For
example to get maximum performance it might compile an in-house query language into
machine code which it then executes. Although Relogix can detect and warn about such
cases it doesnt attempt to translate them automatically; they typically require rewriting by
hand.
If your application generates code at runtime please contact MicroAPL for advice. We have
wide experience in handling this type of problem. One approach which weve used
successfully on several past projects is to change the application to generate pseudo-code,
which is then executed using a simple pseudo-code interpreter.
Function calls and parameter passing
Relogix typically uses a one-to-one mapping such that each subroutine in the original
assembler code becomes a function of the same name in the C code.
There are two parameter-passing techniques commonly used in mainframe assembler code:
Parameters can be passed in registers
Parameters can be passed in a parameter block pointed to by the R1 register
Alternatively a subroutine may take an R1 parameter block and additional parameters in
registers.
Parameters in Registers
The case where parameters are passed in registers is simple to handle: Each register
becomes a separate parameter to the subroutine. Consider the following assembler code
example:
LA R3,=C'THE TOTAL IS: '
L R4,SUM
BAL R14,PRINTVAL

In this case the translation is straightforward:
char *r3;
long r4;

r3 = "THE TOTAL IS: ";
r4 = sum;
printval (r3, r4);

...which, after intermediate variable elimination is simply:
printval ("THE TOTAL IS: ", sum);

If the subroutine only returns a single register, that becomes the explicit result of the
function. Additional results are handled by passing values by pointer, just like a human
programmer would write:
result = myfunction (&res2);


Parameters in a Parameter Block pointed to by R1
A very common technique in mainframe assembler code is to pass subroutine parameters in a
parameter block. Typically the parameter block is declared in-line, and filled in with pointers to
all the parameters, and then the parameter block address is passed in register R1:
BAL R1,*+12 Branch around in-line param block
DC 2A(0)
LA R2,=C'THE TOTAL IS: '
ST R2,0(R1)
LA R2,SUM
ST R2,4(R1)
BAL R14,PRINTVAL

What should the C translation be in this case? One approach would be a very literal
translation, but it would produce very unnatural C code:

void * param_block [2];
void **r1;
char *r2_1;
long *r2_2;


r1 = &param_block[0]; // BAL R1,*+12
r2_1 = "THE TOTAL IS: "; // LA R2,=CTHE TOTAL IS:
r1 [0] = r2_1; // ST R2,0(R1)
r2_2 = &sum; // LA R2,SUM
r1 [1] = r2_2; // ST R2,4(R1)
printval (r1); // BAL R14,PRINTVAL

...which, after intermediate variable elimination, becomes:

void * param_block [2];

param_block [0] = "THE TOTAL IS: ";
param_block [1] = &sum;
printval (&param_block);

Instead, Relogix models each entry in the parameter block as a separate argument to the
subroutine:
char *param1;
long *param2;
char *r2_1;
long *r2_2;


r2_1 = "THE TOTAL IS: "; // LA R2,=CTHE TOTAL IS:
param1 = r2_1; // ST R2,0(R1)
r2_2 = &sum; // LA R2,SUM
param2 = r2_2; // ST R2,4(R1)
printval (param1, param2); // BAL R14,PRINTVAL

...which, after intermediate variable elimination, is just:
printval ("THE TOTAL IS: ", &sum);

Once again the best code is obtained by choosing the appropriate level of abstraction. The
goal of Relogix is to separate what the code really does - pass some parameters into a
subroutine - from the assembler-specific way that it does this.

Dynamic loading of modules
As a final example, consider the treatment of dynamically loaded modules.
In mainframe assembler code its typical to write one major function per file. Each file is
separately assembled, but unlike a typical C program the object modules are not linked
together. Instead code which wants to call a function in a different module must dynamically
load the module during program execution:

MOD1 CSECT
LOAD EP=MOD2 Dynamically load module
ST R0,MOD2ADR Save entry point
...
L R15,MOD2ADR Get MOD2 entry point
BALR R14,R15 and call function
...
MOD2ADR DS A

In most cases the C version of the application will be built from source files which are
compiled separately, but with the resulting object files then linked together into a single
executable.
By default, Relogix detects sequences involving the LOAD call and tracks which module is
loaded and where the module address is stored. When converting this code to C, Relogix
strips out the assembler-specific detail and substitutes a normal external function call. The
LOAD instruction, the code which stores its entry address, and the variable used to store the
address are not required:

extern void mod2 (void);
...
mod2 ();

Note that in this example the external function takes no parameters and returns no result.
However this is just to make the example assembler code simple. In general Relogix
performs detailed cross-module analysis of the code to determine the parameters and
results of each function.
Although the default action of Relogix is to strip out the dynamic loading of an external
module and substitute a simple function call, there are times when this is not appropriate.
Sometimes the application genuinely needs to load external modules at runtime. To
accommodate this, Relogix can optionally convert the code to use a dynamic loading
technique in the target environment.

Conclusion
If you have a small assembler application its possible to re-code it in C by hand, but for
larger applications this rapidly becomes unthinkable. You will spend years rewriting the code
and probably introduce many bugs along the way.
To quickly get the code working you need to consider some form of automatic translation.
Reliably converting assembler code to C is hard, even for a human. Producing C code thats
easy to read and easy to maintain is harder, and not possible using a simplistic approach to
automatic translation. However a tool like Relogix which performs detailed analysis of the
code can produce remarkable results.
For more information about Relogix please visit our website:
http://www.microapl.co.uk/asm2c/index.html

About MicroAPL
MicroAPL was founded in 1979, and developed one of the world's first multi-user
microcomputers.
Since 1990 the company has concentrated on the translation of assembly language, working
for major clients such as Apple Computer, EMC, Motorola/Freescale Semiconductor, Novell,
Schneider, Philips, DaimlerChrysler, Nortel, Alcatel, and many others.
Our first automated translation tool, PortAsm, translated code from one architecture to
assembler of another architecture. However, the main emphasis since 2003 has been on
translation to C, using our Relogix translation tool which builds on the considerable practical
experience we had built up in assembler-to-assembler translation.

You might also like