You are on page 1of 9

A C code example

We shall now look at the C code of a coprocessing system consisting of the three components
mentioned earlier. First, a C source file with a very simple example of a synthesized function
and its wrapper are shown. Next, a host program outlines how the communication with the
logic is done from the host. This host program is then improved for the sake of performance
and reliability, and showing the recommended programming techniques.
Sample code for HLS synthesis
To clarify how HLS works with Xillybus, lets consider a simple example, which
demonstrates the calculation of a trigonometric sine and a simple integer operation, both
covered in a custom function, mycalc(). This is a very simple function, but as Xilinx guide to
Vivado HLS shows, the possibilities go way beyond this. mycalc() takes the role of the
synthesized function.
This function is called by a wrapper function, xillybus_wrapper(), which is responsible for
the interface with the host. It accepts an integer and a floating point number from the host
through a data pipe, which is represented by the in argument. It returns the integer plus one
and the (trigonometric) sine of the floating point number, using the out argument.
How the *in++ and *out++ operations transport data from and to the host application is
explained below. A walkthrough of the code is given immediately after its listing here.
#include <math.h>
#include <stdint.h>
#include "xilly_debug.h"

extern float sinf(float);

int mycalc(int a, float *x2) {


*x2 = sinf(*x2);
return a + 1;
}

void xillybus_wrapper(int *in, int *out) {


#pragma AP interface ap_fifo port=in
#pragma AP interface ap_fifo port=out
#pragma AP interface ap_ctrl_none port=return

uint32_t x1, tmp, y1;


float x2, y2;
xilly_puts("Hello, world\n");

// Handle input data


x1 = *in++;
tmp = *in++;
x2 = *((float *) &tmp); // Convert uint32_t to float

// Debug output
xilly_puts("x1=");
xilly_decprint(x1, 1);
xilly_puts("\n");

// Run the calculations


y1 = mycalc(x1, &x2);
y2 = x2; // This helps HLS in the conversion below

// Handle output data


tmp = *((uint32_t *) &y2); // Convert float to uint32_t
*out++ = y1;
*out++ = tmp;
}

A brief explanation of the code above


This piece of code starts with #include statements. The math.h inclusion is necessary for
the sine function. xilly_debug.h contains headers for debug functions.
The declaration of xillybus_wrapper() as well is the pragma directives followed by it relate to
the interface with the Xillybus IP core, and must always appear as shown. In particular, the
name of this function must not be changed, nor its arguments.
Next, we have a call to xilly_puts(), which produces a debug message that can be displayed
easily on the host computers console, as elaborated in part VI.
After this, the input data is fetched. Each *in++ operation fetches a 32-bit word originating
from the host. In the code shown, the first word is interpreted as an unsigned integer, and is
put in x1. The second word is treated as a 32-bit float, and is stored in x2. The
communication of data is explained further below.
This is followed by x1s value written on the host computers console as a decimal number
for debug purposes.
In the next part, a call to mycalc(), the synthesized function is made. This function returns
one result as its return value, and the second piece of data goes back by changing x2. The
wrapper function copies the updated value of x2 into a new variable, y2, which may appear to
be a redundant operation.
Had this code been compiled for execution on a processor, the copying to y2 would have
been redundant indeed. When using HLS, this is however necessary to make the compiler
handle the conversion to float later on. This reflects a somewhat quirky behavior of the HLS
compiler, but this is one of the delicate issues of using a pointer: The HLS compiler doesnt
really generate a memory array and a pointer to it. The use of the pointer is just a hint on what
we want to accomplish, and sometimes these hints need to pushed a bit.
Finally, the results are sent back to the host: Each *out++ sends a 32-bit word to the
computer, with due conversion from float.
Note that the *in++ and *out++ operators don't really move pointers, and there is no
underlying memory array. Rather, these symbolize moving data from and to FIFOs (and
eventually from and to Xillybus pipes). Hence, the only way the "in" and "out" variables may
be used is *in++ and *out++.
The host program
The following program can be used to communicate with the logic. Most notable is that two
device files, which behave like named pipes, are used for communication with the logic:
/dev/xillybus_read_32 and /dev/xillybus_write_32. These two files are generated by
Xillybus driver, as explained on this page.
As before, the listing is followed by comments.
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdint.h>

int main(int argc, char *argv[]) {

int fdr, fdw;

struct {
uint32_t v1;
float v2;
} tologic, fromlogic;

fdr = open("/dev/xillybus_read_32", O_RDONLY);


fdw = open("/dev/xillybus_write_32", O_WRONLY);

if ((fdr < 0) || (fdw < 0)) {


perror("Failed to open Xillybus device file(s)");
exit(1);
}

tologic.v1 = 123;
tologic.v2 = 0.78539816; // ~ pi/4

// Not checking return values of write() and read(). This must be done
// in a real-life program to ensure reliability.

write(fdw, (void *) &tologic, sizeof(tologic));


read(fdr, (void *) &fromlogic, sizeof(fromlogic));

printf("FPGA said: %d + 1 = %d and also "


"sin(%f) = %f\n",
tologic.v1, fromlogic.v1,
tologic.v2, fromlogic.v2);

close(fdr);
close(fdw);

return 0;
}

The program begins with opening two files, /dev/xillybus_read_32 and


/dev/xillybus_write_32. These two files are the operating systems representation of the data
pipes through which the host communicates with the logic.
The tologic structure is then filled with some values for transmission to the logic, after
which its written directly from memory to xillybus_write_32. Effectively, this writes 8 bytes,
or more precisely, two 32-bit words. The first is the integer 123 put in tologic.v1, and the
second is the float in tologic.v2. The tologic structure was hence set up to match the logic
expectation of data: One integer by the first *in++ instruction, and one float by the second.
It is crucial to match the amount of data sent to /dev/xillybus_write_32 with the number of
*in++ operations in the wrapper function. If there is too little data sent, the synthesized
function may not execute at all. If theres too much, the following execution will probably be
faulty.
At this point, the function is executed in logic, and the result is returned as two 32-bit words
by virtue of the *out++ operations at the end of the wrapper function. These two values are
read from /dev/xillybus_read_32 by the read() call that follows write().
In this example, the same structure format was chosen for inlogic and outlogic, but
theres no need to stick to this. Its just important that the data sent and received is in sync
with the wrapper functions number of *in++ and *out++ operations.
Finally, the input and output structures are printed out for review.
Its important to note, that the program above demonstrates a single execution of the
synthesized function. This is not the way to measure the efficiency of using coprocessing, as
I/O latency and other delays will cause a poor outcome. Rather, its kept concise for the sake
of illustration. A more realistic program is given below for reference.
This code was written for compilation on Linux. Windows users may need to make all or
some of the following adjustments:
Change the file name string from /dev/xillybus_read_32 to \\\\.\\xillybus_read_32
(the actual file name on Windows is \\.\xillybus_read_32, but escaping is necessary). The
second file name changes to \\\\.\\xillybus_write_32.
Replace the #include statement for unistd.h with io.h
Replace the calls to open(), read(), write() and close() with _open(), _read(), _write() and
_close()
Running
The expected behavior of a test run is now shown. For this to work, Xillybus driver must
have been loaded and detected its counterpart in the logic fabric. How this is set up is
explained in part IV.
Before attempting a test run, its recommended to begin watching the debug output by typing
cat /dev/xillybus_read_8 at shell prompt. In another terminal window, run the program,
which should look like this:

$ ./hlsdemo

FPGA said: 123 + 1 = 124 and also sin(0.785398) = 0.707107

As a result of the execution, some debug output will be generated:

$ cat /dev/xillybus_read_8

Hello, world

x1=123

Hello, world

The origins of the first two lines are easily found on the wrapper function above. The third
Hello, world line may come slightly unexpected, and may not appear in some cases. Its a
result of the HLS compilers attempt to promote data flow. The logics state machine always
assumes that new input data is onway, and attempts to move things forward as much as
possible to save the processing time once the data arrives.
Since no input data is needed for the second Hello, world, its sent out as soon as possible.
In this case, its immediately after x1=123, which depends on input data. In theory, it could
go on printing out the x1= part as well, but the compiler didnt optimize things this far.
A practical host program
The code above outlines the way data is exchanged, but two changes are necessary in
practical system:
Sending a single set of data for processing is extremely inefficient, making I/O overhead
a major delay component. Its also wrong to wait for the outcome of a single execution
before sending the next set.
The return values from read() and write() arent checked, so partial operation and UNIX
signals arent handled properly. This is a negligible issue when a single chunk of 8 bytes
is going back and forth, but may cause weird problems in real-life applications.
The program below shows a suggested practical Linux-style implementation of using the
logic for coprocessing. This is a throughput-oriented implementation, focused on keeping the
data flowing rather than completing rounds of requests and responses.
The following differences are most notable:
Rather than generating a single set of data for processing, an array of structures is
allocated and sent. Likewise, an array of data is received from the logic. This reduces the
I/O overhead, and the impact of software and hardware latencies.
The program forks into two processes, one for writing and one for reading data. Making
these two tasks independent prevents the processing from stalling due to lack of data to
process or output data waiting to be cleared up. This independency can be achieved with
threads (in particular in Windows) or using the select() call as well.
The read() and write() calls are made as necessary to ensure reliable I/O. These while
loops may appear cumbersome, but they are necessary to respond correctly to partial
completions of these calls (not all bytes read or written) which is a frequent case under
load. The EINTR error is also handled as necessary to react properly to POSIX signals,
which may be sent to the running processes, possibly by unrelated software.
Note that for real use, the debug messages must be removed from the synthesized and
wrapper functions, as they may slow down execution dramatically, in particular by forcing
sequential execution where a speedup is possible by parallel execution.
The programs listing follows.
#include <stdio.h>
#include <unistd.h>

#include <stdlib.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdint.h>

#define N 1000

struct packet {
uint32_t v1;
float v2;
};

int main(int argc, char *argv[]) {

int fdr, fdw, rc, donebytes;


char *buf;
pid_t pid;
struct packet *tologic, *fromlogic;
int i;
float a, da;

fdr = open("/dev/xillybus_read_32", O_RDONLY);


fdw = open("/dev/xillybus_write_32", O_WRONLY);

if ((fdr < 0) || (fdw < 0)) {


perror("Failed to open Xillybus device file(s)");
exit(1);
}

pid = fork();

if (pid < 0) {
perror("Failed to fork()");
exit(1);
}

if (pid) {
close(fdr);

tologic = malloc(sizeof(struct packet) * N);


if (!tologic) {
fprintf(stderr, "Failed to allocate memory\n");
exit(1);
}

// Fill array of structures with just some numbers


da = 6.283185 / ((float) N);

for (i=0, a=0.0; i<N; i++, a+=da) {


tologic[i].v1 = i;
tologic[i].v2 = a;
}

buf = (char *) tologic;


donebytes = 0;

while (donebytes < sizeof(struct packet) * N) {


rc = write(fdw, buf + donebytes, sizeof(struct packet) * N - donebytes);

if ((rc < 0) && (errno == EINTR))


continue;

if (rc <= 0) {
perror("write() failed");
exit(1);
}

donebytes += rc;
}

sleep(1); // Let debug output drain (if used)

close(fdw);
return 0;
} else {
close(fdw);

fromlogic = malloc(sizeof(struct packet) * N);


if (!fromlogic) {
fprintf(stderr, "Failed to allocate memory\n");
exit(1);
}

buf = (char *) fromlogic;


donebytes = 0;

while (donebytes < sizeof(struct packet) * N) {


rc = read(fdr, buf + donebytes, sizeof(struct packet) * N - donebytes);

if ((rc < 0) && (errno == EINTR))


continue;

if (rc < 0) {
perror("read() failed");
exit(1);
}

if (rc == 0) {
fprintf(stderr, "Reached read EOF!? Should never happen.\n");
exit(0);
}
donebytes += rc;
}

for (i=0; i<N; i++)


printf("%d: %f\n", fromlogic[i].v1, fromlogic[i].v2);

sleep(1); // Let debug output drain (if used)

close(fdr);
return 0;
}
}

You might also like