You are on page 1of 16

Application eXecute-In-Place (XIP) with Linux and AXFS

Sören Wellhöfer
soeren.wellhoefer@gmx.net
September 17, 2009

Abstract Finally, this article is complemented by pre-


senting and analyzing the results of perfor-
XIP, or eXecute-In-Place when written out, is mance tests that have been conducted on
a technique of directly accessing application two embedded systems to compare AXFS and
code and data in non-volatile flash memory JFFS2 in terms of execution speed.
rather than transferring it to physical RAM
first in order for an execution to proceed. It is 1 Regular program execution
frequently used in the contexts of embedded
computing. A program, when stored on disk or flash, is
The sheer notion, however, of using more in essence nothing else but binary data, that
cost-intensive flash chips to run programs is, code that gets executed and program data
from has been looked upon rather suspiciously that will be used by the program during exe-
by many a system engineer in the past. cution. Linux uses a specific and very flexible
Moreover, in the world of Linux, running format for this purpose called ELF (Executable
applications in-place is far less common than and Linkable Format). It is widely accepted and
doing it for the kernel in a likewise fashion.1 has become the official de-facto standard in the
For that matter, kernel XIP has already been Unix world.
successfully used to improve boot-up time in a When a user decides to execute a program,
number of cases. Application XIP, on the other the shell he uses invokes the Linux system call
hand, is not yet as widely taken advantage of fork(). By doing so, a new process with
by open source system designers. a unique PID (Process Identification) is created.
This article is to demonstrate, in concept as This process initially is an exact clone of the
well as practice, how application eXecute-In- shell process itself; all the executable code as
Place basically works and to shed some light well as the process data is merely copied.2
on the argument that XIP might not be such an The next thing that happens is that a sys-
aberrant idea after all. tem call of the exec() family of functions is
invoked which receives the path name of the
Alas, free, comprehensive and practical re-
file that the user wishes to execute as one of its
sources on the subject matter are relatively
arguments. The exec()-like call now recog-
scarce; it is hoped that with this short article
nizes the ELF format and attempts to replace
this condition can at least be somewhat reme-
sections of the currently running process with
diated.
those found in the file. While doing so, most of
In its first part, this article gives a broader
the program code as well as the program data
view on the concepts of eXecute-In-Place (XIP)
for user applications with specific references 1. Here, application XIP denotes the fact that user-
to Linux. The second part focuses on AXFS space programs are to be executed from flash memory.
(Advanced XIP File System) and begins by ex- 2. Actually, processes data is not blindly copied but
both parent and child process share the same region of
plaining its basic ideas, followed by concrete memory unless writes occur; then the needed sections
instructions on how to set up a Linux box with are really copied. This technique is often called copy-on-
it. demand.

1
is replaced with that of the now to be newly binary data unchanged in flash memory while
executed program. still applying compression to everything else
on the file system image.
Within the virtual memory address space
2 Application XIP of a process, the executable code and static
data (.text and .data) are directly mapped
What has been described in the previous sec-
to flash memory so that a paging5 of these
tion was the normal-fashioned way of doing
sections becomes unnecessary.
things. In-place-execution - or also abbreviated
Note that for XIP data, it cannot be possible
XIP - takes a slightly different approach in
to be compressed as this would thwart the
that the actual program code as well as the
very purpose of in-place execution itself; data
program’s static data (the .text and .data
would have to be inflated and moved to RAM
sections in an ELF file respectively) are never
first in order to execute it - this is precisely
copied to RAM and neither will they be copied
what is being tried to avoid.
when a process forks itself by calling fork().
Again, not making use of XIP means that
When a program is to be executed in-place, program data (the ELF .data and .text sec-
the only things actually made space for in tions) are stored in compressed form in flash
RAM is the .bss section3 , meaning the unini- memory just as everything else on a com-
tialized data that a program will be using, pressed file system. It is then decompressed
as well as the program’s stack that grow dy- and loaded into RAM when needed during the
namically as execution proceeds4 . The .text execution flow, that is, when misses in the page
and .data sections of an ELF file do not get cache lookup performed by the kernel occur.6
transferred to RAM since execution directly
proceeds from non-volatile flash memory by 3. The .bss (Block Started by Symbol) of an ELF file
describes all the uninitialized data of a program. It does
setting the execution pointer to a memory lo-
not occupy any actual space on disk but gets allocated in
cation on the flash itself. RAM upon execution and is filled with all zeros initially.
This is potentially useful if a tight limit on By contrast, the .data section does already occupy
RAM resources constrains its utilization such memory on disk; it contains all the non-changeable data
which is hard-coded into the application. An example for
as is often the case for embedded systems and
this would be a statically assigned string constant in a C
smaller devices; with XIP, no fetching of pages program.
from flash and copying them to RAM is neces- 4. Note that all this is only true for read-only file sys-
sary whatsoever. tems. If data is to be also modified, the .data section
Because of its nature XIP can be perceived as does also get (partially) copied to RAM so that it can be
changed in before committing it back to the disk or flash.
a form of shared memory access in that multi-
5. Paging, as performed by the operating system and
ple processes executing the same code would in the the applied context, is the proccess of retrieving
all share it in flash memory without maintain- data segements called pages (usually 4kB in size) from
ing a unique copy somewhere in RAM just for and external media, such as hard disks or flashes, and
themselves. loading them into RAM. This allowes the memory space
of a process to be built up when execution begins. But
A method commonly found in the Linux because the memory as it is seen by the process is not
world is to use XIP in connection with com- truly continuous, but rather is in itself made up of many
pressed read-only file systems such as cramfs fragmented portions residing at some location in physi-
or SquashFS. There are XIP extensions (patches) cal RAM, this method is often called virtual memory, or
VM. With most file systems that apply compression, it
for those file systems available that aim to min-
becomes in addition to merely copying pages also nec-
imize flash storage utilization through com- essary to decompress them whilst paging; this naturally
pression whilst providing XIP capabilities. takes up time as well.
This is usually achieved by storing application 6. The kernel is, on most architectures, aided in this

2
This method of pulling them in only when formance tremendously.8
truly needed is often called demand-paging be- On the contrary, conducting the same test
cause a loading only occurs when they are with the application being XIP, this drop in
absolutely wanted. performance is much more gradual and not as
One disadvantages of XIP is that read ac- abrupt. At very low levels of free memory
cess to flash memory is relatively slow when space available (less than 5%) where regular
compared to RAM. Whereas retrieving a page execution from RAM would result in a com-
of memory in random access fashion from plete system freeze, XIP is capable of still keep-
regular SRAM takes about 25 ns, this amount ing up a fairly good system performance and
increases to about 100 ns for an average NOR overall responsiveness.
flash chip. Although recent developments Moreover, the differences in speed men-
have dramatically reduced this number down tioned above, between executing from NOR
to even 70 ns for some rather expensive NOR flash and RAM can be alleviated by employing
chips, flash memory access can never be as fast larger-scaled CPU instruction caches that con-
as SRAM due to its intrinsic technical make- tinuely pre-fetch soon to be executed chunks of
up. code from flash memory.
It is important to note that eXecute-In-Place For instance, the Alchemy Au1100 MIPS32-
is only truly possible with NOR flash memory based processor, one that is part of one of the
without applying some sort of emulation layer. systems used for the benchmarks described in
Unlike with NOR, NAND flash memory is the last part of this article, is endowed with
not directly memory-mappable and can only a 16 kB instruction cache. As the percentage
be access on a per-block basis of usually 512 of cache hits begins to increase at a steady
kB. This generally yields a faster data rate, rate (relative to the number of cache flushes),
however it renders NAND unsuitable for XIP the differences between RAM and NOR access
where it must be possible to singly access and speed begin indeed to smooth out, especially
read individual bytes; this is necessary for a for higher system loads. Performance is, of
program execution. course, not the same nor can it be; however,
instruction caches can have a noticeable impact
Another disadvantage is posed by the fact
on the speed of execution from flash memory.
that RAM is generally available at a lower cost
In summary it can be said that if a ma-
than flash memory. For XIP to be used benefi-
jor point of emphasis when making design
cially, one must consider whether the advan-
choices is placed on minimizing RAM utiliza-
tages gained in RAM preservation outweigh
tion, XIP can definitely be said to fulfill this
the disadvantages of higher flash memory uti-
requirement at large. As having huge amounts
lization and slower memory access and render
of RAM available does also always mean to
the trade-off worthwhile.
put up with a higher power consumption,
In a test case constructed by engineers at XIP can bring about an improvement because
Intel, it has been shown that when running an
application and reducing the available RAM task by an MMU (Memory Management Unit), a hardware
at fixed successive intervals, system perfor- component to speed up page table look-ups.
mance will drop rapidly at a certain point 7. If physical memory becomes exhausted, it is com-
where memory saturation has been reached. mon to apply swapping. The principle is to temporally
store data from RAM on disk (“swap out”) as to forth-
As the system runs of memory, the operating
with create space for other currently more exigent data.
system is excessively burdened with the task 8. The condition of high “memory pressure”, keeping
of swapping7 out pages to an external storage the system from doing anything useful, has generally
medium; this, of course, degrades system per- become to be known as thrashing.

3
an excessive amount of RAM becomes quite individual pages of an application’s binary
unnecessary. Thus, it is especially useful for can be marked for in-place execution, mean-
applications that involve low-power consump- ing that only those pages will directly be run
tion and portability, such as cell phones, PDAs from flash memory instead of being copied
and other portable media devices. (paged) to RAM first. This is a clear advantage
Furthermore, the concern that storing XIP when compared to patched versions of cramfs,
data uncompressed does necessarily imply to where only applications in their entirety can,
use up more flash memory is only partially by means of setting a “sticky bit”, be marked
justified. Making use of XIP in conjunction for in-place execution.
with compressed read-only file systems can A user usually decides which pages (or
also keep flash utilization relatively low if the chunks in AXFS terminology) should be
balance between compressed segments and marked for XIP by applying a specific method
XIP segments on the flash is carefully kept and called profiling. By doing so, it is possible
adjusted according to the specific requirements to determine the most frequently called upon
of the system. portions of a program’s binary; these can then
Conclusively, XIP must be acknowledged to be marked for XIP. Being able to mark pages
be a viable solution that can be safely taken individually conveniently allows the user to
into account by system engineers designing specifically pinpoint the hot-spots where exe-
embedded systems and smaller devices. cution time is spent most.
This has the advantage of offering true
eXecute-In-Place capabilities while still being
3 AXFS able to partially offer the benefits associated
with a compressed file system. In plain
Although XIP patches for cramfs (called “linear
words, space is saved through compression,
cramfs”) have by now been literally used for
but where appropriate, no compression is ap-
years, it had never been put to the Linux kernel
plied in favor of in-place executability.
mainline, nor has it ever been cleanly imple-
All other portions of a program that are
mented in any well-defined way.
not marked for XIP remain compressed on the
By designing and implementing AXFS (Ad-
flash and are only inflated and copied to RAM
vanced XIP File System) the author intended
when necessary; here, precisely the notion of
to create a stand-alone file system that ele-
demand-paging which has been mentioned in
gantly supports application XIP while decreas-
the previous section applies. Eventually, XIP
ing RAM footprint and speeding up the execu-
regions from flash and uncompressed blocks
tion of certain time critical code.
now residing in RAM are used to piece to-
In a paper describing the motivations behind
gether an application’s virtual memory space.
AXFS, the author furthermore laments the fact
Figure 1 is intended to illustrate how the linear
that in the past, Linux kernel developers seem
virtual address space of an example process
to have misperceived the notion of flash mem-
might be composed.
ory in that they have created abstraction layers
Application code and data marked for XIP,
that allowed them to essentially treat flashes
as was seen, is not, unlike with the unpatched
like block devices, without making proper use
versions of cramfs or SquashFS, stored in com-
of the intricacies of flash as memory. In this
pressed form. Although this increases the size
sense, the author speaks of a “Flash-as-block
paradigm”[6] that was to be overcome with
9. In the schematic, assume the size of each shown
AXFS. chunk of data to be a multiple of a the system’s default
The main feature of AXFS, however, is that page size.

4
NOR Flash memory RAM Virtual memory
Low memory
XIP Code
Compr. Code .text
XIP
Code
XIP Uncompr.
Data
Compr. .data
Uncompr. Data

.bss

High memory

Figure 1: This diagram shows how elegantly AXFS combines its inherent eXecute-In-Place capabilities with
the effectiveness of file system compression. XIP regions and compressed blocks alike reside in
flash memory9 , here shown on the left-hand side. As indicated by the blue arrows, XIP regions
are directly mapped into the program’s virtual address space shown to the far right; a storing in
RAM of those is avoided entirely. This, however, is not so for compressed blocks; when required
during the program flow, page faults and are generated and the necessary pieces are loaded into
RAM (demand-paging). Once there, they are used to complete the virtual memory layout. It should
furthermore be noted that it does not matter in what section of a the process’s memory map XIP
regions eventually show up - they can be either, .text or .data.

of data residing in flash memory slightly, the memory.”[4]


author of AXFS argues for this to be an actual In spite of the severity of this argument,
advantage. imagine a system where many larger-sized
By running and profiling a particular pro- programs are to be executed at once. Regard-
gram it is indeed very likely to discover and ing this case, free memory space would dwin-
identify often executed chunks within the ap- dle rather quickly because for each individual
plication binary; for most programs, execution application code and data sections would be
time tends to cumulate very distinctively at stored in RAM in their inflated state, thus
least in a few places. wasting a considerable amount of memory.
The author’s main arguments for his file If, on the other hand, use of AXFS’ eXecute-
system are based on the fact that with JFFS2, In-Place capabilities had been made, the many
SquashFS, and even cramfs lots of memory program instances could have all been directly
space gets effectively wasted, and that with executing from flash memory, thus circum-
AXFS, this is avoided altogether: venting the time-consuming process of read-
“Consider 2MB of file data being actively ing, decompressing and copying.
used. It would exist both as 1MB com- Proceeding in this line of reason of wasted
pressed in Flash (assuming 50% compression) RAM space the author of AXFS assesses that
and 2MB uncompressed in RAM, 3MB of total in the example from above “AXFS would use

5
2MB total memory instead of 3MB.” As AXFS combined in such a way as to have the first one
application code data exists uncompressed the hold all the XIP data, and the latter to contain
full 2MB would need to be stored in flash the compressed payload.
memory. However, he concludes that “it uses
1MB extra flash and saves 2MB of RAM.” - 4 Setting up AXFS
a definite advantage in the realms of limited
memory resources. In this sense the author This section attempts to briefly outline the
speaks of a 2:1 ratio by which AXFS always steps required to successfully set up a Linux
beats its fellow contenders like cramfs. box with AXFS.
The author of AXFS does also see another First and foremost it is necessary to integrate
advantage in the fact that with his file system, support for the AXFS file system into a rea-
paging, in certain cases, is practically unnec- sonably recent version of the Linux kernel. In
essary. This, he says, can boost up system order to do so, a copy of the kernel must be
performance tremendously: “Eliminating the obtained; it could e.g. be downloaded from
overhead of paging (copying the compressed http://www.kernel.org.
data and decompressing) allows AXFS to be Having subsequently extracted the kernel
much faster at booting systems and launching sources, the next step is to also download a
applications.” reasonably new version of AXFS. The easiest
With all of the above said, it is safe to deem way for this is to achieve is to do a check-
AXFS a suitable file system in the context of out from the developer’s SVN repository at
platforms that have to be contented with rather SourceForge:
low RAM resources while still having to be
svn co https://svn.sourceforge.net\
able to maintain a level of good overall system
/svnroot/axfs
performance.
It should at last be mentioned that, as was After having completed the download, one
stated in the previous section, true in-place ex- can proceed from here on by entering the now
ecution is only possible with NOR flash chips. present branch directory. The ./patchin.pl
To this end AXFS is luckily capable of operat- perl script10 that resides in one of the sub-
ing on the basis of NAND flashes, too; this is directories should be called with the appropri-
done by practically simulating execution from ate command line arguments to automate the
a NOR flash in RAM. This happens as follows: process of applying the patches that integrate
Pages, or chunks as they are called in AXFS the AXFS sources into the present kernel tree.
parlance, marked for XIP are silently copied
to RAM upon mounting the AXFS file system ./patchin.sh <path_to_kernel_source>
image; this process may or may not incur a
noticeable latency period. Fortunate is the one who does not run into
The actual execution of those XIP chunks more deeply-rooted problems at this point.
proceeds from RAM rather than from flash However, fixing a few compiler complaints
memory itself. Naturally, this whole process by adjusting the source accordingly should in
is much slower than doing true in-place exe- many cases suffice to being the end of most
cution using NOR flashes. Figure 2 aims to troubles.
illustrate the whole process just described. The kernel should now be in a patched state
and is ready to be configured. This can be
One other feature AXFS offers and that
should at least be made mention of herein is 10. Obviously, the perl interpreter must be installed on
that it allows NOR and NAND flashes to be the development host.

6
NAND Flash memory RAM Virtual memory
Low memory
XIP XIP Code
Compr. XIP Code .text
XIP
Uncompr. Code
XIP
XIP Data

Compr. .data
Uncompr. Data

.bss

High memory

Figure 2: This schematic depicts how AXFS copes with NAND flashes which, in reality, are unsuitable for in-
place execution. Initially all, that is, compressed and uncompressed AXFS data structures alike (so-
called blocks and XIP regions), reside in flash memory, here shown on the left. Upon mounting the
AXFS file system image, those chunks marked for in-place execution (the XIP regions) are copied
to RAM, indicated by the blue lines. When then, during program execution, page faults occur,
the compressed blocks still in flash memory, are decompressed and loaded to RAM (indicated
by the red lines); together with the XIP chunks already loaded, they are used to complement the
process’s virtual memory layout show on the right-hand side. Note in all this that compared
to figure 1 (NOR), much more overall RAM is used-up, yielding practically no positive effects
towards memory preservation.

achieved by running make menuconfig in The next step now is to compile and use the
the kernel source top-level directory. To enable mkfs.axfs utility that is used to create an
AXFS AXFS file system image. Having entered the
Later on, when enough profiling informa- appropriate SVN directory and having com-
tion have been gathered, the kernel can be piled the tool, a directory tree that is populated
recompiled with profiling support disabled; in with a typical root file system structure as well
fact, it is highly recommended to do so in order as all the relevant tools should be procured.
to permanently improve performance for the Invoking the AXFS mkfs.axfs utility in the
production release. following way produces a valid file system
image:
Finally, the new kernel can be compiled
with make prepare && make. After this, it
should now be deployable to the target device. mkfs.axfs <root_fs_dir> <img_name>
If module compilation of AXFS was opted for,
a quick make modules install should turn Here, the first argument denotes the root
up the module axfs.ko at the proper location; directory mentioned above and the second one
it can then be copied and loaded into the run- is meant to be the name of the output file the
ning kernel on the target system. file system image will be written to.

7
Once the new kernel is up and running, it is Here X is the number given to the mtd de-
about time to try out if it is capable of recog- vice.
nizing and using the AXFS file system image As the last finalizing step it should be
previously created. This is done by simply made sure that the kernel command line con-
mounting it. tains the right parameter to actually find
First off, the image file should be made ac- the new root file system. Therefore, one
cessible on the target platform through means should be certain that the kernel command
like directly copying it to the device, NFS, line contains something very similar to this:
or similar methods.11 Before now actually root=/dev/mtdblockX.
mounting it, it becomes necessary to create a Now, re-booting the platform, the AXFS root
loop device12 node first; that is so that the file system is used. Note that at this stage,
AXFS driver within the kernel can associate everything is compressed; only the next step of
itself with the image file when mounting later profiling reveals the segments on of the image
on. The following command should be entered that should be marked for XIP. After this, the
in order to create the new device node: file system image can be recreated using the
information gathered and to enable XIP.
mknod /dev/loop0 b 7 8
5 AXFS Profiling
Eventually, the actual mounting may take
place. This can be achieved through the fol- In order to make use of AXFS’ eXecute-In-Place
lowing command:13 capabilities, it becomes necessary to profile one
or many target applications to see where most
mount -t axfs -o loop \ of the execution cycles get spent; once identi-
<axfs_root_img> <mountpoint> fied, they can be easily marked for XIP. Recall
again that with AXFS, the XIP mechanism op-
Here, the last argument indicates the path to erates on the basis of individual pages instead
a directory that serves as the immediate mount of entire executables as a whole as it is done for
point for the file system; it can practically be linear cramfs.
any arbitrary location. What basically happens on the file system
Having, with the previous step, confirmed level when profiling is that the AXFS kernel
that the file system is indeed working and driver records the number of page faults that
useable, it is now time get the image deployed occur for a specific location within the file sys-
to flash. It is very likely that the reader uses the tem memory map.
MTD (Memory Technology Device) subsystem,
which provides an abstraction layer for access- 11. Note that for the purpose of simply mounting the
ing flash memory as device blocks.14 The tool image, it is not yet required for it to be located on the
flash.
eraseall, which is part of the mtdutils tool
12. Loop devices allow file system images to be mounted
suite, should be used to make sure that the as if they were block devices. Block devices are most
mtd device is fully erased before using it for frequently associated with hard disks.
the AXFS file system image. After this step 13. Note that the mount utility that is supplied with
is taken, the image can be simply copied onto busybox might have problems to recognize the -t flag.
it. The following listing gives the necessary It is better to obtain and compile a full-fletched version
of mount that is part of the util-linux tool suit. It can
commands: be downloaded from http://www.kernel.org/pub/
linux/utils/util-linux/.
eraseall /dev/mtdX 14. For more information, see http://www.
cp <axfs_root_img> /dev/mtdblockX linux-mtd.infradead.org/.

8
Profiling a particular applications in practice ...
means to simply run in on the target platform </xipfsmap>
and to let the AXFS profiler do its work.15 It Multiple binary files and their XIP chunks
will collect data and write the information it may be given in the XML file. For each chunk
could gather to /proc/axfs/volume0. specified, the size must be explicitly given.
Profiling data therein is presented line-wise; This number usually defaults to the page size
each line reveals how often which page was typical for the current system which in most
called for, starting at the time the file system cases defaults to 4 Kilobytes (4096 Bytes).
got mounted. The format used specifies these To all other pages not listed in the input XML
information separated by a comma. So for file, AXFS applies its compression algorithm;
instance, a line in the profiling file could look the pages marked for XIP, however, remain
similar to this: unaltered (uncompressed) altogether.
When subsequently executing an applica-
/sbin/init, 2412, 8 tion, the kernel internally decides which the
right method to be invoked for reading and
This would mean that in the executable
handling a specific page of the program is,
saved as /sbin/init, page 2412 was up until
depending on whether it is compressed or
now called exactly 8 times. Calling the ap-
marked for XIP (uncompressed).
plication to be scrutinized a few times should
To give some last insight at this point, the fol-
eventually yield a fairly clear picture of where
lowing C-code snippet from the AXFS kernel
the hot spots are.
driver is shown; it initializes a structure that
Having obtained this information, it now
contains pointer references to functions that
becomes possible to regenerate the AXFS file
give the appropriate operation for retrieving
system (as previously described in section 4),
a page from flash, depending on whether the
but this time with the desired pages marked
page is XIP or not.
for XIP. The mkfs.axfs tool can receive an
additional argument using the -i flag which static struct
points to a user-created file in the XML format; address_space_operations axfs_aops =
{
it indicates which pages the user desires to
.readpage = axfs_readpage,
have marked for XIP within the file system .get_xip_page = axfs_get_xip_page
image. };
An XML file created for this purpose does
typically have a format similar to the one pre- After having successfully created the new
sented in the following excerpt from such a file: file system image, the steps outlined in the pre-
ceding section should be redone, only that the
<xipfsmap> kernel should now be recompiled with AXFS
<file>
profiling disabled.
<name>/bin/sed</name>
<chunk>
<size>4096</size> 6 Benchmarks
<offset>0</offset>
</chunk> In the sections above it was freely stated that it
<chunk> is possible to reduce memory footprint and to
<size>4096</size> speed up program execution times by marking
<offset>12512</offset>
</chunk> 15. Note that for this to work the kernel must have
... gotten compiled with AXFS profiling enabled as has been
</file> described in the section prior.

9
certain sections of it for in-place execution. and to store them in RAM, applying decom-
That now to be shown by proof-of-concept and pression algorithms if necessary; there they
by presenting the results of actual tests that can now be accessed by the program. For
have been conducted. XIP, however, no copying takes place as data is
The general idea of testing was to not only directly accessed on the flash - “in-place”, so to
measure run time (which is frequently done for speak. It goes without saying that this is much
XIP benchmarks), but to also put the data ob- faster.16
tained for raw execution time into relation with Furthermore, by making the global array’s
the sizes of the binary files that were executed. total length variable at compile time, it is now
Doing this, it is possible to get a much better conveniently possible to alter the actual file
feel for how execution time changes as file size size of the resulting executable by specifying
grows. different values for SIZE. This can be achieved
In the succeeding section, the basic ideas and by passing the compiler flag -DSIZE=X to
principles behind this approach are explained. GCC, where X denotes the desired value17 .
Using this method, files of considerable size
can be produced.
6.1 Procedure
If X equals zero, the binary’s size after com-
First, assume the following snippet of C-code pilation is about 1 Megabyte. When increased
which in itself is a small program. Its unspec- sequentially, the file size increases exponen-
tacular functionality does nothing else but to tially with a base of two; thus, the file’s final
decrementally step through an array of 4 byte size is always at about 2X Megabytes.
integers while copying single data fields from Now, using the following simple shell script
it. it is easily possible generate a number of pro-
#include <stdint.h> grams with identical functionality whose file
sizes increases rather rapidly.
const int32_t a[0x40000<<(SIZE)] =
{1}; #!/bin/sh
for i in $(seq 0 $1); do
int main() gcc -DSIZE=$i -o $2$i $2.c
{ done
register int i = \
sizeof a / sizeof(int32_t); Assuming that the C-program from above
got saved as bmark.c and that the shell script
int32_t v; got saved as gen.sh, one may now invoke
while(--i >= 0) v = a[i]; the script as follows and thereby create eight
executables from which the largest on is about
return 0;
27 = 128 Megabytes, that is half of the quarter
}

While executing, the while-loop of this


16. Notice that this is holds only true for the very first
nifty program traverses the integer array and round of execution. Upon subsequent invocations, the
internally causes page faults as it goes. By virtual memory manager will have kicked in, aiding the
copying from the array it is ascertained that program with its caches that still contain the uncom-
each single piece of array data will indeed be pressed program data from previous executions; running
the program now literally proceeds from RAM rather
truly read. than from flash which is faster by nature.
For non-XIP execution, this forces the kernel 17. Note that cross-compiler version of GCC for the
to load the required pages from flash memory target platform was used

10
of a Gigabyte, in size.18 kernel’s virtual memory manger by making
sure the page caches were cleared each time.
./gen.sh 8 bmark This was accomplished by running the follow-
ing command sequence:
After a short while, the results can be eas-
ily verified by issuing the command ls -1hs
sync && \
that should result in an output similar to this:
echo 3 > /proc/sys/vm/drop_caches
1.1M bmark0 33M bmark5
2.1M bmark1 65M bmark6 Even if it is unlikely that cached pages from
4.1M bmark2 129M bmark7 one application should be reused for another
8.1M bmark3 one at all, clearing the page cache has at least
17M bmark4 the effect of providing each program with an
empty starting ground; furthermore, the ker-
As can be seen, each program’s size is about nel is now definitely forced to go through the
as much as was predicted.19 Eventually, a pool complete sequence of reading, decompressing,
of ancillary programs that can be utilized for and copying each page from flash memory
the purpose of execution speed benchmarking anew.
has been obtained.
To now gauge the definite run time of each
6.2 Results and evaluation
program, the Unix tool time is used which
is capable of measuring the elapsed real-time The testing procedure delineated in the section
and of displaying the result in seconds after the prior was in this exact form applied to two
process has finished running. For this to work, embedded systems which will subsequently be
the following command was run on the target referred to as system A and B.
platform: Both systems are equipped with an RMI
Alchemy Au1x00 processing core which is
time -p ./bmarkX based on the MIPS32 processor architecture.
Whereas system A is endowed only with an
Here, the X indicates the size of the program
Au1100 CPU clocked at about 350 MHz, sys-
according to the conventions devised above.
tem B is geared up with the newer succeeding
Waiting out the time the program takes to
model, namely the Au1200, that runs at core
complete, an output similar to this is given by
frequency of about 500 MHz. Both systems
time:
have an on-board SRAM of 128 Megabytes in
real 0.21 size integrated into them. System A as well
user 0.02 as system B were running a Linux kernel of
sys 0.19
18. 128 Mb was chosen as a limit because more RAM
was not available on the systems used for testing. Had an
The interesting part here being the topmost even larger file size than this be used, the system would
line, which is the elapsed real-time in seconds; have been likely to be in the condition of thrashing; that
herein, this is taken as the measure of execu- is, most of the system’s resources would have been used
tion speed. up to copy pages back and forth between swap space and
physical RAM.
In all this, it is worth pointing out that be-
19. The slight overhead that is discernible is probably
tween each successive step in the series of run- due to inaccuracies in measurement, compiler alignment,
ning the next larger program, it was made sure and general program overhead as the array is not the
that no data remained in the read buffers of the only piece of data that makes up a program as a whole.

11
version 2.6.22 with AXFS patched into to; pre- Size JFFS2 AXFS Delta Cached
emption20 was enabled and the slab allocator 1 0.08 0.05 0.03 0.02
was opted for. 2 0.15 0.10 0.05 0.04
The other file system besides AXFS that was 4 0.30 0.21 0.09 0.08
made use of for the purpose of obtaining com- 8 0.61 0.40 0.21 0.15
parable data was JFSS2 which is fully log- 16 1.20 0.83 0.37 0.29
structured (journaled) and zlib-compressed 32 2.42 1.59 0.82 0.60
file systems that is built on top MTD (Memory 64 4.91 3.00 1.91 1.23
Technology Device) kernel subsystem; it is gen- 128 9.66 5.91 3.75 2.50
erally accepted to be a well-performing read-
Table 2: Results for system B
/write file system for flashes.
By now successively running the bench-
marking application series bmarkX, the fol- one, execution time does also roughly appear
lowing two tables (1 and 2) of values, for to double in general. However, crucial to
system A and B respectively, were obtained. recognize is that, although the initial differ-
A more speaking representation is given by ences in speed may not seem too large for an
figures 3 and 4. executable of 1 Megabyte, the gap becomes
Note that in addition to the values for the rapidly wider as file size increases; this is easily
two file systems, a third set of data was gath- discernible in both of the bar charts. The rate of
ered along the way; it represents the run-times “chasm growth” (the delta change) settles also
of the programs shortly after the first round of roughly at about two.
execution for JFFS2 had been completed. By AXFS does not only beat JFFS2 at an increas-
that time, most of the application code and ingly better rate regarding execution time, but
data had already been cached in RAM so that one must also bear always in mind that for
swapping did in essence not occur, thus the JFFS2 this exact amount memory is actually
column “Cached”. The other fourth column used up in physical RAM, whereas for AXFS
termed “Delta” just shows the differences in none is consumed at all. Moreover, JFFS2 does
time between JFFS2 and AXFS. also occupy a considerable amount of flash
memory to store the exact same data, only this
Size JFFS2 AXFS Delta Cached time with compression applied.
1 0.11 0.05 0.06 0.02 So for instance, taking the case with an ex-
2 0.20 0.12 0.08 0.05 ecutable file size of 32 Megabytes and esti-
4 0.39 0.25 0.14 0.09 mating JFFS2’s applied zlib compression with
8 0.77 0.51 0.26 0.17 50%21 , there is 16 Megabytes stored com-
16 1.50 1.03 0.47 0.35 pressed on flash as well as 32 Megabytes stored
32 2.99 2.01 0.98 0.69 uncompressed in RAM adding up to a total
64 5.91 4.06 1.85 1.37 of 48 Megabytes. Compared to AXFS, 16
128 12.31 8.11 4.22 2.83 Megabytes got thus wasted.
Table 1: Results for system A 20. Preemption allows the kernel to interrupt the cur-
rently running task at any given point in favor of another
more predominant one. Under usual circumstances,
execution is guaranteed to return to this point; however,
Taking a first casual glance at the data at
this may happend after an indeterminable period of time
hand unsurprisingly reveals that with each has elapsed.
consecutive execution of the next larger file 21. In most real cases this ratio stays somewhat below
that is twice as much in size as the previous 50%.

12
12

8
10

JFFS2
8

AXFS
Execution Time (sec)

Execution Time (sec)

6
Cached

JFFS2
AXFS
6

Cached

4
4

2
2
0

0
2 4 8 16 32 64 128 2 4 8 16 32 64 128

Executable Size (Mb) Executable Size (Mb)

Figure 3: Speeds on System A Figure 4: Speeds on System B

Therefore, the gain from using AXFS for marginally, leaning a little more towards the
larger files is twofold, once in execution speed faster system B.
and once in the amount of RAM saved.22 All that was examined in the above discus-
Stated quantitatively, AXFS has, for both sion is conducive to the conclusion that in cases
system tested, brought a gain of about 35% where program images to be executed become
in overall execution speed compared to JFFS2. larger, the benefits for execution speed when
Nevertheless, once cached, AXFS is certain to using AXFS begin to show more distinctively.
be at the short end; here it is approximately Adding in the factor that multiple instances
60% slower than SRAM. In section 2 it was of the same larger-sized program might be
indirectly expressed that NOR flash access running at once, or that more than only one
speeds are around 75% slower than they are for memory-heavy application is run, makes the
SRAM. The ratio observed here is not quite as mentioned benefits appear even clearer.
bad but it comes directionally near it.
One finding that comes as no big surprise is 7 Prospects
that system B is generally faster than system
A when overlooking the range of data more Currently everything makes the appearance as
broadly. The ratio at which system B beats A if NAND will continue to push NOR flashes
is roughly 20%, relative to system B; this does further out of the market sector. NAND
not appear to change much for either of the file is available at lower “price per bit”, has
systems, nor does the executable file size seem faster access speeds for bulky data, and can
to matter much in this regard.
22. One might bring up the argument that file sizes of 64
Interestingly enough, the execution speeds, Mb and more are beyond reality in embedded systems;
that for RAM are directly correlatable with however, there exist such memory-intensive applications
memory access speeds, appears to differ only even on smaller devices.

13
generally store more data.23 Values for other places than cell phones it will likely be
power-consumption are relatively comparable. seen around more often.
NAND is deemed faster for writing, whereas AXFS specifically is still rather fresh and
NOR is generally believed to be more effective documentation for it other than the source
for reading. code itself is scarce if at all present. Luckily, it
Notwithstanding these facts, there have is sufficient simple and can be understood by
been various attempts endeavoring to redesign almost anybody having a good proficient level
NAND chips in such a way as to render them of understanding in C and some experience in
palatable for XIP mechanisms to take place; kernel programming related topics.
this, however, is a difficult task in and of itself. With the advent of AXFS, some good ideas
Most of the solutions proposed and imple- were put into practice, especially regarding ap-
mented involve at least some form or another plication XIP. If propagated more in the future,
of temporary buffering.24 Plainly asserted, re- it might likely change how embedded open
garding a NAND chip’s internal organization, source developers judge the possible benefits
there simply is no effortlessly feasible way for gained from using XIP.
linear reads and random access to occur; this,
however, is a fundamental requirement neces- Acknowledgements
sary for eXecute-In-Place.
An imaginable future for XIP is advertised First of all, much thanks must gratefully be
by exploiting the expedient synergy that might given to the Ultratronik Entwicklungs GmBH
result in the combination of NOR and NAND for supporting the efforts of this article by pro-
flashes in one and the same device; if NAND viding the embedded boards used for testing.
- because of its cost-effectiveness, speed, and Thanks does also go to Jared Hubert, the
larger allowable storage volume - and NOR author of AXFS, for kindly answering a few
- because of its faster reading speeds and in- questions.
herent XIP capabilities - were to be dually
incorporated, the combined power of both References
could be harnessed by finding the right bal-
ance between XIP and data storage require- [1] Jared Hulbert, Justin Treon, “Creating opti-
ments. AXFS has gone an important step into mized XIP systems”, Intel, 2006
this direction by making exactly this one of its
features. Perhaps, adding other technologies [2] “Execute-in-place for file mappings”,
like PSRAM (Pseudo-Static RAM) into the mix http://www.mjmwired.net/kernel/
might even be worth a consideration, too. Documentation/filesystems/xip.
One pro-argument for XIP might further- txt, 2009
more be that many devices, especially cell [3] “NAND vs. NOR Flash Memory - Technol-
phones, PDAs and some routers, are already ogy Overview”, Toshiba America Electron-
equipped with NOR flashes by design. If flash ics Components, Inc., 2008
memory is available in any case it might as
well be utilized for purposes besides exclu- 23. As of this writing, common-use NAND flashes are
sively storing persistent data. capable to hold data up 16 Gigabyte, whereas for NOR,
it is at about 4 Gigabyte.
After all, the final reckoning remains with
24. Key terms often associated with these approaches
the engineers who are trusted to make the right are i.e. code-shadowing, sliding-window data retrieval, or
design choices for their specific requirements - fast-paging. The OneNAND technology is, for instance,
if, in the future, XIP turns out to be useful in one that makes use of some of these concepts.

14
[4] Jared Hulbert, “[RFC] Advanced XIP File
System”, http://lwn.net/Articles/
182337, 2006

[5] Jonathan Corbet, “AXFS: a compressed,


execute-in-place filesystem”, http://lwn.
net/Articles/295545, 2008

[6] Jared Hulbert, “Introducing the Advanced


XIP File System”, Proceedings of the Linux
Symposium, 2007

15
Copyright
Copyright (c) 2009 Sören Wellhöfer
Permission is granted to copy, distribute and/or modify this document under the
terms of the GNU Free Documentation License, Version 1.3 or any later version
published by the Free Software Foundation; with no Invariant Sections, no Front-Cover
Texts, and no Back-Cover Texts. A copy of the license is can be obtained from
http://www.gnu.org/copyleft/fdl.html

16

You might also like