You are on page 1of 9

Real-Time Operating System Kernel for Multithreaded Processor

Kiyofumi Tanaka School of Information Science, Japan Advanced Institute of Science and Technology kiyofumi@jaist.ac.jp

Abstract
In future embedded system development, multithreaded processors will be used for further performance improvement to satisfy large-scale and sophisticated applications. PRESTOR-1, a multithreaded processor we developed, has a mechanism, a processor context buffer (PCB), that accommodates thread contexts spilt from built-in context slots. Threads/tasks located in the PCB are controlled and swapped for built-in active contexts fully by hardware control and performance of a system with many threads/tasks can be enhanced. Our RTOS kernel that is compatible with the ITRON specication is extended to utilize the PRESTOR-1 multithreaded architecture including the PCB mechanism and several extended instructions. Evaluation for execution with PCB showed higher performance than single-thread execution or multithreaded execution without PCB in spite of more cache misses.

[5], and so on. These kernels targets various embedded processors. One of the features of our RTOS kernel is the adaptability, where task scheduling algorithm can be selected out of the ITRON original method, rate monotonic, earlier deadline rst, least laxity rst, the mixture method we proposed [12], and so on, according to the objectives of the system, while the ITRON originally species that the scheduling is based only on xed priority statically assigned to tasks. In addition, we revised the RTOS kernel and adapted it to a multithreaded processor PRESTOR-1 [11] that we developed, in order to meet the demands on performance improvement. The processor has a hardware-controlled context buffer that virtually increases the number of built-in task contexts, and several extended instructions for fast execution. Our RTOS kernel exploits the mechanisms to enhance the system performance. We evaluated the mechanisms in execution using several tasks with our RTOS. In this paper, we describe the outline of PRESTOR-1 and our RTOS, and show the results of the evaluation. Section 2 describes the organization and architectural mechanisms of a target multithreaded processor, PRESTOR-1. In section 3, the organization of our RTOS kernel and an extension to the multithreaded processor are shown. Section 4 shows the results of evaluation for task execution on the RTOS and the multithreaded processor, and Section 5 concludes the paper.

1. Introduction
The growing diversity and complexity of embedded systems increase the demands on improvement of processing performance and real-time operating systems (RTOS) from the viewpoint of software development. RTOSs provide several functions that are useful for development of embedded software, for example, task management, synchronization/communication, memory pool management, time management, etc., and real-time task schedulers that are indispensable for real-time processing. To support efcient development of embedded systems, we developed a new realtime operating system kernel based on ITRON [3] which is the standard specication and is widely used in Japan embedded system industry [12]. The ITRON specication is simple and clear, and therefore, the conformity to the specication is effective for mutual understanding between developers or education for programming [8]. There are many implementations/products that follow the ITRON specication; More [2], NORTi [4], TOPPERS/JSP [6], T-Kernel

2. PRESTOR-1
In this section , we describe the organization of the multithreaded processor, PRESTOR-1. All logic circuits of PRESTOR-1 was designed by writing in VHDL and the processor LSI was developed using Fujitsu 0.18m ASIC technology. The die size of the LSI is 5.0 mm 10.0 mm in the HQFP240 plastic packaging. The number of gates for memory elements (caches, TLBs, and register les) is 530,125, and that for the other logics is 503,259.

Proceedings of the International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems (IWIA'06) 0-7695-2689-6/06 $20.00 2006

2.1 Organization
ehcaC
NI

Figure 1. Block diagram of PRESTOR-1.

caches have an entry-lock mechanism to assist fast trap processing. In addition, the data cache has mechanisms for prefetching and explicit control described in section 2.4. Moreover, the caches have recongurability for the prioritybased partitioning and FIFO buffer [9].

2.2 Multithreaded Execution Facilities


PRESTOR-1 is based on a block-multithreaded architecture that includes multiple thread contexts for fast context switching [14]. In this paper, we call a context slot which a task (or thread) occupies a processor context set (PCS). A PCS consists of a program counter, a next program counter, processor status registers, general-purpose registers, and several special registers which include an address space specier. In our block-multithreaded architecture, switching of active contexts occurs (1) when a cache miss occurs, (2) when a thread/task with higher priority becomes ready, (3) when an interrupt request occurs, (4) when an instruction for thread switching is executed, or (5) when the OS performs scheduling/dispatching. We can prevent the switching concerning (1) and (2) in a supervisor execution mode. In the switching, the next active candidate is selected based on priority of threads/tasks by hardware control. Therefore, the mechanism supports priority-based real-time processing. 2.2.1 Processor Context Buffer

Table 1. Organization of PRESTOR-1.


Instruction set Extended instructions (See section 2.4) SPARC Version 8 integer instructions [13] Inter-PCS Instructions Swap instructions Cache Line Control Instructions Processor Interrupt Level Instructions Byte Twisting Instructions 10 stages, scalar Instruction: 16KB, 32B block Data: 16KB, 32B block Instruction: 256 entries Data: 256 entries Normal: 4 Interrupt: 4 4 entries

Execution Pipeline Cache Memory (4-way set associative) TLB (4-way set associative) # of PCS (See section 2.2, 2.2.3) PCB (See section 2.2.1)

This is a context storage buffer described in section 2.2.1

Mechanisms for fast context switching in a conventional multithreaded architecture are effective only for processor built-in PCSs. Therefore, when all tasks/threads in PCSs are in a waiting state, pipeline stalls cannot be avoided. In a practical computer system, there can be dozens or more

Proceedings of the International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems (IWIA'06) 0-7695-2689-6/06 $20.00 2006

ecafretnI

ataD tinU suB


T UO r d dA

sretsigeR

laicepS

noitcurtsnI ehcaC UMM BCP

rellortnoC

tpurretnI

tseuqeR

regetnI
s ts e u q eR t p u r r e t nI

51

tinU
SUB-tsnI SUB- OD SUB- ID

The organization of PRESTOR-1 is shown in Table 1 and the block diagram is depicted in Figure 1. The processor consists of an integer unit (IU), a processor context buffer (PCB), instruction/data caches, a memory management unit (MMU), special registers, a bus interface unit (BIU), and an interrupt request controller (IRC), that are connected by the internal busses. The IU executes SPARC version 8 [13] integer instructions through the 10-stage scalar execution pipeline. Although the instruction set is based on the SPARC architecture, the organization is quite different from that of commercial SPARC processors. The main difference is that the PRESTOR-1 does not have register windows that all SPARC processors have. We assume that a program code executed does not include the save and restore instructions that explicitly rotate the register windows. The GNU compilers targeting SPARC processors can generate executable codes without such instructions by specifying the option -mat [1]. Therefore, a lack of register windows does not impede software development. PRESTOR-1 has a 4-entry PCB. Each entry accommodates a thread context that consists of values of a program counter, a next program counter, a processor status register, a Y register that is used for multiply/divide calculation in SPARC architecture, MMU registers (page table pointer, etc.), special registers (priority, CSB1 index), thirty-two user-mode general purpose registers, and thirtytwo supervisor-mode general purpose registers. The size of each entry is 288 bytes. The details of PCB mechanisms are described in the section 2.2.1. The processor includes instruction and data split primary caches. Each 16KB cache is four-way set associative and the cache block size is 32 bytes. The instruction and data

Figure 2. Context preloading and prefetching.

active task running. Figure 2 depicts context preloading between PCSs and PCB, and prefetching between the PCB and CSB. In the background of the active thread execution in PCS 2, the context of PCS 1 is moved to a PCB entry 2 while the content in the PCB entry is preloaded into the PCS 1. After that, if the degraded context in the PCB entry 2 had executed the swapping instruction, it is swapped with for one in the CSB. The locations for swapping (A and X in the gure) are specied in advance of execution of the swapping instruction. 2.2.3 Fast Response to Interrupt Requests

2.2.2

Context Preloading and Prefetching

When a running task meets a cache miss, the task priority is dynamically lowered. When the priority is lowered several times and it becomes lower than that of a task in the PCB, the context is pushed out to the PCB and the PCS entry is lled with the higher-priority task in the PCB. Because the active task is now one in another PCS, the replacement is performed in the background of the active task execution. After this context preloading is done, the PCS becomes a candidate for the next active task execution. PCB can virtually increase the number of built-in task/thread contexts in a multithreaded processor as long as the preloading works well. All tasks in the PCSs and PCB can become active fully under the hardware control. On the other hand, tasks in the CSB are initiated and prefetched into the inside of the processor when an active task executes a special swap instruction (CSI in section 2.4). The task which has executed the instruction is degraded to the PCB and then to the CSB, while another task in the CSB is loaded into the PCB. The swapping is performed by hardware in the background of

Generally, when an interrupt occurs, the interrupt handler must save a processor context through the execution of store instructions. This can be a bottleneck with respect to a fast response to the interrupt and be a barrier to real-time processing. PRESTOR-1 has four PCSs dedicated to interrupt routines and automatically switches to one of the PCSs when an interrupt occurs, which eliminates the necessity of saving contexts through software execution. When control returns from the interrupt routine, a set of load instructions is not needed to restore the saved context since the previous PCS is automatically and quickly activated through the execution of a return-trap instruction. This mechanism enables fast response to interrupts. The priority of an interrupt is xed by hardware. When an interrupt with higher priority occurs while some interrupt routine is already running, the higher interrupt request should interrupt the lower one and start immediately. This is why there are multiple PCSs for interrupts. The multiple PCSs permit further interrupts to be received and accepted without overheads of context switching during processing of an interrupt. The organization of the PCSs for interrupt processing differs slightly from those for normal processing. To return to a previous PCS processing in overlapped interrupt processing, each PCS for interrupts has PPCSP (previ-

Proceedings of the International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems (IWIA'06) 0-7695-2689-6/06 $20.00 2006

)yromem ni( BSC


A X

gnihcteferP

BCP
0 1 3 2

2 1 3 0 SCP SCP SCP SCP


gnidaolerP

evitcA

than a hundred of processes under the control of an operating system. Even in embedded systems, not a few tasks can exist. It is difcult for a processor based on multithreaded architecture to accommodate all the process/task contexts, since it would easily increase both hardware size and costs while also reducing the clock frequency if there are many PCSs. Therefore, the operating system sometimes has to swap the contents in PCSs for those in a memory or cache, executing load and store instructions, which can easily cause cache misses and degrade the performance. In general, a primary cache is split into two, an instruction cache and data cache, according to the relevant purposes. The task context data is different from instructions, and slightly different from task-execution or application data, although it is indeed data for a context switching handler routine. PRESTOR-1 provides a cache dedicated to task/thread contexts, the processor context buffer (PCB). In a hierarchy of priority, contexts of tasks/threads with highest priority are located in the PCSs, those of tasks/threads with next higher priority are located in the PCB, and those of tasks/threads with lower priority are in memory. A task/thread in the PCSs can become active when its priority is higher than any other tasks/threads. When the number of tasks managed by OS exceeds the capacity of PCSs and PCB, overowing tasks are located in memory. The context storage buffer (CSB) region in memory contains such task contexts. In the CSB, each task context is stored as a data structure that a special swap instruction can quickly save and fetch; this is the same data format as is used in the PCB.

ous PCS pointer). Figure 3 illustrates how an active PCS is switched when the processor receives interrupt requests. First, the PCS 3 for normal processing is active. When an interrupt request which uses PCS 4 occurs, CPCSP (current PCS pointer) changes and points to PCS 4 (the dotted line in the gure). At the same time, the PPCSP of PCS 4 is set to 3 in order to return to PCS 3 processing after the interrupt processing has ended. In the gure, an interrupt request to use PCS 7, whose priority is higher than that of PCS 4, occurs during the processing in PCS 4. The CPCSP then becomes 7 (broken line) and the PPCSP of PCS 7 is set to 4. Similarly, a further interrupt with a still higher priority which uses PCS 6 generates the solid line and the value 7 in the PPCSP of PCS 6. Execution of a RETT (return trap) instruction at the end of each interrupt processing causes the active PCS to return to the previous processing according to the corresponding PPCSP value. Therefore, after PCS 6 processing, the CPCSP becomes 7. Then it returns to 4, and nally to 3.

be used for distinct tasks/threads. We designed instruction and data caches based on four-way set-associative. Each way can become a partition and be dynamically allocated to a task/thread. For example, the active task with the highest priority uses two partitions, the next highest task uses one partition, and the other tasks share a remaining partition. For data consistency, all partitions are searched during every cache access. The cache behaves as a conventional cache upon a cache hit. The allocation of the partitions is valid only when a missed cache block is being lled. Therefore, only a task identier for each partition and a mechanism to select a block entry to be replaced and overwritten are required to enable the partitioning. This partitioning has the merit that a partition of a task with higher priority is not affected by the execution of lower tasks. Therefore, cache efciency is kept to some extent according to the priority of tasks. 2.3.2 FIFO Buffer

2.3 Recongurable Caches


PRESTOR-1 has recongurable instruction and data caches. Both instruction and data caches can be recongured for priority-based partitioning. In addition, the data cache can be recongured so that a part of the cache memory forms an FIFO buffer to receive data sequences from the outside. 2.3.1 Priority-based Partitioning

The efciency of processor caches tends to be lower in a multithreaded processor than that in a single-thread processor since multiple threads share a common cache at a time. PRESTOR-1 has recongurable caches to keep cache hits rate high for tasks/threads with higher priority. In the recongurable caches, a cache memory is dynamically divided into multiple partitions and each partition can

Figure 3. Switching of PCSs upon the occurrence of interrupts.

There are sequential accesses to a large amount of stride data in scientic computing, media processing, or main memory database processing [7], which results in fragmentation in a conventional cache memory since the accesses might have neither temporal nor spatial locality. In real-time embedded systems, media processing with stream data is spreading. We proposed a mechanism for stride data transfer (SDT), which is a kind of DMA provided by a memory controller [10]. To make the mechanism work efciently, PRESTOR-1 has an FIFO buffer that can be used to receive a stride or continuous data sequence. After one partition of the data cache is recongured to provide an FIFO buffer, the partition is addressed by two pointers; one for read and the other for write. These read and write pointers are automatically increased by one at a read operation by the execution pipeline or at a write operation by the reply data from the outside of the processor, respectively. When the execution pipeline issues a read request (that is triggered by a load instruction) to the FIFO buffer and the value of the read pointer is equal to that of the write one, no valid data exists in the FIFO and a miss signal is asserted. The FIFO does not need to be indexed by any address other than the two pointers. The structure does not need any tag matching for searching, which can simplify the hardware of our recongurable cache. To ensure data consistency in the FIFO usage, cached data in a dirty state whose memory address is the same as that of data to be referenced in the FIFO buffer must not reside in any other partitions. Cache blocks including such dirty data must be written back to memory prior to the use of the FIFO buffer. Otherwise, the DMA mechanism might send stale data to the FIFO. A special instruction, FU-Store, is provided for efcient write-back operations, as described

Proceedings of the International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems (IWIA'06) 0-7695-2689-6/06 $20.00 2006

tpurretnI r of 7 SCP

tpurretnI rof 6 SCP

tpurretnI rof 5 SCP

tpurretnI rof 4 SCP

PSCPC 3 PSCPP lamron rof 3 SCP


. . . .

in the next section.

2.4 Extended Instructions


PRESTOR-1 utilizes instructions which are implementation dependent in the SPARC version 8 architecture specication to provide several extended instructions. This implementation dependence makes it feasible to use the existing compiler, gcc, and assemblers in application development. Our RTOS kernel described in Section 3 exploits these extended instruction in the implementation, mainly to improve efciency. 2.4.1 IPCSI: Inter-PCS Instructions

tion from/into an alternate space. These instructions implement the data cache explicit control, and support fast DMA without hardware-controlled cache coherent mechanisms. FF-Load is a load instruction that forces the data cache to ll a cache entry with a 32-byte block in the memory regardless of whether the reference hits in the cache. At the same time, addressed data whose size is specied by the instruction is written in the destination register. FU-Store is a store instruction that updates memory with a cache block if and only if the specied block exists in the cache and is in a dirty state. The source register value of the store is ignored and therefore written in neither the cache nor the memory. This can be issued as a non-blocking instruction and used for cache scrubbing (consistency) in advance of the FIFO reconguration. 2.4.4 DCPI: Data Cache Prefetch Instructions

Instructions that enable to access any register in any PCS are provided to realize fast inter-thread communication and fast trap handling. The following three instructions utilize load/store instructions from/into an alternate space in SPARC version 8 architecture. IPCS(Inter-PCS)-Load is an instruction to load from an alternate space specied by an address space identier (ASI) that includes information on the destination PCS. A value read from a memory or cache is written to a register in the destination PCS. IPCS-Store is an instruction to store into an alternate space specied by ASI that includes a source PCS number. A register value in the source PCS is stored into memory or cache. IPCS-Add uses a load instruction with ASI that includes a destination PCS number and performs an addition. An added value of two register values in the active PCS is written to a register in the destination PCS. (The added value is originally a memory address as a load instruction. ) When the zero register, %r0, is specied as either of the two source registers, a simple register movement can be performed. 2.4.2 CSI: Context Swapping Instructions

The content of the zero register (%r0) is always zero in the SPARC architecture. A load instruction whose destination register is %r0 and whose size is 4 or fewer bytes has no effect. We make this type of load instruction function as a data prefetch instruction when the address of the load is in a cacheable space. Such an instruction can be combined with other extended instructions. For example, an FF-Load instruction whose destination is %r0 can forcedly ll a cache entry before the succeeding and corresponding load instruction is executed. 2.4.5 BTI: Byte Twisting Instructions

This instruction utilizes a store instruction into an alternate space. The context of a task/thread that has executed this instruction is moved to the PCB and then to an entry of CSB specied in advance, while another task/thread in the CSB specied in advance is loaded into the PCB. The swapping is performed in the background of the active task execution. 2.4.3 CLFI: Cache Line Forcing Instructions

Instructions for reversing the endian byte order are provided by utilizing load/store instructions from/into alternate space with special ASIs. For a load, a read value from a memory or cache is twisted and the twisted value is written in a destination register. For a store, a source register value is twisted and the value is stored. These instructions can be combined with other extended instruction. For example, when combined with IPCS-Add, the calculated result is twisted and the result value is written in the destination register. 2.4.6 PILI: Processor Interrupt Level Instructions

The following instructions, forced-ll load (FF-Load) and forced-update store (FU-Store), utilize a load/store instruc-

A critical section among interrupt handlers can be controlled by regulating the processor interrupt level (PIL). Although the PIL can be changed by executing an instruction, WRPSR, which rewrites the processor status register (PSR) in SPARC version 8, this procedure requires the execution of several instructions beforehand to generate the PSR value. Therefore, we provide extended instructions,

Proceedings of the International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems (IWIA'06) 0-7695-2689-6/06 $20.00 2006

Save&Write PIL and Restore PIL, for directly modifying only the PIL eld in the PSR with an immediate value in the instructions. Save&Write PIL utilizes an instruction, WRASR, which is originally for writing in ancillary state registers (ASR). ASRs are implementation dependent registers in SPARC version 8. The execution of this instruction moves the value of the current PIL to a shelter register and at the same time, an immediate value in the instruction is written in the PIL eld in the PSR. Distinct shelters are prepared for each PCS dedicated to interrupts. Restore PIL utilizes an instruction, RDASR, which is a read instruction from the ASRs. The execution of this instruction restores the PIL eld with the value in the corresponding shelter.

Table 2. System calls in standard prole.


Group Name (i)act tsk can act ext tsk ter tsk chg pri get pri slp tsk tslp tsk (i)wup tsk can wup (i)rel wai sus tsk rsm tsk frsm tsk dly tsk (i)ras tex dis tex ena tex sns tex (i)sig sem wai sem twai sem pol sem (i)set g clr g wai g twai g pol g snd dtq tsnd dtq (i)psnd dtq (i)fsnd dtq rcv dtq trcv dtq prcv dtq snd mbx rcv mbx trcv mbx prcv mbx get mpf tget mpf pget mpf rel mpf set tim get tim isig tim sta cyc stp cyc (i)rot rdq (i)get tid (i)loc cpu (i)unl cpu dis dsp ena dsp sns ctx sns loc sns dsp sns dpn Function Activate task Cancel task activation Terminate invoking task Terminate task Change task priority Reference task priority Put task to sleep Wakeup task Cancel task wakeup Release task from waiting Suspend task Resume suspended task Forcibly resume suspended task delay task Raise task exception Disable task exception Enable task exception Reference task exception state Release semaphore Acquire semaphore

Task management

Task dependent synchronization

3. RTOS Kernel
This section describes the organization of our RTOS kernel and the extensions to the kernel for multithreaded execution on PRESTOR-1.

Task exception handling

Set event ag Clear event ag Wait for event ag

3.1 Organization
Our RTOS kernel provides all functions and system calls in the standard prole dened by the ITRON4.0 specication 2 . The standard prole consists of basic rules of task states and their transitions, task scheduling, static APIs, and system calls for task management, synchronization and communication, memory pool management, time management, and so on. Table 2 shows seventy system calls in the standard prole. (In the table, system calls starting with i in the name are for ones invoked in interrupt handlers, for example, iact tsk. ) Generally, task routines in ITRON applications are described by using these system calls that have C language interfaces. Keeping the rules of the standard prole ensures compatibility and portability of software. The programming environment we provide allows static API-based description dened in ITRON4.0 which facilitates software development. Table 3 shows the eleven static APIs and Figure 4 depicts the process of executable binary generation. In addition, our programming environment gives several libraries that utilize the hardware mechanisms and the special instructions described in section 2.3 and 2.4
2 In the ITRON4.0 specication, there are four proles; full set, standard prole, automotive control prole, and minimum set. The standard prole aims at strictly dening the set of standard functions while maintaining the loose standardization policy which allows adaptability on hardware [8].

Synchronization & communication

Send to data queue

Receive from data queue

Send to mailbox Receive from mailbox

Acquire xed-sized block

Memory pool management

Time management

System state management

Release xed-sized block Set system time Reference system time Supply time tick Start cyclic handler Stop cyclic handler Rotate task precedence Reference task ID Lock the CPU Unlock the CPU Disable dispatching Enable dispatching Reference contexts Reference CPU state Reference dispatching state Reference dispatch pending state

Proceedings of the International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems (IWIA'06) 0-7695-2689-6/06 $20.00 2006

Figure 4. Conguration with static APIs.

Table 3. Static API. Name CRE TSK DEF TEX CRE SEM CRE FLG CRE DTQ CRE MBX CRE MPF CRE CYC DEF INH DEF EXC ATT INI Function Create task Dene task exception Create semaphore Create event ag Create data queue Create mailbox Create xed-sized memory pool Create cyclic hander Dene interrupt handler Dene CPU exception handler Attach initialization routine

fore, we can congure the kernel by merely replacing the scheduling function. In addition, to adapt to rate monotonic, early deadline rst, and least laxity rst scheduling, a static API function, CRE TSK, allows programmers to describe period or worst case execution time as the operands. In the next section, we used the xed priority-based scheduling for evaluation.

3.2 Extension for Multithreaded processor


Our rst implementation of the kernel was one for a single-threaded processor [12]. For future embedded systems that would require higher performance and better realtime property, we extended the kernel to be executed on a multithreaded processor. In the second implementation targeting the multithreaded processor, PRESTOR-1, kernel data that are protected by lock/unloc mechanisms can be shared by all threads execution. The original lock/unlock mechanisms are concerned only with protection against interrupts. On the other hand, thread switching can occur regardless of interrupts, depending on cache misses, and so on. Therefore, to prevent the kernel data from being corrupted, the thread switching should be disabled while some thread is accessing those kernel data. PRESTOR-1 can dynamically enable and disable the thread switching according to a bit in a special control register. The kernel can utilize the mechanism and quickly enable or disable the thread switching by updating the control bit. Using this version of the kernel, two or more tasks can be of a running state simultaneously, which improves system performance and throughput. The priority management of tasks in the kernel and the priority-based thread-switching mechanisms of PRESTOR-1 work together, which enhances a real-time property. The kernel utilizes the fast interrupt response mecha-

for application programmers to easily improve the performance. In the ITRON programming model, tasks, communication means, and memory pool are implemented as objects. The static API functions are written in a conguration le (the left bottom le in the gure) and translated to codes (Cong. Files in the gure) that instantiate various objects and header les by the congurator. After that, a nal executable binary le is generated through compilation by gcc (GNU C compiler) and linking with the kernel and library object les. Our RTOS kernel provides adaptability, where task scheduling algorithm can be selected out of static and xed priority that the ITRON originally species, rate monotonic, earlier deadline rst, least laxity rst, and their mixture method [12], according to the objectives of the system/application. Our implementation completely separates the scheduling function from other processing. There-

Proceedings of the International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems (IWIA'06) 0-7695-2689-6/06 $20.00 2006

yraniB

eliF

selif tcejbo yrarbiL seliF redaeH seliF .gifnoC rekniL ccg rotarugifnoC

elif tcejbo lenreK

}} yb debircseD . } . . {{ )()(kssatdiov { at i )(kksatddoov iv }} . } . . {{ )()t(itnniddoov { i i iv )(itini diov . ;) ,ksat ,KST(KST_ERC ;) ,tini ,(INI_IITA ;)>h.norti<(EDULCNI

r em margorP

nisms of PRESTOR-1. Interrupt handlers attached to the binary code by a static API, DEF INH, are quickly invoked in cooperation with the hardware mechanisms for fast interrupt response. Application programmers have only to specify the correspondence between interrupt processing and a PCS for the interrupt in the static API. The specied PCS is to be used when the interrupt occurs. The priority of the interrupt, specied in the static API, is directly used in the hardware mechanisms for multiple and overlapped interrupt processing. This version of kernel can utilizes the processor context buffer of PRESTOR-1 described in section 2.2.1. In an initialization step, the same number of execution threads as PCSs and PCB entries are activated. Then each thread picks a task in the task ready queue managed by the kernel when it becomes active, and then executes the task code. After the initialization, all threads in PCSs and PCB are controlled fully by hardware. The priority-based hardware control of the PCSs and PCB cooperates with the real-time scheduling by the RTOS kernel.

Table 5. Total execution time. Execution type Single(10) Multi(10) Multi pcb(10) Single(50) Multi(50) Multi pcb(50) Single(100) Multi(100) Multi pcb(100) Single(150) Multi(150) Multi pcb(150) Single(200) Multi(200) Multi pcb(200) Total execution cycles 33,213 31,710 33,280 49,413 37,117 39,791 69,663 48,121 46,905 89,913 58,727 54,721 110,163 68,305 64,755 # of cache misses 405 448 495 405 446 487 405 441 474 405 442 474 405 444 475

4. Evaluation
We evaluated the RTOS kernel and hardware mechanisms of PRESTOR-1 by using a cycle-based processor simulator that executed binary executable codes in the same manner as PRESTOR-1. The simulator traces program execution and generates elapsed clock cycles and the number of cache misses. The latency settings of memory references and PCB access are shown in Table 4. In the simulation, several latencies for main memory accesses (miss-hit in the table) were attempted. The latency of PCB-preloading, 18 cycles, is caused by the size of PCB entry and the bandwidth between PCS and PCB. We prepared a task set that consisted of eight tasks that were independent from each other, that is, there are no communication or synchronization between tasks. Four tasks in the task set have higher xed priority and the other four tasks have lower priority. These tasks were activated simultaneously by executing a system call, act tsk. Each task program executes several system calls in ITRON4.0 and includes a loop structure in which a data array is scanned. Those tasks execution tends to cause many cache misses while scanning the data array. The binary codes that consisted of the tasks and kernel codes were input into the simulator and the execution was simulated until all the tasks nished. The results are shown in Table 5. The table includes the total execution clock cycles and the number of cache misses that occurred in the execution. In the table, Single(10) means that the execution used a single PCS, which is a single-threaded execution. The number in the brackets indicates the cache miss penalty. For example, in Single(10), every cache miss took ten cycles for a missed block to be lled. Multi means the execution with four PCSs, but without PCB. Multi pcb is the execution that uses four PCSs and four PCB entries. From the results, we can see that the Multi and Multi pcb executions are faster than Single execution, except when the miss penalty is short, 10 cycles, although the multithreaded executions cause more cache misses than singlethreaded execution. In addition, over 100 cycles of miss latency, the PCB mechanism can decrease the execution time. In general, multithreaded execution tends to increase cache misses because of contention among threads in the shared cache memory. However, performance improvement caused by the efciency of multithreaded execution was more than performance degradation by the increase of cache misses. Moreover, multithreaded execution applied to multiple tasks could decrease software overheads of task scheduling and dispatching, which contributed to the reduction of execution cycles. Under 100 cycles, Multi pcb was not better than Multi. This is because of the increased

Table 4. Simulation setting. L1 I/D-cache accesses PCB-preload Hit: 1 cycle Miss-hit: 10, 50, 100, 150, or 200 cycles 18 cycles (Background processing)

Proceedings of the International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems (IWIA'06) 0-7695-2689-6/06 $20.00 2006

number of cache misses and the preloading latency. The PCB mechanism sometimes can bring useless preloading or thrashing when the memory access latency is relatively lower. That is, the PCB cannot work well when a degraded thread becomes ready shortly. In the future, speed gap between a processor and memory systems will become larger, even in embedded systems, and therefore, the PCB mechanism can be effective.

5. Conclusion
In this paper, we showed the outline of PRESTOR-1, a multithreaded processor we developed using ASIC process, its architectural mechanisms, and the implementation of the real-time operating system kernel for the multithreaded processor, that is based on the ITRON4.0 specication widely used in embedded system industry, expecting that multithreaded processors would be used, even in embedded system development, for further performance improvement and high quality of real-time processing. The processor includes a hardware mechanism, processor context buffer (PCB), that virtually increases the number of builtin threads/tasks, which can automatically swap thread/task contexts based on priority, hide cache miss penalty, and as a result enhance real-time property by alleviating software overheads of task scheduling and dispatching in RTOS. By using this kernel on the PRESTOR-1 architecture, we evaluated task set execution by cycle-based simulation. The results showed that the PCB mechanism was effective when memory access latency was higher, which means the mechanism will be effective in the future when the speed gap between a processor and a memory system gets larger even in embedded systems. PRESTOR-1 runs on an evaluation print board that has various interfaces such as PCI, USB, UART, PCMCIA, and so on (Figure 5). On the board, several devices, graphic card, Ethernet card, keyboard/mouse, sound device, and disc storage card, are controlled by the PRESTOR-1 and system software with the RTOS kernel. In the future, we evaluate actual applications for embedded systems on this board.

Figure 5. Evaluation board.

[8] T. A. ITRON Committee. ITRON4.0 Specication Ver.4.00.00. [9] K.Tanaka. Fast Context Switching by Hierarchical Task Allocation and Recongurable Cache. In Proc. of Intl. Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, pages 2029, 2003. [10] K.Tanaka. Highly Functional Memory Architecture for Large-Scale Data Application. In Proc. of Intl. Workshop on Innovative Architecture for Future Generation HighPerformance Processors and Systems, pages 109118, 2004. [11] K.Tanaka. PRESTOR-1: A Processor Extending Multithreaded Architecture. In Proc. of Intl. Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, pages 9198, 2005. [12] K.Tanaka. Real-Time Adaptive Task Scheduling. In Proc. of Intl. Conference on Embedded Systems and Applications, pages 2430, 2005. [13] SPARC International,Inc. The SPARC Architecture Manual Version 8. Prentice-Hall,Inc., 1992. [14] W.D.Weber and A.Gupta. Exploring the Benets of Multiple Hardware Contexts in a Multiprocessor Architecture: Preliminary Results. In Proc. of Intl. Symposium on Computer Architecture, pages 273280, 1989.

References
[1] [2] [3] [4] [5] [6] [7] http://gcc.gnu.org/. http://www.access-company.com/. http://www.assoc.tron.org/. http://www.mispo.co.jp/. http://www.t-engine.org/. http://www.toppers.jp/en/. H. Garcia-Molina and K. Salem. Main Memory Database Systems: An Overview. IEEE Trans. on Knowledge and Data Engineering, 4(6):509516, Dec. 1992.

Proceedings of the International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems (IWIA'06) 0-7695-2689-6/06 $20.00 2006

You might also like