You are on page 1of 8

Computer Architecture 10ELB019

Task Two

Mehul. M. Mitha B015938

Benchmarking
This report is a performance study of assembler, C and C++ codes simulated on variety of ARM microprocessors at two different clock frequencies and two memory configuration and the access times. The aim is to see how the different memory configuration, processor architecture and the speed of the processor affect the performance of the code; these variations will produce certain bottlenecks in the individual processor due to the architecture. This report will go in to detail on how the ARM7TDMI and ARM9TDMI microprocessors perform with different memory configurations. Introduction ARM microprocessors are very versatile and are used for many applications such as embedded systems, if someone designs a new piece of hardware, this report gives a brief insight in using an ARM microprocessor with a particular memory configuration and clock speeds. Depending on the specific processor, processor speed and memory type selected, this will affect the power consumption, performance and the manufacturing costs; this is a key factor to consider especially for mobile devices. By considering the factors of the processor types and memory configurations, in-depth research is essential to find the best suitable variation for the desired application. These sets of results will reduce the development time for the desired hardware as it will enable easy selection for specific processor types, processor speeds and memory types.
Start address Length Name

This data will provide sets of results indicating the memory usage for the specific code and static data, as well as the number and types of cycles executed, total clocks cycles and execution time. Procedure To conduct the various tests, the ARM development suite has been implemented. This industrial standard piece of software will precisely simulate the ARM microprocessor on a desktop/laptop computer in most environments. The ARMulator tools can also be used to simulate systemon-chip (SoC). In order to achieve accurate results and to exhibit how the two different ARM processors execute, three types of codes were used: a) ARM assembly block copying of data, b) C Dhrystone benchmark, c) C++ memory allocation. These separate pieces of code will be ran on the two processors at 50MHz and 100MHz, with two memory configuration a) code in ROM and data in SRAM and b) code in SRAM and data in DRAM. Fig 1 give details about the memory configuration. To enable fair accurate and unbiased results the memory maps and scatter files have not been altered during the testing stage and kept in accordance to the specification. Memory configuration A Code in ROM and data in SRAM Memory configuration B Code in SRAM and data in DRAM

Width (bits) 8 32 16

Read times (ns) NSA/SA 100/100 10/10 100/80

Write Times (ns) NSA/SA N/A 10/10 100/80

Access

0x10000 0x9000000 0x9000000

0xA000 0xA000000 0xA000000

ROM SRAM DRAM

R RW RW

Fig (0.1) memory configurations

Computer Architecture 10ELB019

Task Two

Mehul. M. Mitha B015938

To test the individual codes a scatterload and a map file has to be created and linked in to the ARM tools using armulink, the three different types of code were compiled using their individual compilers assembly used armasm, C Code used armcc, C++ code used armcpp. Results asm Block copying of data. Code 1 assembly testing Map A

Fig (1.0) code 1 memory Map A clock

Fig 1.0 show the total clocks needed to execute the code, the data shows that the ARM9TDMI is quicker at processing the code. For the ARM7TDMI the data suggests that the clock speed had no effect of the amount of cycles taken to execute the code, this shows that there is another factor contributing to the performance. Code 1 assembly testing Map B

that is restricting the performance is the memory access times as ARM9TDMI should be quicker. These results show that the memory configuration has a major effect on the cycles needed to execute the block copying code, this is notable in Map B the code is read from SRAM making it very fast in reading the code but the data access time are slow as the data is written in the DRAM and that the DRAM has a slow read and write times compared with Map A where the data is written in SRAM. SRAM has a very fast access times therefore this reduces the execution time, this in turn makes the processing time faster and the code is read from ROM which is slow compared with SRAM. This means that with memory configuration B the processor will have to wait as the data is called then accessed and fed back in to the processor and this process is longer and will be shown in the wait states for the two individual memory configurations. Code 1 wait states Map A

Fig (1.2) code 1 memory Map A wait states

Code 1 wait states Map B


Fig (1.1) code 1 memory Map B clock

Fig 1.1 shows the total clocks needed to execute the code, the data shows that the ARM7TDMI is quicker than ARM9TDMI, the optimal processor here is the ARM7TDMI working at 100MHz. The data shows that at both 50MHz and at 100MHz the total clocks needed to execute the code is 18, this shows that the processor is working at the maximum rate available and that there is most probably another factor

Fig (1.3) code 1 memory Map B waits states

Fig 1.2 and Fig 1.3 show that the Wait States decrease significantly by around 50% when the data in written

Computer Architecture 10ELB019

Task Two

Mehul. M. Mitha B015938

in SRAM but reading the code is still slow therefore wait states are not completely eliminated. The system will only be as fast as the slowest component in the system for memory configuration A the slowest component is ROM and in memory configuration B the slowest component is DRAM. For the ARM9TDMI the Wait States are given by the D Wait States added to the I Wait States. The number of Wait States increases even though code 1 is using sequential access of the memory as it is block copying. The frequency is increased from 50MHz to 100MHz. ARM7TDMI, there is an increase of 2.053 respectively and for ARM9TDMI there is an increase of 2.053 for Map A. AR7TDMI, there is an increase of 2.138 respectively and for ARM9TDMI there is an increase of 2.138 for Map B. This is not consistent but this is most likely due to SRAM as it is very fast this creates a bottleneck as the data is ready but the processor is still processing the previous data, writing data to DRAM. Code 1 Core Cycles Map A

Code 1 Core Cycles Map B

Fig (2.1) code 1 Map B Core Cycles

Fig 2.0 and Fig 2.1 shows that as you increase the frequency the Wait States decrease especially for ARM9TDMI, and that it is generally faster to perform when using the correct memory configuration which reduces bottleneck, but if you use a slow data memory then the Wait States will increase making the total, thus increasing the execution time, this is wasting the processor power and resources as it is waiting. Idle cycles increase but the other cycles are not effected as much.
Total Cycles ARM7TDMI ARM9TDMI ARM7TDMI ARM9TDMI 50MHz 1087 1027 524 502 100Hz 2107 2007 986 964

Map A Map B

Fig (2.2) code 1 Total Cycles taken to execute

Fig (2.0) code 1 Map A Core Cycles

From Fig 2.2 you can see that increasing the clock frequency, the cycle increases correspondingly. For Map A, ARM7TDMI there is an increase of 1.94. For Map A, ARM9TDMI there is an increase of 1.95. For Map B, ARM7TDMI there is an increase of 1.88. For Map B, ARM7TDMI there is an increase of 1.92. This also shows that the memory configuration and the processor speed has an effect on performance, the memory access times are causing a bottleneck.

Computer Architecture 10ELB019

Task Two

Mehul. M. Mitha B015938

Code 2 C Code Core Cycles Map A Results C This is a Dhrystone benchmark. Code 2 C Code testing Map A

Fig (3.2) code 2 core cycles Map A

Fig (3.0) code 2 memory Map A Clock

Code 2 C Code testing Map B

Fig (3.1) code 2 memory Map A Clock

In Fig 3.2 the Idle cycles increase as the clock frequency is increased from 50MHz to 100MHz for ARM9TDMI this shows that the code is being read from ROM but the code cannot be accessed fast enough so as the processor call for the next instruction, it has to wait a few cycles to receive that data the increase in wait times is 1.968. The S Cycles / ID Cycles, N Cycles / Ibus Cycles and C Cycles / Dbus Cycles remain constant and the clock frequency has little or no effect. This verifies that the memory access times are causing the bottleneck.

From Fig 3.0 and 3.1, to complete the Dhrystone benchmark ARM9TDMI with memory map A configuration was quicker to complete as the data is read and written back in to the ROM and the code is in the SRAM which is more efficient. The increase in the clock speed does not have a great effect on the clocks taken, as seen with the assembly code this shows that memory configuration are the limiting factors. The Dhrystone benchmark given a close representation as the benchmark is actually ran in an ARM tools which is running in windows operating system and this system is actually ran on an Intel processor. Therefore the Dhrystone mark is not as accurate as running on the actual hardware.

Code 2 C Code Core Cycles Map A

Fig (3.3) code 2 core cycles Map B

In Fig 3.3 the idle cycles increase as you increase the clock frequency, there is an increase of 2.20 which is more than double, all the other cycles remain constant, this is due to that the data in being read in written in DRAM and the code is running from SRAM, the DRAM is causing the bottleneck.

Computer Architecture 10ELB019


Code 2 C Code idle cycle Map A vs. Map B

Task Two

Mehul. M. Mitha B015938

processor is working to its optimal level and that all the pipelines are filled without any wait gaps. When using the memory configuration B, when you increase the clock frequency the Wait States increase by 2.13, but the value of Wait States remain constant when changing processor from ARM7TDMI to ARM9TDMI this is most likely due to code limitations. This trend is also apparent in the total cycles. Code 3 C++ Total Cycles
Total Cycles ARM7TDMI ARM9TDMI ARM7TDMI ARM9TDMI 50MHz 47081 37394 118729 109042 100MHz 47081 37394 199885 190198

Fig (4.0) code 2 core Idle Cycles

From Fig 3.0 the data shows that the increase in frequency or different memory configuration does not affect the idle cycles for ARM7TDMI this is due to the architecture of the processor as it uses a 3 stage pipeline with Von Neumann architecture. Therefore the idle cycle remains constant and the memory access times are not a factor. For ARM9TDMI the processor uses a 5 stage pipeline with Harvard architecture consequently the memory access times does cause a bottleneck as the processor is able to deal with the higher memory timings. Results C++ This is a memory allocation operation; also known as malloc this is a C++ subroutine which performs dynamic memory allocations. Code 3 C++ Wait States
Wait States ARM7TDMI ARM9TDMI ARM7TDMI ARM9TDMI 50MHz 0 0 71648 71648 100Hz 0 0 152804 152804

Map A Map B

Fig (4.1) code 3 C++ Total Cycles

Map A Map B

Fig (4.1) code 3 C++ wait states

From Fig 4.1 the data shows that there are no wait states for memory configuration A, this shows that the code is being read from ROM with no lag and data is written in SRAM without any memory access time issues. This also shows that the

From the total cycles taken to execute the code, the differences between increasing the clock frequency for Map A, the total cycles taken remain constant. This indicates that the memory rate that the code is being accessed is slow compared to the rate that the processor can work at, therefore the cycle taken remain the same. As for Map B, ARM7TDMI there is an increase of 1.68 when increasing the clock frequency. ARM9TDMI there is an increase of 1.74 when increasing the clock frequency. This displays that doubling the clock frequency does not mean that the total cycles will double, indicating that the code is being accessed at a sufficient rate but the data is not written in the DRAM fast enough, this is why the ARM7TDMI has a marginally smaller increase factor then ARM9TDMI as this has a 5 stage pipeline. The ARM7TDMI can only do a FETCH, DECODE then EXCUTE, whereas the ARM9TDMI has separate instructions and data busses ensuring simultaneous access. Instructions and data are done at the same time this will intern increasing the bandwidth making the process faster and efficient.

Computer Architecture 10ELB019

Task Two

Mehul. M. Mitha B015938

Code 3 C++ Total idle cycles Map A

Fig (5.0) code 3 total Idle Cycles

Code 3 C++ Total idle cycles Map A As the data represents in Fig 5.0 the total idle cycles remain fairly low when compared to ARM9TDMI running on Map B, this is purely down to the ARM9TDMI able to use all of its 5 stage pipelines. Using Map B where the data is written in DRAM, this again is slow therefore the data is kept in the latches waiting for it to be written. The reason why ARM7TDMI does also have higher total idle cycles is because the DRAM is actually faster than the rate that the processor is working at, that is why there is no difference in the total idle cycles from Map A to Map B. Code 3 C++ Clock Map A

Fig 5.1 and 5.2 signifies that in Map A the total clock taken to execute the C++ code decreases when the clock frequency is doubled, for ARM7TDMI there is a 50% increase in performance. For ARM9TDMI there is a 50% increase in performance when the clock frequency is double. For Map B, ARM7TDMI there is only a 15.8% increase in performance for clocks needed to execute the code. As for ARM9TDMI, there is only a 5.1% increase in performance and the clocks needed to execute the code is actually around 1/3 greater than ARM7TDMI. This is because the ARM9TDMI is FETCHING, DECODING, EXECUTING, MEMORY, and WRITEBACK at the same time and the DRAM memory cannot handle this flow of data therefore the clock is higher. This is a general trend seen in the core Cycles. Code 3 C++ Core Cycles Map A

Fig (5.3) code 3 Core Cycles Map A Fig (5.1) code 3 Total Clock Map A

Code 3 C++ Core Cycles Map A

Code 3 C++ Clock Map A

Fig (5.2) code 3 Total Clock Map B

Fig (5.4) code 3 Core Cycles Map B

Computer Architecture 10ELB019

Task Two

Mehul. M. Mitha B015938

The general trend that is seen for assembly and C code is followed in the C++ code as seen in Fig 5.3 and 5.4, that as you increase the clock frequency the idle cycle increase for ARM9TDMI using DRAM as the data configuration as this particular type of memory has a slow access time. However for using SRAM as the data configuration the idle cycles are minimised. For all processors and both memory configurations the S Cycles / ID Cycles, N Cycles / Ibus Cycles and C Cycles / Dbus Cycles remain constant and the clock frequency has little or no effect. The data also shows that for ARM9TDMI the DRAM is the cause of the bottleneck as when using SRAM the Bottleneck is reduces and takes around clocks to be executed. Static Data Code 1 Block Copying block.s Total RO Size (Code + RO Data) 0.10Kb Total RW Size (RW Data + ZI Data) 0.19Kb Total ROM Size (Code + RO Data + RW Data) 0.29 Kb Code 2 C Dhrystone Code Total RO Size (Code + RO Data) 25.06Kb Total RW Size (RW Data + ZI Data) 10.30Kb Total ROM Size (Code + RO Data + RW Data) 25.06 Kb Code 3 C++ Memory Allocation Code Total RO Size (Code + RO Data) 9.71Kb Total RW Size (RW Data + ZI Data) 0.30Kb

Total ROM Size (Code + RO Data + RW Data) 9.71 Kb The Static data shows the memory usage for the individual code.

Conclusion

The results represented in this report clearly show that different memory configuration will affect the performance of the ARM7TDMI and ARM9TDMI. The increase in clock frequency also has a significant effect on the performance but this is not as expected. From the results, the data represents that the ARM9TDMI performance is somewhat marginable faster than ARM7TDMI under certain circumstances i.e. having the correct memory configuration. When comparing, the clock speeds do not seem to have a specific positive correlation on the performance this is seen in the C++ code. This comes down to the memory access times during the execution of the code; it also does not have a major effect on the non-sequential and sequential access time as expected. Overall the increase in the processor clock frequency results in greater Wait States as the data is not being written fast enough. The increase in the Wait States represents that the data transferal rate is not as fast as the processor therefore the processor is not working at its optimal level and the bottleneck is caused by the memory configuration. This is specifically noticeable on the ARM9TDMI when using Map B when the data is in in DRAM. The ARM7TDMI uses a 3 stage pipeline, Von Neumann architecture; the same busses are used for both address and instruction. Whereas the ARM9TDMI uses a more complex

Computer Architecture 10ELB019

Task Two

Mehul. M. Mitha B015938

Harvard architecture which has separate instruction and data busses. There is also a slight performance advantage when using DRAM over ROM as the sequential access time are improved by 25%, when using it as an initial code access although this has not been tested. For the initial code access the SRAM has a major improvement over ROM as the SRAM has an improvement of a 1000% over ROM when it comes to non-sequential access times, as for sequential access time, the SRAM has an improvement of 800% this level of improvement is noticeable when accessing the code. This result is not seen in code 1 assembly block copying most likely due the assembly code has a small number of instructions when compared to C code and C++ code. The results in this report conclude that reading code from SRAM is faster than ROM and reading and writing data in SRAM is quicker than DRAM but this is a major drawback as SRAM is a form of volatile memory. So the system has to have a continuous power supply and in the event of a power outage the code and data will be wiped from the memory. Consequently it is better to opt for code being in ROM and the data being in SRAM so in the event of a power outage the code is not wiped and the system can be reset. Another factor to consider is when using high speed memory with higher processor clocks, the power consumption is increased this could be a major deciding factor when choosing for a mobile application when increasing the battery life is key. SRAM is very expensive when compared to ROM and DRAM, so choosing SRAM as the Data access will mean that the product cost will increase. Trade-offs have to be made between cost, power consumptions and manufacturing cost as well as the retail

price to sell to consumers, as if the performance increase is not noticeable to consumers then the extra cost of using SRAM is not viable. In conclusion the designer and engineers have to be careful when choosing the processor and memory configuration, as it has to be chosen so that the specific hardware setup matches the purpose of use. The bottlenecks and Wait States in the executions can be minimised by choosing a specific type of memory configuration. To help minimise the bottleneck, the code has to be efficient as the nature of the code will affect the overall performance. Note: All the graphs represented in this technical data sheet and RAW data are also presented in the appendix.

You might also like