You are on page 1of 9

White Paper

Building an Efficient, Tightly Coupled


Embedded System Using an
Extensible Processor
June 2014

Authors

Abstract

Jeroen Geuzebroek
Sr. R&D Engineer,
Synopsys Inc.

The increasing demand for better filtering and processing capabilities of the processor within embedded
systems results in a trend to shift from 8-bit microcontroller tightly coupled embedded systems towards 32-bit
processor bus-based embedded systems. As a consequence, the power, performance and area (PPA) ratio of
these systems also shifts in favor of performance at the cost of power and area. However ultra low power and
small area are the main drivers for these embedded systems. This white paper describes how closely coupled
memories and processor extensions can be leveraged to improve the power and area of these embedded
systems by making the bus infrastructure superfluous. Removing the bus infrastructure reduces area costs as
well as the latency observed when accessing memories and peripheral registers. Reduced latency translates
into performance improvement and power reduction. Improving the PPA ratio of an embedded system by
tightly coupling memories and peripherals is demonstrated by comparing a bus-based implementation of a
sensor hub versus a tightly coupled implementation. The tightly coupled implementation results in a minimum
area reduction of 7K equivalent NAND2 gates, and 2.0x energy reduction.

Ad Vaassen
Sr. System Engineer
Synopsys Inc.

Introduction
Embedded systems exist in many devices that are a part of our daily lives, not only in high-performance
devices such as smart phones and tablets, but also in devices requiring low energy consumption like medical
monitors, hearing aids, ID cards and wearable electronics. These embedded systems perform dedicated tasks
in a very efficient and optimized way. In general they consist of a deeply embedded processor that collects
data from embedded peripherals, filters and processes this data, and returns the processed data to an
application host processor or another peripheral.
The right PPA ratio is key for such processor-based embedded systems. Ultra low power and small area are
the main drivers. Ultra low power is very important as these embedded systems are often found in batterypowered devices with a limited energy budget and where a long battery life is important. Small area is required
to reduce BoM costs and limit the device size as embedded systems are often put in very low-cost devices
where there is limited room for the (packaged) embedded system. The performance of the embedded systems
in the low-cost, low-energy devices only needs to be as good as is required to perform its dedicated tasks
within the specified timing constraints, while still enabling low energy consumption and small area. The PPA
ratio of an embedded system differs from the ratio in a high-end host processor in a mobile device, where
performance is the main driver and power and area are sacrificed.
Embedded systems based on an 8-bit microcontroller with tightly coupled peripherals have always been
very popular due to their small size and low-power operation. Today, the dedicated tasks running on these
embedded systems are becoming more complex, resulting in a demand for better filtering and processing
capabilities of the deeply embedded processor. This has resulted in a shift from 8-bit microcontroller based
embedded systems to 32-bit processor-based embedded systems [1, 2].

Embedded system

JTAG
32-bit
CPU

System
control

...

AHB bus

Memory
control

Memory
control

Embedded
SRAM

Embedded
ROM

Bridge
APB bus

IP0

IP1

...

IPN

Connectivity
Figure 1: Typical 32-bit processor bus-based embedded system

Figure 1 depicts a typical 32-bit processor-based embedded system. Besides the processor, the embedded
system consists of embedded memory (ROM and SRAM), memory controllers, a system controller that takes
care of clock and reset, and several connectivity peripherals, e.g., I2C, GPIO, SPI, ADC, DAC, etc. The processor
communicates to all the IP and memories using an external (multi-layer) AHB bus via one or more AHB interface.
An AHB2APB bridge and an APB bus are added to communicate with the connectivity peripherals.
Integrating a 32-bit processor enables the execution of more complex tasks with higher performance demands
than was possible with an 8-bit microcontroller based system. However, the typical 32-bit processor-based
embedded system architecture, shown in Figure 1, does result in a shift in the PPA ratio. The available performance
increases, at the cost of power and area. The additional hardware bus infrastructure for communicating with
embedded memory and peripherals costs area (gates) and power. Even though a 32-bit processor-based
embedded system enables complex tasks that are not possible with an 8-bit microcontroller, its success may be
limited when the requirements for small area and low power are not satisfied.
To keep the right PPA ratio for embedded systems, developers want an embedded system with the performance
of a 32-bit processor and the area footprint and lower power of an 8-bit microcontroller with tightly coupled
peripherals and memories.
The remainder of this paper describes how a configurable processor can be leveraged to optimize an embedded
system for power and area, even while improving the performance of a 32-bit microcontroller.

Tight integration of memories and peripherals


Processors that can be configured to contain closely coupled instruction and data memories no longer need to
access the AHB bus to get instructions and data from embedded memories. This removes latencies introduced
by the AHB multi-layer bus (e.g., by an arbiter that arbitrates between instruction and data fetches) and enables
performance improvements for any application.
Still AHB and APB buses remain, allowing the processor to access the system controller and the peripherals.
These buses, including arbiters and bridges are costly, both in terms of area (several KGates) and in latency. Each
peripheral read request forces the core to translate the request into an AHB read transaction, which requires two
passes through an AHB2APB bridge: one for the request and one for the subsequent response. Each pass takes
(at least) two clock cycles. In the best case scenario, where the CPU and bus frequency are identical, this results
in a minimum of four clock cycles per transaction. In typical applications, where the CPU clock frequency is higher
than the bus frequency, the latency on the processor side can become significantly higher. All peripherals share the
same APB bus. This also adds latency as only one APB transaction for one peripheral can be active at a time and
a pending APB transaction for one peripheral will block APB transactions to all other peripherals until the pending
APB transaction has finished.

Building an Efficient, Tightly Coupled Embedded System Using an Extensible Processor

The ability to add custom registers and interfaces to a processor core enables integrators to tightly couple
peripherals with a custom interface and to have direct access from the processor to the peripherals with custom
registers. Tightly coupling all peripherals removes the need of the bus infrastructure and its area and latency
(performance) penalties, which will improve the PPA ratio. Having tightly coupled interfaces to each peripheral also
enables concurrent access to each peripheral, as there is no shared bus interface any longer.

ARC EM Core with ARC Processor Extensions


DesignWare ARC processor cores [3] can be configured to contain Closely Coupled Memories (CCMs) for
Instructions (ICCM) and Data (DCCM), removing the need to access the AHB bus for memory accesses.
The ARC Processor EXtensions (APEX) technology [4,5,6] enables designers to add custom instructions to the
ARC processor core. APEX gives ARC cores the ability to perform certain functions in hardware and to directly
control the added hardware accelerators from the processor pipeline. The performance improvements enable
a power reduction, as the clock frequency of the processor can be lowered while still meeting the targeted
performance [6,7,8]. APEX is not limited to adding custom instructions to an ARC core. It also provides the ability
to add custom auxiliary registers and custom external interface signals to the processor core. The method
for leveraging these APEX auxiliary registers and external interfaces to tightly couple a peripheral to the ARC
processor is described in the following three steps, also depicted in Figure 2. In order to achieve this, the RTL HDL
code of the peripheral must be available and it must be possible to change it.
1) Start by removing the APB bus interface from the peripheral. The ARC processor will directly access the
peripheral using APEX auxiliary registers and will not rely on a bus interface.
2) For each existing peripheral register, create an exact APEX auxiliary register copy of this register and remove
the original peripheral register. Four billion (232) auxiliary registers can be specified. You should choose an address
range for the auxiliary registers that enables use of the same address offsets for each auxiliary register that would
otherwise be reserved in the external address space to access each peripheral register via the AHB/APB bus.
This simplifies software porting as now the memory mapped I/O calls used to access the peripheral register only
need to be replaced by LR/SR (load auxiliary register/save auxiliary register) instructions while keeping the same
register offsets.
3) Add APEX external interface signals for the new APEX registers and connect these to the peripheral internal
register interface. As a result, the ARC processor becomes tightly coupled to the peripheral internals using the
APEX auxiliary registers and APEX external interface.

Register map

1
Register IF
ARC EM4

APEX

Register map

Register IF

Register IF

APEX

3
Register IF

Peripheral

ARC EM4

Peripheral

APB IF

Peripheral

ARC EM4

Figure 2: Tightly coupled peripherals with APEX

Building an Efficient, Tightly Coupled Embedded System Using an Extensible Processor

When all IP blocks have been tightly coupled to the ARC processor core, the external bus interface, including
bridges, arbiters, adapters, etc., becomes superfluous and can be removed, resulting in the tightly coupled
embedded system depicted in Figure 3.
JTAG

Embedded system

ARC EM4

Debug

ICCM

Timer

Execute

Commit

Pipeline
Interrupt
controller

IP0

System
control

DCCM

IP1

APEX
hardware
accelerators

...

IPN

Connectivity
Figure 3: ARC+APEX integrated embedded system

Such an integrated embedded system does not only result in lower area costs compared to a typical bus-based
embedded system, but the tight integration of the peripherals also generates other benefits, i.e., the latencies to
access the memories and peripheral registers are reduced. The processor core accesses the auxiliary registers
in one cycle instead of a minimum of four cycles for the peripheral registers in a bus-based system. If there are
many peripheral register accesses, the power and area savings can be significant. The next section demonstrates
the PPA improvements by tightly coupling memories and peripherals to a processor by means of an embedded
sensor system.

Optimized Sensor Hub Implementation


Sensors have become an integral part of many consumer devices such as smart phones and tablets. Designers of
such devices typically do not want the application processors in these devices to be over-burdened with collecting,
filtering and processing sensor data. Therefore the designers deploy embedded systems with deeply embedded
processors to collect, filter and process sensor data, and then provide the result to the application processor.
This way, a power-hungry application processor can be put into sleep mode, while an ultra low-power, sensoroptimized embedded system takes care of the sensor processing function and only wakes up the application
processor when required.
This section highlights a processor-based sensor hub for determining the orientation of a device to demonstrate
the PPA improvements by tightly coupling peripherals and memories to the embedded processor core. The
sensor hub collects sensor data from a magnetometer, accelerometer and a gyroscope, filters and processes the
collected sensor data to determine the orientation of a device, and sends the results wirelessly to a host. Normally
a customer using the ARC processor for a sensor application would use tightly coupled memories and peripherals.
In order to make a quantitative comparison with 32-bit processors that are bus-based we implemented both an
ARC EM4 tightly coupled implementation and an ARC EM4 bus-based implementation.

Building an Efficient, Tightly Coupled Embedded System Using an Extensible Processor

Figure 4 depicts the bus-based implementation of such a sensor hub, implemented with an ARC EM4 without
closely coupled memories and with bus-based peripherals. The system contains two I2C masters and one SPI
master to collect the sensor data from the different sensor transducers concurrently, GPIO to act upon other
events in the system, and a UART to communicate with a host.
JTAG

ARC EM4

Debug

IFQ

Execute

Timer

Commit

Pipeline
Interrupt
controller
AHB-I

AHB-D

AHB peripheral

AHB multi-layer
AHB2APB
bridge

AHB
Memory
control

Memory
control

Embedded
SRAM

Embedded
ROM

APB bus

UART

I 2C
master

I 2C
master

SPI
master

GPIO

Host
Connectivity
Figure 4: A bus-based implementation of the sensor hub

Figure 5 shows the tightly-coupled implementation of the same sensor hub. Compared to the typical bus-based
implementation provided in Figure 4, the external memories are replaced by ICCM and DCCM and the peripherals
are tightly coupled to the ARC processor using APEX technology. CCMs and tightly coupled peripherals have
made the external bus infrastructure superfluous.

JTAG

ARC EM4

ICCM

Debug

Execute

Timer

Commit

Pipeline
Interrupt
controller

UART

I 2C
master

DCCM

I 2C
master

SPI
master

GPIO

Host
Connectivity
Figure 5: ARC + APEX tightly coupled implementation of the sensor hub

Building an Efficient, Tightly Coupled Embedded System Using an Extensible Processor

Table 1 shows the area savings in NAND2 equivalent gates achieved by the tightly coupled implementation.
As the external bus infrastructure has become superfluous, the area of the AHB multilayer and the AHB2APB
bridge is saved. Inside the EM4 core, memory requests do not have to be converted to AHB transactions, saving
the area of the EM4 internal AHB bus converters and initiators. Also, the Instruction Fetch Queue (IFQ) is no
longer required. The IFQ is present in the bus-based implementation to have a small prefetch queue available
in the core, in this case of one instruction word, to reduce the latency impact of AHB memory accesses on the
processor performance. In total, 7K NAND2 equivalent gates are saved by the tightly coupled implementation of
the sensor hub. The area savings do not include the memory controllers, as memory controllers are present in
both implementations.
Component

NAND2 equivalent gates

Comments

AHB interconnect

3650

AHB interconnect and interfaces

AHB2APB bridge

1850

AHB2APB bridge

IFQ

1500

Instruction fetch queue

Total

7000

Typical bus-based vs.


ARC+APEX optimized

Table 1: Sensor hub gate count reduction with an ARC + APEX tightly coupled implementation

Note that much of the area savings, such as that of the AHB multi-layer bus, AHB interfaces, and the IFQ, are
independent of the number of peripherals implemented. Only the area savings of the APB Bridge will slightly
change with the number of attached peripherals. This means that smaller (low-cost) embedded systems with fewer
peripherals will benefit more from a tightly coupled implementation with respect to area. The size of the IFQ, a
configurable option of the ARC EM4 core, enables designers to make a balanced trade off between performance,
power and area, e.g., a larger IFQ size improves performance at the cost of area. The area reduction by a tightly
coupled implementation will vary between 7K and 15.8K equivalent NAND2 gates, depending on the otherwise
chosen IFQ size in a bus-based implementation.
In the remainder of this section we will quantify the performance and power benefits of the tightly coupled solution.
The sensor application runs at 100 Hz, meaning that within 10 ms the EM4 core needs to
1. Collect the sensor data
2. Process the sensor data
3. Send the sensor data to a host (UART)

In order to meet these real-time constraints the EM4 core, the memories and the AHB multilayer-bus in the busbased implementation need to run at 15 MHz. To be able save power, the APB interfaces of the peripherals are
kept at 5 MHz. In order to meet the same real-time constraints the tightly coupled implementation only needs to
run at 5 MHz.
100%
90%
Relative to bus-based

80%
70%

2.1x
reduction

60%
50%

4.2x
reduction

40%

Bus-based
Tightly-coupled

30%
20%
10%
0%
Cycle counts

Energy

Figure 6: Tightly coupled results for processing the sensor data

Building an Efficient, Tightly Coupled Embedded System Using an Extensible Processor

The bus-based implementation needs to run at a higher clock frequency, as it suffers from the higher latency of
memory fetches and peripheral accesses. The impact of tightly-coupled memories on performance and energy
consumption is illustrated in Figure 6. In the second step of the sensor application, when the sensor data is
processed by the EM4 core, there is no interaction with the peripherals. The cycle counts spent on processing
are independent of the clock frequency as, for each implementation, both core and memories run at the same
frequency. This allows a direct comparison of the cycle counts of the bus-based implementation with the cycle
counts of the tightly coupled implementation. Figure 6 shows the number of cycles the EM4 core spends in the
processing stage relative to the number of cycles spent by the bus-based implementation. The tightly coupled
implementation requires 4.2x fewer cycles for the processing than the bus-based implementation. The tightly
coupled implementation fetches from the memories with single cycle latency while the AHB read transactions of
the bus-based implementation cost several additional cycles due to the AHB bus infrastructure. For the processing
stage, the tightly coupled implementation results in an energy reduction excluding memories of 2.1x, in a 40-nm
technology node. These numbers are a good reflection of the performance and energy gains that can be achieved
with a tightly coupled ARC solution over other 32-bit processor solutions using a bus-based approach.
The first stage of the sensor application, when collecting the sensor data, also demonstrates the impact of the
tight integration of the peripherals. In this stage the I2C and SPI peripherals retrieve the data from the transducers,
after which the EM4 core collects the retrieved data from the peripherals. Therefore the time spent in the collecting
stage depends on both the I/O time the peripherals require to collect the data from the transducers and the time
the core requires to collect the data from the peripherals. For both implementations the same I/O transfer rates are
programmed at the peripherals.
100%
90%
Relative to bus-based

80%

1.6x
reduction

70%
60%

Bus-based

50%

Tightly-coupled

40%
30%
20%
10%
0%
Elapsed time

Energy

Figure 7: Tightly coupled results for collecting the sensor data

To be able to compare the time spent in the collecting stage we have to compare elapsed time because the
I/O transfers of the peripherals translate into different cycle counts for different core clock frequencies. Figure 7
shows the elapsed time and energy consumption of the tightly coupled implementation relative to the bus-based
implementation. Even though in the tightly coupled implementation the core only runs on one-third the clock
frequency of the bus-based implementation, the same amount of time is spent on collecting the data from the
peripherals. In the tightly coupled implementation the processor is able to retrieve the data from the different
peripherals with less latency compared to the bus-based implementation, where the shared APB bus infrastructure
reduces the performance.
While the peripherals are retrieving sensor data, the EM4 core is waiting for data to become available. In the
bus-based implementation, the core consumes power at 15 MHz while waiting. In contrast, in the tightly coupled
implementation, the core consumes power at only 5 MHz. For the data collection part of the sensor application,
the tightly coupled implementation results in 1.6x energy reduction excluding memories compared to the busbased implementation.

Building an Efficient, Tightly Coupled Embedded System Using an Extensible Processor

100%
90%
Relative to bus-based

80%
70%

2.0x energy
reduction

60%

Bus-based

50%

Tightly-coupled

40%
30%
20%
10%
0%
Energy

Figure 8: Tightly coupled results for sensor application

Figure 8 shows the energy consumption of the tightly coupled implementation compared to the bus-based
implementation during a complete iteration of collecting, processing and sending the sensor data. The tightly
coupled implementation results in an energy reduction of 2.0x excluding memories compared to the bus-based
implementation. The energy reduction has been achieved by both removing the AHB and APB bus infrastructure
and by reducing the core clock frequency due to the performance improvements of the tight integration of
memories and peripherals.

Conclusion
The trend to shift from tightly coupled embedded systems utilizing an 8-bit microcontroller towards 32-bit
processor bus-based embedded systems also shifts the PPA ratio of these embedded systems in favor of
performance at the cost of power and area. Closely coupled memories together with ARC APEX provide a means
to tightly couple memories and peripherals to an ARC processor core and make the area- and latency-expensive
bus infrastructure redundant. This reduces both the power consumption and area costs of the embedded system
without sacrificing performance. A comparison between a bus-based implementation and a tightly coupled
implementation of a sensor hub system has shown that the tightly coupled implementation results in a minimum
area reduction of 7K NAND2 equivalent gates, and a 2.0x energy reduction.

Building an Efficient, Tightly Coupled Embedded System Using an Extensible Processor

References
[1] Migrating to Andes from 8051, https://www.semiwiki.com/forum/content/3171-migrating-andes-8051.html,
Semiwiki.com, February 2014
[2] Migrating from 8051 to Cortex Microcontrollers, Application Note 237,http://infocenter.arm.com/help/
index.jsp?topic=/com.arm.doc.dai0237a/index.html, ARM
[3] ARC Simplifies Processor Configurations, http://www.design-reuse.com/news/8838/arc-simplifiesprocessor-configurations.html, ARC, October 2004
[4] DesignWare Extensions and Options, Synopsys IP website, http://www.synopsys.com/IP/ProcessorIP/
ConfigurableExtensions/Pages/default.aspx, Synopsys, Inc.
[5] DesignWare ARC Processor Cores, Synopsys IP website http://www.synopsys.com/IP/ProcessorIP/
ARCProcessors/Pages/default.aspx, Synopsys, Inc.
[6] A. Vaassen and P. Struik, Next Generation Smart Sensor System, SAME 2013 conference, http://www.
same-conference.org/images/documents/papers/2013/S4-Paper-Synopsys-2013.pdf, 2013
[7] J. Geuzebroek, Leveraging Processor Configurability to Build an Ultra-Low Power Embedded
Subsystem, white paper, https://www.synopsys.com/dw/doc.php/wp/leveraging_processor_extensibility.pdf,
Synopsys, Inc., March 2014
[8] P. Struik, Ultra Low-Power 9D Fusion Implementation, white paper, https://www.synopsys.com/dw/doc.
php/wp/9d_sensor_fusion_implementation.pdf Synopsys, Inc., June 2014

Synopsys, Inc. 700 East Middlefield Road Mountain View, CA 94043 www.synopsys.com
2014 Synopsys, Inc. All rights reserved. Synopsys is a trademark of Synopsys, Inc. in the United States and other countries. A list of Synopsys trademarks is
available at http://www.synopsys.com/copyright.html. All other names mentioned herein are trademarks or registered trademarks of their respective owners.
06/14.AP.CS4315.

You might also like