You are on page 1of 31

INTRODUCTION

It has always been a frequent question -- "Will I benefit from multiple processors?" With
the growing popularity of dual core processors, the topic is more important than ever!
Will multiple processors or a dual core processor be beneficial to you, and what are the
differences between them? These are the questions this article will attempt to lay to rest.

A major question for some people getting ready to buy a high-end system is whether they
want or need to have two processors available to them. For anyone doing video editing,
multi-threaded applications, or a lot of multitasking the answer is a very clear 'yes'. Then
the question becomes whether two separate processors (as in a dual Xeon or Opteron
system) is the way to go, or whether a single dual core processor (like a Pentium D or
Athlon64 X2) will do just as well.

MULTI-CORE

A multi-core CPU (or chip-level multiprocessor, CMP) combines two or more


independent cores into a single package composed of a single integrated circuit (IC),
called a die, or more dies packaged together. A dual-core processor contains two cores
and a quad-core processor contains four cores. A multi-core microprocessor implements
multiprocessing in a single physical package. A processor with all cores on a single die is
called a monolithic processor. Cores in a multicore device may share a single coherent
cache at the highest on-device cache level (e.g. L2 for the Intel Core 2) or may have
separate caches (e.g. current AMD dual-core processors). The processors also share the
same interconnect to the rest of the system. Each "core" independently implements
optimizations such as superscalar execution, pipelining, and multithreading. A system
with N cores is effective when it is presented with N or more threads concurrently. The
most commercially significant (or at least the most 'obvious') multi-core processors are
those used in computers (primarily from Intel & AMD) and game consoles (e.g., the Cell
processor in the PS3). In this context, "multi" typically means a relatively small number
of cores. However, the technology is widely used in other technology areas, especially
those of embedded processors, such as network processors and digital signal processors,
and in GPU’s.

DEVELOPMENT
While manufacturing technology continues to improve, reducing the size of single gates,
physical limits of semiconductor-based microelectronics have become a major design
concern. Some effects of these physical limitations can cause significant heat dissipation
and data synchronization problems. The demand for more capable microprocessors
causes CPU designers to use various methods of increasing performance. Some
instruction-level parallelism (ILP) methods like superscalar pipelining are suitable for
many applications, but are inefficient for others that tend to contain difficult-to-predict
code. Many applications are better suited to thread level parallelism (TLP) methods, and
multiple independent CPUs is one common method used to increase a system's overall
TLP. A combination of increased available space due to refined manufacturing processes
and the demand for increased TLP is the logic behind the creation of multi-core CPUs.

DUAL CORE DEFINED


As the tasks that computers can perform get more complicated, and as people desire to do
more at once, computer manufacturers are trying hard to increase speed in order to keep
up with demand. Having a faster CPU has been the traditional way to keep up, since a
faster CPU can do a task then quickly switch and work on the next. However, due to size,
complexity and heat issues it has become increasingly difficult to make CPUs faster. In
order to continue to improve performance, another solution had to be found.

Having two CPUs (and a motherboard capable of hosting


them) is more expensive, so computer engineers came up
with another approach: take two CPUs, smash them
together onto one chip, and presto! The power of two
CPUs, but only one socket on the motherboard. This
keeps the price of the motherboards reasonable, and
allows for the power of two CPUs (also known as cores)
with a cost that is less than two separate chips. This, in a
nut shell, is what the term "Dual Core" refers to - two
CPUs put together on one chip.

There are more subtle differences between brands (how they combined two cores onto
one chip, and the speeds they run each core at) that can affect how much of a boost in
performance you can get from having a dual core CPU. Additionally, different types of
programs get differing benefits from having a dual core chip.
Dual Core Implementation

Because of the different ways AMD and Intel came into the dual-core market, each
platform deals with the increased communication needs of their new processors
differently. AMD claims that they have been planning the move to dual-core for several
years now, since the first Athlon64s and Opterons were released. The benefit of this can
be seen in the way that the two cores on their processors communicate directly -- the
structure was already in place for the dual cores to work together. Intel, on the other
hand, simply put two of their Pentium cores on the same chip, and if they need to
communicate with each other it has to be done through the motherboard chipset. This is
not as elegant a solution, but it does its job well and allowed Intel to get dual-core
designs to the market quickly. In the future Intel plans to move to a more unified design,
and only time can tell what that will look like.

Intel did not increase the speed of their front-side-bus (the connection
between the CPU and the motherboard) when they switched to dual-core,
meaning that though the processing power doubled, the amount of
bandwidth for each core did not. This puts a bit of a strain on the Intel
design, and likely prevents it from being as powerful as it could be. To
counteract this effect, Intel continues to use faster system memory to keep
information supplied to the processor cores. As a side note, the highest-
end Intel chip, the Pentium Extreme Edition 955, has a higher front-side-bus speed, as
well as having a larger (2MB per core) cache memory and the ability to use
Hyperthreading (which all non-Extreme Edition Pentium D processors lack). This makes
it a very tempting choice for those wanting to overcome some of the design handicaps of
Intel's dual-core solution.

AMD, on the other hand, does not use a front-side-bus in the traditional
sense. They use a technology called HyperTransport to communicate with
the chipset and system memory, and they have also moved the memory
controller from the chipset to the CPU. By having the memory controller
directly on the processor, AMD has given their platform a large
advantage, especially with the move to dual-core. The latest generation of
AMD single-core processors can use single- or dual-channel PC3200
memory, but it is interesting to note that even though dual-channel operation doubles the
memory speed, it does not double the actual memory performance for single-core
processors. It appears that dual-channel memory just provides significanly more
bandwidth than a single processor core can use. However, with dual-core processors all
that extra bandwidth can be put to good use, allowing the same technology already
present in single-core chips to remain unchanged without causing the same sort of
bottleneck Intel suffers from.

DUAL CORE PROCESSORS


A dual core processor is a CPU with two separate cores on the same die, each with its
own cache. It's the equivalent of getting two microprocessors in one.
In a single-core or traditional processor the CPU is fed strings of instructions it must
order, execute, then selectively store in its cache for quick retrieval. When data outside
the cache is required, it is retrieved through the system bus from random access memory
(RAM) or from storage devices. Accessing these slows down performance to the
maximum speed the bus, RAM or storage device will allow, which is far slower than the
speed of the CPU. The situation is compounded when multi-tasking.
In this case the processor must switch back and forth between two or more sets of data
streams and programs. CPU resources are depleted and performance suffers.
In a dual core processor each core handles incoming data strings simultaneously to
improve efficiency. Just as two heads are better than one, so are two hands. Now when
one is executing the other can be accessing the system bus or executing its own code.
Adding to this favorable scenario, both AMD and Intel's dual-core flagships are 64-bit.
To utilize a dual core processor, the operating system must be able to recognize multi-
threading and the software must have simultaneous multi-threading technology (SMT)
written into its code. SMT enables parallel multi-threading wherein the cores are served
multi-threaded instructions in parallel. Without SMT the software will only recognize
one core. Adobe Photoshop is an example of SMT-aware software. SMT is also used
with multi-processor systems common to servers.

A dual core processor is different from a multi-processor system. In the latter there are
two separate CPUs with their own resources. In the former, resources are shared and the
cores reside on the same chip. A multi-processor system is faster than a system with a
dual core processor, while a dual core system is faster than a single-core system, all else
being equal.

An attractive value of dual core processors is that they do not require a new motherboard,
but can be used in existing boards that feature the correct socket. For the average user the
difference in performance will be most noticeable in multi-tasking until more software is
SMT aware. Servers running multiple dual core processors will see an appreciable
increase in performance.

Multi-core processors are the goal and as technology shrinks, there is more "real-estate"
available on the die. In the fall of 2004 Bill Siu of Intel predicted that current
accommodating motherboards would be here to stay until 4-core CPUs eventually force a
changeover to incorporate a new memory controller that will be required for handling 4
or more cores.

WHY DUAL CORE PROCESSORS?


Processors with two cores are the future, because parallelizing computation increases
performance much more than any incremental clock speed gain ever could. While 4 GHz
is doable from a technical point of view, the disadvantages such as huge thermal
dissipation, large cooling requirements and high energy consumption speak clearly
against it. Eventually, CPUs will go to four cores and even larger numbers.
The key to increasing performance lies in the thread-optimization of applications:
Modular software based on multiple threads can be executed quickly by distributing
these threads to all available processing units, a process called multi-threading. Operating
systems have been doing this on an application level for a while, by assigning processor
time to different applications, which is called multi-tasking. With two cores, your
computer will be much more responsive and is unlikely to stall at any time due to a single
task consuming a lot of processor resources.
It is logical to assume that two processor cores could also draw twice the amount of
power, but that is not the case. Although they do require more power when both cores are
under a high load, the dual core will finish multi-threaded tasks much quicker than a
comparable single core processor. This will result in a total power consumption that is
usually below the total power draw of a single core CPU. In addition, the CPU makers
need to meet their platform specifications and can only change these when a new
platform revision is due. Also, more processors use energy saving mechanisms that will
put unused processor units to sleep while they are not in use.

COMPONENTS OF DUAL CORE PROCESSORS

TEXAS INSTRUMENTS TMS320

Texas Instruments TMS320 is a blanket name for a series of digital signal processors
(DSPs) from Texas Instruments. It was introduced on April 8, 1983 through the
TMS32010 processor, which was then the fastest DSP on the market.
The processor is available in many different variants, some with fixed-point arithmetic
and some with floating point arithmetic. The floating point DSP TMS320C3x, which
exploits delayed branch logic, has as many as three delay slots.

The flexibility of this line of processors has led to it being used not merely as a co-
processor for digital signal processing but also as a main CPU. They all support standard
IEEE JTAG control for development.
The original TMS32010 and its subsequent variants is an example of a CPU with a
Modified Harvard architecture, which features separate address spaces for instruction and
data memory but the ability to read data values from instruction memory. The
TMS32010 featured a fast multiply-and-accumulate useful in both DSP applications as
well as transformations used in computer graphics. The graphics controller card for the
Apollo Computer DN570 Workstation, released in 1985, was based on the TMS32010
and could transform 20,000 2D vectors/second.

KILOCORE

Kilocore, from Rapport Inc. and IBM, is a high-performance, low-power multi-core


processor, with 1025 cores. It contains a single PowerPC processing core, and 1024 8-bit
Processing Elements running at 125 MHz each, which can be dynamically reconfigured,
connected by a shared interconnect. It allows high performance parallel processing.
Rapport's first product to market is the KC256, with 256 8-bit processing elements. The
KC256 started shipping in 2006[1]. The elements are grouped in 16 "stripes" of 16
processing elements each, with each stripe able to be dedicated to a particular task.
The "thousand core" products are the KC1024 and KC1025, due in 2008. Both have 1024
8-bit processing elements, in a 32 x 32-stripe configuration. The KC1025 has the
PowerPC CPU, while the KC1024 has processing elements only.
EMOTION ENGINE
The Emotion Engine is a CPU developed and manufactured by Sony and Toshiba for use
in the Sony PlayStation 2. It consists of a MIPS based core, two Vector Processing Units
(VPU), a graphics interface (GIF), a 10 channel DMA unit, a memory controller, an
Image Processing Unit (IPU) and an input output interface.

At the heart of the Emotion Engine is a two way superscalar in order MIPS based core
primarily based on the MIPS III ISA but includes some instructions defined by the MIPS
IV ISA. The MIPS based core consists of two 64 bit fixed point units one single
precision (32 bit) floating point unit with a six stage pipeline. To feed the execution units
with instructions and data, there is a 16 KB two way set associative instruction cache, an
8 KB two way set associative non blocking data cache and a 16 KB scratchpad RAM.
Both the instruction and data caches are virtually indexed and physically tagged while the
scratchpad RAM exists in a separate memory space. A combined 48 double entry
instruction and data translation look aside buffer is provided for translating virtual
addresses. Branch prediction is achieved by a 64 entry branch target address cache and a
branch history table that is integrated into the instruction cache. The branch mispredict
penalty is three cycles due to the short six stage pipeline.

The two VPUs (VPU0 and VPU1) provide the majority of the Emotion Engine's floating
point performance. Each VPU features thirty two 128 bit registers, sixteen 16 bit fixed
point registers, four FMAC units, a FDIV unit and a local data memory. The data
memory for VPU0 is 4 KB in size while VPU1 features a 16 KB data memory. To
achieve high bandwidth, the VPU's data memory is connected directly to the GIF, and
both of the data memories can be read directly by the DMA unit. A single vector
instruction consists of four 32 bit IEEE compliant single precision floating point values
which are distributed to the four single precision (32 bit) FMAC units for processing.
Contrary to popular belief, the Emotion Engine is not a 128 bit processor as it does not
process a 128-bit value, only a bunch of four 32 bit values that fit into one 128 bit
register. This scheme is similar to the SSEx extensions by Intel. The FMAC units have
an instruction latency of four cycles, but as they have a six stage pipeline, they have a
throughput of one cycle per an instruction. The FDIV unit has a nine stage pipeline and
can execute one instruction every seven cycles.
Communications between the MIPS core, the two VPUs, GIF, memory controller and
other units is handled by a 128 bit wide internal data bus running at half the clock
frequency of the CPU. At 300 MHz, the internal data bus provides a maximum
theoretical bandwidth of 2.4 GiB/s. DMA transfers over this bus occurs in packets of
eight 128 bit words, achieving a peak bandwidth of 2 GiB/s. The Emotion Engine
interfaces directly to the Graphics Synthesizer via the GIF and a dedicated 64 bit wide,
150 MHz bus with a maximum theoretical bandwidth of 1.2 GiB/s.

Communication between the Emotion Engine and RAM occurs through two channels of
DRDRAM and the memory controller, which interfaces to the internal data bus. The two
channels of DRDRAM have a maximum theoretical bandwidth of 3.2 GiB/s, about 33%
more bandwidth than the internal data bus. Because of this, the memory controller
buffers data sent from the DRDRAM channels so the extra bandwidth can be utilised by
the CPU.

To provide communications between the Emotion Engine and the Input Output Processor
(IOP), the input output interface interfaces a 32 bit wide, 37.5 MHz input output bus with
a maximum theoretical bandwidth of 150 MB/s to the internal data bus. It should be
noted that this interface provides vastly more bandwidth than what is required by the
PlayStation's input output devices.

The first versions of the PlayStation 3 featured an Emotion Engine on its motherboard to
achieve backwards compatibility with Playstation and PlayStation 2 titles. However,
subsequent releases of the Playstation 3, including the initial PAL release, dropped the
Emotion Engine to lower costs. Instead, software emulation is used to allow backwards
compatibility.

GRAPHICS PROCESSING UNIT

A graphics processing unit or GPU (also occasionally called visual processing unit or
VPU) is a dedicated graphics rendering device for a personal computer, workstation, or
game console. Modern GPUs are very efficient at manipulating and displaying computer
graphics, and their highly parallel structure makes them more effective than general-
purpose CPUs for a range of complex algorithms. A GPU can sit on top of a video card,
or it can be integrated directly into the motherboard. More than 90% of new desktop and
notebook computers have integrated GPUs, which are usually far less powerful than their
add-in counterparts.

GRAPHICS ACCELERATORS

A GPU (Graphics Processing Unit) to the CPU attached onto the Graphics card making
the graphics card perform better.
A graphics accelerator incorporates custom microchips which contain special
mathematical operations commonly used in graphics rendering. The efficiency of the
microchips therefore determines the effectiveness of the graphics accelerator. They are
mainly used for playing 3D games or high-end 3D rendering.

A GPU implements a number of graphics primitive operations in a way that makes


running them much faster than drawing directly to the screen with the host CPU. The
most common operations for early 2D computer graphics include the BitBLT operation
(combines several bitmap patterns using a RasterOp), usually in special hardware called a
"blitter", and operations for drawing rectangles, triangles, circles, and arcs. Modern
GPUs also have support for 3D computer graphics, and typically include digital video-
related functions.

PARALLAX PROPELLER

The Parallax P8X32 Propeller is a parallel microcontroller with eight 32-bit RISC CPU
cores, introduced in 2006.

The Parallax propeller, its built in SPIN programming language and byte code
interpreter, and the "Propeller Tool" integrated programming environment were all
designed by a single person, Parallax's co-founder and president Chip Gracey.

SPEED AND POWER MANAGEMENT

The Propeller can be clocked using either an internal, on-chip oscillator (providing a
lower total parts count, but sacrificing some accuracy and thermal stability) or an
external crystal or resonator (providing higher maximum speed with greater accuracy at
an increased total cost). Either of these sources may be run through an on-chip PLL clock
multiplier, which may be set at 1x, 2x, 4x, 8x, or 16x.

Both the on-board oscillator frequency (if used) and the PLL multiplier value may be
changed at run-time. If used correctly, this can improve power efficiency; for example,
the PLL multiplier can be decreased before a long "no operation" wait required for
timing purposes, then increased afterwards, causing the processor to use less power.
However, the utility of this technique is limited to situations where no other cog is
executing timing-dependent code (or is carefully designed to cope with the change), since
the effective clock rate is common to all cogs.

The effective clock rate ranges from from 32KHz up to 80 MHz (with the exact values
available for dynamic control dependent on the configuration used, as described above).
When running at 80MHz, the proprietary interpreted Spin programming language
executes approximately 80,000 instruction-tokens per second on each core, giving 8
times 80,000 for 640,000 high level instructions per second. Most machine-language
instructions take 4 clock-cycles to execute, resulting in 20 MIPS per cog, or 160 MIPS in
total for an 8-cog Propeller.

In addition to lowering the clock rate to that actually required, power consumption can be
reduced by turning off cogs (which then use very little power), and by reconfiguring I/O
pins which are not needed, or can be safely placed in a high-impedance state ("tristated"),
as inputs. Pins can be reconfigured dynamically, but again, the change applies to all cogs,
so synchronization is important for certain designs.

FIELD-PROGRAMMABLE GATE ARRAY (FPGA)

sometimes multiple CPUs are placed on a single FPGA

A field-programmable gate array is a semiconductor device containing programmable


logic components called "logic blocks", and programmable interconnects. Logic blocks
can be programmed to perform the function of basic logic gates such as AND, and XOR,
or more complex combinational functions such as decoders or simple mathematical
functions. In most FPGAs, the logic blocks also include memory elements, which may be
simple flip-flops or more complete blocks of memory.

A hierarchy of programmable interconnects allows logic blocks to be interconnected as


needed by the system designer, somewhat like a one-chip programmable breadboard.
Logic blocks and interconnects can be programmed by the customer or designer, after the
FPGA is manufactured, to implement any logical function—hence the name "field-
programmable".

FPGAs are usually slower than their application-specific integrated circuit (ASIC)
counterparts, cannot handle as complex a design, and draw more power (for any given
semiconductor process). But their advantages include a shorter time to market, ability to
re-program in the field to fix bugs, and lower non-recurring engineering costs. Vendors
can sell cheaper, less flexible versions of their FPGAs which cannot be modified after the
design is committed. The designs are developed on regular FPGAs and then migrated
into a fixed version that more resembles an ASIC.

Complex programmable logic devices (CPLDs) are an alternative for simpler designs.

PHYSICS PROCESSING UNIT

A Physics Processing Unit (PPU) is a dedicated microprocessor designed to handle the


calculations of physics, especially in the physics engine of video games. Examples of
calculations involving a PPU might include rigid body dynamics, soft body dynamics,
collision detection, fluid dynamics, hair and clothing simulation, finite element analysis,
and fracturing of objects. The idea is that specialized processors offload time consuming
tasks from a computer's CPU, much like how a GPU performs graphics operations in the
main CPU's place.

The first PPUs were the SPARTA and HELLAS.

The term was coined by Ageia's marketing to describe their PhysX chip to consumers.
Several other technologies in the CPU-GPU spectrum have some features in common
with it, although Ageia's solution is the only complete one designed, marketed,supported,
and placed within a system exclusively as a PPU.

DUAL CORE PROCESSING: OVER-SIMPLIFIED,


DEMYSTIFIED AND EXPLAINED.
The latest buzz in the processor industry is about dual core processors. AMD may be the
first to take the limelight with their announcement of dual core AMD Opteron processors
set to launch in mid-2005 but Intel and IBM are cueing up their dual core processors as
well.

A dual core processor is exactly what it sounds like. It is two processor cores on one die
essentially like having a dual processor system in one processor. AMD's Opteron
processor has been dual processor capable since its inception. Opteron was designed with
an extra HyperTransport link. The relevance of it was mostly overlooked.
HyperTransport Technology simply means a faster connection that is able to transfer
more data between two chips. This does not mean that the chip itself is faster. It means
that the capability exists via the HyperTransport pathway for one chip to "talk" to another
chip or device at a faster speed and with greater data throughput.
We knew that HyperTransport Technology would provide for a faster connection to
system memory, the GPU and the rest of the motherboard but back in the fall of 2003 we
thought of the extra HyperTransport link as a connection to another physical processor.
It didn't dawn on us that the "extra" processor could be on the same die. While some will
say "I knew that" most didn't pick up on it.

AMD have the added punch of being able to drop their dual core Opteron processors into
existing 940-pin sockets. This upgrade path is extremely favorable as all it will require is
a processor swap and, perhaps, a BIOS update.

Intel are continuing with their Pentium 4 cores by releasing two flavors codenamed
Paxville and Dempsey. The codenames will very likely change once the marketing
department gets their hands on it as "Introducing the new Dempsey" has a very lackluster
ring to it.

MAC orientated Think Secret posted IBM plans on the PowerPC 970MP codenamed
Antares and rumored to clock in at 3GHz with a 1GHz EI (Elastic Interface) bus.

The horses are now in the paddock. AMD, INTEL and MAC loyalists are beginning to
group at the fence to eye up their favorite and the competition. The post parade is still a
ways off and with post time now set at mid-2005 it's anybody's guess who will be out of
the gate first.

HARDWARE TREND
The general trend in processor development has been from multi-core to many-core:
from dual-, quad-, eight-core chips to ones with tens or even hundreds of cores; see
manycore processing unit. In addition, multi-core chips mixed with simultaneous
multithreading, memory-on-chip, and special-purpose "heterogeneous" cores promise
further performance and efficiency gains, especially in processing multimedia,
recognition and networking applications. There is also a trend of improving energy
efficiency by focusing on performance-per-watt with advanced fine-grain or ultra fine-
grain power management and dynamic voltage and frequency scaling (DVFS).

SOFTWARE IMPACT
Software benefits from multicore architectures where code can be executed in parallel.
Under most common operating systems this requires code to execute in separate threads
or processes. Each application running on a system runs in its own process so multiple
applications will benefit from multicore architectures. Each application may also have
multiple threads but, in most cases, it must be specifically written to utilize multiple
threads. Operating system software also tends to run many threads as a part of its normal
operation. Running virtual machines will benefit from adoption of multiple core
architectures since each virtual machine runs independently of others and can be
executed in parallel.

Most application software is not written to use multiple concurrent threads intensively
because of the challenge of doing so. A frequent pattern in multithreaded application
design is where a single thread does the intensive work while other threads do much less.

For example, a virus scan application may create a new thread for the scan process, while
the GUI thread waits for commands from the user (e.g. cancel the scan). In such cases,
multicore architecture is of little benefit for the application itself due to the single thread
doing all heavy lifting and the inability to balance the work evenly across multiple cores.
Programming truly multithreaded code often requires complex co-ordination of threads
and can easily introduce subtle and difficult-to-find bugs due to the interleaving of
processing on data shared between threads (thread-safety). Consequently, such code is
much more difficult to debug than single-threaded code when it breaks. There has been a
perceived lack of motivation for writing consumer-level threaded applications because of
the relative rarity of consumer-level multiprocessor hardware. Although threaded
applications incur little additional performance penalty on single-processor machines, the
extra overhead of development has been difficult to justify due to the preponderance of
single-processor machines.

As of September 2006, with the typical mix of mass-market applications the main benefit
to an ordinary user from a multi-core CPU will be improved multitasking performance,
which may apply more often than expected. Ordinary users are already running many
threads; operating systems utilize multiple threads, as well as antivirus programs and
other 'background processes' including audio and video controls. The largest boost in
performance will likely be noticed in improved response time while running CPU-
intensive processes, like antivirus scans, defragmenting, ripping/burning media (requiring
file conversion), or searching for folders. For example, if the automatic virus scan
initiates while a movie is being watched, the movie is far less likely to lag, as the
antivirus program will be assigned to a different processor than the processor running the
movie playback.

Given the increasing emphasis on multicore chip design, stemming from the grave
thermal and power consumption problems posed by any further significant increase in
processor clock speeds, the extent to which software can be multithreaded to take
advantage of these new chips is likely to be the single greatest constraint on computer
performance in the future. If developers are unable to design software to fully exploit the
resources provided by multiple cores, then they will ultimately reach an insurmountable
performance ceiling.

The telecommunications market had been one of the first that needed a new design of
parallel datapath packet processing because there were a very quick adoption of these
multiple core processors for the datapath and the control plane. These MPUs are going to
replace the traditional Network Processors that were based on proprietary micro- or pico-
code. 6WIND was the first company to provide embedded software for these
applications.

PARALLEL PROGRAMMING

Parallel programming techniques can benefit from multiple cores directly. Some existing
parallel programming models such as OpenMP and MPI can be used on multi-core
platforms. Intel introduced a new abstraction for C++ parallelism called TBB. Other
research efforts include the Codeplay Sieve System, Cray's Chapel, Sun's Fortress, and
IBM's X10.

Managing concurrency acquires a central role in developing parallel applications. The


basic steps in designing parallel applications are:

PARTITIONING

The partitioning stage of a design is intended to expose opportunities for parallel


execution. Hence, the focus is on defining a large number of small tasks in order to yield
what is termed a fine-grained decomposition of a problem.

COMMUNICATION

The tasks generated by a partition are intended to execute concurrently but cannot, in
general, execute independently. The computation to be performed in one task will
typically require data associated with another task. Data must then be transferred between
tasks so as to allow computation to proceed. This information flow is specified in the
communication phase of a design.
AGGLOMERATION

In the third stage, we move from the abstract toward the concrete. We revisit decisions
made in the partitioning and communication phases with a view to obtaining an
algorithm that will execute efficiently on some class of parallel computer. In particular,
we consider whether it is useful to combine, or agglomerate, tasks identified by the
partitioning phase, so as to provide a smaller number of tasks, each of greater size. We
also determine whether it is worthwhile to replicate data and/or computation.

MAPPING

In the fourth and final stage of the parallel algorithm design process, we specify where
each task is to execute. This mapping problem does not arise on uniprocessors or on
shared-memory computers that provide automatic task scheduling.
On the other hand, on the server side, multicore processors are ideal because they allow
many users to connect to a site simultaneously and have independent threads of
execution. This allows for Web servers and application servers that have much better
throughput.

ARE TWO CORES BETTER THAN ONE?


There will most likely be three terms that come up to fuel the dual core debate; pipeline,
cache and bus.

This is the most basic of explanations of what a processor pipeline is. First the data
instruction set is needed.
A processor loads instructions into the pipeline. Think of the pipeline like a conveyor
belt. The data is processed sequentially one after another.
The AMD processor pipeline is shorter than the INTEL processor pipeline and this is one
of the reasons why AMD runs at a lower clock speed.

Pipelining, like most things in life, is good in moderation. Making a processor's pipeline
too short causes a longer minimum clock period which hinders the manufacturer's ability
to ramp up the clock speed. Making the pipeline very long allows faster clock speeds
however it also increases the cost of stalls and flushes which negatively affects
performance and also increases the amount of resources required to pipeline the
processor.

This is discussed in-depth in Short-Media's Pipelining Explained article.


A shorter pipeline means that more work has to be done in the pipeline per clock cycle
thus the clock speed cannot be as high compared to a processor with a longer pipeline.
However, with a shorter pipeline, the data gets through it faster thus balancing the
equation. This is one of the reasons why an AMD processor can compete with higher
clocked INTEL processors.
Data that that is continually used in preparation for the pipeline is stored in the
processor's cache and a processor is smart enough to anticipate what data it may require.
If the processor needs to reach outside of the cache then it does so through the bus to
system RAM. Now remember that the processor cache is running at the same clock speed
as the processor itself. If it is a 2 GHz processor then the speed limit on the highway
between cache and the rest of the processor is 2GHz. If the processor has to reach out
through the bus to main system memory then it must slow down to that bus speed. A bus
speed of 400 MHz is five times slower than the 2 GHz example.

In layman's terms think of the processor as a carpenter. The carpenter's truck is system
memory and the cache are the tools he's packed into the house for the job. The carpenter
has anticipated what tools he may need to do the job. If the tool is not at hand then he
must go back to the truck to get the right tool thus slowing down the job at hand.

PUTTING IT ALL TOGETHER

Two pairs of hands make the work go faster. This is quite true in computers with dual
processors especially with SMP (Symmetric Multiprocessing) software. Not all software
is SMP aware. In fact only a small percentage of it is. SMP capability is something that
must be written into the code. The program must know that it can utilize two processors
to complete processes simultaneously. This is known as multithreading.

A dual core processor is between a single core processor and a dual processor system for
architecture. A dual core processor has two cores but will share some of the other
hardware like the memory controller and bus. A dual processor system has completely
separate hardware and shares nothing with the other processor.

A dual core processor won't be twice as fast as a single core processor nor will it be as
fast as a dual processor system.

It will fall somewhere in the middle but there are going to be specific advantages.
There will be two pipelines and that means there can be two sets of instructions being
carried out simultaneously.
There will also be two processor caches to keep more of the necessary "tools" or data on
the processor die for faster access.

The trick will be the bus. If everyone wants on the bus at the same time then there will be
the Keystone Cops comedy of errors as everyone tries to squeeze through the door at the
same time. The two processor cores have to be designed to be smart enough to "wait" for
the other to finish accessing the bus.

Now all of this is happening at the nanosecond level so don't think there's time for a
coffee. Nanosecond wait states means there's not even enough time to THINK about
thinking about having a coffee.
TO SMP OR NOT TO SMP ?

The processor engineers have probably already thought about tackling the SMP situation.
What good is a dual core processor if the software only recognizes and then uses only
one of the cores? The majority of software is not written to utilize multithreading at
present. This breaks open a whole new can of worms in concepts of parallel computing.
Intel's Hyper-Threading is a single processor logical variation of dual core processors.
AMD has just taken it one step further with two physical cores on one processor die.
Could AMD's engineers have cracked the hardware problem of a dual core processor and
load balancing a program that isn't written for multithreading?

This is where dual core processors could fall short of expectations for mainstream users.
If the software cannot "see" the second processor then it will not benefit from it.
Programs, such as Adobe Photoshop, are SMP aware and are much faster on a dual
processor system. There is no doubt that a program like Photoshop will be much faster on
a dual core system than its single core counterpart. The majority of operating systems do
recognize and support at least two processors. There is some load balancing of non-SMP
applications but not as efficiently as those written for multithreading.

DUAL CORE PROCESSORS FOR LOW-POWER, HIGH-


PERFORMANCE

More and more PC users run their systems 24 hours a day to permit functions such as
downloading files, running backups, scanning for viruses and operating Web servers.
PCs for these operations should be quiet and energy-efficient, while offering sufficient
performance for the applications of both today and tomorrow.
AMD's Turion 64 and Intel's Pentium M are the thriftiest processors when it comes to
energy consumption, but, strictly speaking, both are becoming obsolete. The future
belongs to dual core processors, as they provide substantial performance enhancements
for a low-power PC.

If you look for a dual core processor, the obvious choices will be the AMD Athlon 64 X2
and the Intel Pentium D. We only recommend the 65 nm version of the Pentium D (the
900 series), because the aged 90 nm 800 series suffers from high thermal dissipation.

Clearly, the Athlon 64 X2 offers superior performance and efficiency, but we are looking
for high efficiency solutions, which draws our attention to Intel's latest mobile dual core
processor: the Core Duo.

Intel's Core Duo is a 65 nm part and offers optimized performance per clock cycle thanks
to its reconditioned microarchitecture, while drawing no more power than its
predecessor, the Pentium M. Intel specifies a maximum design power of 31 W, which is
an excellent result when put in the context of its performance. The Centrino Duo launch
was spoiled by a USB power drain issue, which causes battery time to decrease
dramatically. Since this proved to be a software issue, it is Microsoft courting our
resentment now, as the promised patch has still not arrived. Luckily, this does not affect
desktop applications, nor does it change our assessment of the Core Duo processor being
one of the finest we have seen to date; it would be very appealing for a low-power PC.

The Core Duo uses the same processor socket as the Pentium M, but requires some
electrical modifications, which means that you cannot use existing Socket 479
motherboards. Suitable products are not yet available, but several motherboard makers
are working on them. We had a look at one of the first solutions back in March, AOpen's
i975Xa-YDG</a.. This is a full-fledged Core Duo ATX motherboard, but its 975X
chipset is not known to be energy efficient. As there is currently no alternative, we
decided to go for this motherboard and a Core Duo T2600 processor running at 2.16 GHz
and FSB667.

AMD is getting ready to release the Turion 64 X2 dual core processor, but until this
delicacy is served up, we will stick with the desktop dual core Athlon 64 X2 3800+ at 2.0
GHz and a Biostar TForce 6100-939 motherboard. Although the processor itself requires
more energy than the Core Duo, the entire AMD64 platform is more energy efficient, so
this promises to be a very interesting competition.

We also added a Pentium M 780 on a MSI 915GM Speedster and a Turion 64 MT-40 on
a K8NGM-V to the lineup. Notice that these motherboards come with integrated
graphics, but for the sake of a fair comparison with the AOpen i975Xa-YDG, we ran all
systems with a dedicated graphics card.
TYPICAL PC ENERGY CONSUMPTION REVIEWED

Component Desktop PC Energy Efficient PC Savings


Processor ~ 30-120 W ~ 10-35 W ~ 60-70%
Platform ~ 15-50 W ~ 10-30 W ~ 20-40%
Graphics ~ 10-120 W ~ 5-25 W ~ 50-80%
Power supply Varying n/a
Other ~ 15-30 W ~ 5-20 W ~ 30-50%
Total power draw* ~ 70-350 W** ~ 35-110 W ~ 50-70%
* Does not include display
** Dual and quad graphics solutions require more energy, as do systems with multiple drives and add-in cards

The table above shows the power draw for typical components, as well as the potential
energy savings from using efficient PC parts. You will notice that the savings can vary
heavily; this is due to many product options for various components, and their power
differences:

PROCESSOR

Energy consumption increases exponentially with the clock speed. Accordingly, reducing
the clock speed will reduce the power draw, especially if energy saving features such as
Cool & Quiet (AMD) or SpeedStep (Intel) are used to reduce the operating voltage. In
this context, the processor type makes a huge difference as well: a modern 65 nm
Pentium D 900 or Pentium 4 6x1 series runs cooler and wastes less energy than the 90
nm Pentium D 800 and Pentium 4 500/600 processors. AMD processors show similar
effects from one generation to the next, but the impact is less dramatic due to AMD's
more elaborate SOI (silicon on insulator) manufacturing and processor architecture.

PLATFORM

This term refers to the motherboard, including the chipset and on-board components. It
comprises functional parts such as audio chips and additional controllers, as well as basic
components such as voltage regulators. Intel's current desktop chipsets are not
particularly efficient these days, while Athlon 64 core logic benefits from the memory
controller being a part of AMD's current processors. However, motherboards that use a
mobile chipset rather than a desktop version require considerably less energy.
GRAPHICS

Today's graphics cards are able to squeeze more and more visual effects and even physics
calculations out of any graphics processor, but this comes at a tremendous energy price
due to the several hundred million transistors used. The basic power draw just for
displaying the Windows screen can be 15 to 30 W.

As the 3D units become active, power consumption increases further; a modern graphics
card will convert from 50 to 120 W of electricity into heat. High-end graphics cards even
come with a separate power connector to satisfy requirements that exceed the power
supply specifications for PCI Express. Of course, dual graphics setups - ATI Crossfire or
Nvidia SLI - will almost double the graphics power requirements.

If you want to save as much energy as possible, there is no alternative to using an


integrated graphics solution. Unfortunately, at least today, you must choose between fast
3D graphics and low power operation (see the test results).

POWER SUPPLY

Power supplies become less efficient the closer they run to their maximum output, which
means that a larger amount of energy will be converted into heat. It is difficult to provide
precise numbers, however, because the degree of efficiency varies with the load.

OTHER

There are other components where power is a concern, such as the hard drive or optical
drive, but with these the energy consumption is usually under 10 W. Using 2.5" or even
1.8" hard drives helps to reduce power consumption, but this has a noticeable negative
impact on performance. Since the difference in power is not very large, we recommend
focusing on other components first.
AMD PERFORMANCE COMPARISON
This is a comparison of two systems with virtually identical hardware. The video card
and hard drive used were the same brand and model. The amount of RAM is identical (2x
1GB PC3200) with the only difference being that the Opteron system used ECC memory.
The real differences were just the motherboard and CPUs. For the Opteron system we
have 2 model '248' processors running at 2.2Ghz each with 1MB of cache. They are
running on a Tyan Thunder K8WE board, which uses an nVidia nForce Professional
chipset. The single-CPU solution is an Athlon64 X2 4400+, with two cores each running
at 2.2Ghz and each sporting a 1MB cache. This processor was installed in an Asus A8N-
SLI Premium motherboard, utilizing an nVidia nForce4 SLI chipset.

As you can see, graphics performance is very similar, with 3dMark05 scores only 1 point
apart and less than 4% variation in the 3dMark'03 scores. Looking closer, we can also see
that the important specific metrics in PCMark04 are very similar as well -- the biggest
difference is seen in the additional overhead of ECC impacting the memory performance.
All around, I would say that with the AMD platform there is little noticeable difference
between dual-core and dual processors.
For those with plenty of money to burn, it is also common for us to build a AMD
Opteron system with a dual CPU motherboard, and using a dual core CPU in each socket.
That gives a grand total of four functional CPU cores! This setup is especially desirable if
you need to have multiple heavy duty applications open (CAD, video editing, and
modeling come to mind) - just make sure you complement those processors with plenty
of memory.

INTEL PERFORMANCE COMPARISON


For Intel, we have compiled a comparison between a pair of Xeon 3.0Ghz CPUs with
1MB cache each and a single Pentium D 830. The Pentium D has two cores with each
running at 3.0Ghz with 1MB of cache. Furthermore, both setups use an 800Mhz front-
side-bus to communicate with the motherboard. Again, the motherboards themselves are
different but each system has the same amount of memory (2GB) and similar video cards
(GeForce 6800GT 256MB). There is a little more variation between these two systems
because the memory is configured differently: the Xeon is using two sticks of 1GB
PC3200, while the Pentium is using 4 sticks of 512MB PC2 5400. This gives the
Pentium D a definite advantage in overall memory bandwidth available, but that is a very
tangible benefit of using the Pentium D line. Intel has not yet, as of this writing, updated
its Xeon processors and their chipsets to handle higher speed RAM.

Here again we see fairly close performance in graphics, with the Xeon system in a very
slight lead. In more performance-oriented tests, however, we see the Pentium system
tending to pull ahead by a fair margin. This is most likely due to its significant memory
speed advantage, but again this is a very valid and important result. The RAM that was
used in the Pentium D system is standard for that platform, but even if we wanted to, we
could not build a Xeon setup with the same speed of memory. So while the processors
may be very comparable in performance the overall win definitely goes to the Pentium D
dual-core platform.

ADVANTAGES
• CACHE COHERENCY

The proximity of multiple CPU cores on the same die allows the cache coherency
circuitry to operate at a much higher clock rate than is possible if the signals have to
travel off-chip.

• CACHE SNOOP/BUS SNOOP


Combining equivalent CPUs on a single die significantly improves the performance of
cache snoop (alternative: Bus snooping) operations. Put simply, this means that signals
between different CPUs travel shorter distances, and therefore those signals degrade less.

• LESS DEGRADATION OF SIGNALS

These higher quality signals allow more data to be sent in a given time period since
individual signals can be shorter and do not need to be repeated as often.

• PRINTED CIRCUIT BOARD (PCB)

Assuming that the die can fit into the package, physically, the multi-core CPU designs
require much less Printed Circuit Board (PCB) space than multi-chip SMP designs.
• LESS POWER USAGE

A dual-core processor uses slightly less power than two coupled single-core processors,
principally because of the increased power required to drive signals external to the chip
and because the smaller silicon process geometry allows the cores to operate at lower
voltages.

• REDUCES LATENCY

Reduction in power usage i.e the cores operating at lower voltages leads to reduced
latency. Furthermore, the cores share some circuitry, like the L2 cache and the interface
to the front side bus (FSB).

Thus, in terms of competing technologies for the available silicon die area, multi-core
design can make use of proven CPU core library designs and produce a product with
lower risk of design error than devising a new wider core design. Also, adding more
cache suffers from diminishing returns.

DISADVANTAGES
• ADJUSTMENT TO OPERATING SYSTEM SUPPORT

This means existing software are required to maximize utilization of the computing
resources provided by multi-core processors. Also, the ability of multi-core processors to
increase application performance depends on the use of multiple threads within
applications.

• LOWER PRODUCTION YIELDS


Integration of a multi-core chip drives production yields down and they are more difficult
to manage thermally than lower-density single-chip designs. Intel has partially countered
this first problem by creating its quad-core designs by combining two dual-core on a
single die with a unified cache, hence any two working dual-core dies can be used, as
opposed to producing four cores on a single die and requiring all four to work to produce
a quad-core. From an architectural point of view, ultimately, single CPU designs may
make better use of the silicon surface area than multiprocessing cores, so a development
commitment to this architecture may carry the risk of obsolescence.

• LIMITS THE REAL-WORLD PERFORMANCE

Finally, raw processing power is not the only constraint on system performance. Two
processing cores sharing the same system bus and memory bandwidth limits the real-
world performance advantage. If a single core is close to being memory bandwidth
limited, going to dual-core might only give 30% to 70% improvement. If memory
bandwidth is not a problem, a 90% improvement can be expected. It would be possible
for an application that used 2 CPUs to end up running faster on one dual-core if
communication between the CPUs was the limiting factor, which would count as more
than 100% improvement.
CONCLUSION
Our conclusion for the Core Duo processor is particularly interesting, because in theory it
is capable of enabling the assembly of a dual core desktop system that requires only 45
W. This, however, requires a motherboard that uses the 945GM chipset, which is not yet
available. Using AOpen's 975X motherboard forces the user to go for discrete graphics,
which drives the system power consumption to a minimum of 70 W. Several
motherboard companies are currently working on desktop motherboards for Core Duo,
but most do not deploy the 945GM chipset into the desktop. Core Duo is a great product
and is very efficient, but it stumbles due to the lack of a suitable low-power desktop
platform. (Ironically, it was Intel who has been trying to refocus the industry to think
about platforms...)

Efficient systems depend on efficient components, and there is one component that lately
has turned into a serious energy hog: the graphics card. The $500 monsters from ATI and
Nvidia consume 100 W or more when doing their 3D work. Even the basic requirement
is at least around 20 W, which would seem to make high performance and energy
efficiency mutually exclusive.

But wait a minute - there are graphics solutions that work without excessive power
requirements. Have a look at the products that both ATI and Nvidia send into multimedia
and gaming notebooks: there is the Mobility Radeon and the GeForce to Go, both of
which include technology that helps conserve energy. How about offering PCI Express
graphics cards that are based on mobile graphics solutions? Although the market
certainly is not huge, it seems like there are would definitely be interest. Such products
would also help in the comparison of different mobile graphics solutions, by running
them on a reference system.

The bottom line question is simple: does a low-power PC make sense for you? We
recommend first finding out your energy costs per kWh (kilowatt-hour), and then
calculating how much energy would be consumed over the course of a year if you
operate the system 24/7. An example would be 100 W of power consumption times 24
hours, times 365 days. The result is 876 kWh, which you have to multiply by your
energy cost per kWh.

It is obvious that any high-end component would spoil your saving efforts. For example,
we cannot recommend buying a Core Duo T2600 in order to reduce your energy bill.
Instead, go for the mid range; here, Turion 64 solutions currently offer the best bang for
the energy saving buck. But the Turion 64 X2 will be available soon, and the first
945GM motherboards for Core Duo should also hit the retail in the not too distant future.
We expect both to renew the energy efficiency debate.

REFERENCES

1.Multi-Core Architectures: Understanding Mechanisms by R Kumar, V Zyuban.


2. Wikipedia- The Encyclopedia
3. www.google.com
4. www.intel.com
5. www.webopedia.com
6. www.icrontic.com
7. www.amd.com

You might also like