Professional Documents
Culture Documents
It has always been a frequent question -- "Will I benefit from multiple processors?" With
the growing popularity of dual core processors, the topic is more important than ever!
Will multiple processors or a dual core processor be beneficial to you, and what are the
differences between them? These are the questions this article will attempt to lay to rest.
A major question for some people getting ready to buy a high-end system is whether they
want or need to have two processors available to them. For anyone doing video editing,
multi-threaded applications, or a lot of multitasking the answer is a very clear 'yes'. Then
the question becomes whether two separate processors (as in a dual Xeon or Opteron
system) is the way to go, or whether a single dual core processor (like a Pentium D or
Athlon64 X2) will do just as well.
MULTI-CORE
DEVELOPMENT
While manufacturing technology continues to improve, reducing the size of single gates,
physical limits of semiconductor-based microelectronics have become a major design
concern. Some effects of these physical limitations can cause significant heat dissipation
and data synchronization problems. The demand for more capable microprocessors
causes CPU designers to use various methods of increasing performance. Some
instruction-level parallelism (ILP) methods like superscalar pipelining are suitable for
many applications, but are inefficient for others that tend to contain difficult-to-predict
code. Many applications are better suited to thread level parallelism (TLP) methods, and
multiple independent CPUs is one common method used to increase a system's overall
TLP. A combination of increased available space due to refined manufacturing processes
and the demand for increased TLP is the logic behind the creation of multi-core CPUs.
There are more subtle differences between brands (how they combined two cores onto
one chip, and the speeds they run each core at) that can affect how much of a boost in
performance you can get from having a dual core CPU. Additionally, different types of
programs get differing benefits from having a dual core chip.
Dual Core Implementation
Because of the different ways AMD and Intel came into the dual-core market, each
platform deals with the increased communication needs of their new processors
differently. AMD claims that they have been planning the move to dual-core for several
years now, since the first Athlon64s and Opterons were released. The benefit of this can
be seen in the way that the two cores on their processors communicate directly -- the
structure was already in place for the dual cores to work together. Intel, on the other
hand, simply put two of their Pentium cores on the same chip, and if they need to
communicate with each other it has to be done through the motherboard chipset. This is
not as elegant a solution, but it does its job well and allowed Intel to get dual-core
designs to the market quickly. In the future Intel plans to move to a more unified design,
and only time can tell what that will look like.
Intel did not increase the speed of their front-side-bus (the connection
between the CPU and the motherboard) when they switched to dual-core,
meaning that though the processing power doubled, the amount of
bandwidth for each core did not. This puts a bit of a strain on the Intel
design, and likely prevents it from being as powerful as it could be. To
counteract this effect, Intel continues to use faster system memory to keep
information supplied to the processor cores. As a side note, the highest-
end Intel chip, the Pentium Extreme Edition 955, has a higher front-side-bus speed, as
well as having a larger (2MB per core) cache memory and the ability to use
Hyperthreading (which all non-Extreme Edition Pentium D processors lack). This makes
it a very tempting choice for those wanting to overcome some of the design handicaps of
Intel's dual-core solution.
AMD, on the other hand, does not use a front-side-bus in the traditional
sense. They use a technology called HyperTransport to communicate with
the chipset and system memory, and they have also moved the memory
controller from the chipset to the CPU. By having the memory controller
directly on the processor, AMD has given their platform a large
advantage, especially with the move to dual-core. The latest generation of
AMD single-core processors can use single- or dual-channel PC3200
memory, but it is interesting to note that even though dual-channel operation doubles the
memory speed, it does not double the actual memory performance for single-core
processors. It appears that dual-channel memory just provides significanly more
bandwidth than a single processor core can use. However, with dual-core processors all
that extra bandwidth can be put to good use, allowing the same technology already
present in single-core chips to remain unchanged without causing the same sort of
bottleneck Intel suffers from.
A dual core processor is different from a multi-processor system. In the latter there are
two separate CPUs with their own resources. In the former, resources are shared and the
cores reside on the same chip. A multi-processor system is faster than a system with a
dual core processor, while a dual core system is faster than a single-core system, all else
being equal.
An attractive value of dual core processors is that they do not require a new motherboard,
but can be used in existing boards that feature the correct socket. For the average user the
difference in performance will be most noticeable in multi-tasking until more software is
SMT aware. Servers running multiple dual core processors will see an appreciable
increase in performance.
Multi-core processors are the goal and as technology shrinks, there is more "real-estate"
available on the die. In the fall of 2004 Bill Siu of Intel predicted that current
accommodating motherboards would be here to stay until 4-core CPUs eventually force a
changeover to incorporate a new memory controller that will be required for handling 4
or more cores.
Texas Instruments TMS320 is a blanket name for a series of digital signal processors
(DSPs) from Texas Instruments. It was introduced on April 8, 1983 through the
TMS32010 processor, which was then the fastest DSP on the market.
The processor is available in many different variants, some with fixed-point arithmetic
and some with floating point arithmetic. The floating point DSP TMS320C3x, which
exploits delayed branch logic, has as many as three delay slots.
The flexibility of this line of processors has led to it being used not merely as a co-
processor for digital signal processing but also as a main CPU. They all support standard
IEEE JTAG control for development.
The original TMS32010 and its subsequent variants is an example of a CPU with a
Modified Harvard architecture, which features separate address spaces for instruction and
data memory but the ability to read data values from instruction memory. The
TMS32010 featured a fast multiply-and-accumulate useful in both DSP applications as
well as transformations used in computer graphics. The graphics controller card for the
Apollo Computer DN570 Workstation, released in 1985, was based on the TMS32010
and could transform 20,000 2D vectors/second.
KILOCORE
At the heart of the Emotion Engine is a two way superscalar in order MIPS based core
primarily based on the MIPS III ISA but includes some instructions defined by the MIPS
IV ISA. The MIPS based core consists of two 64 bit fixed point units one single
precision (32 bit) floating point unit with a six stage pipeline. To feed the execution units
with instructions and data, there is a 16 KB two way set associative instruction cache, an
8 KB two way set associative non blocking data cache and a 16 KB scratchpad RAM.
Both the instruction and data caches are virtually indexed and physically tagged while the
scratchpad RAM exists in a separate memory space. A combined 48 double entry
instruction and data translation look aside buffer is provided for translating virtual
addresses. Branch prediction is achieved by a 64 entry branch target address cache and a
branch history table that is integrated into the instruction cache. The branch mispredict
penalty is three cycles due to the short six stage pipeline.
The two VPUs (VPU0 and VPU1) provide the majority of the Emotion Engine's floating
point performance. Each VPU features thirty two 128 bit registers, sixteen 16 bit fixed
point registers, four FMAC units, a FDIV unit and a local data memory. The data
memory for VPU0 is 4 KB in size while VPU1 features a 16 KB data memory. To
achieve high bandwidth, the VPU's data memory is connected directly to the GIF, and
both of the data memories can be read directly by the DMA unit. A single vector
instruction consists of four 32 bit IEEE compliant single precision floating point values
which are distributed to the four single precision (32 bit) FMAC units for processing.
Contrary to popular belief, the Emotion Engine is not a 128 bit processor as it does not
process a 128-bit value, only a bunch of four 32 bit values that fit into one 128 bit
register. This scheme is similar to the SSEx extensions by Intel. The FMAC units have
an instruction latency of four cycles, but as they have a six stage pipeline, they have a
throughput of one cycle per an instruction. The FDIV unit has a nine stage pipeline and
can execute one instruction every seven cycles.
Communications between the MIPS core, the two VPUs, GIF, memory controller and
other units is handled by a 128 bit wide internal data bus running at half the clock
frequency of the CPU. At 300 MHz, the internal data bus provides a maximum
theoretical bandwidth of 2.4 GiB/s. DMA transfers over this bus occurs in packets of
eight 128 bit words, achieving a peak bandwidth of 2 GiB/s. The Emotion Engine
interfaces directly to the Graphics Synthesizer via the GIF and a dedicated 64 bit wide,
150 MHz bus with a maximum theoretical bandwidth of 1.2 GiB/s.
Communication between the Emotion Engine and RAM occurs through two channels of
DRDRAM and the memory controller, which interfaces to the internal data bus. The two
channels of DRDRAM have a maximum theoretical bandwidth of 3.2 GiB/s, about 33%
more bandwidth than the internal data bus. Because of this, the memory controller
buffers data sent from the DRDRAM channels so the extra bandwidth can be utilised by
the CPU.
To provide communications between the Emotion Engine and the Input Output Processor
(IOP), the input output interface interfaces a 32 bit wide, 37.5 MHz input output bus with
a maximum theoretical bandwidth of 150 MB/s to the internal data bus. It should be
noted that this interface provides vastly more bandwidth than what is required by the
PlayStation's input output devices.
The first versions of the PlayStation 3 featured an Emotion Engine on its motherboard to
achieve backwards compatibility with Playstation and PlayStation 2 titles. However,
subsequent releases of the Playstation 3, including the initial PAL release, dropped the
Emotion Engine to lower costs. Instead, software emulation is used to allow backwards
compatibility.
A graphics processing unit or GPU (also occasionally called visual processing unit or
VPU) is a dedicated graphics rendering device for a personal computer, workstation, or
game console. Modern GPUs are very efficient at manipulating and displaying computer
graphics, and their highly parallel structure makes them more effective than general-
purpose CPUs for a range of complex algorithms. A GPU can sit on top of a video card,
or it can be integrated directly into the motherboard. More than 90% of new desktop and
notebook computers have integrated GPUs, which are usually far less powerful than their
add-in counterparts.
GRAPHICS ACCELERATORS
A GPU (Graphics Processing Unit) to the CPU attached onto the Graphics card making
the graphics card perform better.
A graphics accelerator incorporates custom microchips which contain special
mathematical operations commonly used in graphics rendering. The efficiency of the
microchips therefore determines the effectiveness of the graphics accelerator. They are
mainly used for playing 3D games or high-end 3D rendering.
PARALLAX PROPELLER
The Parallax P8X32 Propeller is a parallel microcontroller with eight 32-bit RISC CPU
cores, introduced in 2006.
The Parallax propeller, its built in SPIN programming language and byte code
interpreter, and the "Propeller Tool" integrated programming environment were all
designed by a single person, Parallax's co-founder and president Chip Gracey.
The Propeller can be clocked using either an internal, on-chip oscillator (providing a
lower total parts count, but sacrificing some accuracy and thermal stability) or an
external crystal or resonator (providing higher maximum speed with greater accuracy at
an increased total cost). Either of these sources may be run through an on-chip PLL clock
multiplier, which may be set at 1x, 2x, 4x, 8x, or 16x.
Both the on-board oscillator frequency (if used) and the PLL multiplier value may be
changed at run-time. If used correctly, this can improve power efficiency; for example,
the PLL multiplier can be decreased before a long "no operation" wait required for
timing purposes, then increased afterwards, causing the processor to use less power.
However, the utility of this technique is limited to situations where no other cog is
executing timing-dependent code (or is carefully designed to cope with the change), since
the effective clock rate is common to all cogs.
The effective clock rate ranges from from 32KHz up to 80 MHz (with the exact values
available for dynamic control dependent on the configuration used, as described above).
When running at 80MHz, the proprietary interpreted Spin programming language
executes approximately 80,000 instruction-tokens per second on each core, giving 8
times 80,000 for 640,000 high level instructions per second. Most machine-language
instructions take 4 clock-cycles to execute, resulting in 20 MIPS per cog, or 160 MIPS in
total for an 8-cog Propeller.
In addition to lowering the clock rate to that actually required, power consumption can be
reduced by turning off cogs (which then use very little power), and by reconfiguring I/O
pins which are not needed, or can be safely placed in a high-impedance state ("tristated"),
as inputs. Pins can be reconfigured dynamically, but again, the change applies to all cogs,
so synchronization is important for certain designs.
FPGAs are usually slower than their application-specific integrated circuit (ASIC)
counterparts, cannot handle as complex a design, and draw more power (for any given
semiconductor process). But their advantages include a shorter time to market, ability to
re-program in the field to fix bugs, and lower non-recurring engineering costs. Vendors
can sell cheaper, less flexible versions of their FPGAs which cannot be modified after the
design is committed. The designs are developed on regular FPGAs and then migrated
into a fixed version that more resembles an ASIC.
Complex programmable logic devices (CPLDs) are an alternative for simpler designs.
The term was coined by Ageia's marketing to describe their PhysX chip to consumers.
Several other technologies in the CPU-GPU spectrum have some features in common
with it, although Ageia's solution is the only complete one designed, marketed,supported,
and placed within a system exclusively as a PPU.
A dual core processor is exactly what it sounds like. It is two processor cores on one die
essentially like having a dual processor system in one processor. AMD's Opteron
processor has been dual processor capable since its inception. Opteron was designed with
an extra HyperTransport link. The relevance of it was mostly overlooked.
HyperTransport Technology simply means a faster connection that is able to transfer
more data between two chips. This does not mean that the chip itself is faster. It means
that the capability exists via the HyperTransport pathway for one chip to "talk" to another
chip or device at a faster speed and with greater data throughput.
We knew that HyperTransport Technology would provide for a faster connection to
system memory, the GPU and the rest of the motherboard but back in the fall of 2003 we
thought of the extra HyperTransport link as a connection to another physical processor.
It didn't dawn on us that the "extra" processor could be on the same die. While some will
say "I knew that" most didn't pick up on it.
AMD have the added punch of being able to drop their dual core Opteron processors into
existing 940-pin sockets. This upgrade path is extremely favorable as all it will require is
a processor swap and, perhaps, a BIOS update.
Intel are continuing with their Pentium 4 cores by releasing two flavors codenamed
Paxville and Dempsey. The codenames will very likely change once the marketing
department gets their hands on it as "Introducing the new Dempsey" has a very lackluster
ring to it.
MAC orientated Think Secret posted IBM plans on the PowerPC 970MP codenamed
Antares and rumored to clock in at 3GHz with a 1GHz EI (Elastic Interface) bus.
The horses are now in the paddock. AMD, INTEL and MAC loyalists are beginning to
group at the fence to eye up their favorite and the competition. The post parade is still a
ways off and with post time now set at mid-2005 it's anybody's guess who will be out of
the gate first.
HARDWARE TREND
The general trend in processor development has been from multi-core to many-core:
from dual-, quad-, eight-core chips to ones with tens or even hundreds of cores; see
manycore processing unit. In addition, multi-core chips mixed with simultaneous
multithreading, memory-on-chip, and special-purpose "heterogeneous" cores promise
further performance and efficiency gains, especially in processing multimedia,
recognition and networking applications. There is also a trend of improving energy
efficiency by focusing on performance-per-watt with advanced fine-grain or ultra fine-
grain power management and dynamic voltage and frequency scaling (DVFS).
SOFTWARE IMPACT
Software benefits from multicore architectures where code can be executed in parallel.
Under most common operating systems this requires code to execute in separate threads
or processes. Each application running on a system runs in its own process so multiple
applications will benefit from multicore architectures. Each application may also have
multiple threads but, in most cases, it must be specifically written to utilize multiple
threads. Operating system software also tends to run many threads as a part of its normal
operation. Running virtual machines will benefit from adoption of multiple core
architectures since each virtual machine runs independently of others and can be
executed in parallel.
Most application software is not written to use multiple concurrent threads intensively
because of the challenge of doing so. A frequent pattern in multithreaded application
design is where a single thread does the intensive work while other threads do much less.
For example, a virus scan application may create a new thread for the scan process, while
the GUI thread waits for commands from the user (e.g. cancel the scan). In such cases,
multicore architecture is of little benefit for the application itself due to the single thread
doing all heavy lifting and the inability to balance the work evenly across multiple cores.
Programming truly multithreaded code often requires complex co-ordination of threads
and can easily introduce subtle and difficult-to-find bugs due to the interleaving of
processing on data shared between threads (thread-safety). Consequently, such code is
much more difficult to debug than single-threaded code when it breaks. There has been a
perceived lack of motivation for writing consumer-level threaded applications because of
the relative rarity of consumer-level multiprocessor hardware. Although threaded
applications incur little additional performance penalty on single-processor machines, the
extra overhead of development has been difficult to justify due to the preponderance of
single-processor machines.
As of September 2006, with the typical mix of mass-market applications the main benefit
to an ordinary user from a multi-core CPU will be improved multitasking performance,
which may apply more often than expected. Ordinary users are already running many
threads; operating systems utilize multiple threads, as well as antivirus programs and
other 'background processes' including audio and video controls. The largest boost in
performance will likely be noticed in improved response time while running CPU-
intensive processes, like antivirus scans, defragmenting, ripping/burning media (requiring
file conversion), or searching for folders. For example, if the automatic virus scan
initiates while a movie is being watched, the movie is far less likely to lag, as the
antivirus program will be assigned to a different processor than the processor running the
movie playback.
Given the increasing emphasis on multicore chip design, stemming from the grave
thermal and power consumption problems posed by any further significant increase in
processor clock speeds, the extent to which software can be multithreaded to take
advantage of these new chips is likely to be the single greatest constraint on computer
performance in the future. If developers are unable to design software to fully exploit the
resources provided by multiple cores, then they will ultimately reach an insurmountable
performance ceiling.
The telecommunications market had been one of the first that needed a new design of
parallel datapath packet processing because there were a very quick adoption of these
multiple core processors for the datapath and the control plane. These MPUs are going to
replace the traditional Network Processors that were based on proprietary micro- or pico-
code. 6WIND was the first company to provide embedded software for these
applications.
PARALLEL PROGRAMMING
Parallel programming techniques can benefit from multiple cores directly. Some existing
parallel programming models such as OpenMP and MPI can be used on multi-core
platforms. Intel introduced a new abstraction for C++ parallelism called TBB. Other
research efforts include the Codeplay Sieve System, Cray's Chapel, Sun's Fortress, and
IBM's X10.
PARTITIONING
COMMUNICATION
The tasks generated by a partition are intended to execute concurrently but cannot, in
general, execute independently. The computation to be performed in one task will
typically require data associated with another task. Data must then be transferred between
tasks so as to allow computation to proceed. This information flow is specified in the
communication phase of a design.
AGGLOMERATION
In the third stage, we move from the abstract toward the concrete. We revisit decisions
made in the partitioning and communication phases with a view to obtaining an
algorithm that will execute efficiently on some class of parallel computer. In particular,
we consider whether it is useful to combine, or agglomerate, tasks identified by the
partitioning phase, so as to provide a smaller number of tasks, each of greater size. We
also determine whether it is worthwhile to replicate data and/or computation.
MAPPING
In the fourth and final stage of the parallel algorithm design process, we specify where
each task is to execute. This mapping problem does not arise on uniprocessors or on
shared-memory computers that provide automatic task scheduling.
On the other hand, on the server side, multicore processors are ideal because they allow
many users to connect to a site simultaneously and have independent threads of
execution. This allows for Web servers and application servers that have much better
throughput.
This is the most basic of explanations of what a processor pipeline is. First the data
instruction set is needed.
A processor loads instructions into the pipeline. Think of the pipeline like a conveyor
belt. The data is processed sequentially one after another.
The AMD processor pipeline is shorter than the INTEL processor pipeline and this is one
of the reasons why AMD runs at a lower clock speed.
Pipelining, like most things in life, is good in moderation. Making a processor's pipeline
too short causes a longer minimum clock period which hinders the manufacturer's ability
to ramp up the clock speed. Making the pipeline very long allows faster clock speeds
however it also increases the cost of stalls and flushes which negatively affects
performance and also increases the amount of resources required to pipeline the
processor.
In layman's terms think of the processor as a carpenter. The carpenter's truck is system
memory and the cache are the tools he's packed into the house for the job. The carpenter
has anticipated what tools he may need to do the job. If the tool is not at hand then he
must go back to the truck to get the right tool thus slowing down the job at hand.
Two pairs of hands make the work go faster. This is quite true in computers with dual
processors especially with SMP (Symmetric Multiprocessing) software. Not all software
is SMP aware. In fact only a small percentage of it is. SMP capability is something that
must be written into the code. The program must know that it can utilize two processors
to complete processes simultaneously. This is known as multithreading.
A dual core processor is between a single core processor and a dual processor system for
architecture. A dual core processor has two cores but will share some of the other
hardware like the memory controller and bus. A dual processor system has completely
separate hardware and shares nothing with the other processor.
A dual core processor won't be twice as fast as a single core processor nor will it be as
fast as a dual processor system.
It will fall somewhere in the middle but there are going to be specific advantages.
There will be two pipelines and that means there can be two sets of instructions being
carried out simultaneously.
There will also be two processor caches to keep more of the necessary "tools" or data on
the processor die for faster access.
The trick will be the bus. If everyone wants on the bus at the same time then there will be
the Keystone Cops comedy of errors as everyone tries to squeeze through the door at the
same time. The two processor cores have to be designed to be smart enough to "wait" for
the other to finish accessing the bus.
Now all of this is happening at the nanosecond level so don't think there's time for a
coffee. Nanosecond wait states means there's not even enough time to THINK about
thinking about having a coffee.
TO SMP OR NOT TO SMP ?
The processor engineers have probably already thought about tackling the SMP situation.
What good is a dual core processor if the software only recognizes and then uses only
one of the cores? The majority of software is not written to utilize multithreading at
present. This breaks open a whole new can of worms in concepts of parallel computing.
Intel's Hyper-Threading is a single processor logical variation of dual core processors.
AMD has just taken it one step further with two physical cores on one processor die.
Could AMD's engineers have cracked the hardware problem of a dual core processor and
load balancing a program that isn't written for multithreading?
This is where dual core processors could fall short of expectations for mainstream users.
If the software cannot "see" the second processor then it will not benefit from it.
Programs, such as Adobe Photoshop, are SMP aware and are much faster on a dual
processor system. There is no doubt that a program like Photoshop will be much faster on
a dual core system than its single core counterpart. The majority of operating systems do
recognize and support at least two processors. There is some load balancing of non-SMP
applications but not as efficiently as those written for multithreading.
More and more PC users run their systems 24 hours a day to permit functions such as
downloading files, running backups, scanning for viruses and operating Web servers.
PCs for these operations should be quiet and energy-efficient, while offering sufficient
performance for the applications of both today and tomorrow.
AMD's Turion 64 and Intel's Pentium M are the thriftiest processors when it comes to
energy consumption, but, strictly speaking, both are becoming obsolete. The future
belongs to dual core processors, as they provide substantial performance enhancements
for a low-power PC.
If you look for a dual core processor, the obvious choices will be the AMD Athlon 64 X2
and the Intel Pentium D. We only recommend the 65 nm version of the Pentium D (the
900 series), because the aged 90 nm 800 series suffers from high thermal dissipation.
Clearly, the Athlon 64 X2 offers superior performance and efficiency, but we are looking
for high efficiency solutions, which draws our attention to Intel's latest mobile dual core
processor: the Core Duo.
Intel's Core Duo is a 65 nm part and offers optimized performance per clock cycle thanks
to its reconditioned microarchitecture, while drawing no more power than its
predecessor, the Pentium M. Intel specifies a maximum design power of 31 W, which is
an excellent result when put in the context of its performance. The Centrino Duo launch
was spoiled by a USB power drain issue, which causes battery time to decrease
dramatically. Since this proved to be a software issue, it is Microsoft courting our
resentment now, as the promised patch has still not arrived. Luckily, this does not affect
desktop applications, nor does it change our assessment of the Core Duo processor being
one of the finest we have seen to date; it would be very appealing for a low-power PC.
The Core Duo uses the same processor socket as the Pentium M, but requires some
electrical modifications, which means that you cannot use existing Socket 479
motherboards. Suitable products are not yet available, but several motherboard makers
are working on them. We had a look at one of the first solutions back in March, AOpen's
i975Xa-YDG</a.. This is a full-fledged Core Duo ATX motherboard, but its 975X
chipset is not known to be energy efficient. As there is currently no alternative, we
decided to go for this motherboard and a Core Duo T2600 processor running at 2.16 GHz
and FSB667.
AMD is getting ready to release the Turion 64 X2 dual core processor, but until this
delicacy is served up, we will stick with the desktop dual core Athlon 64 X2 3800+ at 2.0
GHz and a Biostar TForce 6100-939 motherboard. Although the processor itself requires
more energy than the Core Duo, the entire AMD64 platform is more energy efficient, so
this promises to be a very interesting competition.
We also added a Pentium M 780 on a MSI 915GM Speedster and a Turion 64 MT-40 on
a K8NGM-V to the lineup. Notice that these motherboards come with integrated
graphics, but for the sake of a fair comparison with the AOpen i975Xa-YDG, we ran all
systems with a dedicated graphics card.
TYPICAL PC ENERGY CONSUMPTION REVIEWED
The table above shows the power draw for typical components, as well as the potential
energy savings from using efficient PC parts. You will notice that the savings can vary
heavily; this is due to many product options for various components, and their power
differences:
PROCESSOR
Energy consumption increases exponentially with the clock speed. Accordingly, reducing
the clock speed will reduce the power draw, especially if energy saving features such as
Cool & Quiet (AMD) or SpeedStep (Intel) are used to reduce the operating voltage. In
this context, the processor type makes a huge difference as well: a modern 65 nm
Pentium D 900 or Pentium 4 6x1 series runs cooler and wastes less energy than the 90
nm Pentium D 800 and Pentium 4 500/600 processors. AMD processors show similar
effects from one generation to the next, but the impact is less dramatic due to AMD's
more elaborate SOI (silicon on insulator) manufacturing and processor architecture.
PLATFORM
This term refers to the motherboard, including the chipset and on-board components. It
comprises functional parts such as audio chips and additional controllers, as well as basic
components such as voltage regulators. Intel's current desktop chipsets are not
particularly efficient these days, while Athlon 64 core logic benefits from the memory
controller being a part of AMD's current processors. However, motherboards that use a
mobile chipset rather than a desktop version require considerably less energy.
GRAPHICS
Today's graphics cards are able to squeeze more and more visual effects and even physics
calculations out of any graphics processor, but this comes at a tremendous energy price
due to the several hundred million transistors used. The basic power draw just for
displaying the Windows screen can be 15 to 30 W.
As the 3D units become active, power consumption increases further; a modern graphics
card will convert from 50 to 120 W of electricity into heat. High-end graphics cards even
come with a separate power connector to satisfy requirements that exceed the power
supply specifications for PCI Express. Of course, dual graphics setups - ATI Crossfire or
Nvidia SLI - will almost double the graphics power requirements.
POWER SUPPLY
Power supplies become less efficient the closer they run to their maximum output, which
means that a larger amount of energy will be converted into heat. It is difficult to provide
precise numbers, however, because the degree of efficiency varies with the load.
OTHER
There are other components where power is a concern, such as the hard drive or optical
drive, but with these the energy consumption is usually under 10 W. Using 2.5" or even
1.8" hard drives helps to reduce power consumption, but this has a noticeable negative
impact on performance. Since the difference in power is not very large, we recommend
focusing on other components first.
AMD PERFORMANCE COMPARISON
This is a comparison of two systems with virtually identical hardware. The video card
and hard drive used were the same brand and model. The amount of RAM is identical (2x
1GB PC3200) with the only difference being that the Opteron system used ECC memory.
The real differences were just the motherboard and CPUs. For the Opteron system we
have 2 model '248' processors running at 2.2Ghz each with 1MB of cache. They are
running on a Tyan Thunder K8WE board, which uses an nVidia nForce Professional
chipset. The single-CPU solution is an Athlon64 X2 4400+, with two cores each running
at 2.2Ghz and each sporting a 1MB cache. This processor was installed in an Asus A8N-
SLI Premium motherboard, utilizing an nVidia nForce4 SLI chipset.
As you can see, graphics performance is very similar, with 3dMark05 scores only 1 point
apart and less than 4% variation in the 3dMark'03 scores. Looking closer, we can also see
that the important specific metrics in PCMark04 are very similar as well -- the biggest
difference is seen in the additional overhead of ECC impacting the memory performance.
All around, I would say that with the AMD platform there is little noticeable difference
between dual-core and dual processors.
For those with plenty of money to burn, it is also common for us to build a AMD
Opteron system with a dual CPU motherboard, and using a dual core CPU in each socket.
That gives a grand total of four functional CPU cores! This setup is especially desirable if
you need to have multiple heavy duty applications open (CAD, video editing, and
modeling come to mind) - just make sure you complement those processors with plenty
of memory.
Here again we see fairly close performance in graphics, with the Xeon system in a very
slight lead. In more performance-oriented tests, however, we see the Pentium system
tending to pull ahead by a fair margin. This is most likely due to its significant memory
speed advantage, but again this is a very valid and important result. The RAM that was
used in the Pentium D system is standard for that platform, but even if we wanted to, we
could not build a Xeon setup with the same speed of memory. So while the processors
may be very comparable in performance the overall win definitely goes to the Pentium D
dual-core platform.
ADVANTAGES
• CACHE COHERENCY
The proximity of multiple CPU cores on the same die allows the cache coherency
circuitry to operate at a much higher clock rate than is possible if the signals have to
travel off-chip.
These higher quality signals allow more data to be sent in a given time period since
individual signals can be shorter and do not need to be repeated as often.
Assuming that the die can fit into the package, physically, the multi-core CPU designs
require much less Printed Circuit Board (PCB) space than multi-chip SMP designs.
• LESS POWER USAGE
A dual-core processor uses slightly less power than two coupled single-core processors,
principally because of the increased power required to drive signals external to the chip
and because the smaller silicon process geometry allows the cores to operate at lower
voltages.
• REDUCES LATENCY
Reduction in power usage i.e the cores operating at lower voltages leads to reduced
latency. Furthermore, the cores share some circuitry, like the L2 cache and the interface
to the front side bus (FSB).
Thus, in terms of competing technologies for the available silicon die area, multi-core
design can make use of proven CPU core library designs and produce a product with
lower risk of design error than devising a new wider core design. Also, adding more
cache suffers from diminishing returns.
DISADVANTAGES
• ADJUSTMENT TO OPERATING SYSTEM SUPPORT
This means existing software are required to maximize utilization of the computing
resources provided by multi-core processors. Also, the ability of multi-core processors to
increase application performance depends on the use of multiple threads within
applications.
Finally, raw processing power is not the only constraint on system performance. Two
processing cores sharing the same system bus and memory bandwidth limits the real-
world performance advantage. If a single core is close to being memory bandwidth
limited, going to dual-core might only give 30% to 70% improvement. If memory
bandwidth is not a problem, a 90% improvement can be expected. It would be possible
for an application that used 2 CPUs to end up running faster on one dual-core if
communication between the CPUs was the limiting factor, which would count as more
than 100% improvement.
CONCLUSION
Our conclusion for the Core Duo processor is particularly interesting, because in theory it
is capable of enabling the assembly of a dual core desktop system that requires only 45
W. This, however, requires a motherboard that uses the 945GM chipset, which is not yet
available. Using AOpen's 975X motherboard forces the user to go for discrete graphics,
which drives the system power consumption to a minimum of 70 W. Several
motherboard companies are currently working on desktop motherboards for Core Duo,
but most do not deploy the 945GM chipset into the desktop. Core Duo is a great product
and is very efficient, but it stumbles due to the lack of a suitable low-power desktop
platform. (Ironically, it was Intel who has been trying to refocus the industry to think
about platforms...)
Efficient systems depend on efficient components, and there is one component that lately
has turned into a serious energy hog: the graphics card. The $500 monsters from ATI and
Nvidia consume 100 W or more when doing their 3D work. Even the basic requirement
is at least around 20 W, which would seem to make high performance and energy
efficiency mutually exclusive.
But wait a minute - there are graphics solutions that work without excessive power
requirements. Have a look at the products that both ATI and Nvidia send into multimedia
and gaming notebooks: there is the Mobility Radeon and the GeForce to Go, both of
which include technology that helps conserve energy. How about offering PCI Express
graphics cards that are based on mobile graphics solutions? Although the market
certainly is not huge, it seems like there are would definitely be interest. Such products
would also help in the comparison of different mobile graphics solutions, by running
them on a reference system.
The bottom line question is simple: does a low-power PC make sense for you? We
recommend first finding out your energy costs per kWh (kilowatt-hour), and then
calculating how much energy would be consumed over the course of a year if you
operate the system 24/7. An example would be 100 W of power consumption times 24
hours, times 365 days. The result is 876 kWh, which you have to multiply by your
energy cost per kWh.
It is obvious that any high-end component would spoil your saving efforts. For example,
we cannot recommend buying a Core Duo T2600 in order to reduce your energy bill.
Instead, go for the mid range; here, Turion 64 solutions currently offer the best bang for
the energy saving buck. But the Turion 64 X2 will be available soon, and the first
945GM motherboards for Core Duo should also hit the retail in the not too distant future.
We expect both to renew the energy efficiency debate.
REFERENCES