You are on page 1of 14

Part I

Introduction to Parallelism

Chapter 1

Why parallel computation?


intro.tex

1.1

The only hope for faster computers

Almost all computation done during the rst forty years of the history of computers could be called sequential. One of the characteristics of sequential computation is that it employs a single processor to solve some problem. (Here, the term problem is used in the broad sense, i.e., performing some task.) These processors had become continuously faster (and cheaper) during the rst three decades, doubling their speed every two or three years. (See Fig. 1.1.) However, due to the limit that the speed of light imposes on us, it seems extremely unlikely that we can build uni-processor computers (i.e., computers that contain only one processor) that can achieve performance signicantly higher than 1,000,000,000 oating-point operations per second usually called 1 Gops. The unit ops is a widely used measure of memory access performance. It equals the rate at which a machine can perform single-precision oating point operations, i.e., how many such operations the computer can perform in a unit of time second in our case. As with physical quantities, computing power is measured using the notations kilo (1K = 103 ), mega (1M = 106 ), giga (1G = 109 ) and tera (1T = 1012 ). If the computer does not have the

capability to handle oating-point operations in hardware, we use the term ips which stands for instructions per second. It is true that 1 Gops may sound as quite a large number. However, if you consider the always-increasing human appetite for computational power (i.e. for solving larger problems faster), a need for an alternative route to sequential computation becomes apparent. Parallel computation seems to be the most promising (if not the only) alternative. So, parallel computation is dened as the practice of employing a (usually large1 ) number of cooperating processors, communicating among themselves to solve large problems fast. It has quickly become an important area in computer science. During the past ve years, parallel computation has grown so wide and strong that most of the research conducted in the elds of design and analysis of algorithms, computer languages, computer applications and computer architectures are within its context. New parallel machines with novel architectures are being built every
It is not easy to dene what large is. Even though a machine with only two processors qualies as parallel computer, in this book we will focus on machines with at least two levels of magnitude greater than this often refered to as massively parallel. Moderately parallel computers are usually treated in dierent ways.
1

year. The number of processors in these machines reaches 65,536 today, with some dreams to build in the near future a machine with 128,000 processors. However, the huge cost of building such machines, combined with the lack of available funding that traditionally came from the military, delay such plans. Nevertheless, the demand for laying the theoretical foundations and, more generally, understanding the nature of parallel computation becomes continuously greater. The issues that arise in this process are many. To mention only a few of them: What computational models can provide both realizable architectures and a basis for abstraction simple enough for understanding parallel computation? What sort of problems are naturally suited for large scale parallelization? What are the basic techniques for designing parallel algorithms? While no denite answer to these problems exists today, the eorts of the research community have provided a better understanding of parallel computation. In particular, several models have been introduced; a large number of problems have been identied as amenable to parallel computation, and new techniques have been developed to help the design of parallel algorithms. We are not there yet, but we are denitely making progress towards this goal.

1.2

Parallelism in real life

The world of computing until recently was dominated by the sequential way of thinking. In fact, sequential processing has been very successful and has set high standards that parallel processing will have to try hard to match. 10

Before introducing the parallel way of thinking, let us make sure that we have a clearer idea of what problems the sequential way of thinking may have. For example, let us consider the simple problem of assigning 0 to the 100 memory locations of an integer array A, initialization performed by most compilers every time arrays are used. The processor would have to execute the code typically produced by the compiler, that visits sequentially every memory location A[1..100] of the array and assigns the value 0 to it. This code could look like this 2 : i = 0; while (i <= 100) { i++; A[i] = 0; } This is an operation many of you were taught in your rst CS course. It has been observed that students in that introductory course often have some troubles coming up with such a solution, and one of the reasons may be that it seems unnatural to those who are not yet used to think like the computer, as the popular expression goes. Let us explain what we mean by that. Consider the logically equivalent problem of an instructor distributing handouts to her 100 students in the beginning of a class period. It is very unlikely that she would walk around the class giving the handouts to each and every student in a sequential fashion: take a deep breath; while (there are students you have not visited yet) { visit a student that you have not visited yet;
Probably the compiler would produce a for-loop instead of a while-loop, but that is not important here
2

11

give a copy of the handouts to that student; } Such an action would be very time consuming and would occupy her for most of the class period, while the students stay idle not to mention bored. 3 Instead, the instructor would hand the whole package to the student sitting, say, at an end of the rst row, and then return to her previous activity. Each student will be occupied for only a short period of time: receiving the handouts when they reach him, keeping one copy, and handing the remaining copies to the nearest student that has not received handouts yet. Instructors action: visit the first student; give the pile with the handouts to that student; continue teaching the class; Students action: while (the pile with the handouts have not reached you) attend the class; pick up a copy of the handouts; if (there is a student that has not received pass up the handouts to this student; This is a special parallel method used commonly in assembly lines, known as the pipeline, and in the classroom example is, apparently, preferable. In parallel computing terms, we say that this technique has better load balance than the sequential one, because the task of distributing the handouts is evenly divided among the number of persons involved.
It is irrelevant here that almost certainly this habit would make the instructor wellknown around campus.
3

handouts yet)

12

There is not, however, a single parallel way of doing this job, but several. When time is at a premium, for example, when a midterm exam is being handed out, it is important that every students gets a copy of the exam as soon as possible. In this case, the instructor may speed up this process by handing a portion of the exams to the students sitting at the end of each desk row. This process is faster than the previous one by a factor that equals, roughly, the number of desk rows. This approach is more ecient than the rst because the total amount of time to execute the same task is less than the time associated with the rst.

1.2.1

A more interesting example

The pipelining trick we just mentioned probably seems too obvious and one may think that we could adapt easily sequential machines in an assembly-line fashion to simulate pipelining. Indeed, the rst attempts for building parallel computers used exactly this idea, producing the so-called systolic or linear arrays (which we will examine in Chapter 2). But the idea has also application inside the sequential processor architecture. Consider a sequential processor processing assembly code instructions. Typically, each instruction is split in ve pieces (instruction fetch, instruction decoding, operand fetch, evaluation, and store shown in gure 1.2), and are executed in a pipeline fashion. In fact, today every sequential processors manufactured is taking advantage of parallelism in the form of pipelining instructions (and not only). But, is that all there is to parallel computing? Denitely not! Pipelining is just one of the many parallel techniques that have been developed. Let us see another, less obvious, yet still simple. Consider the problem of calculating the sum of n numbers stored inside an array. For simplicity, let us assume that n is a power of 2. As with the distribution of handouts, a sequential algorithm would re-

13

Figure 1.2: Five stages of the pipelined execution of an instruction. Five instructions are being executed in 10 pipelined parallel steps, instead of 25 sequential ones.

x1

x2

x3

x4

x5

x6

x7

x8

12

78

56

34

12

13

58

14

14

18

15

16

17

18

Figure 1.3: Comparison between sequential (on the left) and parallel ways of summing up eight numbers. The sequential way takes n 1 = 7 steps, but the parallel takes only log2 (8) = 3 steps using 4 processors.

14

quire n 1 steps for this calculation: in the rst step the rst two numbers are added, in the second step the third number is added to the sum, etc., until the (n 1)-st step, during which the last number is added. If more processors are available, though, we can speed things up considerably. Here is one possibility: In the rst step we have processors add up pairs of consecutive numbers. So, one processor adds up the rst two numbers, another sums up the third and fourth, etc., all the way to the pair of the last two numbers. Now we are left with half as many numbers to add, and we can do the same with them recursively. Therefore, after log2 (n) steps all the numbers have been summed up 4 . For example, to sum up a million numbers, could take only 19 parallel steps, versus a million sequential steps. This is quite some saving in time! You may have noted, however, that we also need about n/2 processors (to execute the rst parallel step, all the rest need fewer processors). Thats not too bad if you have eight numbers to sum, but for a million numbers you need about half-a-million processors, which seems unthinkable today (yet not impossible in the future). After all, technology is moving because people think and plan what seemed impossible at some point of time!

1.3

Parallel vs sequential solutions The magic box

Probably the rst approach one may think for designing a parallel algorithm is to modify and parallelize an existing sequential one. It would be nice if someone had written a program s2p.c that takes as input a sequential program and produces an equivalent parallel program which runs much faster and exhibits good load balancing (see gure 1.4). After all, this is not very dicult to do for the initialization problem we saw earlier: one has only
Thats why we wanted n to be a power of 2; to make the counting of the parallel steps easier. Of course, the algorithm can be made to work for any n
4

15

Figure 1.4: The magic box that converts sequential code to parallel. to write a program that can recognize a code fragment with the general structure var1 = val1; while (var1 < val2) { var1++; var2[var1] = val3; } and convert it to an equivalent parallel code which divides up the operations performed by the number of available processors. This simple solution, of course, requires that there are no data dependencies among operationsassigned to dierent processors.. (We do not present the converted code here since we have not yet discussed the synyax of parallel code yet.) Unfortunately, it seems that very few sequential algorithms can be modied to produce signicantly faster parallel algoriths. At the same time, even fewer of them have obvious or simple parallel modications. Moreover, problems that happen to have simple sequential solutions do not necessarily have a practical parallel solution, sometimes do not have an ecient parallel solution at all! An interesting example of the latter case is depth-rst search, a technique which has applications in almost every area of computer science. It has been used as a basis for many problem solutions and, as is well known, it has a simple linear-time sequential implementation. For example, it is used as a 16

technique to search the nodes of a graph for some property. this technique has not yet been found. exist!
6

However,

despite considerable research eorts, an ecient parallel implementation for What is worse, there is some evidence that an ecient parallel algorithm for this problem may not even As a result of that, all these sequential algorithms that have been developed using depth-rst search cannot be easily converted to parallel ones through some magic box. We have to nd new parallel solutions that use dierent methods, as we will see later.

1.4

Interconnection Networks

Parallel computing came to age in the mid-eighties, when chip manufacturers were able to produce large quantities of processor chips economically. So, suddenly, having lots of chips containing processors along with small local memories (the so-called processing elements or PEs) was not a problem. New problems arose, however: 1. how do you connect up all these processors to create a fast machine of cooperating processing elements? 2. how do you program such a machine? Researchers and manufacturers, during the last ten years have come up with various designs of interconnection networks for making PEs communicate fast (Figure 1.5: linear (also known as systolic) arrays, rings, meshes (2-dimensional arrays), tori, 3-dimensional arrays, trees, butteries,
5 According to the myth, it was even used by Theseus to search the Labyrinth looking for the Minotaur. 6 The fastest (deterministic) parallel implementation of it today, runs in O(log11 n n) time using n3 processors. Even the best randomized algorithm runs in O(log7 n) parallel time using almost n4 processors. The exact number of processors is n M M (n), where M M (n) is the sequential time required to multiply two n n matrices. Currently, M M (n) = n2.376 using the so-called Strassens technique.

17

Figure 1.5: Various popular interconnection networks: Linear array, 2dimensional array, complete, buttery, hypercube, tree, 3-dimensional array. hypercubes, and complete graphs (implemented usually through crossbar switches) and combinations of them. Unfortunately, creating a fast and cheap interconnection network has proven to be a more dicult task than initially anticipated. The research continues in all fronts, but manufacturers have postponed the ambitious plans of building machines with thousands of weak processors, and they concentrate on building computers with fewer (e.g., up to a thousand or so) powerful processors.

1.5

Commercial parallel machines and simulators

The last decade saw a plethora of new parallel machines of varying success, abilities, computing power and price tags. There exists an ongoing report listing the peak performance of the top 500 supercomputers worldwide that is updated regularly. Peak performance is the theoretically top performance of computer. (It also means the least performance that the manufacturer assures that you will never achieve.) According to the report published on July, 1993, the CM-5 computer with 1024 processors and 131 Gops peak performance was the most powerful computer available; its price tag was 18

about $ 25 million. Here are some more of the contestants: Manufacturers Thinking Machines NEC Intel Cray Hitachi MasPar Model CM-5 SX-3/4412 Delta YMP C916 S-3800/480 MP-2216 # Proc 1024 4 512 16 4 16384 Peak perf. 131 26 20 15 32 2.4 Linpac 60 23 14 14 5.7 1.6

The table above summarizes the most powerful machines of six manufacturers. Peak performance and Linpac are measured in Gops. Linpac performance is achieved when running a particular application that solves a system of thousands of linear equations using the Gaussian elimination method. To get a feeling of what these numbers mean, the uni-processor RS-6000 IBM workstation has a peak performance of about 20 Mops.

1.6

Who Needs Parallel Computers?

In the beginning of this chapter we remarked that the human appetite for greater computing power never ends. This may seem that needs justication; most people consider computers already extremely poweful and fast. If you use a personal computer for writing letters, browsing the internet, balancing your checkbook and playing chess, then you probably do not need more computing power (unless you are a really good chess player). On the other hand, there are several scientic applications that could certainly use much more computing power. Among them: Graphics: Volume rendering, Virtual reality, ray tracing. Simulation: Weather prediction, Chip verication, oil exploration. Image Processing: Image enhancement, feature extraction. Articial Intelligence: Image recognition, Game playing (chess, GO, etc.) 19

Large Database Searching: Air ights scheduling, DNA matching (The human genome project) This list can go on and on, since any scientic area could use better and faster computers. In fact, computers are considered to be the laboratories of the future. A biologist or a chemist would not have to mix expensive or dangerous elements in the laboratory to test some scientic hypothesis she would just do it in front of a computer that simulated the molecules of the elements!

1.7

Parallel I/O The next challenge

Even though the processing performance of the computers have increased considerably in the last few years, input/output (I/O) devices have not kept up with this trend. While processor speed gets roughly eight times faster every ten years, the main memory access and disk cycle time decrease by only one third in the same time period. As you realize, this widens the gap between I/O and processing times. So, our ability to use very fast computers eciently, depends on our ability to feed them with data suciently fast. This is the so-called parallel I/O bottleneck; the bad news is expected to get worse in the future, and a lot of research is being done today in this direction. The federal HighPerformance Computing and Communication program (HPCC) calls for the development of the 3 T computer, which will have 1 Tops computing power, 1 TByte main memory, and 1 TByte/sec I/O bandwidth. The last requirement seems the most daunting

20

You might also like