This is the first tutorial in the "Livermore Computing Getting Started" workshop. It is intended to provide only a brief overview of the extensive and broad topic of Parallel Computing, as a lead-in for the tutorials that follow it. As such, it covers just the very basics of parallel computing, and is intended for someone who is just becoming acquainted with the subject and who is planning to attend one or more of the other
tutorials in this workshop. It is not intended to cover Parallel Programming in depth, as this would require significantly more time. The tutorial begins with a discussion on parallel computing - what it is and how it's used, followed by a discussion on concepts and terminology associated with parallel computing. The topics of parallel memory architectures and programming models are then explored. These topics are followed by a series of practical discussions on a number of the complex issues
related to designing and running parallel programs. The tutorial concludes with several examples of how to parallelize several simple problems. References are included for further self-study. Traditionally, software has been written for serial computation: For example: In the simplest sense, parallel
computing is the simultaneous use of multiple compute resources to solve a computational problem: For example: Historically, parallel computing has been considered to be "the high end of computing," and has been used to model difficult problems in many areas of science and engineering: Today, commercial applications provide an equal or greater driving force in the development of faster computers. These applications require the processing of large amounts of data in sophisticated ways. For example: Parallel computers still follow this basic design, just
multiplied in units. The basic, fundamental architecture remains the same. More info on his other remarkable accomplishments: http://en.wikipedia.org/wiki/John_von_Neumann
Image Image Image Image Image Image Image Image Contemporary CPUs consist of one or more cores - a distinct execution unit with its own instruction stream. Cores with a CPU may be organized
into one or more sockets - each socket with its own distinct memory . When a CPU consists of two or more sockets, usually hardware infrastructure supports memory sharing across sockets. A standalone "computer in a box." Usually comprised of multiple CPUs/processors/cores, memory, network interfaces, etc. Nodes are networked together to comprise a supercomputer. A logically
discrete section of computational work. A task is typically a program or program-like set of instructions that is executed by a processor. A parallel program consists of multiple tasks running on multiple processors. Breaking a task into steps performed by different processor units, with inputs streaming through, much like an assembly line; a type of parallel computing. Describes a computer architecture where all processors have direct access
to common physical memory. In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same logical memory locations regardless of where the physical memory actually exists. Shared memory hardware architecture where multiple processors share a single address space and have equal access to all resources - memory, disk, etc. In hardware,
refers to network based memory access for physical memory that is not common. As a programming model, tasks can only logically "see" local machine memory and must use communications to access memory on other machines where other tasks are executing. Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as through a shared memory bus or over a network. The coordination of parallel tasks in
real time, very often associated with communications. Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application's wall clock execution time to increase. In parallel computing, granularity is a quantitative or qualitative measure of the ratio of computation to communication. Observed speedup of a code which has been parallelized, defined as: One of the simplest and most widely used indicators for a parallel program's performance. Required execution time that is unique to parallel tasks, as opposed to that for doing useful work. Parallel overhead can
include factors such as: Refers to the hardware that comprises a given parallel system - having many processing elements. The meaning of "many" keeps increasing, but currently, the largest parallel computers are comprised of processing elements numbering in
the hundreds of thousands to millions. Solving many similar, but independent tasks simultaneously; little to no need for coordination between the tasks. Refers to a parallel system's (hardware and/or software) ability to demonstrate a proportionate increase in parallel speedup with the addition of more resources. Factors that contribute to scalability include:
Kendall Square Research (KSR) ALLCACHE approach. Machine memory was physically distributed across networked machines, but appeared to the user as a single shared memory global address space. Generically, this approach is referred to as "virtual shared memory".
Message Passing Interface (MPI) on SGI Origin 2000. The SGI Origin 2000 employed the CC-NUMA type of shared memory architecture, where every task has direct access to global address space spread across all machines. However, the ability to send and receive messages using MPI, as is commonly done over a network of distributed memory machines, was implemented and commonly used.
Implementations Implementations In both cases, the programmer is responsible for determining the parallelism (although compilers can sometimes help).
POSIX Threads
OpenMP
More Information
Distributed Memory / Message Passing ModelDistributed memory model
Implementations:
More Information
Data Parallel ModelData parallel model
Implementations:
Hybrid ModelHybrid model with MPI and OpenMPHybrid model with MPI and CUDA
SPMD and MPMDSingle Program Multiple Data (SPMD)SPMD model
Multiple Program Multiple Data (MPMD)MPMD model
Designing Parallel ProgramsAutomatic vs. Manual Parallelization
Fully Automatic
Programmer Directed
Understand the Problem and the ProgramPrograms = algorithms + data + (hardware) Determine whether the problem can actually be parallelized
Calculate the potential energy for each of several thousand independent conformations of a molecule. When done, find the minimum energy conformation. This problem is able to be solved in parallel. Each of the molecular conformations is independently determinable. The calculation of the minimum energy conformation is also a parallelizable problem.
Calculation of the first 10,000 members of the Fibonacci series (0,1,1,2,3,5,8,13,21,...) by use of the formula: The calculation of the F(n) value uses those of both F(n-1) and F(n-2), which must be computed first. An example of a parallel algorithm for solving this problem (using Binet's formula): where
Partitioning
Domain DecompositionIn this type of partitioning, the data associated with a problem is decomposed. Each parallel task then works on a portion of the data. Domain decompositionThere are different ways to partition data: Partitioning examplesFunctional DecompositionIn this approach, the focus is on the computation that is to be performed rather than on the data manipulated by the computation. The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work. Functional decompositionFunctional decomposition lends itself well to problems that can be split into different tasks. For example: Ecosystem ModelingEach program calculates the population of a given group, where each group's growth depends on that of its neighbors. As time progresses, each process calculates its current state, then exchanges information with the neighbor populations. All tasks then progress to calculate the state at the next time step. Ecosystem modelingSignal ProcessingAn audio signal data set is passed through four distinct computational filters. Each filter is a separate process. The first segment of data must pass through the first filter before progressing to the second. When it does, the second segment of data passes through the first filter. By the time the fourth segment of data is in the first filter, all four tasks are busy. Signal processingClimate ModelingEach model component can be thought of as a separate task. Arrows represent exchanges of data between components during computation: the atmosphere model generates wind velocity data that are used by the ocean model, the ocean model generates sea surface temperature data that are used by the atmosphere model, and so on. Climate modelingComplex relationships between climate and atmospheric modeling components
CommunicationsWho Needs Communications?The need for communications between tasks depends upon your problem: You DON'T need communications
You DO need communications
Factors to ConsiderThere are a number of important factors to consider when designing your program's inter-task communications: Communication overheadCommunication is complex
Latency vs. bandwidth
Visibility of communications
Synchronous vs. asynchronous communications
Scope of communications
Efficiency of communications
Overhead and complexityExample of parallel communications overhead and complexityFinally, realize that this is only a partial list of things to consider! Synchronization
Types of SynchronizationBarrier
Lock / semaphore
Synchronous communication operations
Data DependenciesDefinition
ExamplesLoop carried data dependenceDO J = MYSTART,MYEND A(J) = A(J-1) * 2.0 END DO
Loop independent data dependencetask 1 task 2 ------ ------ X = 2 X = 4 . . . . Y = X**2 Y = X**3
How to Handle Data Dependencies
Load Balancing
How to Achieve Load BalanceEqually partition the work each task receives
Use dynamic work assignment
GranularityComputation / Communication Ratio
Fine-grain ParallelismFine-grain parallelism
Coarse-grain ParallelismCoarse-grain parallelism
Which is Best?
I/OThe Bad NewsI/O operations
The Good News
Debugging
Performance Analysis and Tuning
Parallel ExamplesArray Processing2-D array
do j = 1,n do i = 1,n a(i,j) = fcn(i,j) end do end do
Parallel Solution 1Parallel solution 1
Column-major: do j = mystart, myend
do i = 1, n
a(i,j) = fcn(i,j)
end do
end do Row-major: for i (i = mystart; i < myend; i++) { for j (j = 0; j < n; j++) { a(i,j) = fcn(i,j); } }
One possible solution:
find out if I am MASTER or WORKER if I am MASTER initialize the array send each WORKER info on part of array it owns send each WORKER its portion of initial array receive from each WORKER results else if I am WORKER receive from MASTER info on part of array I own receive from MASTER my portion of initial array # calculate my portion of array do j = my first column,my last column do i = 1,n a(i,j) = fcn(i,j) end do end do send MASTER results endif Example programs
Parallel Solution 2: Pool of Tasks
Pool of tasks scheme
Master Process:
Worker Process: repeatedly does the following
find out if I am MASTER or WORKER if I am MASTER do until no more jobs if request send to WORKER next job else receive results from WORKER end do else if I am WORKER do until no more jobs request job from MASTER receive from MASTER next job calculate array element: a(i,j) = fcn(i,j) send results to MASTER end do endif Discussion
PI CalculationCalculating pi in serial
npoints = 10000 circle_count = 0 do j = 1,npoints generate 2 random numbers between 0 and 1 xcoordinate = random1 ycoordinate = random2 if (xcoordinate, ycoordinate) inside circle then circle_count = circle_count + 1 end do PI = 4.0*circle_count/npoints
Parallel SolutionCalculating pi in parallel
npoints = 10000 circle_count = 0 p = number of tasks num = npoints/p find out if I am MASTER or WORKER do j = 1,num generate 2 random numbers between 0 and 1 xcoordinate = random1 ycoordinate = random2 if (xcoordinate, ycoordinate) inside circle then circle_count = circle_count + 1 end do if I am MASTER receive from WORKERS their circle_counts compute PI (use MASTER and WORKER calculations) else if I am WORKER send to MASTER circle_count endif Example Programs
Simple Heat Equation2-D heat problemFinite differencing scheme
do iy = 2, ny - 1 do ix = 2, nx - 1 u2(ix, iy) = u1(ix, iy) + cx * (u1(ix+1,iy) + u1(ix-1,iy) - 2.*u1(ix,iy)) + cy * (u1(ix,iy+1) + u1(ix,iy-1) - 2.*u1(ix,iy)) end do end do
Parallel SolutionParallel solution to 2-D heat problem
find out if I am MASTER or WORKER if I am MASTER initialize array send each WORKER starting info and subarray receive results from each WORKER else if I am WORKER receive from MASTER starting info and subarray # Perform time steps do t = 1, nsteps update time send neighbors my border info receive from neighbors their border info update my portion of solution array end do send MASTER results endif Example Programs
1-D Wave Equation
A(i,t+1) = (2.0 * A(i,t)) - A(i,t-1) + (c * (A(i-1,t) - (2.0 * A(i,t)) + A(i+1,t))) where c is a constant
1-D Wave Equation Parallel Solution
find out number of tasks and task identities #Identify left and right neighbors left_neighbor = mytaskid - 1 right_neighbor = mytaskid +1 if mytaskid = first then left_neigbor = last if mytaskid = last then right_neighbor = first find out if I am MASTER or WORKER if I am MASTER initialize array send each WORKER starting info and subarray else if I am WORKER` receive starting info and subarray from MASTER endif #Perform time steps #In this example the master participates in calculations do t = 1, nsteps send left endpoint to left neighbor receive left endpoint from right neighbor send right endpoint to right neighbor receive right endpoint from left neighbor #Update points along line do i = 1, npoints newval(i) = (2.0 * values(i)) - oldval(i) + (sqtau * (values(i-1) - (2.0 * values(i)) + values(i+1))) end do end do #Collect results and write to file if I am MASTER receive results from each WORKER write results to file else if I am WORKER send results to MASTER endif Example Programs
This completes the tutorial.Please complete the online evaluation form. References and More Information
Which of the following best describes the ability of parallel computing to improve efficiency?Which of the following best describes the ability of parallel computing solutions to improve efficiency? The efficiency of a solution that can be broken down into parallel portions is still limited by a sequential portion.
Which of the following best describes a challenge involved in using parallel computing solution?Which of the following best describes a challenge involved in using a parallel computing solution? A parallel computing solution may not be appropriate for an algorithm in which each step requires the output from the preceding step.
Which of the following parallel computing solutions would minimize the amount of time it takes to execute all four processes?Which of the following parallel computing solutions would minimize the amount of time it takes to execute all four processes? why? With two processors running in parallel, execution time is minimized when the processors take on as close to an equal workload as possible.
Which of the following best explains the consequences of the problem being Undecidable?Which of the following best explains the consequence of the problem being undecidable? The problem can be solved algorithmically, but it will require an unreasonably long amount of time.
|