This is the first tutorial in the "Livermore Computing Getting Started" workshop. It is intended to provide only a brief overview of the extensive and broad topic of Parallel Computing, as a lead-in for the tutorials that follow it. As such, it covers just the very basics of parallel computing, and is intended for someone who is just becoming acquainted with the subject and who is planning to attend one or more of the other
tutorials in this workshop. It is not intended to cover Parallel Programming in depth, as this would require significantly more time. The tutorial begins with a discussion on parallel computing - what it is and how it's used, followed by a discussion on concepts and terminology associated with parallel computing. The topics of parallel memory architectures and programming models are then explored. These topics are followed by a series of practical discussions on a number of the complex issues
related to designing and running parallel programs. The tutorial concludes with several examples of how to parallelize several simple problems. References are included for further self-study. Traditionally, software has been written for serial computation: For example: In the simplest sense, parallel
computing is the simultaneous use of multiple compute resources to solve a computational problem: For example: Historically, parallel computing has been considered to be "the high end of computing," and has been used to model difficult problems in many areas of science and engineering: Today, commercial applications provide an equal or greater driving force in the development of faster computers. These applications require the processing of large amounts of data in sophisticated ways. For example: Parallel computers still follow this basic design, just
multiplied in units. The basic, fundamental architecture remains the same. More info on his other remarkable accomplishments: //en.wikipedia.org/wiki/John_von_Neumann
Image Image Image Image Image Image Image Image Contemporary CPUs consist of one or more cores - a distinct execution unit with its own instruction stream. Cores with a CPU may be organized
into one or more sockets - each socket with its own distinct memory . When a CPU consists of two or more sockets, usually hardware infrastructure supports memory sharing across sockets. A standalone "computer in a box." Usually comprised of multiple CPUs/processors/cores, memory, network interfaces, etc. Nodes are networked together to comprise a supercomputer. A logically
discrete section of computational work. A task is typically a program or program-like set of instructions that is executed by a processor. A parallel program consists of multiple tasks running on multiple processors. Breaking a task into steps performed by different processor units, with inputs streaming through, much like an assembly line; a type of parallel computing. Describes a computer architecture where all processors have direct access
to common physical memory. In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same logical memory locations regardless of where the physical memory actually exists. Shared memory hardware architecture where multiple processors share a single address space and have equal access to all resources - memory, disk, etc. In hardware,
refers to network based memory access for physical memory that is not common. As a programming model, tasks can only logically "see" local machine memory and must use communications to access memory on other machines where other tasks are executing. Parallel tasks typically need to exchange data. There are several ways this can be accomplished, such as through a shared memory bus or over a network. The coordination of parallel tasks in
real time, very often associated with communications. Synchronization usually involves waiting by at least one task, and can therefore cause a parallel application's wall clock execution time to increase. In parallel computing, granularity is a quantitative or qualitative measure of the ratio of computation to communication. Observed speedup of a code which has been parallelized, defined as: One of the simplest and most widely used indicators for a parallel program's performance. Required execution time that is unique to parallel tasks, as opposed to that for doing useful work. Parallel overhead can
include factors such as: Refers to the hardware that comprises a given parallel system - having many processing elements. The meaning of "many" keeps increasing, but currently, the largest parallel computers are comprised of processing elements numbering in
the hundreds of thousands to millions. Solving many similar, but independent tasks simultaneously; little to no need for coordination between the tasks. Refers to a parallel system's (hardware and/or software) ability to demonstrate a proportionate increase in parallel speedup with the addition of more resources. Factors that contribute to scalability include:
Kendall Square Research (KSR) ALLCACHE approach. Machine memory was physically distributed across networked machines, but appeared to the user as a single shared memory global address space. Generically, this approach is referred to as "virtual shared memory".
Message Passing Interface (MPI) on SGI Origin 2000. The SGI Origin 2000 employed the CC-NUMA type of shared memory architecture, where every task has direct access to global address space spread across all machines. However, the ability to send and receive messages using MPI, as is commonly done over a network of distributed memory machines, was implemented and commonly used.
Implementations Implementations In both cases, the programmer is responsible for determining the parallelism (although compilers can sometimes help). Implementations:
Implementations: Programs = algorithms + data + (hardware) Calculate the potential energy for each of several thousand independent conformations of a molecule. When done, find the minimum energy conformation. This problem is able to be solved in parallel. Each of the molecular conformations is independently determinable. The calculation of the minimum energy conformation is also a parallelizable problem. Calculation of the first 10,000 members of the Fibonacci series (0,1,1,2,3,5,8,13,21,...) by use of the formula: The calculation of the F(n) value uses those of both F(n-1) and F(n-2), which must be computed first. An example of a parallel algorithm for solving this problem (using Binet's formula): where In this type of partitioning, the data associated with a problem is decomposed. Each parallel task then works on a portion of the data. There are different ways to partition data: In this approach, the focus is on the computation that is to be performed rather than on the data manipulated by the computation. The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work. Functional decomposition lends itself well to problems that
can be split into different tasks. For example: Each program calculates the population of a given group, where each group's growth depends on that of its neighbors. As time progresses, each process calculates its current state, then exchanges information with the neighbor populations. All tasks then progress to calculate the state at the next time step. An
audio signal data set is passed through four distinct computational filters. Each filter is a separate process. The first segment of data must pass through the first filter before progressing to the second. When it does, the second segment of data passes through the first filter. By the time the fourth segment of data is in the first filter, all four tasks are busy.Table of Contents
Abstract
Overview
What Is Parallel Computing?
Serial Computing
Serial computing generic exampleParallel Computing
Parallel computing generic example
Parallel Computers
IBM BG/Q Compute Chip with 18 cores (PU) and 16 L2 Cache units (L2)
Network connections
Example of typical parallel computer cluster
Source:
Top500.orgWhy Use Parallel Computing?
The Real World Is Massively Complex
Real world phenomena can be simulated with parallel computingReal world phenomena can be simulated with parallel computingMain Reasons for Using Parallel Programming
SAVE TIME AND/OR MONEY
Working in parallel shortens completion timeSOLVE LARGER / MORE COMPLEX PROBLEMS
Parallel computing can solve increasingly complex problemsPROVIDE CONCURRENCY
Collaborative networksTAKE ADVANTAGE OF NON-LOCAL RESOURCES
SETI has a large worldwide user baseMAKE BETTER USE OF UNDERLYING PARALLEL HARDWARE
Parallel computing coresThe Future
Source: Top500.orgWho Is Using Parallel Computing?
Science and Engineering
Parallel computing is key to simulating a range of complex physical phenomenaIndustrial
and Commercial
Parallel computing is used in many commercial applicationsGlobal Applications
Source: Top500.orgSource: Top500.orgSource: Top500.orgConcepts and Terminology
von Neumann Computer Architecture
John von Neumann circa 1940s
(Source: LANL archives)
Basic computing architectureFlynn's Classical Taxonomy
Flynn's taxonomySingle Instruction, Single Data (SISD)
Single Instruction, Multiple Data (SIMD)
Multiple Instruction, Single Data (MISD)
Multiple Instruction, Multiple Data (MIMD)
General Parallel Computing Terminology
CPU
Observed Speedup
Massively ParallelPotential Benefits, Limits and Costs of Parallel Programming
Amdahl's Law
Amdahl's lawSpeedup when introducing more processors
1
speedup = --------
1 - P
1
speedup = ------------
P + S
---
N
speedup
-------------------------------------
N P = .50 P = .90 P = .95 P = .99
----- ------- ------- ------- -------
10 1.82 5.26 6.89 9.17
100 1.98 9.17 16.80 50.25
1,000 1.99 9.91 19.62 90.99
10,000 1.99 9.91 19.96 99.02
100,000 1.99 9.99 19.99 99.90
2D Grid Calculations
Parallel fraction 85 seconds 85%
Serial fraction 15 seconds 15%
2D Grid Calculations
Parallel fraction 680 seconds 97.84%
Serial fraction 15 seconds 2.16% Complexity
Portability
Resource Requirements
Scalability
Strong and weak scalingParallel Computer Memory Architectures
Shared Memory
General
Characteristics
Uniform Memory Access (UMA)
Uniform memory accessNon-Uniform Memory Access (NUMA)
Non-uniform memory accessAdvantages
Disadvantages
Distributed Memory
General Characteristics
Distributed memoryAdvantages
Disadvantages
Hybrid Distributed-Shared Memory
General Characteristics
Advantages and Disadvantages
Parallel Programming Models
SHARED memory model on a DISTRIBUTED memory machine
DISTRIBUTED memory model on a SHARED memory machine
Shared Memory Model (without threads)
Shared memory modelThreads Model
Threads model
POSIX Threads
OpenMPMore Information
Distributed Memory / Message Passing Model
Distributed memory modelMore Information
Data Parallel Model
Data parallel model
Hybrid Model
Hybrid model with MPI and OpenMPHybrid model with MPI and CUDASPMD and MPMD
Single Program Multiple Data (SPMD)
SPMD modelMultiple Program Multiple Data (MPMD)
MPMD modelDesigning Parallel Programs
Automatic vs. Manual Parallelization
Fully Automatic
Programmer DirectedUnderstand the Problem and the Program
F(n) = F(n-1) + F(n-2)Partitioning
Domain Decomposition
Functional Decomposition
Each model component can be thought of as a separate task. Arrows represent exchanges of data between components during computation: the atmosphere model generates wind velocity data that are used by the ocean model, the ocean model generates sea surface temperature data that are used by the atmosphere model, and so on.
Climate modelingComplex relationships between climate and atmospheric modeling components- Combining these two types of problem decomposition is common and natural.
Communications
Who Needs Communications?
The need for communications between tasks depends upon your problem:
You DON'T need communications- Some types of problems can be decomposed and executed in parallel with virtually no need for tasks to share data. These types of problems are often called embarrassingly parallel - little or no communications are required.
- For example, imagine an image processing operation where every pixel in a black and white image needs to have its color reversed. The image data can easily be distributed to multiple tasks that then act independently of each other to do their portion of the work.
- Most parallel applications are not quite so simple, and do require tasks to share data with each other.
- For example, a 2-D heat diffusion problem requires a task to know the temperatures calculated by the tasks that have neighboring data. Changes to neighboring data has a direct effect on that task's data.
Factors to Consider
There are a number of important factors to consider when designing your program's inter-task communications:
Communication overheadCommunication is complex- Inter-task communication virtually always implies overhead.
- Machine cycles and resources that could be used for computation are instead used to package and transmit data.
- Communications frequently require some type of synchronization between tasks, which can result in tasks spending time "waiting" instead of doing work.
- Competing communication traffic can saturate the available network bandwidth, further aggravating performance problems.
- Latency is the time it takes to send a minimal (0 byte) message from point A to point B. Commonly expressed as microseconds.
- Bandwidth is the amount of data that can be communicated per unit of time. Commonly expressed as megabytes/sec or gigabytes/sec.
- Sending many small messages can cause latency to dominate communication overheads. Often it is more efficient to package small messages into a larger message, thus increasing the effective communications bandwidth.
- With the Message Passing Model, communications are explicit and generally quite visible and under the control of the programmer.
- With the Data Parallel Model, communications often occur transparently to the programmer, particularly on distributed memory architectures. The programmer may not even be able to know exactly how inter-task communications are being accomplished.
- Synchronous communications require some type of "handshaking" between tasks that are sharing data. This can be explicitly structured in code by the programmer, or it may happen at a lower level unknown to the programmer.
- Synchronous communications are often referred to as blocking communications since other work must wait until the communications have completed.
- Asynchronous communications allow tasks to transfer data independently from one another. For example, task 1 can prepare and send a message to task 2, and then immediately begin doing other work. When task 2 actually receives the data doesn't matter.
- Asynchronous communications are often referred to as non-blocking communications since other work can be done while the communications are taking place.
- Interleaving computation with communication is the single greatest benefit for using asynchronous communications.
- Knowing which tasks must communicate with each other is critical during the design stage of a parallel code. Both of the two scopings described below can be implemented synchronously or asynchronously.
- Point-to-point - involves two tasks with one task acting as the sender/producer of data, and the other acting as the receiver/consumer.
- Collective - involves data sharing between more than two tasks, which are often specified as being members in a common group, or collective. Some common variations (there are more):
- Oftentimes, the programmer has choices that can affect communications performance. Only a few are mentioned here.
- Which implementation for a given model should be used? Using the Message Passing Model as an example, one MPI implementation may be faster on a given hardware platform than another.
- What type of communication operations should be used? As mentioned previously, asynchronous communication operations can improve overall program performance.
- Network fabric—different platforms use different networks. Some networks perform better than others. Choosing a platform with a faster network may be an option.
Finally, realize that this is only a partial list of things to consider!
Synchronization
- Managing the sequence of work and the tasks performing it is a critical design consideration for most parallel programs.
- Can be a significant factor in program performance (or lack of it)
- Often requires "serialization" of segments of the program.
Types of Synchronization
Barrier- Usually implies that all tasks are involved
- Each task performs its work until it reaches the barrier. It then stops, or "blocks".
- When the last task reaches the barrier, all tasks are synchronized.
- What happens from here varies. Often, a serial section of work must be done. In other cases, the tasks are automatically released to continue their work.
- Can involve any number of tasks
- Typically used to serialize (protect) access to global data or a section of code. Only one task at a time may use (own) the lock / semaphore / flag.
- The first task to acquire the lock "sets" it. This task can then safely (serially) access the protected data or code.
- Other tasks can attempt to acquire the lock but must wait until the task that owns the lock releases it.
- Can be blocking or non-blocking.
- Involves only those tasks executing a communication operation.
- When a task performs a communication operation, some form of coordination is required with the other task(s) participating in the communication. For example, before a task can perform a send operation, it must first receive an acknowledgment from the receiving task that it is OK to send.
- Discussed previously in the Communications section.
Data Dependencies
Definition
- A dependence exists between program statements when the order of statement execution affects the results of the program.
- A data dependence results from multiple use of the same location(s) in storage by different tasks.
- Dependencies are important to parallel programming because they are one of the primary inhibitors to parallelism.
Examples
Loop carried data dependence DO J = MYSTART,MYEND A(J) = A(J-1) * 2.0 END DO- The value of A(J-1) must be computed before the value of A(J), therefore A(J) exhibits a data dependency on A(J-1). Parallelism is inhibited.
- If Task 2 has A(J) and task 1 has A(J-1), computing the correct value of A(J) necessitates:
- Distributed memory architecture - task 2 must obtain the value of A(J-1) from task 1 after task 1 finishes its computation
- Shared memory architecture - task 2 must read A(J-1) after task 1 updates it
- As with the previous example, parallelism is inhibited. The value of Y is dependent on:
- Distributed memory architecture - if or when the value of X is communicated between the tasks.
- Shared memory architecture - which task last stores the value of X.
- Although all data dependencies are important to identify when designing parallel programs, loop carried dependencies are particularly important since loops are possibly the most common target of parallelization efforts.
How to Handle Data Dependencies
- Distributed memory architectures - communicate required data at synchronization points.
- Shared memory architectures -synchronize read/write operations between tasks.
Load Balancing
- Load balancing refers to the practice of distributing approximately equal amounts of work among tasks so that all tasks are kept busy all of the time. It can be considered a minimization of task idle time.
- Load balancing is important to parallel programs for performance reasons. For example, if all tasks are subject to a barrier synchronization point, the slowest task will determine the overall performance.
How to Achieve Load Balance
Equally partition the work each task receives- For array/matrix operations where each task performs similar work, evenly distribute the data set among the tasks.
- For loop iterations where the work done in each iteration is similar, evenly distribute the iterations across the tasks.
- If a heterogeneous mix of machines with varying performance characteristics are being used, be sure to use some type of performance analysis tool to detect any load imbalances. Adjust work accordingly.
- Certain classes of problems result in load imbalances even if data is evenly distributed among tasks:
Sparse arrays - some tasks will have actual data to work on while others have mostly "zeros." | Adaptive grid methods - some tasks may need to refine their mesh while others don't. | N-body simulations - particles may migrate across task domains requiring more work for some tasks. |
- When the amount of work each task will perform is intentionally variable, or is unable to be predicted, it may be helpful to use a scheduler-task pool approach. As each task finishes its work, it receives a new piece from the work queue.
- Ultimately, it may become necessary to design an algorithm which detects and handles load imbalances as they occur dynamically within the code.
Granularity
Computation / Communication Ratio
- In parallel computing, granularity is a qualitative measure of the ratio of computation to communication.
- Periods of computation are typically separated from periods of communication by synchronization events.
Fine-grain Parallelism
Fine-grain parallelism- Relatively small amounts of computational work are done between communication events.
- Low computation to communication ratio.
- Facilitates load balancing.
- Implies high communication overhead and less opportunity for performance enhancement.
- If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation.
Coarse-grain Parallelism
Coarse-grain parallelism- Relatively large amounts of computational work are done between communication/synchronization events
- High computation to communication ratio
- Implies more opportunity for performance increase
- Harder to load balance efficiently
Which is Best?
- The most efficient granularity is dependent on the algorithm and the hardware environment in which it runs.
- In most cases the overhead associated with communications and synchronization is high relative to execution speed so it is advantageous to have coarse granularity.
- Fine-grain parallelism can help reduce overheads due to load imbalance.
I/O
The Bad News
I/O operations- I/O operations are generally regarded as inhibitors to parallelism.
- I/O operations require orders of magnitude more time than memory operations.
- Parallel I/O systems may be immature or not available for all platforms.
- In an environment where all tasks see the same file space, write operations can result in file overwriting.
- Read operations can be affected by the file server's ability to handle multiple read requests at the same time.
- I/O that must be conducted over the network (NFS, non-local) can cause severe bottlenecks and even crash file servers.
The Good News
- Parallel file systems are available. For example:
- GPFS: General Parallel File System (IBM). Now called IBM Spectrum Scale.
- Lustre: for Linux clusters (Intel)
- HDFS: Hadoop Distributed File System (Apache)
- PanFS: Panasas ActiveScale File System for Linux clusters (Panasas, Inc.)
- And more - see //en.wikipedia.org/wiki/List_of_file_systems#Distributed_parallel_file_systems
- The parallel I/O programming interface specification for MPI has been available since 1996 as part of MPI-2. Vendor and "free" implementations are now commonly available.
- A few pointers:
- Rule #1: Reduce overall I/O as much as possible.
- If you have access to a parallel file system, use it.
- Writing large chunks of data rather than small chunks is usually significantly more efficient.
- Fewer, larger files performs better than many small files.
- Confine I/O to specific serial portions of the job, and then use parallel communications to distribute data to parallel tasks. For example, Task 1 could read an input file and then communicate required data to other tasks. Likewise, Task 1 could perform write operation after receiving required data from all other tasks.
- Aggregate I/O operations across tasks - rather than having many tasks perform I/O, have a subset of tasks perform it.
Debugging
- Debugging parallel codes can be incredibly difficult, particularly as codes scale upwards.
- The good news is that there are some excellent debuggers available to assist:
- Threaded - pthreads and OpenMP
- MPI
- GPU / accelerator
- Hybrid
- Livermore Computing users have access to several parallel debugging tools installed on LC's clusters:
- TotalView from RogueWave Software
- DDT from Allinea
- Inspector from Intel
- Stack Trace Analysis Tool (STAT) - locally developed at LLNL
- All of these tools have a learning curve associated with them.
- For details and getting started information, see:
- LC's web pages at hpc.llnl.gov/software/development-environment-software
- TotalView tutorial: hpc.llnl.gov/documentation/tutorials/totalview-tutorial
Performance Analysis and Tuning
- As with debugging, analyzing and tuning parallel program performance can be much more challenging than for serial programs.
- Fortunately, there are a number of excellent tools for parallel program performance analysis and tuning.
- Livermore Computing users have access to several such tools, most of which are available on all production clusters.
- Some starting points for tools installed on LC systems:
- LC's web pages at //hpc.llnl.gov/software/development-environment-software
- TAU: //www.cs.uoregon.edu/research/tau/docs.php
- HPCToolkit: //hpctoolkit.org/documentation.html
- Open|Speedshop: //www.openspeedshop.org/
- Vampir / Vampirtrace: //vampir.eu/
- Valgrind: //valgrind.org/
- PAPI: //icl.cs.utk.edu/papi/
- mpiP: //mpip.sourceforge.net/
- memP: //memp.sourceforge.net/
Parallel Examples
Array Processing
2-D array- This example demonstrates calculations on 2-dimensional array elements; a function is evaluated on each array element.
- The computation on each array element is independent from other array elements.
- The problem is computationally intensive.
- The serial program calculates one element at a time in sequential order.
- Serial code could be of the form:
- Questions to ask:
- Is this problem able to be parallelized?
- How would the problem be partitioned?
- Are communications needed?
- Are there any data dependencies?
- Are there synchronization needs?
- Will load balancing be a concern?
Parallel Solution 1
Parallel solution 1- The calculation of elements is independent of one another - leads to an embarrassingly parallel solution.
- Arrays elements are evenly distributed so that each process owns a portion of the array (subarray).
- Distribution scheme is chosen for efficient memory access; e.g. unit stride (stride of 1) through the subarrays. Unit stride maximizes cache/memory usage.
- Since it is desirable to have unit stride through the subarrays, the choice of a distribution scheme depends on the programming language. See the Block - Cyclic Distributions Diagram for the options.
- Independent calculation of array elements ensures there is no need for communication or synchronization between tasks.
- Since the amount of work is evenly distributed across processes, there should not be load balance concerns.
- After the array is distributed, each task executes the portion of the loop corresponding to the data it owns.
- For example, both Fortran (column-major) and C (row-major) block distributions are shown:
Column-major:
do j = mystart, myend do i = 1, n a(i,j) = fcn(i,j) end do end doRow-major:
for i (i = mystart; i < myend; i++) { for j (j = 0; j < n; j++) { a(i,j) = fcn(i,j); } }- Notice that only the outer loop variables are different from the serial solution.
- Implement as a Single Program Multiple Data (SPMD) model - every task executes the same program.
- Master process initializes array, sends info to worker processes and receives results.
- Worker process receives info, performs its share of computation and sends results to master.
- Using the Fortran storage scheme, perform block distribution of the array.
- Pseudo code solution: red highlights changes for parallelism.
- MPI Array Program in C
- MPI Array Program in Fortran
Parallel Solution 2: Pool of Tasks
- The previous array solution demonstrated static load balancing:
- Each task has a fixed amount of work to do
- May be significant idle time for faster or more lightly loaded processors - slowest tasks determines overall performance.
- Static load balancing is not usually a major concern if all tasks are performing the same amount of work on identical machines.
- If you have a load balance problem (some tasks work faster than others), you may benefit by using a "pool of tasks" scheme.
- Two processes are employed
Master Process:
- Holds pool of tasks for worker processes to do
- Sends worker a task when requested
- Collects results from workers
Worker Process: repeatedly does the following
- Gets task from master process
- Performs computation
- Sends results to master
- Worker processes do not know before runtime which portion of array they will handle or how many tasks they will perform.
- Dynamic load balancing occurs at run time: the faster tasks will get more work to do.
- Pseudo code solution: red highlights changes for parallelism.
- In the above pool of tasks example, each task calculated an individual array element as a job. The computation to communication ratio is finely granular.
- Finely granular solutions incur more communication overhead in order to reduce task idle time.
- A more optimal solution might be to distribute more work with each job. The "right" amount of work is problem dependent.
PI Calculation
Calculating pi in serial- The value of PI can be calculated in various ways. Consider the Monte Carlo method of approximating PI:
- Inscribe a circle with radius r in a square with side length of 2r
- The area of the circle is Πr2 and the area of the square is 4r2
- The ratio of the area of the circle to the area of the square is:
Πr2 / 4r2 = Π / 4 - If you randomly generate N points inside the square, approximately
N * Π / 4 of those points (M) should fall inside the circle. - Π is then approximated as:
N * Π / 4 = M
Π / 4 = M / N
Π = 4 * M / N - Note that increasing the number of points generated improves the approximation.
- Serial pseudo code for this procedure:
- The problem is computationally intensive—most of the time is spent executing the loop
- Questions to ask:
- Is this problem able to be parallelized?
- How would the problem be partitioned?
- Are communications needed?
- Are there any data dependencies?
- Are there synchronization needs?
- Will load balancing be a concern?
Parallel Solution
Calculating pi in parallel- Another problem that's easy to parallelize:
- All point calculations are independent; no data dependencies
- Work can be evenly divided; no load balance concerns
- No need for communication or synchronization between tasks
- Parallel strategy:
- Divide the loop into equal portions that can be executed by the pool of tasks
- Each task independently performs its work
- A SPMD model is used
- One task acts as the master to collect results and compute the value of PI
- Pseudo code solution: red highlights changes for parallelism.
Example Programs
- MPI Pi Calculation Program in C
- MPI Pi Calculation Program in Fortran
Simple Heat Equation
2-D heat problemFinite differencing scheme- Most problems in parallel computing require communication among the tasks. A number of common problems require communication with "neighbor" tasks.
- The 2-D heat equation describes the temperature change over time, given initial temperature distribution and boundary conditions.
- A finite differencing scheme is employed to solve the heat equation numerically on a square region.
- The elements of a 2-dimensional array represent the temperature at points on the square.
- The initial temperature is zero on the boundaries and high in the middle.
- The boundary temperature is held at zero.
- A time stepping algorithm is used.
- The calculation of an element is dependent upon neighbor element values:
- A serial program would contain code like:
- Questions to ask:
- Is this problem able to be parallelized?
- How would the problem be partitioned?
- Are communications needed?
- Are there any data dependencies?
- Are there synchronization needs?
- Will load balancing be a concern?
Parallel Solution
Parallel solution to 2-D heat problem- This problem is more challenging, since there are data dependencies, which require communications and synchronization.
- The entire array is partitioned and distributed as subarrays to all tasks. Each task owns an equal portion of the total array.
- Because the amount of work is equal, load balancing should not be a concern
- Determine data dependencies:
- interior elements belonging to a task are independent of other tasks
- border elements are dependent upon a neighbor task's data, necessitating communication.
- Implement as an SPMD model:
- Master process sends initial info to workers, and then waits to collect results from all workers
- Worker processes calculate solution within specified number of time steps, communicating as necessary with neighbor processes
- Pseudo code solution: red highlights changes for parallelism.
Example Programs
- MPI Heat Equation Program in C
- MPI Heat Equation Program in Fortran
1-D Wave Equation
- In this example, the amplitude along a uniform, vibrating string is calculated after a specified amount of time has elapsed.
- The calculation involves:
- the amplitude on the y axis
- i as the position index along the x axis
- node points imposed along the string
- update of the amplitude at discrete time steps.
- The equation to be solved is the one-dimensional wave equation:
where c is a constant
- Note that amplitude will depend on previous timesteps (t, t-1) and neighboring points (i-1, i+1).
- Questions to ask:
- Is this problem able to be parallelized?
- How would the problem be partitioned?
- Are communications needed?
- Are there any data dependencies?
- Are there synchronization needs?
- Will load balancing be a concern?
1-D Wave Equation Parallel Solution
- This is another example of a problem involving data dependencies. A parallel solution will involve communications and synchronization.
- The entire amplitude array is partitioned and distributed as subarrays to all tasks. Each task owns an equal portion of the total array.
- Load balancing: all points require equal work, so the points should be divided equally
- A block decomposition would have the work partitioned into the number of tasks as chunks, allowing each task to own mostly contiguous data points.
- Communication need only occur on data borders. The larger the block size the less the communication.
- Implement as an SPMD model:
- Master process sends initial info to workers, and then waits to collect results from all workers
- Worker processes calculate solution within specified number of time steps, communicating as necessary with neighbor processes
- Pseudo code solution: red highlights changes for parallelism.
Example Programs
- MPI Concurrent Wave Equation Program in C
- MPI Concurrent Wave Equation Program in Fortran
This completes the tutorial.
Please complete the online evaluation form.
References and More Information
- Author: Blaise Barney, Livermore Computing (retired), Donald Frederick, LLNL
- Contact:
- A search on the Web for "parallel programming" or "parallel computing" will yield a wide variety of information.
- Recommended reading - Parallel Programming:
- "Designing and Building Parallel Programs", Ian Foster - from the early days of parallel computing, but still illuminating.
//www.mcs.anl.gov/~itf/dbpp/ - "Introduction to Parallel Computing", Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar.
- University
of Oregon - Intel Parallel Computing Curriculum
//ipcc.cs.uoregon.edu/curriculum.html - UC Berkeley CS267, Applications of Parallel Computing, Prof. Jim Demmel, UCB -- //sites.google.com/lbl.gov/cs267-spr2020
- Udacity CS344: Intro to Parallel Programming - //developer.nvidia.com/udacity-cs344-intro-parallel-programming
- "Programming on Parallel Machines", Norm Matloff, UC Davis: //heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBookS2011.pdf
- Cornell Virtual Workshop: Parallel Programming Concepts and High-Performance Computing - //cvw.cac.cornell.edu/Parallel/
- CS267, Applications of Parallel Computers, Spring 2021, Prof. Jim Demmel, UCB - //sites.google.com/lbl.gov/cs267-spr2021
- Introduction to High Performance Scientific Computing", Victor Eijkhout, TACC - //pages.tacc.utexas.edu/~eijkhout/istc/istc.html
- COMP 705: Advanced Parallel Computing (Fall, 2017), SDSU, Prof. Mary Thomas - //edoras.sdsu.edu/~mthomas/f17.705
- Georg Hager's SC '20 Tutorial on Node-Level Performance Tuning - //blogs.fau.de/hager/tutorials/sc20
- "Designing and Building Parallel Programs", Ian Foster - from the early days of parallel computing, but still illuminating.
- Recommended reading - Linux
- An Introduction to Linux - //cvw.cac.cornell.edu/Linux/
- Linux Tutorial for Beginners: Introduction to Linux Operating System - //www.youtube.com/watch?v=V1y-mbWM3B8
- "Introduction to Linux" - Boston University - //www.bu.edu/tech/files/2018/05/2018-Summer-Tutorial-Intro-to-Lin…
- Photos/Graphics have been created by the authors, created by other LLNL employees, obtained from non-copyrighted, government or public domain (such as //commons.wikimedia.org/) sources, or used with the permission of authors from other presentations and web pages.
History: These materials evolved from the following sources:
- Tutorials developed by the Cornell University Center for Advanced Computing (CAC) available at //cvw.cac.cornell.edu/.
- Tutorials developed by the Maui High Performance Computing Center’s “SP Parallel Programming Workshop” (no longer available).