Cover
Halftitle Page
About the Author
Title Page
Copyright Page
Dedication
Preface
Acknowledgements
Book Organization
Contents
Unit I — Multi – Core Processors
Chapter 1 SIMD and MIMD Systems
1.1 Single Core to Multi-Core Architectures
1.1.1 Basic Concepts of Single Core
1.1.2 Single Core Architecture
1.1.3 Multi-core Architecture
1.2 SIMD and MIMD Systems
1.2.1 SIMD Systems
1.2.2 MIMD Systems
1.3. Interconnection Networks
1.3.1 Shared Medium
1.3.2 Switched Medium
1.3.3 Topologies of Switched Network
1.3.4 2-D Mesh Network
1.3.5 Binary Tree Network
1.3.6 Hyper Tree Network
1.3.7 Butterfly Network
1.3.8 Hyper Cube Network
1.3.9 Shuffle Exchange Network
1.3.10 Shared Memory interconnection Network
1.3.11 Distributed Memory Interconnection Networks
1.3.12 Direct Interconnect Distributed Memory
1.3.13 Indirect Interconnect Distributed Memory
1.4. Symmetric Shared Memory Architecture
1.4.1 UMA Shared Memory Architecture
1.4.2 NUMA Symmetric Shared Memory Architecture
1.4.3 Pros and Cons of Symmetric Shared Memory Architecture
1.5. Distributed Shared Memory Architecture
1.5.1 Advantages of Distributed Shared Memory Architecture
1.6. Cache Coherence
1.6.1 Snooping Cache Coherence
1.6.2 Directory Based Cache Coherence:
1.7. Performance Issues
1.7.1 Speedup
1.7.2 Efficiency
1.7.3 Amdahl’s Law
1.7.4 Scalability
1.7.5 Timing Analysis
1.8. Parallel Program Design
1.8.1 Task Channel Computation Model
1.8.2 Ian Foster’s Design Methodology
1.8.3 Boundary Value Problem
1.8.4 Finding the Maximum
1.8.5 n-Body Problem:
1.8.6 Adding Data Input
Unit II — Parallel Program Challenges
Chapter 2 Performance and Synchronization in Parallel Programming
2.1 Challenges in Parallel Programming
2.1.1 Synchronization Challenge
2.1.2 Communication Challenge
2.1.3 Load Balancing Challenge
2.1.4 Scalability Challenge
2.2 Performance
2.2.1 SpeedUp of Parallel Program
2.2.2 Efficiency of Parallel Program
2.2.3 Amdahl’s Law
2.2.4 Gustafson Barsis’s Law
2.2.5 Karp-Flatt Metric Law
2.2.6 Common Metrics for Performance
2.2.7 Algorithm Complexity Impacting Performance
2.2.8 Structure Impacting Performance
2.2.9 Compiler Optimization Impacting Performance
2.3 Scalability
2.3.1 ISO Efficiency Relation
2.4 Synchronization and Data Sharing
2.5 Data Races
2.5.1 Conditions for Data Races
2.5.2 Detecting Data Races Using Tools
2.5.3 Avoiding Data Race
2.6 Synchronization Primitives
2.7 Mutexes
2.7.1 Mutex Lock in the Critical Region
2.7.2 Protecting Critical Section by Mutex
2.8 Locks
2.8.1 Types of Locks
2.8.2 Mutex Locks
2.8.3 Recursive Locks
2.8.4 Reader Writer Locks
2.8.5 Spin Locks
2.9 Semaphores
2.9.1 Implementation of Semaphore
2.9.2 Semaphore Solution to Producer-Consumer Problem
2.10 Barriers
2.11 Deadlocks
2.12 Live Locks
2.13 Communication between Threads
2.14 Condition Variables
2.15 Signals
2.16 Message Queues
2.16.1 Opening and Closing Message Queue
2.16.2 Passing Message into Message Queue
2.17 Pipes
2.17.1 Creating Anonymous Pipe between Parent and Child Process
2.17.2 Creating Named Pipe between Parent and child Process
2.17.3 Creating Anonymous Pipes in Windows
Unit III — Shared Memory Programming with OpenMP
Chapter 3 Open MP Programming
3.1 Open MP Execution Model
3.1.1 Compiling and Executing Open MP Programs
3.1.2 Open MP Program Code
3.1.3 Open MP Code Error Debugging
3.1.4 Open MP Compilation Commands
3.2 Memory Model
3.3 Open MP Directives
3.3.1 Parallel Directive
3.3.2 For Directive
3.3.3 Parallel For Directive
3.3.4 Critical Directive
3.3.5 Critical Directive
3.3.6 Single Directive
3.3.7 Sections and Parallel Sections Directive
3.3.8 Master Directive
3.3.9 No Wait Directive Clause
3.3.10 Barrier Directive
3.3.11 Atomic Directive
3.3.12 Flush Directive
3.3.13 Task Directive
3.3.14 Task Wait Directive
3.3.15 Ordered Directive
3.4 Work Sharing Constructs
3.5 Libirary Functions
3.6 Handling Data Parallelism
3.6.1 Data Parallelism through Parallel Pragma
3.6.2 Data Parallelism Through For Directive
3.6.3 Data Parallelism through Single Directive
3.6.4 Data Parallelism through “omp_get_thread_num” and“omp_get_num_threads”
3.6.5 Data Parallelism through “Nowait” Clause
3.7 Handling Functional Parallelism
3.8 Handling Loops
3.8.1 Parallelizing Loops
3.8.2 Variable Scoping in Parallelized Loops
3.8.3 Parallelizing Loop Reductions
3.8.4 Scheduling Loops
3.8.5 Collapsing Loops
3.9 Performance Considerations
3.9.1 Performance through Barrier and Nowait:
3.9.2 Performance through Multithreading
3.9.3 Parallelizing Loop Reductions Performance through Copyin and Copyprivate
Unit IV — Distributed Memory Programming with MPI
Chapter 4 MPI Programming
4.1 MPI Program Execution Model
4.1.1 MPI Program Compilation and Execution
4.2. MPI Constructs
4.2.1 MPI _ Init construct
4.2.2 MPI _Finalize Construct
4.2.3 MPI _Comm_Size Construct
4.2.4 MPI _Comm_Rank Construct
4.3 MPI Libraries
4.3.1 MPI _Abort
4.3.2 MPI _Address
4.3.3 MPI _Allgather
4.3.4 MPI _ Allreduce
4.3.5 MPI_ Alltoall
4.3.6 MPI _Attr_delete
4.3.7 MPI_Attr_get
4.3.8 MPI_Attr_put
4.3.9 MPI _Barrier
4.3.10 MPI _BCast
4.3.11 MPI _BSend
4.3.12 MPI_Buffer_attach
4.3.13 MPI_Buffer_detach
4.3.14 MPI_Cancel
4.3.15 MPI_Cart_Create
4.3.16 MPI_Comm_Compare
4.3.17 MPI_Comm_Create
4.3.18 MPI_Comm_dup
4.3.19 MPI_Comm_free
4.3.20 MPI_Comm_group
4.3.21 MPI_Comm_remote_group
4.3.22 MPI_Comm_remote_size
4.3.23 MPI_Comm_Split
4.3.24 MPI_Errhandler_Create
4.3.25 MPI_Error_String
4.3.26 MPI_Gather
4.3.27 MPI_Get_Count
4.3.28 MPI_Get_Processor_Name
4.3.29 MPI_Get_Version
4.3.30 MIP_Graph_Create
4.3.31 MPI_Graph_Map
4.3.32 MPI_Group_Compare
4.3.33 MPI_Group_Free
4.3.34 MPI_Group_Intersection
4.3.35 MPI_Group_Rank
4.3.36 MPI _Group _Size
4.3.37 MPI _Group_Union
4.3.38 MPI _Init _Thread
4.3.39 MPI _Intercomm _Create
4.3.40 MPI_IProbe
4.3.41 MPI _IRecv
4.3.42 MPI _ ISend
4.3.43 MPI_KeyVal_Create
4.3.44 MPI _ Pack
4.3.45 MPI _Recv
4.3.46 MPI _ Probe
4.3.47 MPI_Reduce
4.3.48 MPI_Reduce_Scatter
4.3.49 MPI_Request_Free
4.3.50 MPI_Scan
4.3.51 MPI _Scatter
4.3.52 MPI _ Send
4.3.53 MPI_Sendrecv
4.3.54 MPI _Start
4.3.55 MPI_Test
4.3.56 MPI_ Topo_Test
4.3.57 MPI_Type _Commit
4.3.58 MPI _Type_Count
4.3.59 MPI_Type_Extent
4.3.60 MPI_Type_ Indexed
4.3.61 MPI _Type_ Size
4.3.62 MPI_Type_Struct
4.3.63 MPI _Type_Vector
4.3.64 MPI_Unpack
4.3.65 MPI _Wait
4.3.66 MPI _WaitAll
4.3.67 MPI _WaitAny
4.3.68 MPI _WaitSome
4.3.69 MPI _Type_Ub
4.3.70 MPI_Testall
4.4 MPI Send and Receive
4.5 Point to Point Communication
4.6 Collective Communication
4.6.1 MPI_Reduce
4.6.2 MPI_Allreduce
4.6.3 MPI_BCast
4.6.4 MPI_Scatter
4.6.5 MPI _Gather
4.6.6 MPI_Allgather
4.7 MPI Derived Datatypes
4.7.1 Use of MPI Derived Datatype
4.8 Performance Evaluation
4.8.1 Timing Performance
4.8.2 Performance Results
4.8.3 SpeedUp Performance
4.8.4 Efficiency Performance
4.8.5 Scalability Performance
Unit V — Parallel Program Development
Chapter 5 Tree Search with Open MP and MPI
5.1 Case Studies
5.1.1 Neutron Transport System
5.1.2 Temperature at a Point Inside Two Dimensional Plate
5.1.3 Two Dimensional Ising Model
5.1.4 Room Assignment Problem
5.1.5 Parking Garage System
5.1.6 Traffic Circle System
5.2 n – Body Solvers
5.2.1 Definition of n- Body Problem
5.2.2 Definition of n- Body Solver
5.2.3 n- Body Solver for Moving Stars
5.2.4 Serial n- Body Solver Program
5.2.5 Parallelizing the Serial N- Body Solver
5.2.6 Parallelizing n- Body Basic Solver using Open MP
5.2.7 Parallelizing nBody Reduced Solver using OpenMP
5.2.8 Performance of OpenMP Program
5.2.9 Parallelizing n-Body Basic Solver Using MPI
5.2.10 Parallelizing n-Body Reduced Solver Using MPI
5.2.11 Performance of MPI n-Body Solvers
5.3 Tree Search
5.3.1 Recursive Depth First Search
5.3.2 Non Recursive Depth First Search
5.3.3 Datastructures Used in Tree Search
5.3.4 Parallelizing Tree Search
5.3.5 Tree Search Static Parallelization using Pthreads
5.3.6 Tree Search Dynamic Parallelization using PThreads
5.2.7 Termination in Tree Search Dynamic Parallelization using Pthreads
5.2.8 Stack Splitting in Dynamic Parallelization
5.2.9 Evaluating Tree Search of Pthreads
5.4 OpenMP and MPI Implementation of Tree Search and Comparison
5.4.1 Parallelizing Tree Search Programs using OpenMP
5.4.2 Performance of OpenMp Implementation
5.4.3 Static partitioning Tree Search Implementation using MPI
5.4.4 Dynamic Partitioning Tree Search Implementation using MPI
5.4.5 Performance of MPI Programs