#### Scientific Computing WS 2017/2018

Lecture 25

Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de

## Why parallelization?

► Computers became faster and faster without that...



- ▶ But: clock rate of processors limited due to physical limits
- $\blacktriangleright \Rightarrow$  parallelization is the main road to increase the amount of data processed
- ► Parallel systems nowadays ubiquitous: even laptops and smartphones have multicore processors
- ► Amount of accessible memory per processor is limited ⇒ systems with large memory can be created based on parallel processors

#### TOP 500 2016 rank 1-6

# Based on linpack benchmark: solution of dense linear system. Typical desktop computer: $R_{max}\approx 100\dots 1000\,GFlop/s$

| Rank | Site                                                                  | System                                                                                                                            | Cores      | Rmax<br>(TFlop/s) | Rpeak<br>(TFlop/s) | Power<br>(kW) |
|------|-----------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|------------|-------------------|--------------------|---------------|
| 1    | National Supercomputing Center in<br>Wuxi<br>China                    | Sunway TaihuLight - Sunway MPP,<br>Sunway SW26010 260C 1.45GHz,<br>Sunway<br>NRCPC                                                | 10,649,600 | 93,014.6          | 125,435.9          | 15,371        |
| 2    | National Super Computer Center in<br>Guangzhou<br>China               | Tianhe-2 [MilkyWay-2] - TH-IVB-<br>FEP Cluster, Intel Xeon E5-2692 12C<br>2.200GHz, TH Express-2, Intel Xeon<br>Phi 3151P<br>NUDT | 3,120,000  | 33,862.7          | 54,902.4           | 17,808        |
| 3    | DOE/SC/Oak Ridge National<br>Laboratory<br>United States              | Titan - Cray XK7 , Opteron 6274 16C<br>2.200GHz, Cray Gemini interconnect,<br>NVIDIA K20x<br>Cray Inc.                            | 560,640    | 17,590.0          | 27,112.5           | 8,209         |
| 4    | DOE/NNSA/LLNL<br>United States                                        | Sequoia - BlueGene/Q, Power BQC<br>16C 1.60 GHz, Custom<br>IBM                                                                    | 1,572,864  | 17,173.2          | 20,132.7           | 7,890         |
| 5    | RIKEN Advanced Institute for<br>Computational Science [AICS]<br>Japan | K computer, SPARC64 VIIIfx 2.0GHz,<br>Tofu interconnect<br>Fujitsu                                                                | 705,024    | 10,510.0          | 11,280.4           | 12,660        |
| 6    | DOE/SC/Argonne National Laboratory<br>United States                   | Mira - BlueGene/Q, Power BQC 16C<br>1.60GHz, Custom<br>IBM                                                                        | 786,432    | 8,586.6           | 10,066.3           | 3,945         |

[Source:www.top500.org ]

#### TOP 500 2016 rank 7-13

| Rank | Site                                                                  | System                                                                                                   | Cores   | Rmax<br>(TFlop/s) | Rpeak<br>(TFlop/s) | Power<br>(kW) |
|------|-----------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|---------|-------------------|--------------------|---------------|
| 7    | DOE/NNSA/LANL/SNL<br>United States                                    | Trinity - Cray XC40, Xeon E5-2698v3<br>16C 2.3GHz, Aries interconnect<br>Cray Inc.                       | 301,056 | 8,100.9           | 11,078.9           | 4,233         |
| 8    | Swiss National Supercomputing<br>Centre (CSCS)<br>Switzerland         | Piz Daint - Cray XC30, Xeon E5-2670<br>8C 2.600GHz, Aries interconnect ,<br>NVIDIA K20x<br>Cray Inc.     | 115,984 | 6,271.0           | 7,788.9            | 1,754         |
| 9    | HLRS -<br>Höchstleistungsrechenzentrum<br>Stuttgart<br>Germany        | Hazet Hen - Cray XC40, Xeon E5-<br>2680v3 12C 2.5GHz, Aries<br>interconnect<br>Cray Inc.                 | 185,088 | 5,640.2           | 7,403.5            | 3,615         |
| 10   | King Abdullah University of Science<br>and Technology<br>Saudi Arabia | Shaheen II - Cray XC40, Xeon E5-<br>2698v3 16C 2.3GHz, Aries<br>interconnect<br>Cray Inc.                | 196,608 | 5,537.0           | 7,235.2            | 2,834         |
| 11   | Total Exploration Production<br>France                                | Pangea - SGI ICE X, Xeon Xeon E5-<br>2670/ E5-2680v3 12C 2.5GHz,<br>Infiniband FDR<br>HPE/SGI            | 220,800 | 5,283.1           | 6,712.3            | 4,150         |
| 12   | Texas Advanced Computing<br>Center/Univ. of Texas<br>United States    | Stampede - PowerEdge C8220, Xeor<br>E5-2680 8C 2.700GHz, Infiniband<br>FDR, Intel Xeon Phi SE10P<br>Dell | 462,462 | 5,168.1           | 8,520.1            | 4,510         |
| 13   | Forschungszentrum Juelich (FZJ)<br>Germany                            | JUQUEEN - BlueGene/Q, Power BQC<br>16C 1.600GHz, Custom Interconnect<br>IBM                              | 458,752 | 5,008.9           | 5,872.0            | 2,301         |

[Source:www.top500.org]

#### Parallel paradigms

#### SIMD Single Instruction Multiple Data



[Source: computing.llnl.gov/tutorials]

- "classical" vector systems: Cray, Convex
- Graphics processing units (GPU)

## MIMD Multiple Instruction Multiple Data



[Source: computing.llnl.gov/tutorials]

- Shared memory systems
  - ► IBM Power, Intel Xeon, AMD Opteron . . .
  - Smartphones . . .
  - Xeon Phi R.I.P.
- Distributed memory systems
  - interconnected CPUs

#### MIMD Hardware: Distributed memory



[Source: computing.llnl.gov/tutorials]

- "Linux Cluster"
- "Commodity Hardware"
- Memory scales with number of CPUs interconneted
- ► High latency for communication
- Mostly programmed using MPI (Message passing interface)
- Explicit programming of communications: gather data, pack, send, receive, unpack, scatter

MPI\_Send(buf,count,type,dest,tag,comm)
MPI\_Recv(buf,count,type,src,tag,comm,stat)

#### MIMD Hardware: Shared Memory

Symmetric Multiprocessing (SMP)/Uniform memory acces (UMA)



[Source: computing.llnl.gov/tutorials]

- Similar processors
- ► Similar memory access times

#### Nonuniform Memory Access (NUMA)



[Source: computing.llnl.gov/tutorials]

- Possibly varying memory access latencies
- Combination of SMP systems
- ccNUMA: Cache coherent NUMA
- Shared memory: one (virtual) address space for all processors involved
- Communication hidden behind memory acces
- Not easy to scale large numbers of CPUS
- MPI works on these systems as well

## Hybrid distributed/shared memory

- Combination of shared and distributed memory approach
- ► Top 500 computers



[Source: computing.llnl.gov/tutorials]

- ▶ Shared memory nodes can be mixed CPU-GPU
- Need to master both kinds of programming paradigms

## Shared memory programming: pthreads

- ▶ Thread: lightweight process which can run parallel to others
- pthreads (POSIX threads): widely distributed
- cumbersome tuning + syncronization
- basic structure for higher level interfaces

```
#include <pthread.h>
void *PrintHello(void *threadid)
{ long tid = (long)threadid;
  printf("Hello World! It's me, thread #%ld!\n", tid);
  pthread_exit(NULL);
int main (int argc, char *argv[])
{ pthread_t threads[NUM_THREADS];
  int rc; long t;
  for(t=0; t<NUM_THREADS; t++){</pre>
    printf("In main: creating thread %ld\n", t);
    rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t);
    if (rc) {printf("ERROR; return code from pthread_create() is %d\n", rc
    pthread_exit(NULL);
Source: computing.llnl.gov/tutorials
```

compile and link with

gcc -pthread -o pthreads pthreads.c

#### Shared memory programming: C++11 threads

- ▶ Threads introduced into C++ standard with C++11
- ▶ Quite late... many codes already use other approaches
- ▶ But interesting for new applications

```
#include <iostream>
#include <thread>
void call from thread(int tid) {
  std::cout << "Launched by thread " << tid << std::endl;
int main() {
  std::thread t[num threads];
  for (int i = 0; i < num_threads; ++i) {</pre>
    t[i] = std::thread(call from thread, i);
  std::cout << "Launched from main\n":
  //Join the threads with the main thread
  for (int i = 0; i < num_threads; ++i) {</pre>
    t[i].join();
  return 0:
```

Source: https://solarianprogrammer.com/2011/12/16/cpp-11-thread-tutorial/

compile and link with

## Thread programming: mutexes and locking

- ▶ If threads work with common data (write to the same memory address, use the same output channel) access must be synchronized
- ► Mutexes allow to define regions in a program which are accessed by all threads in a sequential manner.

```
#include <mutex>
std::mutex mtx:
void call from thread(int tid) {
  mtx.lock()
  std::cout << "Launched by thread " << tid << std::endl;
  mtx.unlock()
int main() {
  std::thread t[num threads];
  for (int i = 0; i < num_threads; ++i) {</pre>
    t[i] = std::thread(call from thread, i);
  std::cout << "Launched from main\n":
  for (int i = 0; i < num threads; ++i) t[i].join();</pre>
  return 0:
```

- ▶ Barrier: all threads use the same mutex for the same region
- ▶ Deadlock: two threads block each other by locking two different locks and waiting for each other to finish

## Shared memory programming: OpenMP

- Mostly based on pthreads
- ▶ Available in C++,C,Fortran for all common compilers
- ► Compiler directives (pragmas) describe parallel regions

```
... sequential code ...
#pragma omp parallel
{
    ... parallel code ...
}
(implicit barrier)
... sequential code ...
```



[Source: computing.llnl.gov/tutorials]

## Shared memory programming: OpenMP II

```
#include <iostream>
#include <cstdlib>
void call from thread(int tid) {
  std::cout << "Launched by thread " << tid << std::endl;
int main (int argc, char *argv[])
  int num threads=1:
  if (argc>1) num_threads=atoi(argv[1]);
  #pragma omp parallel for
  for (int i = 0; i < num threads; ++i)</pre>
    call from thread(i):
  return 0;
```

compile and link with

```
g++ -fopenmp -o cppomp cppomp.cxx
```

#### Example: u = au + v und $s = u \cdot v$

```
double u[n],v[n];
#pragma omp parallel for
for(int i=0; i<n; i++)
    u[i]+=a*v[i];

//implicit barrier
double s=0.0;
#pragma omp parallel for
for(int i=0; i<n; i++)
    s+=u[i]*v[i];</pre>
```

- Code can be parallelized by introducing compiler directives
- ► Compiler directives are ignored if not in parallel mode
- ightharpoonup Write conflict with + s: several threads may access the same variable

## Preventing conflicts in OpenMP

Critical sections are performed only by one thread at a time

```
double s=0.0;
#pragma omp parallel for
for(int i=0; i<n; i++)
#pragma omp critical
{
   s+=u[i]*v[i];
}</pre>
```

► Expensive, parallel program flow is interrupted

#### Do it yourself reduction

- Remedy: accumulate partial results per thread, combine them after main loop
- "Reduction"

```
#include <omp.h>
int maxthreads=omp_get_max_threads();
double s0[maxthreads];
double u[n],v[n];
for (int ithread=0;ithread<maxthreads; ithread++)</pre>
   s0[ithread]=0.0;
#pragma omp parallel for
for(int i=0; i<n; i++)</pre>
  int ithread=omp_get_thread_num();
  s0[ithread]+=u[i]*v[i];
double s=0.0;
for (int ithread=0:ithread<maxthreads: ithread++)</pre>
  s+=s0[ithread]:
```

## OpenMP Reduction Variables

```
double s=0.0;
#pragma omp parallel for reduction(+:s)
for(int i=0; i<n; i++)
    s+=u[i]*v[i];</pre>
```

▶ In standard situations, reduction variables can be used to avoid write conflicts, no need to organize this by programmer

#### OpenMP: further aspects

```
double u[n],v[n];
#pragma omp parallel for
for(int i=0; i<n; i++)
u[i]+=a*u[i];</pre>
```

```
        Memory
        CPU | C
```

[Quelle: computing.llnl.gov/tutorials]

- ► Distribution of indices with thread is implicit and can be influenced by scheduling directives
- Number of threads can be set via OMP\_NUM\_THREADS environment variable or call to omp\_set\_num\_threads()
- ► First Touch Principle (NUMA): first thread which "touches" data triggers the allocation of memory with the processeor where the thread is running on

#### Parallelization of PDE solution

$$\Delta u = f \text{ in}\Omega,$$
  $u|_{\partial\Omega} = 0$   $\Rightarrow u = \int_{\Omega} f(y)G(x,y)dy.$ 

- ▶ Solution in  $x \in \Omega$  is influenced by values of f in all points in  $\Omega$
- → global coupling: any solution algorithm needs global communication

## Structured and unstructured grids

#### Structured grid



- Easy next neighbor access via index calculation
- Efficient implementation on SIMD/GPU
- Strong limitations on geometry

#### Unstructured grid



[Quelle: tetgen.org]

- General geometries
- Irregular, index vector based access to next neighbors
- ► Hardly feasible fo SIMD/GPU

## Stiffness matrix assembly for Laplace operator for P1 FEM

$$a_{ij} = a(\phi_i, \phi_j) = \int_{\Omega} \nabla \phi_i \nabla \phi_j \, dx$$
$$= \int_{\Omega} \sum_{K \in \mathcal{T}_h} \nabla \phi_i |_K \nabla \phi_j |_K \, dx$$

Assembly loop:

Set  $a_{ij} = 0$ .

For each  $K \in \mathcal{T}_h$ :

For each  $m, n = 0 \dots d$ :

$$s_{mn} = \int_{K} 
abla \lambda_{m} 
abla \lambda_{n} \ dx$$
  $a_{j_{dof}(K,m),j_{dof}(K,n)} = a_{j_{dof}(K,m),j_{dof}(K,n)} + s_{mn}$ 

#### Mesh partitioning

Partition set of cells in  $\mathcal{T}_h$ , and color the graph of the partitions.

Result:  $\mathcal{C}$ : set of colors,  $\mathcal{P}_c$ : set of partitions of given color. Then:  $\mathcal{T}_h = \bigcup_{c \in \mathcal{C}} \bigcup_{p \in \mathcal{P}_c} p$ 

- ► Sample algorithm:
  - ightharpoonup Subdivision of grid cells into equally sized subsets by METIS (Karypis/Kumar) ightharpoonup Partitions of color 1
  - $\blacktriangleright$  Create separators along boundaries  $\rightarrow$  Partitions of color 2
  - ightharpoonup "triple points" ightharpoonup Partitions of color 3



# Parallel stiffness matrix assembly for Laplace operator for P1 FEM

```
Set a_{ij}=0.

For each color c\in\mathcal{C}

#pragma omp parallel for

For each p\in\mathcal{P}_c:

For each K\in p:

For each m,n=0\ldots d:

s_{mn}=\int_K \nabla \lambda_m \nabla \lambda_n \ dx

a_{j_{dof}(K,m),j_{dof}(K,n)}+=s_{mn}
```

- Prevent write conflicts by loop organization
- No need for critical sections
- Similar structure for Voronoi finite volumes, nonlinear operator evaluation, Jacobi matrix assembly

#### Linear system solution

- Sparse matrices
- ▶ Direct solvers are hard to parallelize though many efforts are undertaken, e.g. Pardiso
- Iterative methods easier to parallelize
  - partitioning of vectors + coloring inherited from cell partitioning
  - keep loop structure (first touch principle)
  - parallelize
    - vector algebra
    - scalar products
    - matrix vector products
    - preconditioners