Parallel programming with OpenMP and MPI

Table of Contents

OpenMP
MPI

OpenMP 

OpenMP is a multiprocessing library for C, C++, and Fortran based on shared memory. It’s very concise and many serial programs can be parallelised by adding just a few extra lines. However, the simplicity is a bit deceptive in that the shared memory model can lead to some subtle errors due to multiple threads trying to access the same resource simultaneously or pieces of code where the order of operations is important not being properly synchronised.

OpenMP is mainly implkemented through preprocessor directives starting with #pragma omp. Below is an example Hello World program in C using OpenMP. To compile it with gcc just add -fopenmp to the command.

#include <stdio.h>
#include <omp.h>
int main()
{
    #pragma omp parallel
    {
        printf("Hello from %d\n",omp_get_thread_num());
    }
    return 0;
}

Parallelising a for loop 

Add #pragma omp for inside a parallel block or if the block starts at the loop #pragma omp parallel for can be used as a shorthand. This will delegate the iterations of the loop among the threads rather than each thread executing the entire loop:

#include <stdio.h>
#include <omp.h>
int main()
{
    int i;
    #pragma omp parallel for
    for (i=0;i<10;i++)
    {
        printf("Hello from %d\n",omp_get_thread_num());
    }
    return 0;
}

A practical example: Fitting a Gaussian distribution 

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
#define NVALS 10000
#define NBINS 100
void bin_data(double* data, double* xbin, double delta, double* data_binned)
{
      //bin the data
}
double calc_chi2(double mu, double sigma, double* model, double* data)
{
    //calculate chi square of model given data
}
void calc_model(double mu, double sigma, double delta, double* binx, double* model)
{
    //calculate model for given binning
}
int main() {
    double data[NVALS],xbin[NBINS],data_binned[NBINS],model[NBINS];
    double chisq[100][100];
    //Read random Gaussian distributed numbers from file to array data
    double delta = 5.0/NBINS;
    bin_data(data, xbin, delta, data_binned);
    int imu,isigma;
    //Calculate chi^2 for Gaussian distributions with different mu and sigma
#pragma omp parallel for private(model,imu,isigma)
    for (imu=0;imu<100;imu++)
    {
        for (isigma=0;isigma<100;isigma++)
        {
            double mu = imu/50.0;
            double sigma = (isigma+1.0)/100.0;
            calc_model(mu, sigma, delta, xbin, model);
            chisq[imu][isigma] = calc_chi2(mu, sigma, model, data_binned);
        }
    }
    //Find best fitting mu and sigma for Gaussian
    double minchisq = 1.e12;
    int imu_min, isigma_min;
    for (imu=0;imu<100;imu++)
    {
        for (isigma=0;isigma<100;isigma++)
            if (chisq[imu][isigma] < minchisq)
            {
                minchisq = chisq[imu][isigma];
                imu_min = imu;
                isigma_min = isigma;
            }
        }
    }
    printf("min mu %f, min sigma %f, min chisq %f\n",imu_min/50.0, (isigma_min+1.0)/100.0,chisq[imu_min][isigma_min]);
    return 0;
}

As can be seen, the only modification to parallelise this program is to add the preprocessor directive.

Parallelising the other for loop 

We should also be able to parallelise the other for loop which finds the minimum chi square and the corresponding best fit parameters. Doing this is too complicated for a bytes ‘n biscuits session, but the problem is considerably simpler if we only care about the minimum chi square value without the corresponding mu and sigma because it the becomes a straightforward reduction.

Example of critical, atomic, and reduction: calculating a sum 

Say we want to calculate the sum of the numbes from 1 to 10 (which is 1+2+3…+10=45). Naively we might do the following:

#include <stdio.h>
#include <omp.h>
int main()
{
    int i;
    int sum = 0;
#pragma omp parallel for
    for (i=1; i < 10; i++)
    {
        sum += i;
    }
    printf("sum %d\n",sum);
}

This won’t work however because, even though the order of the addition doesn’t matter, the value of the sum variable might be overwritten even while it is in the process of being updated by another thread. We can’t get around this by using private in this case because the value depends on the total sum and each thread’s own copy would only have a partial sum of the iterations that it was delegated. Instead, we can force a part of the code to only be accessed by one thread at a time using the critical directive:

#pragma omp parallel for
    for (i=1; i < 10; i++)
    {
#pragma omp critical
        {
            sum += i;
        }
    }

This will print 45 every time as expected, however it’s very inefficient because the threads have to wait for each other. Therefore critical should not be used unless there’s no other solution. We can make the code slightly more effective by using atomic instead:

#pragma omp parallel for
    for (i=1; i < 10; i++)
    {
#pragma omp atomic
            sum += i;
    }

atomic is like critical but only meant for single small operations. In those cases the compiler can optimise the code a bit more effectively. For multiple lines, a critical block still must be used. In any case though, OpenMP provides a far better solution in this case: reduction. A reduction is a calculation that takes the same variable from each thread and combines them using some operator, e.g. summation in this case. It’s a very common problem and so OpenMP has a special syntax for this:

#pragma omp parallel for reduction(+:sum)
    for (i=1; i < 10; i++)
    {
        sum += i;
    }

Then OpenMP takes care of correctly managing the access to the reduction variable. If we want to calculate the factorial instead we would just change the operator to reduction(*:sum) and sum *= i; (the operation in reduction and the line doing the calculation must match).

Parallelising the other for loop using reduction 

Coming back to the fitting programme, since OpenMP also has a minimum operator we can program finding the minimum chi square value as a reduction. The loop then becomes:

#pragma omp for private(imu,isigma) reduction(min:minchisq)
    for (imu=0;imu<100;imu++)
    {
        for (isigma=0;isigma<100;isigma++)
        {
            if (chisq[imu][isigma] < minchisq)
            {
                minchisq = chisq[imu][isigma];
            }
        }
    }
    printf("min chisq %f\n",minchisq);
    return 0;
}

Synchronising threads: barrier and sections 

An important concept I haven’t used here is thread synchronisation. At the end of each parallel section and for loop there is an implied barrier. That is, the threads all wait for each other to get there before continuing. A barrier can also be explicitly put anywhere with the #pragma omp barrier directive. Another way to control the flow is through sections which is outside of the scope of this lesson.

MPI 

MPI is another library for parallel processing in C,C++, and Fortran. It uses separate memory for every thread (i.e. functionally all variables are private) and does not automatically delegate loops among the threads. As such it takes a lot more effort to parallellise a programme in MPI than in OpenMP. The advantage though is that it can work on larger scales where memory cannot be shared such as on a cluster. For optimal performance the two can be combined with OpenMP being used within each node on the cluster and MPI taking care of inter-node communication. Because of the complicated nature of MPI, I’ll just show a few examples to contrast it to OpenMP without going into too much detail.

An MPI programme is always running in parallel from beginning to end (i.e. it’s as though the entire code is enclosed in a #pragma omp parallel block). MPI is based on explicitly sending messages between threads. A hellow world programme in MPI looks like:

#include <mpi.h>
#include <stdio.h>
int main() {
    MPI_Init(NULL, NULL);
    int id;
    MPI_Comm_rank(MPI_COMM_WORLD, &id);
    printf("Hello world from %d\n",id);
    MPI_Finalize();
}

The MPI version of the fitting programme looks quite different from the OpenMP version. As all threads are active at the same time, any code that should run in serial can be enclosed in an if(id==0) to ensure that it will only be run by a single process (referred to as root or master). The file for example is only read by the root process and then broadcast to all the other processes rather than having every thread try to read from the file. Since each thread has its own private chi square array it is initialised to be smaller and then combined afterwards with a reduction command. This creates a combined array of the minimum elements among the threads on the root process which then has to find the minimum value in that array. Below is the main function of the MPI version of the fitting programme:

int main() {
    MPI_Init(NULL, NULL);
    int id, comm_size;
    MPI_Comm_rank(MPI_COMM_WORLD, &id);
    MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
    double data[NVALS],xbin[NBINS],data_binned[NBINS],model[NBINS];
    double chisq[100*101/comm_size];
    int i, j;
    for (i = 0; i < 100*101/comm_size; i++) chisq[i] = 1.e12;
    if (id==0)
    {
        //Read file into data array
    }
    //Send data array from root processor to everyone
    MPI_Bcast(data, NVALS, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    double delta = 5.0/NBINS;
    bin_data(data, xbin, delta, data_binned);
    int imu,isigma;
    int icpu = 0;
    for (imu=0;imu<100;imu++)
    {
        if (imu%comm_size != id) continue;
        for (isigma=0;isigma<100;isigma++)
        {
            double mu = imu/50.0;
            double sigma = (isigma+1.0)/100.0;
            calc_model(mu, sigma, delta, xbin, model);
            chisq[icpu] = calc_chi2(mu, sigma, model, data_binned);
            icpu++;
        }
    }
    double chisq_min[100*101/comm_size];
    //Combine into a single minimum chi square array
    MPI_Reduce(&chisq, &chisq_min, (int)100*101/comm_size, MPI_DOUBLE, MPI_MIN, 0, MPI_COMM_WORLD);
    //Find minimum value in the combined chi square array
    if (id==0)
    {
        double minval = 1.e12;
        for(i=0;i<100*101/comm_size;i++)
        {
            if (chisq_min[i] < minval) minval=chisq_min[i];
        }
        printf("min chisq: %f\n",minval);
    }
    MPI_Finalize();
}