Parallel programming with OpenMP and MPI
Table of Contents
OpenMP
OpenMP is a multiprocessing library for C, C++, and Fortran based on shared memory. It’s very concise and many serial programs can be parallelised by adding just a few extra lines. However, the simplicity is a bit deceptive in that the shared memory model can lead to some subtle errors due to multiple threads trying to access the same resource simultaneously or pieces of code where the order of operations is important not being properly synchronised.
OpenMP is mainly implkemented through preprocessor directives starting with
#pragma omp
. Below is an example Hello World program in C using OpenMP. To
compile it with gcc just add -fopenmp
to the command.
#include <stdio.h>
#include <omp.h>
int main()
{
#pragma omp parallel
{
printf("Hello from %d\n",omp_get_thread_num());
}
return 0;
}
Parallelising a for loop
Add #pragma omp
for inside a parallel block or if the block starts at the
loop #pragma omp parallel for
can be used as a shorthand. This will delegate
the iterations of the loop among the threads rather than each thread
executing the entire loop:
#include <stdio.h>
#include <omp.h>
int main()
{
int i;
#pragma omp parallel for
for (i=0;i<10;i++)
{
printf("Hello from %d\n",omp_get_thread_num());
}
return 0;
}
A practical example: Fitting a Gaussian distribution
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>
#define NVALS 10000
#define NBINS 100
void bin_data(double* data, double* xbin, double delta, double* data_binned)
{
//bin the data
}
double calc_chi2(double mu, double sigma, double* model, double* data)
{
//calculate chi square of model given data
}
void calc_model(double mu, double sigma, double delta, double* binx, double* model)
{
//calculate model for given binning
}
int main() {
double data[NVALS],xbin[NBINS],data_binned[NBINS],model[NBINS];
double chisq[100][100];
//Read random Gaussian distributed numbers from file to array data
double delta = 5.0/NBINS;
bin_data(data, xbin, delta, data_binned);
int imu,isigma;
//Calculate chi^2 for Gaussian distributions with different mu and sigma
#pragma omp parallel for private(model,imu,isigma)
for (imu=0;imu<100;imu++)
{
for (isigma=0;isigma<100;isigma++)
{
double mu = imu/50.0;
double sigma = (isigma+1.0)/100.0;
calc_model(mu, sigma, delta, xbin, model);
chisq[imu][isigma] = calc_chi2(mu, sigma, model, data_binned);
}
}
//Find best fitting mu and sigma for Gaussian
double minchisq = 1.e12;
int imu_min, isigma_min;
for (imu=0;imu<100;imu++)
{
for (isigma=0;isigma<100;isigma++)
if (chisq[imu][isigma] < minchisq)
{
minchisq = chisq[imu][isigma];
imu_min = imu;
isigma_min = isigma;
}
}
}
printf("min mu %f, min sigma %f, min chisq %f\n",imu_min/50.0, (isigma_min+1.0)/100.0,chisq[imu_min][isigma_min]);
return 0;
}
As can be seen, the only modification to parallelise this program is to add the preprocessor directive.
Controlling variable sharing: the private command
The private(model,imu,isigma)
part of the OpenMP directive is necessary in
this case for the program to run correctly in parallel. By default, all
variables declared outside of the parallel region are shared while variables
declared inside the block are private to each thread (i.e. each thread has its
own local copy). Loop counters should always be private since they shouldn’t
change in the middle of an iteration. Additionally, in the example above the
model array is declared outside the parallel loop and so must be set to private.
Otherwise multiple threads might try to fit different models at the same time
such that one of them overwrites the model that another thread is in the process
of using, leading to unexpected and unpredictable behaviour.
Since they exist in a different scope, private variables are actually distinct
variables even though they share the same name as a variable previously
declared. By default, they are not set to the value they had outside of the
parallel region and at the end of the parallel region their values are not
copied into the variables existing outside. If the initial value is important
use firstprivate
to copy the value from outside the parallel region into the
private variable at the start of the region. Similarly, in a for loop
lastprivate
can be used to copy the value of the private variable at the last
iteration into the outside variable.
Parallelising the other for loop
We should also be able to parallelise the other for loop which finds the minimum chi square and the corresponding best fit parameters. Doing this is too complicated for a bytes ‘n biscuits session, but the problem is considerably simpler if we only care about the minimum chi square value without the corresponding mu and sigma because it the becomes a straightforward reduction.
Example of critical, atomic, and reduction: calculating a sum
Say we want to calculate the sum of the numbes from 1 to 10 (which is 1+2+3…+10=45). Naively we might do the following:
#include <stdio.h>
#include <omp.h>
int main()
{
int i;
int sum = 0;
#pragma omp parallel for
for (i=1; i < 10; i++)
{
sum += i;
}
printf("sum %d\n",sum);
}
This won’t work however because, even though the order of the addition doesn’t
matter, the value of the sum variable might be overwritten even while it is in
the process of being updated by another thread. We can’t get around this by
using private in this case because the value depends on the total sum and each
thread’s own copy would only have a partial sum of the iterations that it was
delegated. Instead, we can force a part of the code to only be accessed by one
thread at a time using the critical
directive:
#pragma omp parallel for
for (i=1; i < 10; i++)
{
#pragma omp critical
{
sum += i;
}
}
This will print 45 every time as expected, however it’s very inefficient because
the threads have to wait for each other. Therefore critical
should not be used
unless there’s no other solution. We can make the code slightly more effective
by using atomic
instead:
#pragma omp parallel for
for (i=1; i < 10; i++)
{
#pragma omp atomic
sum += i;
}
atomic
is like critical
but only meant for single small operations. In
those cases the compiler can optimise the code a bit more effectively. For
multiple lines, a critical block still must be used. In any case though, OpenMP
provides a far better solution in this case: reduction. A reduction is a
calculation that takes the same variable from each thread and combines them
using some operator, e.g. summation in this case. It’s a very common problem and
so OpenMP has a special syntax for this:
#pragma omp parallel for reduction(+:sum)
for (i=1; i < 10; i++)
{
sum += i;
}
Then OpenMP takes care of correctly managing the access to the reduction
variable. If we want to calculate the factorial instead we would just change the
operator to reduction(*:sum) and sum *= i;
(the operation in reduction and the
line doing the calculation must match).
Parallelising the other for loop using reduction
Coming back to the fitting programme, since OpenMP also has a minimum operator we can program finding the minimum chi square value as a reduction. The loop then becomes:
#pragma omp for private(imu,isigma) reduction(min:minchisq)
for (imu=0;imu<100;imu++)
{
for (isigma=0;isigma<100;isigma++)
{
if (chisq[imu][isigma] < minchisq)
{
minchisq = chisq[imu][isigma];
}
}
}
printf("min chisq %f\n",minchisq);
return 0;
}
Synchronising threads: barrier and sections
An important concept I haven’t used here is thread synchronisation. At the end of each parallel section and for loop there is an implied barrier. That is, the threads all wait for each other to get there before continuing. A barrier can also be explicitly put anywhere with the #pragma omp barrier directive. Another way to control the flow is through sections which is outside of the scope of this lesson.
MPI
MPI is another library for parallel processing in C,C++, and Fortran. It uses separate memory for every thread (i.e. functionally all variables are private) and does not automatically delegate loops among the threads. As such it takes a lot more effort to parallellise a programme in MPI than in OpenMP. The advantage though is that it can work on larger scales where memory cannot be shared such as on a cluster. For optimal performance the two can be combined with OpenMP being used within each node on the cluster and MPI taking care of inter-node communication. Because of the complicated nature of MPI, I’ll just show a few examples to contrast it to OpenMP without going into too much detail.
An MPI programme is always running in parallel from beginning to end (i.e. it’s as though the entire code is enclosed in a #pragma omp parallel block). MPI is based on explicitly sending messages between threads. A hellow world programme in MPI looks like:
#include <mpi.h>
#include <stdio.h>
int main() {
MPI_Init(NULL, NULL);
int id;
MPI_Comm_rank(MPI_COMM_WORLD, &id);
printf("Hello world from %d\n",id);
MPI_Finalize();
}
The MPI version of the fitting programme looks quite different from the OpenMP version. As all threads are active at the same time, any code that should run in serial can be enclosed in an if(id==0) to ensure that it will only be run by a single process (referred to as root or master). The file for example is only read by the root process and then broadcast to all the other processes rather than having every thread try to read from the file. Since each thread has its own private chi square array it is initialised to be smaller and then combined afterwards with a reduction command. This creates a combined array of the minimum elements among the threads on the root process which then has to find the minimum value in that array. Below is the main function of the MPI version of the fitting programme:
int main() {
MPI_Init(NULL, NULL);
int id, comm_size;
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
double data[NVALS],xbin[NBINS],data_binned[NBINS],model[NBINS];
double chisq[100*101/comm_size];
int i, j;
for (i = 0; i < 100*101/comm_size; i++) chisq[i] = 1.e12;
if (id==0)
{
//Read file into data array
}
//Send data array from root processor to everyone
MPI_Bcast(data, NVALS, MPI_DOUBLE, 0, MPI_COMM_WORLD);
double delta = 5.0/NBINS;
bin_data(data, xbin, delta, data_binned);
int imu,isigma;
int icpu = 0;
for (imu=0;imu<100;imu++)
{
if (imu%comm_size != id) continue;
for (isigma=0;isigma<100;isigma++)
{
double mu = imu/50.0;
double sigma = (isigma+1.0)/100.0;
calc_model(mu, sigma, delta, xbin, model);
chisq[icpu] = calc_chi2(mu, sigma, model, data_binned);
icpu++;
}
}
double chisq_min[100*101/comm_size];
//Combine into a single minimum chi square array
MPI_Reduce(&chisq, &chisq_min, (int)100*101/comm_size, MPI_DOUBLE, MPI_MIN, 0, MPI_COMM_WORLD);
//Find minimum value in the combined chi square array
if (id==0)
{
double minval = 1.e12;
for(i=0;i<100*101/comm_size;i++)
{
if (chisq_min[i] < minval) minval=chisq_min[i];
}
printf("min chisq: %f\n",minval);
}
MPI_Finalize();
}