Some notes and experiences from scaling LSDTopoTools code to parallel computing facilities (HPC). I’m using the OpenMP standard, which is widely used and is applicable to all kinds of computing architectures. It will also work on your desktop pc/laptop if you have a multicore CPU.

Identifying hotspots and ‘parallelisable’ functions

An example with LSDCatchment model: First, I profiled the code to find where the bottlenecks are. I used gprof for this. Before you compile the code however, remember to enable the profiling flags -pg in the makefile. When you run the compiled code again, you will get the gmon.out file. This contains all the profiling data. To generate some useable output from this, run:

gprof MyLSDExecutable.out gmon.out > analysis.txt

The anlysis text file will give you some output like so:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 77.71   5993.06  5993.06   234993    25.50    25.50  LSDCatchmentModel::qroute()
 17.63   7352.97  1359.91   234993     5.79     5.79  LSDCatchmentModel::depth_update()
  3.58   7629.10   276.13    46998     5.88     5.88  LSDCatchmentModel::scan_area()
  1.02   7707.40    78.29   234993     0.33     0.34  LSDCatchmentModel::catchment_water_input_and_hydrology(double)
  0.03   7709.43     2.03   234993     0.01     0.01  LSDCatchmentModel::water_flux_out(double)
  0.01   7710.29     0.86        7   122.86   191.43  LSDCatchmentModel::get_area4()
  ...

The analysis tells us that there is one function in this code taking up the bulk of the compute time: qroute(). (In this case it’s a water-routing algorithm. Followed by a water depth update function and an area scanning function. This is quite a good scenario (we don’t have to parallelise dozens of function to get a good return).

Parallelisation

I used the OpenMP library to parallelise the code. OpenMP is a magical set of preprocessor directives that you add to sections of your code to run them in parallel. It does some clever stuff to your code when it compiles. For some light bedtime reading, you could consult the 568-page specification document.

Okay, now you’ve read that, you will know that you need to add the #include <omp.h> include file in one of your header or source files. The OpenMP libraries are fairly standard bundles with most compilers. If you have the gcc-devel package installed, you will almost certainly have the OpenMP libraries and thus the omp.h header file.

Secondly, you need to identify the part of the function that can be parallelised. I use the scan_area() function as a simple example. (qroute() is not so straightforward…). This function has a for loop that iterates over every element in a 2D raster matrix. It then checks to see if the water depth is greater than zero, and sets a value in another array based on this test. This loop is easily parallelisable because the iterations do not depend on each other, and could be calculated in any order.

  #pragma omp parallel for
  for (int j=1; j <= jmax; j++)
  {
    int inc = 1;
    for (int i=1; i <= imax; i++)
    {
      // zero scan bit..
      down_scan[j][i] = 0;
      // and work out scanned area by checking the 8 surrounding cells and centre cell.
      if (water_depth[i][j] > 0
          || water_depth[i][j - 1] > 0
          || water_depth[i][j + 1] > 0
          || water_depth[i - 1][j] > 0
          || water_depth[i - 1][j - 1] > 0
          || water_depth[i - 1][j + 1] > 0
          || water_depth[i + 1][j - 1] > 0
          || water_depth[i + 1][j + 1] > 0
          || water_depth[i + 1][j] > 0
          )
      {
        down_scan[j][inc] = i;
        inc++;
      }
    }
  }

When the code runs with OpenMP enabled, the iterations in this loop will be distributed to different threads and cpus. (More on this later). I put in similar #pragma omp parallel for lines in the two other expensive functions found by the profiler.

Note that jmax and imax are variables declared out of scope of the for loop. We don’t need to worry about them being modified or changing their value since we only check their value. Same for water_depth[][] - we are just reading the array, not modifying it. When the parallel region is reached, it is divided up between the number of threads specified before runtime (see later section). Since the value if jmax is known, each thread will perform jmax/threads number of iterations. Each thread will have its own private copy of all the variables declared inside the loop.

What happens at the inner loop? Each thread must carry out the inner loop over its full number of iterations, i.e. imax. (There is no further ‘nested’ parallelism here.) The shared down_scan[][] array is modified, but there is no chance of accessing/overwriting the wrong part of the matrix because: 1) the j indices are unique to each outer loop thread - this is taken care of by the #pragma omp for statement, and 2) inc is always local to the inner loop iterations, and always starts at 1 in every thread.

Note the variable/incrementer inc++. Luckily, inc is declared as a local variable inside the outer for loop. I.e. every outer loop iteration starts with a fresh version of int inc = 1;. If this was not the case, and inc was a global variable, we would be in trouble - inc could be read and written to by different threads at different times, so the final value could be off. (There is a solution to this using a reduction, but thankfully that is not the case here - although it is a common occurrence). You will need to check which loops are suitable for parallelisation in your own code. This is the tricky bit - not all loop iterations are independent of each other and parallelise so easily. It is best to consult a brief tutorial on OpenMP for how to do this, and look at other functioning examples. So I won’t go into this in any more details. Here are some useful tutorials:

Dr Dobbs - Getting Started with OpenMP

Lawerence Livermore National Lab - OpenMP tutorial

ARCHER (the UK national supercomputing centre) also list details of upcoming training courses, and you can look through their training slides and exercises here:

Open MP Tutorial by ARCHER - Course Materials

Compiling and running

You need to add just one extra flag to your makefile: -fopenmp (for gcc users). That’s it. You probably want to turn on an optimisation flag as well (e.g. -O2), unless you are debugging.

Once compiled, you can run the executable on your computer/cluster. All you need to do beforehand is set the number of threads that you want the code to use. (It can’t always detect this from your environment, so best to set it explicitly). In the linux terminal this is done like so (I have no idea how you do it in Windows).

export OMP_NUM_THREADS=8

Where 8 should usually be the same as the number of available cores on your machine. Usually this means 1 physical processor, which has several physical cores on the chip. If you are running this in some sort of cluster-environment, OpenMP is limited to accessing a single compute node and the memory associated with that node. You may have more than one ‘compute node’ available, but an OpenMP application cannot distribute a program across multiple separate nodes. (Read the Dr Dobbs link above if this is unclear.)

Then just run the executable as normal. You can check the usage of your cpus using the top command in linux. Then press 1 to get a breakdown of usage by CPU. You should hopefully see that all of your cpus (cores) are now in use, compared to just one that would have been in use when running the code serially.

Results from porting to ARCHER

ARCHER is a massive HPC cluster housed in Edinburgh used for academic reserach in the UK. It’s made up of a load of compute nodes (2000+ of them). Each node has 2x 12 core processors sharing 64GB of RAM. There is some clever technology called HyperThreading which you can turn on to ‘double’ the number of logical cores available, giving you a maximum of 48 processing units. (Confusingly, these 48 logical cores are referred to as CPUs, even though they aren’t physical CPUs like we might think of normally.)

I did some simulations with LSDCatchmentModel with a variety of core/thread configurations. The test simulation was 48hrs (real time) of the Boscastle storm and flooding in 2004. In serial mode (i.e. no parallelisation) the simulation took around c. 4 hours. With the optimised, parallel code, I could get this down to about 11 minutes in the best case scenario.

Cores/CPUs Hyperthreading OpenMP Threads Time
1* no 1 c. 4 hours
8* yes 8 c. 50 mins
12 no 12 28 mins
24 yes 24 23 mins
24 no 24 15 mins
48 yes 48 11 mins

*These 2 simulations were done on a desktop PC with broadly comparable single CPU to the ARCHER nodes, so it’s not quite a fair test, but close enough. The rest were done on the standard Archer compute node. 48 cores is the maximum available on a single compute node.

Comments on scaling

Linear speed is theoretically possible (i.e. speed up simply a multiplier of the number of processors), but unlikely to happen perfectly in practice. The scaling here is roughly sub linear. Hyper-threading (running in effect two logical cores on a physical core, ‘doubling’ the number of cpus available) is rarely as effective as true core increases, though the gains are noticeable here. See Amdahl’s Law if you are really curious about how far this could scale…I think (as a rough guess), it’s approaching the region of diminshing returns. Unfortunately I don’t have enough CPUs/cores to test it further!

Some general thoughts about LSDTopoTools and parallelisation

Warning: These ideas/ramblings have not been verified by an actual computer scientist…

A lot of the algorithms in LSDTT involve iteration over a loop on elements of an array or 2D matrix. On a high resolution or large DEM this is potentially hundreds of thousands, if not millions, of calculations in a single function call. Other algorithms, such as the segment-fitting algorithm, check millions of permuations. I am generalising here, but many of the algorithms’ iterations are independent of one another. In other words, you don’t always need the answer from the current iteration before you can start the next one. This makes a lot of the analyses ripe for parallelisation. The hillshade algorithm is probably a classic simple example.

    for (int i = 1; i < NRows-1; ++i){
        for (int j = 1; j < NCols-1; ++j){
            float slope_rad = 0;
            float aspect_rad = 0;
            float dzdx = 0;
            float dzdy = 0;

            if (RasterData[i][j] != NoDataValue){
                dzdx = ((RasterData[i][j+1] + 2*RasterData[i+1][j] + RasterData[i+1][j+1]) -
                       (RasterData[i-1][j-1] + 2*RasterData[i-1][j] + RasterData[i-1][j+1]))
                        / (8 * DataResolution);
                dzdy = ((RasterData[i-1][j+1] + 2*RasterData[i][j+1] + RasterData[i+1][j+1]) -
                       (RasterData[i-1][j-1] + 2*RasterData[i][j-1] + RasterData[i+1][j-1]))
                       / (8 * DataResolution);

                slope_rad = atan(z_factor * sqrt((dzdx*dzdx) + (dzdy*dzdy)));

                if (dzdx != 0){
                    aspect_rad = atan2(dzdy, (dzdx*-1));
                    if (aspect_rad < 0) aspect_rad = 2*M_PI + aspect_rad;
                }
                else{
                    if (dzdy > 0) aspect_rad = M_PI/2;
                    else if (dzdy < 0) aspect_rad = 2 * M_PI - M_PI/2;
                    else aspect_rad = aspect_rad;
                }
                hillshade[i][j] = 255.0 * ((cos(zenith_rad) * cos(slope_rad)) +
                                  (sin(zenith_rad) * sin(slope_rad) *
                                  cos(azimuth_rad - aspect_rad)));

                if (hillshade[i][j] < 0) hillshade[i][j] = 0;
            }
        }

Basically the algorithm looks around to the cell neighbours and performs some fairly trivial calculation. But none of the calculations depend on the answers from other iterations (Again…I haven’t tested this yet, see below for an actual tested and working example). You could probably parallelise this by placing a single statement before the outer for loop:

#pragma omp parallel for
for (int i = 1; i < NRows-1; ++i){
    for (int j = 1; j < NCols-1; ++j){
    // Do stuff...

Finally, most of the computers we use have some multi-core capability. Even your laptop probably has 2 or 4 cores. All this computing power is sitting there waiting to be put to good use! Depending on how parallelisable the algorithm is, you would expect to get something like a 3-3.5x speed up on a 4 core machine compared to running the algorithm in serial. (You rarely get n times speed up (where n is number of cores) because of overheads, communicating between cores/memory etc.)


Workflow for documentation pages

After quite a bit of trial and error, I have settled on a documentation workflow I’m happy with.

Setting up the repositories.

To start, you need to set up a documentation structure that forms a static html site. This could be a variety of things, such as asciidocotr, dOxygen, Sphynx, jekyll, etc.

Then make a directory structure so there is a folder for both the master and gh-pages branches of the code.

I found having seperate folders minimizes the chances of having merge conflicts.

So for example, you might have a repository that is called My_docs and in it are two subfolders master and gh-pages

For example:

$ pwd
home/Git_projects/My_docs/
$ ls
master gh-pages

Keep the source files in the master branch

In the master branch, keep the source files for the static website. For example, I keep all my asciidoctor source files in the master branch.

Build the website, and keep the website files in the gh-pages branch.

I use asciidoctor to build webistes, but the built website is not tracked in the master folder. Instead, when I build the website, I then copy the html and image files over to gh-pages and commit them ONLY to the gh-pages branch.

Making sure you don’t have conflicts.

To avoid conflicts, it is helpful to only have the gh-pages branch in the gh-pages folder, and I also find it useful to only have files needed for building the website in the gh-pages branch.

For the master branch, you just work as normal.

For the gh-pages branch you 1. Clone the master into the gh-pages branch 2. Check out the gh-pages branch

$ pwd
home/Git_projects/My_docs/gh-pages
$ git git checkout origin/gh-pages -b gh-pages
  1. Now delete the master branch in the gh-pages folder:
$ git branch -d master

This step is mainly to ensure that you don’t accidentally commit something to master while you are in the gh-pages directory. It is not needed but is a useful precaution. 4. Copy across html and index files every time you build the website.


Managing repository pages in Github (alternative approach)

In the previous post, I went through a method that involves keeping different directories for the two branches of your code. There is another method for keeping a single directory, which is detailed below.

WARNING: This method led me down a rabbit hole of many merging conflicts. Keeping two seperate directories is MUCH better!!

Setting up a website branch

In guthub, the repository website is always read from a branch called gh-pages. The first thing you need to do to set up a repository website is to create this branch.

You can go here: https://pages.github.com/ to see the instructions.

In this example, we are going to make a project site, and start from scratch.

Making a new branch in the repository is the easy bit!

Setting up your repositories

We are going to follow this approach: http://lea.verou.me/2011/10/easily-keep-gh-pages-in-sync-with-master/

  1. Make changes to the master and push them
  2. First, check out the gh-pages branch
$ home/Git_projects/LSDTT_book/
$ git git checkout origin/gh-pages -b gh-pages
  1. Use the rebase function
$ git branch
* gh-pages
master
$ git rebase master
  1. Push the changes in master to gh-pages (so it keeps up)

WARNING: This can lead to major merging problems

I don’t really like this appraoch since it can lead to horrible merging conflicts if you don’t remember to keep rebasing every time. Also when working on windows sometimes you get a merge conflict with Thumbs.db which is very difficult to fix. I would avoid this approach.


Managing repository pages in Github

Github enables you to have website for your repositories. Managing these websites takes a little bit of getting used to, because you will have to be familiar with banches in git. The post explains one way to manage a master branch for the actual software and a gh-pages branch for the website.

Setting up a website branch

In guthub, the repository website is always read from a branch called gh-pages. The first thing you need to do to set up a repository website is to create this branch.

You can go here: https://pages.github.com/ to see the instructions.

In this example, we are going to make a project site, and start from scratch.

Making a new branch in the repository is the easy bit!

Setting up your repositories

Once you have created a branch, you need to manage both the master and the gh-pages branches. You might not want them to sit in the same place.

We are going to follow the instructions here: https://gist.github.com/chrisjacob/833223

  1. First, make sure your master branch is up to date on github and then delete it locally. You are going to clone it into a master subdirectory. I am going to do this with the LSDTT_book repository.
  2. On github, create a gh-pages branc.
  3. In the LSDTT_book repoository, make two directories: master and gh-pages.
  4. Now clone the main repository into the master repo from the LSDTT_book directory
$ pwd
home/Git_projects/LSDTT_book/
$ git clone https://github.com/LSDtopotools/LSDTT_book.git master
  1. Now clone the repo AGAIN, but this time into the gh-pages directory
$ pwd
home/Git_projects/LSDTT_book/
$ git clone https://github.com/LSDtopotools/LSDTT_book.git gh-pages
  1. Check out the gh-pages branch
$ cd gh-pages
$ pwd
home/Git_projects/LSDTT_book/gh-pages
$ git git checkout origin/gh-pages -b gh-pages
  1. Now delete the master branch here:
$ git branch -d master
*This step is mainly to ensure that you don't accidentally commit something to master while you are in the gh-pages directory*. 
It is not *needed* but is a useful precaution. 
  1. Now you can push changes to your gh-pages branch from this repo without having to check it out each time
$ git push -u origin gh-pages. 


Title: Layout: Default —

Why Git is not like SVN

Git is different to svn and I think that’s where a lot of confusion stems. (Certainly where my confusion comes from…) Svn is designed around having a central repository that people checkout, make changes to, and then commit up to the centrally hosted repository. Locally checked-out repos are just that – local copies of the repo. Git calls itself a distributed version control system, and so when you clone out a repository to your local machine, you are getting a fully-fledged copy of that repository with its own detailed versioning history, branches, merges etc.. It seems like we want to use git like svn at the moment (and git lets you do this), and there is a good tutorial of how to do this here:

Git tutorial for ‘centralised workflow’

Summary of the ‘SVN-style’ workflow

Here’s a brief summary of the above tutorial.

Since we all have merge rights and work from a single repository, there’s no need to fork your own branches of the code (although you could do, and it’s more “Git-esque” to do so apparently). After you’ve cloned a copy of the repo to your computer, you open your favourite editor (vim) and hack away all morning, powered by coffee and Tunnock’s teacakes. Then you do this:

git commit LSDWhiskyFlowDynamics.cpp -m "Added some changes to blended malt routines"

But this is git, not svn, and this will only commit your modifications to your local repository. If you want to push it to the central repository (i.e. the one on github) you have to tell git to push:

git push origin master

And if nobody has made any changes since, your changes will be added to the github repo. If someone has made changes since, you will get an error message from Git telling you something about tips of branches being behind in your local repo (due to an early frost, presumably) and you need to do a git pull. But don’t just type git pull, instead go for:

git pull --rebase origin master

This tells git to make your commits go to the head of the master branch on the remote repo, so your changes are seen as the most up to date. git pull still works, but when someone else tries to push to the remote repo, their commits get mangled with a superfluous merge (which is not necessarily bad, it just looks a bit confusing when browsing the repo)

If you want to see what the differences are between your ‘local’ and the github repository of the project, you can do:

git fetch
git diff <local branch> <remote-tracking branch>

Git fetch just grabs the changes from the remote repo – it doesn’t actually merge them into your local branchm but is needed before you can run the diff command.

For us this is probably:

git diff origin origin/master

Other gotchas

  1. There is no such thing as git update. (As much as I wanted to believe there was). Do a a git pull --rebase instead.
  2. git add does not commit the file just added to anything. You have to follow it with a git commit to get it under version control in your local repository. (And remember git commit only commits it locally…)
  3. git status tells you the status of your local repository. Actually this isn’t a gotcha, and is one of the few things to work like svn does. Hooray!