Below I will discuss the necessary components to turn these tasks into a working task based MCRT algorithm.
The first important component in the task based MCRT algorithm is the grid that is split up into many smaller parts. I will refer to these parts as subgrids. Each subgrid has a number of neighbouring subgrids with which it shares a face, edge or corner, and to be able to efficiently process photon packets, this neighbour information needs to be stored. Each subgrid also requires a lock, i.e. some way of preventing multiple threads to access the subgrid at once.
Subgrids should as much as possible be self-contained, i.e. they should store all the geometrical information required to compute positions and volumes within the subgrid. They can however store this information in a compressed way, e.g. for a regular grid there is no need to store the volumes and coordinates of all cells; you only need to know the dimensions of the subgrid box and the number of cells within the subgrid.
Due to the strong non-local character of Monte Carlo radiation transfer, some parts of the grid (i.e. subgrids) are much more likely to be visited by photon packets than others, especially if the number of sources is low. This means that the locking mechanism that protects individual subgrids can potentially cause serious performance bottlenecks. To address this issue, it is possible to make subgrid copies: a subgrid that is visited a lot more than average is duplicated and the neighbouring relations for the neighbouring subgrids are adjusted so that half of the neighbours links to the original subgrid, and the other half to the copy. From the point of view of the individual subgrids, the algorithm can still proceed in the same way. At the end of the photon propagation step for all photon packets we can simply synchronise the optical properties between the copy and the original. This procedure can be repeated to generate more copies; I opted to do this using a power of two copy hierarchy, as shown below. In this hierarchy, each subgrid gets assigned a copy level , and the number of copies for that subgrid is then given by . Since all subgrids are small in memory, the duplication and synchronisation steps are very cheap and cause only a small overhead.
To determine optimal values for the number of subgrid copies, there are a number of options. One option is to keep track of the number of visits to each subgrid during a previous step of the MCRT algorithm and to use that value to refine the copy hierarchy dynamically. Although this is arguably the most correct method, it is quite tricky to implement and has a significant overhead. Furthermore, the load per subgrid tends to change between steps, so that this measurement is not necessarily representative enough. I therefore opted for more simple methods.
The least simple method consists of running a very low resolution MCRT simulation with little photon packets before the start of the actual simulation. Each subgrid in the actual simulation makes up a single cell in this low resolution simulation, and the optical properties of this cell are used to determine a good load for the corresponding subgrid. Despite its simplicity, this method performs surprisingly well, and is also the best method I have so far to perform the distributed memory load balancing for the task based MCRT algorithm. However, some overhead associated with the additional bootstrap simulation is clearly present, not in the least in the amount of additional code that needs to be written and checked. And I didn’t test this method in enough different scenarios to ensure that it always leads to a good load balance.
This is why I opted for an even simpler method for CMacIonize 2.0: a copy hierarchy based on the source positions. The main idea behind this is that subgrids with a high load are almost exclusively subgrids that either contain a source, or have a neighbour that does. We can flag the subgrids that contain a source and assign an arbitrary copy level to these (I currently use , which correspond to copies). Next, we can ensure that subgrids that have a neighbour with a high copy level get assigned a high copy level as well, by e.g. limiting the copy level difference between neighbouring subgrids to 1. All of this can be done once, at the start of the simulation, and imposes very little overhead. Nevertheless, I found that it works very well.
A last subtlety about the use of subgrid copies is the treatment of source photons. Subgrids with a high copy level are almost exclusively subgrids that contain a source, and these subgrids get most of the photon packet load directly from photon packets originating at that source. To make sure the copies get a fair share of this load, we need to keep track of the number of copies of each subgrid that contains a source and distribute the photon packets from that source evenly among these copies. This requires some bookkeeping, but can be done without too much overhead.
The task based MCRT algorithm introduced here has one big disadvantage compared to traditional task based algorithms like the one in SWIFT: the generation of additional tasks. Since we have no idea what subgrids will be visited by any given photon packet, we cannot a priori predict how many propagation tasks are required for each subgrid or what the task dependencies for these will be. The only way to get around this is by creating the propagation tasks when necessary, i.e. a task that is already running can spawn new tasks.
In itself this is not a problem for the execution of the tasks, as the task based locking mechanisms prevent unspawned tasks from being executed. It does however require us to implement an additional stop condition for the photon propagation step, as an empty queue for all threads does not necessarily mean the step is done. And it creates a significant issue with memory management for the resources that are used by the tasks: the tasks themselves and the photon packet buffers.
The tasks themselves are very small, but still require some variables to be stored: the type of task, the index of the subgrid they operate on, the index of the photon buffers involved (if any) and some locking mechanism to make sure only one thread can execute them. These variables require memory that needs to be allocated and managed. The same is true for the photon buffers; these are considerably larger in memory. I very quickly discovered that allocating the task and photon buffer memory when the task or photon buffer is created and deallocating it when it is no longer used is very slow. Furthermore, it causes huge bottlenecks, as memory allocations are almost always thread safe and hence very slow in a shared memory context. To avoid this, it is possible to preallocate a fixed amount of task and photon buffer memory at the start and reuse this during the simulation.
Managing preallocated memory requires some bookkeeping. First of all, it means that instead of handling tasks and photon buffers, we will be handling indices to these objects within the preallocated memory, and we need to pass on the relevant memory whenever we want to access task or photon buffer properties. Secondly, it means that we need to keep track of which indices are already in use and which ones are free. To make everything thread safe, we have to do these checks using a thread safe mechanism, i.e. we can use a lock for every element in the preallocated memory.
All of these requirements can be met by constructing a thread safe memory array as follows. First, a large block of memory is allocated with a fixed size that is chosen in advance (the current implementation does not allow changing the size at run time). Then, a shared memory lock is created for each element in the newly allocated memory block. Finally, an atomic index variable is created and set to point to the first element in the memory block. Atomic means that this variable can only be changed atomically, using low level CPU instructions that are guaranteed to be thread safe. So only one thread is allowed to change the value of the atomic index at any given time.
Then, we define three functions to deal with the elements of the memory block:
To make sure that freed up elements are reused, we wrap the index value modulo the memory block size. The entire procedure is illustrated in the figure below:
A last important aspect of the task based MCRT algorithm is dealing with partially filled photon packet buffers. The size of a photon packet buffer is clearly quite arbitrary (I found that a value of 200 works well, but I do not have a good explanation for this particular value) and there is absolutely no guarantee that a buffer will ever end up being completely full, especially for subgrids that are far away from sources and near the end of the photon packet propagation step. If photon buffers were only propagated when they are full, then a lot of photon packets would end up not being propagated, which would invalidate the Monte Carlo technique. Or the algorithm would deadlock because it would never be able to process the partially filled buffers.
The solution is to prematurely schedule partially filled photon buffers in a smart way. In the current implementation, each thread first tries to obtain free tasks from its own task queue and execute those. If this fails, the thread will try to steal a task from another queue. If this fails too, it will try to get a task from a shared queue, which contains all photon source and scattering tasks. Only if this fails and the thread would otherwise end up being idle will it attempt to prematurely launch a partially filled photon buffer as follows. Each subgrid internally keeps track of the associated photon buffer with the largest number of photon packets. The idle thread starts with a reasonably high target number of photon packets, and will query all subgrids for their largest photon buffer size. Whenever it finds a subgrid whose largest photon buffer exceeds the target number, it will try to lock the subgrid and schedule the corresponding task for that photon buffer. If it fails to find a matching subgrid, it will reduce the target number and repeat the procedure, until a photon buffer has been scheduled.
Note that this premature scheduling routine is very sensitive to deadlocks and thread conflicts, as it is mainly active near the end of the photon propagation step, when most threads are idle and are desperately looking for work, like a hungry pack of hyenas looking for food. This means that proper locks should be used whenever required, but also that locks should be released as soon as possible, so that threads suffer as little as possible from concurrency bottlenecks. In other words, it is safe to assume that any work that shows up at this stage will be executed almost immediately, and likely by another thread than the one that schedules it.
The above gives a pretty good overview of the basic building blocks of the new task based MCRT algorithm I have developed. A few details were missing:
There are of course much more subtleties that I either forgot about, or that are just too technical for this post. It is not unlikely that I will discuss some of these in the future.
Professional astronomer.