In this post we are going to analyze a simplified parallel version of the famous algorithm PageRank, the algorithm used by Google Search to rank web pages in their search engine results. The code of the algorithm has been written in C++ exploiting the library OpenMP to parallelize the code. Finally it has been tested over different number of threads as well as different scheduling policies in order to compare its performances and analyze them.
Important: The algorithm’s implementation details have not been reported here but can be found on the following repository.
The table below gives an overview of the configuration of the environment used to run the experiment.
|Operative system||Ubuntu Server 16.04.1 LTS 64-bit|
|Memory||Intel(R) Xeon(R) CPU E5-2680 32 cores|
|C/C++ compiler||128 GB|
The experiments have been run on a huge graph containing 3072441 nodes (pages) and 117185083 edges (links). Each node had an average of 38,14 ingoing or outgoing edges. In addition, the node with more outgoing connections had a total of 33007 outgoing edges while the node receiving more input connections had not more than 3415 incoming edges. Furthermore, it has been realized that 11% of the nodes didn’t have any outgoing edge and only 1 of the 3072441 nodes had zero incoming edges. Thus, the graph was not symmetric and its imbalances could result in different workloads while computing the rank of different nodes.
Important: The graph used for this experiment in not publicly available but a toy a graph has been created to test the correctness of the algorithm.
The schedule clause affects how loop iterations are mapped onto threads. In this experiment has been taken into examination 3 of the scheduling policies offered by OpenMP.
Static Policy: For N iterations and T threads, all threads equally get one chunk of N/T of loop iterations. The static scheduling type is appropriate when all iterations have the same computational cost.
Dynamic Policy: For N iterations and T threads, each thread gets a fixed-size chunk of k loop iterations and when a particular thread finishes its chunk of iterations, it simply gets assigned a new chunk. So, the relationship between iterations and threads is non deterministic and would only be decide at run-time. However, despite its good flexibility, this schedule presents a high overhead due the lots of decision regarding which thread gets each chunk. Thus, the dynamic scheduling type is appropriate when the iterations require different computational costs, that is, the iterations are poorly balanced between each other.
Guided Policy: For N iterations and T threads, initially the first thread gets a fixed-size chunk of k=N/T loop iterations. Afterward, the second thread gets a chunk of size proportional to the number of iterations that have not been assigned divided by the number of the threads. Therefore, the size of the chunks decreases until a minimum threshold decided by the user. However, the chunk which contains the last iterations may also have smaller size. Its advantage over the static policy is that it can better handle imbalanced load and it also takes fewer decisions than the pure dynamic version, thus causing less overhead. The guided scheduling type is appropriate when the iterations are poorly balanced between each other. The initial chunks are larger, because they reduce overhead. The smaller chunks fill the schedule towards the end of the computation and improve load balancing. This scheduling type is especially appropriate when poor load balancing occurs toward the end of the computation.
Running time comparison
Finally, the program has been tested on different numbers of threads and under different scheduling policies and the respective performances have been reported in the tables below. In particular, in the first experiment has been used a chunk size of 1, which is the default set by OpenMP, while in the second experiment has been used a chunk size of 32. Let’s remind that the chunk size is the minimum number of iterations assigned to each tread.
Important: The last three columns of the first row of the table have to be interpreted as: Scheduling policy (chunk size). Furthermore, the running times have been expressed in seconds.
|threads||static (1)||dynamic (1)||guided (1)|
|threads||STATIC (32)||dynamic (32)||guided (32)|
To make a comparison with the serial version, the program took in average 9,5 seconds when running on a single thread. To understand the different running times taken by the program to terminate its task, let’s first briefly review how each of the examined policies works.
Overall, the program run faster when using a bigger chunk size. This is due to the overhead generated by the scheduling policies while assigning chunks to threads. As expected, the improvement is much more remarked in the dynamic and guided policy where the chunk assignment process is the main bottleneck. The speedup obtained by the dynamic scheduling policies suggest us that the analyzed graph was very unbalanced, thus resulting in some iterations taking much more time than others. We can also note how the program’s running time for different numbers of threads followed the Amdahl’s law. In fact, it benefits of bigger relative speedups when the number of multiple processes remains low and its increase in speedup cools down as more threads are used at the same time. The program has also been tested running with 32 threads but the results have not been reported since the running time was similar or even worse respect to the 16 threads version.