In this post, we are going to analyze a simplified parallel version of the famous algorithm PageRank, the algorithm used by Google Search to rank web pages in their search engine results.
The code of the PageRank algorithm has been written in C++ exploiting the library OpenMP to parallelize the code. Finally, it has been tested over a different number of threads as well as different scheduling policies in order to compare its performances and analyze them.
Important: The algorithm’s implementation details have not been reported here but can be found in the following repository.
The table below gives an overview of the configuration of the environment used to run the experiment.
|Operative system||Ubuntu Server 16.04.1 LTS 64-bit|
|Memory||Intel(R) Xeon(R) CPU E5-2680 32 cores|
|C/C++ compiler||128 GB|
The experiments have been run on a huge graph containing 3072441 nodes (pages) and 117185083 edges (links). Each node had an average of 38,14 ingoing or outgoing edges. In addition, the node with more outgoing connections had a total of 33007 outgoing edges while the node receiving more input connections had not more than 3415 incoming edges.
Furthermore, it has been realized that 11% of the nodes didn’t have any outgoing edge and only 1 of the 3072441 nodes had zero incoming edges. Thus, the graph was not symmetric and its imbalances could result in different workloads while computing the rank of different nodes.
Important: The graph used for this experiment is not publicly available but a toy graph has been created to test the correctness of the algorithm.
The schedule clause affects how to loop iterations are mapped onto threads. For this experiment have been taken into examination 3 of the scheduling policies offered by OpenMP.
For N iterations and T threads, all threads equally get one chunk of N/T of loop iterations. The static scheduling type is appropriate when all iterations have the same computational cost.
For N iterations and T threads, each thread gets a fixed-size chunk of k loop iterations and when a particular thread finishes its chunk of iterations, it simply gets assigned a new chunk. So, the relationship between iterations and threads is non-deterministic and would only be decided at run-time.
However, despite its good flexibility, this schedule presents a high overhead due to the lots of decisions regarding which thread gets each chunk. Thus, the dynamic scheduling type is appropriate when the iterations require different computational costs, that is, the iterations are poorly balanced between each other.
For N iterations and T threads, initially, the first thread gets a fixed-size chunk of k=N/T loop iterations. Afterward, the second thread gets a chunk of size proportional to the number of iterations that have not been assigned divided by the number of threads. Therefore, the size of the chunks decreases until a minimum threshold is decided by the user. However, the chunk which contains the last iterations may also have a smaller size.
Its advantage over the static policy is that it can better handle the imbalanced load and it also takes fewer decisions than the pure dynamic version, thus causing less overhead. The guided scheduling type is appropriate when the iterations are poorly balanced between each other. The initial chunks are larger because they reduce overhead.
The smaller chunks fill the schedule towards the end of the computation and improve load balancing. This scheduling type is especially appropriate when poor load balancing occurs toward the end of the computation.
Running time comparison
Finally, the program has been tested on different numbers of threads and under different scheduling policies and the respective performances have been reported in the tables below.
In particular, in the first experiment has been used a chunk size of 1, which is the default set by OpenMP, while in the second experiment has been used a chunk size of 32 Let’s remind that the chunk size is the minimum number of iterations assigned to each tread.
Important: The last three columns of the first row of the table have to be interpreted as scheduling policy (chunk size). Furthermore, the running times have been expressed in seconds.
|threads||static (1)||dynamic (1)||guided (1)|
|threads||STATIC (32)||dynamic (32)||guided (32)|
To make a comparison with the serial version, the program took an average of 9,5 seconds when running on a single thread. To understand the different running times taken by the program to terminate its task, let’s first briefly review how each of the examined policies works.
Overall, the program runs faster when using a bigger chunk size. This is due to the overhead generated by the scheduling policies while assigning chunks to threads. As expected, the improvement is much more remarked in the dynamic and guided policy where the chunk assignment process is the main bottleneck.
The speedup obtained by the dynamic scheduling policies suggests that the analyzed graph was very unbalanced, thus resulting in some iterations taking much more time than others.
We can also note how the program’s running time for different numbers of threads followed Amdahl’s law. In fact, it benefits from bigger relative speedups when the number of multiple processes remains low and its increase in speedup cools down as more threads are used at the same time.
The program has also been tested running with 32 threads but the results have not been reported since the running time was similar or even worse with respect to the 16 threads version.
- GitHub – PageRank-Parallel: Analysis of parallel version of PageRank algorithm implementation in C++ with OpenMP library