Analysis of parallel version of PageRank algorithm

In this post, we are going to analyze a simplified parallel version of the famous algorithm PageRank, the algorithm used by Google Search to rank web pages in their search engine results.

The code of the PageRank algorithm has been written in C++ exploiting the library OpenMP to parallelize the code. Finally, it has been tested over a different number of threads as well as different scheduling policies in order to compare its performances and analyze them.

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites

Important: The algorithm’s implementation details have not been reported here but can be found in the following repository.

Environment

The table below gives an overview of the configuration of the environment used to run the experiment.

Operative systemUbuntu Server 16.04.1 LTS 64-bit
MemoryIntel(R) Xeon(R) CPU E5-2680 32 cores
C/C++ compiler128 GB
CPUgcc 7.4

Graph description

The experiments have been run on a huge graph containing 3072441 nodes (pages) and 117185083 edges (links). Each node had an average of 38,14 ingoing or outgoing edges. In addition, the node with more outgoing connections had a total of 33007 outgoing edges while the node receiving more input connections had not more than 3415 incoming edges.

Furthermore, it has been realized that 11% of the nodes didn’t have any outgoing edge and only 1 of the 3072441 nodes had zero incoming edges. Thus, the graph was not symmetric and its imbalances could result in different workloads while computing the rank of different nodes.

Important: The graph used for this experiment is not publicly available but a toy graph has been created to test the correctness of the algorithm.

Scheduling policies

The schedule clause affects how to loop iterations are mapped onto threads. For this experiment have been taken into examination 3 of the scheduling policies offered by OpenMP.

Static Policy

For N iterations and T threads, all threads equally get one chunk of N/T of loop iterations. The static scheduling type is appropriate when all iterations have the same computational cost.

Dynamic Policy

For N iterations and T threads, each thread gets a fixed-size chunk of k loop iterations and when a particular thread finishes its chunk of iterations, it simply gets assigned a new chunk. So, the relationship between iterations and threads is non-deterministic and would only be decided at run-time.

However, despite its good flexibility, this schedule presents a high overhead due to the lots of decisions regarding which thread gets each chunk. Thus, the dynamic scheduling type is appropriate when the iterations require different computational costs, that is, the iterations are poorly balanced between each other.

Guided Policy

For N iterations and T threads, initially, the first thread gets a fixed-size chunk of k=N/T loop iterations. Afterward, the second thread gets a chunk of size proportional to the number of iterations that have not been assigned divided by the number of threads. Therefore, the size of the chunks decreases until a minimum threshold is decided by the user. However, the chunk which contains the last iterations may also have a smaller size.

Its advantage over the static policy is that it can better handle the imbalanced load and it also takes fewer decisions than the pure dynamic version, thus causing less overhead. The guided scheduling type is appropriate when the iterations are poorly balanced between each other. The initial chunks are larger because they reduce overhead.

The smaller chunks fill the schedule towards the end of the computation and improve load balancing. This scheduling type is especially appropriate when poor load balancing occurs toward the end of the computation.

Running time comparison

Finally, the program has been tested on different numbers of threads and under different scheduling policies and the respective performances have been reported in the tables below.

In particular, in the first experiment has been used a chunk size of 1, which is the default set by OpenMP, while in the second experiment has been used a chunk size of 32 Let’s remind that the chunk size is the minimum number of iterations assigned to each tread.

Important: The last three columns of the first row of the table have to be interpreted as scheduling policy (chunk size). Furthermore, the running times have been expressed in seconds.

threadsstatic (1)dynamic (1)guided (1)
44.4019114.3515634.632874
82.8792282.697472.846156
162.0154051.911412.045555
Chunk size of 1
threadsSTATIC (32)dynamic (32)guided (32)
44.2813.4276743.151209
82.383421.7311041.887575
161.8481751.1402031.322415
Chunk size of 32

To make a comparison with the serial version, the program took an average of 9,5 seconds when running on a single thread. To understand the different running times taken by the program to terminate its task, let’s first briefly review how each of the examined policies works.

PageRank algorithm running time with 16 threads and dynamic policy

Results

Overall, the program runs faster when using a bigger chunk size. This is due to the overhead generated by the scheduling policies while assigning chunks to threads. As expected, the improvement is much more remarked in the dynamic and guided policy where the chunk assignment process is the main bottleneck.

The speedup obtained by the dynamic scheduling policies suggests that the analyzed graph was very unbalanced, thus resulting in some iterations taking much more time than others.

We can also note how the program’s running time for different numbers of threads followed Amdahl’s law. In fact, it benefits from bigger relative speedups when the number of multiple processes remains low and its increase in speedup cools down as more threads are used at the same time.

The program has also been tested running with 32 threads but the results have not been reported since the running time was similar or even worse with respect to the 16 threads version.

Find more

References

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s