Skip to content

Latest commit

 

History

History
13 lines (9 loc) · 1003 Bytes

README.md

File metadata and controls

13 lines (9 loc) · 1003 Bytes

PageRank Algorithm (using Spark)

This is a python script for a page rank algorithm using Spark.

  • The input is taken from a command line argument. The format of running it is =>

    python pagerank.py [input-file] [no. of iterations]
    
  • Here is where the spark implementation begins. It collects the pairs of URLS and groups them by the same key.

  • The pairs are initialized to a weight of '1'.

  • Now with the no. of iterations read from the CL args, we divide each value to the key it goes and add that up.

  • There is small implementation to handle spider traps called "damping". in quick words, Spider Traps are dead-ends or infinite loops where the iterations get caught in a repetitive chain or cant iterate to another point or key in the algorithm. For this we've assumed the constant d. Its normally d=0.85 but I've assumed 0.80. The line of code for this is =>

    ranks = urls_joined.reduceByKey(add).mapValues(lambda rank: rank * 0.80 + 0.20)