This project implements a simple search engine using a simplified version of the PageRank algorithm. The objective is to build a graph structure, calculate PageRanks, and rank pages based on these values. The search engine operates on a collection of mock web pages provided as plain text files.
The project is divided into three main parts:
pagerank.c
invertedIndex.c
searchPagerank.c
BST.cBST.hDLListStr.cDLListStr.hGraph.cGraph.hhelpers.chelpers.hreadData.creadData.hInvertedIdx.cInvertedIdx.h
Each mock web page is divided into two sections:
- Section 1: Contains URLs representing outgoing links.
- Section 2: Contains the actual content of the web page.
Sections are delimited by specific markers, ensuring clear separation of content and links.
Implement pagerank.c to calculate the PageRank for URLs listed in collection.txt. The program will construct a graph from these URLs and compute their PageRank using the provided algorithm.
Contains URLs separated by whitespace. Each URL corresponds to a text file.
Example:
url11 url21 url22
url23
url31 url24 url34
- Read
collection.txtand construct a graph (adjacency list or matrix). - Calculate Weighted PageRank for each page using the algorithm provided.
The PageRank algorithm is used to calculate the importance of pages within a collection. The following steps are performed:
-
Read Pages: Read pages from the collection in the file
collection.txtand build a graph structure using a chosen representation (adjacency list or matrix). -
Initialize Variables:
N: Total number of URLs in the collection.- For each URL
piin the collection, initialize PageRank:PR(pi, 0) = 1/N. - Set
iterationto 0. - Set
diffto the thresholddiffPR.
-
Iterate: While
iteration < maxIterationsanddiff >= diffPRdo:- Set
tto the current iteration count. - For each URL
piin the collection, update PageRank using the following formula:where:PR(pi, t+1) = (1 - d)/N + d * Σ (PR(pj, t) / L(pj))pj ∈ M(pi): Set of pages with outgoing links topi(excluding self-loops and parallel edges).L(pj): Out-degree of the pagepj.
- Set
-
Calculate Difference: After each iteration, update
diff:diff = Σ |PR(pi, t+1) - PR(pi, t)|for all
ifrom 1 toN. -
Increment Iteration Count: Increase
iterationby 1.
In this algorithm:
M(pi)is a set of pages that have outgoing links to pagepi.L(pj)is the number of outgoing links from pagepj.tandt+1refer to the iteration numbers; for example,PR(pi, 2)refers to the PageRank ofpiafter iteration 2.
d: Damping factordiffPR: Threshold for PageRank differencesmaxIterations: Maximum iterations- Example:
./pagerank 0.85 0.00001 1000
- Format:
URL, outdegree, PageRank - PageRank values formatted to 7 decimal places (
%.7lf) - Sorted in descending order of PageRank
- Example:
url31, 3, 0.2623546
url21, 1, 0.1843112
...
PageRank values may slightly vary but should align with expected values to the first 6 decimal places.
In this part, you will implement invertedIndex.c, which reads data from the collection.txt file and generates an inverted index. This index maps every word in the collection to a sorted list of URLs where the word appears.
- Binary Search Tree (BST): Represent the inverted index as a BST where each node contains a word.
- File List: Each BST node has a linked list of filenames (URLs) where the word is found.
- Ordering:
- The BST is alphabetically ordered by words.
- Each file list is alphabetically ordered by filename.
Before inserting words into the index, normalize them by:
- Converting all letters to lowercase.
- Removing punctuation marks at the end of words:
. , : ; ? * - If a word becomes empty after normalization, it is not inserted.
| Word | Normalised Word |
|---|---|
| Data | data |
| BSTs | bsts |
| algorithms. | algorithms |
| Why? | why |
| graphs*. | graphs |
| .NET | .net |
| unsw.edu.au. | unsw.edu.au |
| Sydney? | sydney? |
| .,!.,:; | .,! |
| new.........s | new.........s |
| * | (empty word) |
Run the program with:
./invertedIndex- Each line starts with a word followed by a list of filenames.
- Sort lines alphabetically by the initial word.
- Sort filenames alphabetically within each line.
design url2 url25 url31 url61
mars url101 url25 url31
vegetation url31 url61
- Words and URLs are separated by spaces (additional spaces are ignored by the testing program).
Implement searchPagerank.c to process search queries using two key data files: invertedIndex.txt and pagerankList.txt. The program finds the most relevant pages based on search terms and outputs the top 30 URLs sorted by relevance and PageRank.
invertedIndex.txt: Contains words and their corresponding page URLs, as produced in Part 2.pagerankList.txt: Contains pages with their URL, outdegree, and PageRank, as produced in Part 1. The values are separated by ", " (comma followed by a space).
- Accepts one or more search terms as command-line arguments.
- Identifies pages that contain one or more matching search terms.
- Outputs the top 30 pages, sorted by:
- Descending order of the number of matching search terms.
- Descending order of PageRank for ties.
- Matches are based on exact words; word variants and plurals are treated as distinct.
- A page's relevance is determined by the presence of search terms, not their frequency on the page.
./searchPagerank [search_term1] [search_term2] [...]When searching for "mars" and "design":
./searchPagerank mars design
url31
url25To run these tests, a bash script is provided, which will change into each directory test* and run the test. To use it, assuming the script runtests is in your current directory, do the following:
./runtestsYou may have to clean up all the binary files and recompile again
make clean
make