Spark Search Pipeline

This is an Assessed Coursework for the Big Data postgraduate course at the University of Glasgow during the academic year 2022-2023.

Coursework Specification

Summary

The goal of this exercise is to familiarize yourselves with the design, implementation and performance testing of Big Data analysis tasks using Apache Spark. You will be required to design and implement a single reasonably complex Spark application. You will then test the running of this application locally on a data of various sizes. Finally, you will write a short report describing your design, design decisions, and where appropriate critique your design. You will be evaluated based on code functionality (does it produce the expected outcome), code quality (is it well designed and follows good software engineering practices) and efficiency (how fast is it and does it use resources efficiently), as well as your submitted report.

Task Description

You are to develop a batch-based text search and filtering pipeline in Apache Spark. The core goal of this pipeline is to take in a large set of text documents and a set of user defined queries, then for each query, rank the text documents by relevance for that query, as well as filter out any overly similar documents in the final ranking. The top 10 documents for each query should be returned as output. Each document and query should be processed to remove stopwords (words with little discriminative value, e.g. ‘the’) and apply stemming (which converts each word into its ‘stem’, a shorter version that helps with term mismatch between documents and queries). Documents should be scored using the DPH ranking model. As a final stage, the ranking of documents for each query should be analysed to remove unneeded redundancy (near duplicate documents), if any pairs of documents are found where their titles have a textual distance (using a comparison function provided) less than 0.5 then you should only keep the most relevant of them (based on the DPH score). Note that there should be 10 documents returned for each query, even after redundancy filtering.

Results

Efficiency

The execution of the full dataset (~5 GB) and with a set of ten queries using four local executors took ~69 seconds.

Note

The dataset was too large to include in the project, but it can be accessed from here.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
BigData-AE		BigData-AE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Search Pipeline

Coursework Specification

Summary

Task Description

Results

Efficiency

Note

About

Releases

Packages

Languages

saadanis/spark-search-pipeline

Folders and files

Latest commit

History

Repository files navigation

Spark Search Pipeline

Coursework Specification

Summary

Task Description

Results

Efficiency

Note

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages