Skip to content
This repository was archived by the owner on Dec 6, 2024. It is now read-only.

Latest commit

 

History

History
13 lines (9 loc) · 971 Bytes

README.md

File metadata and controls

13 lines (9 loc) · 971 Bytes

URL Clusterer - White Paper

Description

A prototype implementation of a methodology to cluster dynamic URLs of a website. There hereby 2 repositories in this organization for achieving this:

  • LinkGraphExtractor: Crawls a given website and stores its URLs on Neo4j.
  • URLClusterer: Clusters the URLs it takes as input by running an Apache Spark pipeline over them.

There is also a paper we had written for this study that is published on 2020 IEEE International Conference on Big Data's proceedings.

Credits