-
Create a virtual environment:
python -m venv venv
-
Activate the virtual environment:
- On Windows:
.\venv\Scripts\activate
- On Unix or macOS:
source venv/bin/activate
- On Windows:
-
Install the required dependencies:
pip install -r requirements.txt
You can run the project in two ways:
python main.py --limit {limit} --k {k} --top {top} This command will build the index with a specified limit on the number of terms and save the index and mapper to the given paths. Once the index is built and saved, it will cluster the data into k clusters and display the top n vocabulary terms for each cluster.
Note, the index will be built and saved in the index/ directory.
The default values for each parameter are:
limit: 50k: 7top: 20
python build_index.py --limit {limit}This command will only build the index.
Note, the index will be built and saved in the index/ directory.
The default values for each parameter are:
limit: 50
python clustering.py --k {number_of_clusters} --top {top_n_terms_per_cluster} -This command runs the clustering process with k clusters and displays the top n terms for each cluster.
Note: when only executing the clustering, the index and mapper used are from the index_main/ directory.
The default values for each parameter are:
k: 7top: 20
if you run the project and get an error like the following:
Traceback (most recent call last):
File "/home/patrick/comp479/p2/main.py", line 18, in <module>
cluster(args.k, args.top, index_path="index/mapper.json", mapper_path="index/mapper.json")
File "/home/patrick/comp479/p2/clustering.py", line 24, in cluster
raise ValueError("All tokens were filtered out. Adjust the threshold.")
ValueError: All tokens were filtered out. Adjust the threshold.Ajdust the threshold from 75 % at line 19 of clustering.py
clustering: directory containing the results from clustering with k=3 and k=6clustering_faculty_department: directory containing the clustering results from k = number of faculties (7) and k = number of departments (49)index_main: directory containing a compiled index and the document url to id mapper.