Skip to content

patrickmac3/comp479-project2

Repository files navigation

Project Setup

Setup

  1. Create a virtual environment:

    python -m venv venv
  2. Activate the virtual environment:

    • On Windows:
      .\venv\Scripts\activate
    • On Unix or macOS:
      source venv/bin/activate
  3. Install the required dependencies:

    pip install -r requirements.txt

Running the Project

You can run the project in two ways:

1. Run main.py

python main.py --limit {limit} --k {k} --top {top} 

This command will build the index with a specified limit on the number of terms and save the index and mapper to the given paths. Once the index is built and saved, it will cluster the data into k clusters and display the top n vocabulary terms for each cluster.

Note, the index will be built and saved in the index/ directory.

The default values for each parameter are:

  • limit: 50
  • k: 7
  • top: 20

2. Build the index

python build_index.py --limit {limit}

This command will only build the index.

Note, the index will be built and saved in the index/ directory. The default values for each parameter are:

  • limit: 50

3. Run the clustering process

python clustering.py --k {number_of_clusters} --top {top_n_terms_per_cluster} -

This command runs the clustering process with k clusters and displays the top n terms for each cluster.

Note: when only executing the clustering, the index and mapper used are from the index_main/ directory.

The default values for each parameter are:

  • k: 7
  • top: 20

Note

if you run the project and get an error like the following:

Traceback (most recent call last):
  File "/home/patrick/comp479/p2/main.py", line 18, in <module>
    cluster(args.k, args.top, index_path="index/mapper.json", mapper_path="index/mapper.json")
  File "/home/patrick/comp479/p2/clustering.py", line 24, in cluster
    raise ValueError("All tokens were filtered out. Adjust the threshold.")
ValueError: All tokens were filtered out. Adjust the threshold.

Ajdust the threshold from 75 % at line 19 of clustering.py

Project Demo

  • clustering: directory containing the results from clustering with k=3 and k=6
  • clustering_faculty_department: directory containing the clustering results from k = number of faculties (7) and k = number of departments (49)
  • index_main: directory containing a compiled index and the document url to id mapper.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages