StuyActivities Catalog Scraper

This project is a Python-based web scraper for the StuyActivities website. It extracts information about various clubs and activities, and formats the data for further use.

Project Structure

The main scripts for the project are located in the scraping/ directory:

club_names.py: This script extracts the names of clubs from an HTML file and writes them to a text file.
get_charter_info.py: This script reads the 'charter.txt' file from each club's directory, extracts specific sections such as the mission statement and meeting schedule using BeautifulSoup, and writes this information to a new 'charter_text.txt' file in the same directory.
get_club_info.py: This script reads an HTML file containing club information, extracts details such as the club name, description, commitment level, categories, image URL, and link for each club using BeautifulSoup, and writes this information to a CSV file.
get_main_info.py: This script reads the 'main.txt' file from each club's directory, extracts the mission, meeting schedule, leaders, and related clubs using BeautifulSoup, and writes this information to a new 'main_info.txt' file in the same directory.
get_members.py: This script reads the 'members.txt' file from each club's directory, extracts the names of the members using BeautifulSoup, and writes these names to a new 'member_name_list.txt' file in the same directory.
get_related_clubs.py: This script uses TF-IDF vectorization and cosine similarity to find and print the top three most related clubs for each club based on the combined text of their charter, main info, and member list.
main.py: This script uses Selenium and BeautifulSoup to navigate to the StuyActivities website, waits for the page to load completely, and then prints the prettified HTML of the page.
scrape_all.py: This script uses Selenium and BeautifulSoup to scrape club information from the StuyActivities website and saves the main page, charter, and members page HTML for each club into separate text files.

The data/ directory contains the scraped data for each club.

How to Run

Install the required Python packages; it's also recommended to create a venv:

python -m venv venv # Optional
source venv/bin/activate #Optional

pip install -r requirements.txt

Run the main script:

python main.py

Data Format

The data for each club is stored in a separate directory under data/. Each directory contains the following files:

main.txt: The raw HTML of the club's main page

main_info.txt: Extracted information from the main page, including the club's mission, meeting schedule, leaders, and related clubs

charter.txt: The raw HTML of the club's charter page

charter_info.txt: Extracted information from the charter page, including the club's mission statement, meeting schedule, purpose, benefits to Stuyvesant, leadership appointment process, and unique qualities

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
scraping		scraping
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StuyActivities Catalog Scraper

Project Structure

How to Run

Data Format

About

Releases

Packages

Contributors 3

Languages

License

stuyai/catalog

Folders and files

Latest commit

History

Repository files navigation

StuyActivities Catalog Scraper

Project Structure

How to Run

Data Format

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages