This is the search engine Tübingo for the course Modern Search Engines in the summer term 2024.
- Daniel Flat
- Lenard Rommel
- Veronika Smilga
- Lilli Diederichs
This project runs with Python 3.11.0. To install the required packages, follow the instructions below:
Ensure Python 3.11.0 is installed on your system. For Ubuntu:
sudo apt install python3.11For MacOS:
brew install [email protected]Make sure you have Poetry installed to manage project dependencies. You can install Poetry via pip:
pip install poetry
poetry installTo install the project dependencies, run the following command:
poetry install;This command reads the pyproject.toml file and installs all specified dependencies into a virtual environment managed by Poetry.
Note: When adding a package to a specific group, ensure that this package is NOT specified anywhere else in the file. Duplicates can cause issues that are hard to resolve.
python -m spacy download en_core_web_sm;For the collaborators and the reviewers make sure that git lfs (git Large File Storage) is installed on your machine. Look here for more details.
For MacOS:
brew install git-lfsWith that we are tracking the dump.sql (the index for our search engine) because it might be too large for a repo. We used
git lfs track ./data/dump.sqlto still be able to push this file.
To activate the Poetry environment, run the following command:
poetry shellFor our search engine we use a shared database with our index. This is required to run the search engine and test it. For that make sure to run the following:
docker compose down;
docker compose up --build dbAfter that you should wait some seconds until you get the message:
LOG: database system is ready to accept connectionsWhen you want to try out how the database works, you can experiment it with 001_Flat_db_example_connection.ipynb
To run the project, execute the following command:
python Main.pyThis command runs the Main.py file, which is the entry point of the project.
If you want to run the Crawler, please go to 006_Nika_crawling_with_checkpointing.ipynb and run it. All instructions are provided in the notebook.
For the project we all want to be on the same page. For that we have one common database, a PostgreSQL.
Furthermore, we all try to be in sync with our data, so for that there always exists a dump.sql.
This file makes it possible to always have the latest data when calling docker compose up --build db.
But when you are working with it, you should update the dump.sql when you make progress for the team. To do that you have to do the following steps.
- Make sure the docker container is running. For that look at the chapter "Running the Project".
- Look up the container ID. For that just exec:
docker psIt should list all your running containers. Look for the right container. The right name of the image is project_mse-db.
An ID might look like this: c946285e9b4f
3. Go and overwrite the dump.sql by exec the following script:
docker exec -t your_container_name_or_id pg_dump -U user search_engine_db > ./db/dump.sqlFor example with the container ID c946285e9b4f the command should look like this:
docker exec -t c946285e9b4f pg_dump -U user search_engine_db > ./db/dump.sql- Push the updated dump.sql using git.
git add db/dump.sql
git commit -m "Update dump.sql"