MSE Project - Tübingo

This is the search engine Tübingo for the course Modern Search Engines in the summer term 2024.

By Group 09:

Daniel Flat
Lenard Rommel
Veronika Smilga
Lilli Diederichs

Installation guide

This project runs with Python 3.11.0. To install the required packages, follow the instructions below:

Python Installation

Ensure Python 3.11.0 is installed on your system. For Ubuntu:

sudo apt install python3.11

For MacOS:

brew install [email protected]

Poetry Installation

Make sure you have Poetry installed to manage project dependencies. You can install Poetry via pip:

pip install poetry
poetry install

Dependencies

Poetry

To install the project dependencies, run the following command:

poetry install;

This command reads the pyproject.toml file and installs all specified dependencies into a virtual environment managed by Poetry.

Note: When adding a package to a specific group, ensure that this package is NOT specified anywhere else in the file. Duplicates can cause issues that are hard to resolve.

spaCy’s trained pipelines

python -m spacy download en_core_web_sm;

Data index

For the collaborators and the reviewers make sure that git lfs (git Large File Storage) is installed on your machine. Look here for more details.

For MacOS:

brew install git-lfs

With that we are tracking the dump.sql (the index for our search engine) because it might be too large for a repo. We used

git lfs track ./data/dump.sql

to still be able to push this file.

Running the Project

Step 1: Activate the Poetry Environment

To activate the Poetry environment, run the following command:

poetry shell

Step 2: Set up the database

For our search engine we use a shared database with our index. This is required to run the search engine and test it. For that make sure to run the following:

docker compose down;
docker compose up --build db

After that you should wait some seconds until you get the message:

LOG:  database system is ready to accept connections

When you want to try out how the database works, you can experiment it with 001_Flat_db_example_connection.ipynb

Step 3: Run the project

To run the project, execute the following command:

python Main.py

This command runs the Main.py file, which is the entry point of the project.

(optional): Run the crawler

If you want to run the Crawler, please go to 006_Nika_crawling_with_checkpointing.ipynb and run it. All instructions are provided in the notebook.

Dealing with the database

For the project we all want to be on the same page. For that we have one common database, a PostgreSQL. Furthermore, we all try to be in sync with our data, so for that there always exists a dump.sql. This file makes it possible to always have the latest data when calling docker compose up --build db.

But when you are working with it, you should update the dump.sql when you make progress for the team. To do that you have to do the following steps.

Make sure the docker container is running. For that look at the chapter "Running the Project".
Look up the container ID. For that just exec:

docker ps

It should list all your running containers. Look for the right container. The right name of the image is project_mse-db. An ID might look like this: c946285e9b4f 3. Go and overwrite the dump.sql by exec the following script:

docker exec -t your_container_name_or_id pg_dump -U user search_engine_db > ./db/dump.sql

For example with the container ID c946285e9b4f the command should look like this:

docker exec -t c946285e9b4f pg_dump -U user search_engine_db > ./db/dump.sql

Push the updated dump.sql using git.

git add db/dump.sql
git commit -m "Update dump.sql"

TODO: Add more documentation

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
__pycache__		__pycache__
data_retrieval		data_retrieval
db		db
exp		exp
nltk_data		nltk_data
ranker		ranker
static		static
templates		templates
utils		utils
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
MSE_24_Group_Projects.ipynb		MSE_24_Group_Projects.ipynb
Main.py		Main.py
README.md		README.md
docker-compose.yml		docker-compose.yml
dummyindex.csv		dummyindex.csv
example_queries.txt		example_queries.txt
example_queries_eval_flat_rommel_smilga_diederichs.txt		example_queries_eval_flat_rommel_smilga_diederichs.txt
frontier.json		frontier.json
pyproject.toml		pyproject.toml
queries_eval_flat_rommel_smilga_diederichs.txt		queries_eval_flat_rommel_smilga_diederichs.txt
queryprocessing.py		queryprocessing.py
requirements.txt		requirements.txt
trial.ipynb		trial.ipynb
trial.py		trial.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MSE Project - Tübingo

By Group 09:

Installation guide

Python Installation

Poetry Installation

Dependencies

Poetry

spaCy’s trained pipelines

Data index

Running the Project

Step 1: Activate the Poetry Environment

Step 2: Set up the database

Step 3: Run the project

(optional): Run the crawler

Dealing with the database

TODO: Add more documentation

Create Index

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

danielflat/Project_MSE

Folders and files

Latest commit

History

Repository files navigation

MSE Project - Tübingo

By Group 09:

Installation guide

Python Installation

Poetry Installation

Dependencies

Poetry

spaCy’s trained pipelines

Data index

Running the Project

Step 1: Activate the Poetry Environment

Step 2: Set up the database

Step 3: Run the project

(optional): Run the crawler

Dealing with the database

TODO: Add more documentation

Create Index

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages