Chat with the 2024/2025 UCL module catalogue

title	emoji	colorFrom	colorTo	sdk	sdk_version	app_file	pinned
Chat with the UCL module catalogue	🎓	blue	indigo	gradio	4.44.1	app.py	false

Chat with the 2024/2025 UCL module catalogue

NOTE

This is a demonstration developed for educational purposes only and is not affiliated with or endorsed by University College London (UCL). The model may provide incorrect or outdated information. Interactions should therefore not be used to inform decisions such as programme choices or module selection.

Please refer to the official UCL module catalogue for accurate and up-to-date information.

The code is licensed under Apache License 2.0 but the module catalogue content is copyright UCL.

Get started

Hugging Face Space

The easiest way to chat with the model is using the Hugging Face space.

Local use

You can use the code snippet below to run the app locally. This project uses uv to manage dependencies and the snippet assumes that you have uv installed.

The app requires an OpenAI API key to run locally.

# Clone repo and install dependencies in venv
git clone https://github.com/joefarrington/ucl_module_chat.git
cd ucl_module_chat

uv venv
source .venv/bin/activate
uv pip install .

# Add API keys as environment variables
# Alternatively, create a .env file in the main directory
export OPENAI_API_KEY=<Your API key>

# Run the app
python app.py

One advantage of LangChain is that you could easily substitute the embedding model and/or LLM for alternatives including locally hosted models using Hugging Face, llama.cpp or Ollama.

Rerun scraping and embedding

The repository includes the vectorstore with pages from the module catalogue embedded using OpenAI's text-embedding-3-small.

The process for downloading the pages from the module catalogue, converting the pages to markdown documents, and embedding the documents can be re-run using the script setup.py. There is no need to run this script unless you want to change the way data is extracted from the HTML pages to markdown, or embed the documents using an alternative model.

Implementation details

Document scraping

The documents in the vectorstore used to provide context to the LLM are based on the publicly available webpages describing each module offered by UCL.

The URLs for the individual module catalogue pages are identified from the module catalogue search page. The modules pages are then visited in sequence and the HTML is downloaded for each page.

There are more efficient ways to scrape the content from the module catalogue (e.g. scrapy). The current method is designed to minimise the effect on the server. There is long wait time between requests and the raw HTML is saved so that alternative methods of extracting the content can be considered without needing to request additional data from the server.

Document conversion

The raw HTML for each module page is converted to a markdown document using Beautiful Soup to parse the HTML and a Jinja template to format the extracted information.

Document embedding

The module pages are relatively short documents and therefore each is treated as a single chunk and embedded as a whole.

Each page is embedded using text-embedding-3-small. FAISS is used to store and search the embedded documents.

Q&A based using RAG

The chat interface is a simple Gradio app, and uses OpenAI's gpt-4o-mini as the underlying LLM.

At each turn of the conversation the following steps are performed, managed using Langchain:

Call the LLM to rephrase the user's query, given the conversation history, so that it includes relevant context from the conversation.
Embed the rephrased query and retrieve relevant documents from the vectorstore.
Call the LLM with the current user input, retrieved documents for context and conversation history. Output the result as the LLM's response in the chat interface.

Potential extensions

Add course descriptions to the vectorstore so that the app is more useful to potential applicants and can explain, for example, which modules are mandatory on certain courses.
Provide links to the module catalogue for modules suggested by the application, either within the conversation or as a separate interface element.
Use a agent-based approach to avoid unnecessary retrieval steps and/or support more complex queries that require multiple retrieval steps.
Use a LangGraph app to manage the conversation history and state.

Useful resources

UCL module catalogue
Langchain official tutorials
Hands-On Large Language Models book and GitHub repository

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
data/module_catalogue_vectorstore		data/module_catalogue_vectorstore
src/ucl_module_chat		src/ucl_module_chat
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chat with the 2024/2025 UCL module catalogue

NOTE

Get started

Hugging Face Space

Local use

Rerun scraping and embedding

Implementation details

Document scraping

Document conversion

Document embedding

Q&A based using RAG

Potential extensions

Useful resources

About

Releases

Packages

Languages

License

joefarrington/ucl_module_chat

Folders and files

Latest commit

History

Repository files navigation

Chat with the 2024/2025 UCL module catalogue

NOTE

Get started

Hugging Face Space

Local use

Rerun scraping and embedding

Implementation details

Document scraping

Document conversion

Document embedding

Q&A based using RAG

Potential extensions

Useful resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages