Skip to content

Commit

Permalink
Uplift (#4)
Browse files Browse the repository at this point in the history
* bring up-to-date

* bring up-to-date
  • Loading branch information
ivyONS authored Apr 18, 2024
1 parent 288013b commit 85c8f7e
Show file tree
Hide file tree
Showing 70 changed files with 2,158 additions and 1,426 deletions.
18 changes: 18 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Ignore everything
*

# Allow files and directories
!/statschat
!/log/.gitkeep
!/notebooks/load_llm_models_docker.py
!/data/db_langchain
!/data/db_langchain_latest
!/pyproject.toml
!/fast-api

# ignore python temp
*.pyc
*.pyo
*.pyd
__pycache__
.pytest_cache
37 changes: 0 additions & 37 deletions .envrc

This file was deleted.

32 changes: 32 additions & 0 deletions .gcloudignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# This file specifies files that are *not* uploaded to Google Cloud
# using gcloud. It follows the same syntax as .gitignore, with the addition of
# "#!include" directives (which insert the entries of the given .gitignore-style
# file at that point).
#
# For more information, run:
# $ gcloud topic gcloudignore
#
.gcloudignore
# If you would like to upload your .git directory, .gitignore file or files
# from your .gitignore file, remove the corresponding line
# below:
.git
.gitignore


# Python pycache:
__pycache__/
# Ignored by the build system


# virtua environments
venv/
env/
.env/
.env

# dev stuff
tests/
docs/
data/bulletins/
flask-app/
17 changes: 17 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Set base image (this loads the Debian Linux operating system)
FROM python:3.10.4-buster
ENV PYTHONUNBUFFERED True

WORKDIR /Statschat

# copy subset of files as specified by dockerignore
COPY . ./
RUN mv notebooks/load_llm_models_docker.py ./load_llm_models_docker.py

RUN python -m pip install --upgrade pip
RUN python -m pip install ".[backend]"

RUN python load_llm_models_docker.py

EXPOSE 8080
CMD ["uvicorn", "fast-api.main_api:app", "--host", "0.0.0.0", "--port", "8080"]
205 changes: 124 additions & 81 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,36 +8,30 @@

## Code state

**Please be aware that for development purposes, these experiments use experimental Large Language Models (LLM's) not intended for production. They can present inaccurate information, hallucinated statements and offensive text by random chance or through malevolent prompts.**
> [!WARNING]
> Please be aware that for development purposes, these experiments use
> experimental Large Language Models (LLM's) not intended for production. They
> can present inaccurate information, hallucinated statements and offensive
> text by random chance or through malevolent prompts.
**Tested on OSX only**

**Peer-reviewed**

**Depends on external API's**

**Under development**

**Experimental**
- **Under development** / **Experimental**
- **Tested on macOS only**
- **Peer-reviewed**
- **Depends on external API's**

## Introduction

This is an experimental application for semantic search of ONS statistical publications.
It uses LangChain to implement a fairly simple embedding search and QA information retrieval
process. Upon receiving a query, documents are returned as search results
using embedding similarity to score relevance. Additionally, the relevant text is
passed to a locally-hosted Large language Model (LLM), which is prompted to write an
answer to the original question, if it can, using only the information contained within
the documents.
This is an experimental application for semantic search of ONS statistical
publications. It uses LangChain to implement a fairly simple Retriaval Augmented Generation (RAG) using embedding search
and QA information retrieval process.

For this prototype, the program is run entirely locally; relevant web pages are scraped and the data
stored in `data/bulletins`, the docstore / embedding store that is created is likewise
in local folders and files, and the LLM and all other code is run in memory on your
desktop or laptop.
Upon receiving a query, documents are
returned as search results using embedding similarity to score relevance. Next, the relevant text is passed to a Large Language
Model (LLM), which is prompted to write an answer to the original question, if it can, using only the information contained within the documents.

The search program should be able to run on a system with 16GB of ram. The LLM is
set up to run on CPU at this research stage. Different models from the Hugging Face
repository can be specified for the search and QA functions.
For this prototype, relevant web pages are
scraped and the data stored in `data/bulletins`, the docstore / embedding store
that is created is likewise in local folders and files, and the LLM is either run in memory or accessed through VertexAI.

## Installation

Expand All @@ -49,106 +43,155 @@ python3.10 -m venv env
source env/bin/activate

python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m pip install .
```

> [!NOTE]
> If you are doing development work on `statschat`, you should install the
> package locally as editable with our optional `dev` dependencies:
> ```shell
> python -m pip install -e ".[dev]"
> ```
### Pre-commit actions
This repository contains a configuration of pre-commit hooks. These are language agnostic and focussed on repository security (such as detection of passwords and API keys). If approaching this project as a developer, you are encouraged to install and enable `pre-commits` by running the following in your shell:
1. Install `pre-commit`:
```shell
pip install pre-commit
```
2. Enable `pre-commit`:
This repository contains a configuration of pre-commit hooks. These are
language agnostic and focussed on repository security (such as detection of
passwords and API keys).
```shell
pre-commit install
```
If approaching this project as a developer, you are encouraged to install and
enable `pre-commits` by running the following in your shell:
1. Install `pre-commit`:
```shell
pip install pre-commit
```
2. Enable `pre-commit`:
```shell
pre-commit install
```

Once pre-commits are activated, whenever you commit to this repository a series of checks will be executed. The pre-commits include checking for security keys, large files and unresolved merge conflict headers. The use of active pre-commits are highly encouraged and the given hooks can be expanded with Python or R specific hooks that can automate the code style and linting. For example, the `flake8` and `black` hooks are useful for maintaining consistent Python code formatting.
Once pre-commits are activated, whenever you commit to this repository a series of checks will be executed. The use of active
pre-commits are highly encouraged.

**NOTE:** Pre-commit hooks execute Python, so it expects a working Python build.
> [!NOTE]
> Pre-commit hooks execute Python, so it expects a working Python build.
## Usage

By default, flask will look for a file called `app.py`, you can also name a specific python program to run.
With `--debug` in play, flask will restart every time it detects a saved change in the underlying
python files.
The first time you run the app, any ML models specified in the code will be downloaded
to your machine. This will use a few GB of data and take a few minutes.
App and search pipeline parameter are stored and can be updated by editing `app_config.toml`.
This main module statschat can be either called directly or deployed as an API (using fastapi).
A lightweight flask front end is implemented separately in a subfolder and relies on the API running.

The first time you instantiate the `Inquirer` class, any ML models specified in the code will be
downloaded to your machine. This will use a few GB of data and take a few
minutes. App and search pipeline parameter are stored and can be updated by
editing `statschat/_config/main.toml`.

We have included three EXAMPLE scraped data files in `data/bulletins` so that
We have included few EXAMPLE scraped data files in `data/bulletins` so that
the preprocessing and app can be run as a small example system without waiting
on webscraping.

### To webscrape the source documents from ONS
#### We have removed this script, and for the sake of demonstration included some example scrape results so that the process can be continued from the next step below
### With Vertex AI

If you wish to use Google's model API update the model variables in
`statschat/_config/main.toml`:
* to use the question-answering system with Google's PaLM2 API set the
`generative_model_name` parameter to `text-unicorn` or `gemini-pro` (their
name for the model).
* for PaLM2 (Gecko) to create embeddings, set the `embedding_model_name`
parameter to `textembedding-gecko@001`. You may also wish to disable the
removal of near-identical documents in the preprocessing pipeline (line 59,
`statschat/embedding/preprocess.py`), to reduce calls to the embedding API.

In addition to changing this parameter, you will need a Google Cloud Platform
(GCP) project set up, with the Vertex AI API enabled. You will need to have the
GCP Command Line Interface installed in the machine running this code, logged
in to an account with sufficient permissions to access the API (you may need to
set up [application default credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc#how-to)).
Usually this can be achieved by running:
```shell
# python statschat/webscraping/main.py
gcloud config set project "<PROJECT_ID>"
gcloud auth application-default login
```

### To create a local document store
```shell
python statschat/preprocess.py
```
## Example endpoint commands

### To run the interactive app
1. #### Webscraping the source documents (not included in the public repository, only examples in `data/bulletins`)

```shell
python statschat/webscraping/main.py
```

2. #### Creating a local document store

```shell
flask --debug run
```
or
```shell
python app.py
```
```shell
python statschat/embedding/preprocess.py
```

The flask app is set respond to https requests on port 5000. To use the user UI navigate in your browser to http://localhost:5000.
3. #### Updating an existing local document store with new articles

The API default url would be http://localhost:5000/api. See [API endpoint documentation](docs/api/README.md) for more details (note, this is a work in progress).
```shell
python statschat/embedding/preprocess_update_db.py
```

4. #### Run the interactive Statschat API

### Search engine parameters
```shell
uvicorn fast-api.main_api:app
```

There are some key parameters in `app_config.toml` that we're experimenting with to improve the search results,
and the generated text answer. The current values are initial guesses:
The fastapi is set to respond to http requests on port 8000. When running, you can see docs at http://localhost:8000/docs.

| Parameter | Current Value | Function |
| --- | --- | --- |
| k_docs | 10 | Maximum number of search results to return |
| similarity_threshold | 1.0 | Cosine distance, a searched document is only returned if it is at least this similar (EQUAL or LOWER) |
| k_contexts | 3 | Number of top documents to pass to generative QA LLM |
5. #### Run the flask web interface

### Alternatively, to run the search evaluation pipeline
```shell
python flask-app/app.py
```
To use the user UI
navigate in your browser to http://localhost:5000. Note that it requires the API to be running and the endpoind specified in the app.

The StatsChat pipeline is currently evaluated based on small number of test question. The main 'app_config.toml' determines pipeline setting used in evaluation and results are written to `data/model_evaluation` folder. The evaluation script requires that project root (assumed working directory) be added to PYTHONPATH, this is handled through [direnv](https://direnv.net/) and
the `.envrc` file.
6. #### Run the search evaluation pipeline
```shell
python statschat/model_evaluation/evaluation.py
```
The StatsChat pipeline is currently evaluated based on small number of test
question. The main 'app_config.toml' determines pipeline setting used in
evaluation and results are written to `data/model_evaluation` folder.

```shell
python statschat/model_evaluation/evaluation.py
```

7. #### Testing
```shell
python -m pytest
```
Preferred unittesting framework is PyTest.

## Testing
### Search engine parameters

Preferred unittesting framework is PyTest:
There are some key parameters in `statschat/_config/main.toml` that we're
experimenting with to improve the search results, and the generated text
answer. The current values are initial guesses:
| Parameter | Current Value | Function |
| --- | --- | --- |
| k_docs | 10 | Maximum number of search results to return |
| similarity_threshold | 2.0 | Cosine distance, a searched document is only returned if it is at least this similar (EQUAL or LOWER) |
| k_contexts | 3 | Number of top documents to pass to generative QA LLM |
```shell
pytest
```
# Data Science Campus
At the [Data Science Campus](https://datasciencecampus.ons.gov.uk/about-us/) we apply data science, and build skills, for public good across the UK and internationally. Get in touch with the Campus at [[email protected]]([email protected]).
At the [Data Science Campus](https://datasciencecampus.ons.gov.uk/about-us/) we
apply data science, and build skills, for public good across the UK and
internationally. Get in touch with the Campus at
[[email protected]]([email protected]).
# License
<!-- Unless stated otherwise, the codebase is released under [the MIT Licence][mit]. -->
The code, unless otherwise stated, is released under [the MIT License][mit].
The documentation for this work is subject to [© Crown copyright][copyright] and is available under the terms of the [Open Government 3.0][ogl] licence.
The documentation for this work is subject to [© Crown copyright][copyright]
and is available under the terms of the [Open Government 3.0][ogl] licence.
[mit]: LICENSE
[copyright]: http://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/
Expand Down
Loading

0 comments on commit 85c8f7e

Please sign in to comment.