Uplift (#4)

* bring up-to-date * bring up-to-date
datasciencecampus · Apr 18, 2024 · 85c8f7e · 85c8f7e
1 parent 288013b
commit 85c8f7e
Show file tree

Hide file tree

Showing 70 changed files with 2,158 additions and 1,426 deletions.
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,18 @@
+# Ignore everything
+*
+
+# Allow files and directories
+!/statschat
+!/log/.gitkeep
+!/notebooks/load_llm_models_docker.py
+!/data/db_langchain
+!/data/db_langchain_latest
+!/pyproject.toml
+!/fast-api
+
+# ignore python temp
+*.pyc
+*.pyo
+*.pyd
+__pycache__
+.pytest_cache
diff --git a/.envrc b/.envrc
diff --git a/.gcloudignore b/.gcloudignore
@@ -0,0 +1,32 @@
+# This file specifies files that are *not* uploaded to Google Cloud
+# using gcloud. It follows the same syntax as .gitignore, with the addition of
+# "#!include" directives (which insert the entries of the given .gitignore-style
+# file at that point).
+#
+# For more information, run:
+#   $ gcloud topic gcloudignore
+#
+.gcloudignore
+# If you would like to upload your .git directory, .gitignore file or files
+# from your .gitignore file, remove the corresponding line
+# below:
+.git
+.gitignore
+
+
+# Python pycache:
+__pycache__/
+# Ignored by the build system
+
+
+# virtua environments
+venv/
+env/
+.env/
+.env
+
+# dev stuff
+tests/
+docs/
+data/bulletins/
+flask-app/
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,17 @@
+# Set base image (this loads the Debian Linux operating system)
+FROM python:3.10.4-buster
+ENV PYTHONUNBUFFERED True
+
+WORKDIR /Statschat
+
+# copy subset of files as specified by dockerignore
+COPY . ./
+RUN mv notebooks/load_llm_models_docker.py ./load_llm_models_docker.py
+
+RUN python -m pip install --upgrade pip
+RUN python -m pip install ".[backend]"
+
+RUN python load_llm_models_docker.py
+
+EXPOSE 8080
+CMD ["uvicorn", "fast-api.main_api:app", "--host", "0.0.0.0", "--port", "8080"]
diff --git a/README.md b/README.md
@@ -8,36 +8,30 @@
 
 ## Code state
 
-**Please be aware that for development purposes, these experiments use experimental Large Language Models (LLM's) not intended for production. They can present inaccurate information, hallucinated statements and offensive text by random chance or through malevolent prompts.**
+> [!WARNING]
+> Please be aware that for development purposes, these experiments use
+> experimental Large Language Models (LLM's) not intended for production. They
+> can present inaccurate information, hallucinated statements and offensive
+> text by random chance or through malevolent prompts.
 
-**Tested on OSX only**
-
-**Peer-reviewed**
-
-**Depends on external API's**
-
-**Under development**
-
-**Experimental**
+- **Under development** / **Experimental**
+- **Tested on macOS only**
+- **Peer-reviewed**
+- **Depends on external API's**
 
 ## Introduction
 
-This is an experimental application for semantic search of ONS statistical publications.
-It uses LangChain to implement a fairly simple embedding search and QA information retrieval
-process.  Upon receiving a query, documents are returned as search results
-using embedding similarity to score relevance.  Additionally, the relevant text is
-passed to a locally-hosted Large language Model (LLM), which is prompted to write an
-answer to the original question, if it can, using only the information contained within
-the documents.
+This is an experimental application for semantic search of ONS statistical
+publications. It uses LangChain to implement a fairly simple Retriaval Augmented Generation (RAG) using embedding search
+and QA information retrieval process.
 
-For this prototype, the program is run entirely locally; relevant web pages are scraped and the data
-stored in `data/bulletins`, the docstore / embedding store that is created is likewise
-in local folders and files, and the LLM and all other code is run in memory on your
-desktop or laptop.
+Upon receiving a query, documents are
+returned as search results using embedding similarity to score relevance. Next, the relevant text is passed to a Large Language
+Model (LLM), which is prompted to write an answer to the original question, if it can, using only the information contained within the documents.
 
-The search program should be able to run on a system with 16GB of ram.  The LLM is
-set up to run on CPU at this research stage.  Different models from the Hugging Face
-repository can be specified for the search and QA functions.
+For this prototype, relevant web pages are
+scraped and the data stored in `data/bulletins`, the docstore / embedding store
+that is created is likewise in local folders and files, and the LLM is either run in memory or accessed through VertexAI.
 
 ## Installation
 
@@ -49,106 +43,155 @@ python3.10 -m venv env
 source env/bin/activate
 
 python -m pip install --upgrade pip
-python -m pip install -r requirements.txt
+python -m pip install .
 ```
 
+> [!NOTE]
+> If you are doing development work on `statschat`, you should install the
+> package locally as editable with our optional `dev` dependencies:
+> ```shell
+> python -m pip install -e ".[dev]"
+> ```
+
 ### Pre-commit actions
-This repository contains a configuration of pre-commit hooks. These are language agnostic and focussed on repository security (such as detection of passwords and API keys). If approaching this project as a developer, you are encouraged to install and enable `pre-commits` by running the following in your shell:
-   1. Install `pre-commit`:
 
-      ```shell
-      pip install pre-commit
-      ```
-   2. Enable `pre-commit`:
+This repository contains a configuration of pre-commit hooks. These are
+language agnostic and focussed on repository security (such as detection of
+passwords and API keys).
 
-      ```shell
-      pre-commit install
-      ```
+If approaching this project as a developer, you are encouraged to install and
+enable `pre-commits` by running the following in your shell:
+1. Install `pre-commit`:
+   ```shell
+   pip install pre-commit
+   ```
+2. Enable `pre-commit`:
+   ```shell
+   pre-commit install
+   ```
 
-Once pre-commits are activated, whenever you commit to this repository a series of checks will be executed. The pre-commits include checking for security keys, large files and unresolved merge conflict headers. The use of active pre-commits are highly encouraged and the given hooks can be expanded with Python or R specific hooks that can automate the code style and linting. For example, the `flake8` and `black` hooks are useful for maintaining consistent Python code formatting.
+Once pre-commits are activated, whenever you commit to this repository a series of checks will be executed. The use of active
+pre-commits are highly encouraged.
 
-**NOTE:** Pre-commit hooks execute Python, so it expects a working Python build.
+> [!NOTE]
+> Pre-commit hooks execute Python, so it expects a working Python build.
 
 ## Usage
 
-By default, flask will look for a file called `app.py`, you can also name a specific python program to run.
-With `--debug` in play, flask will restart every time it detects a saved change in the underlying
-python files.
-The first time you run the app, any ML models specified in the code will be downloaded
-to your machine.  This will use a few GB of data and take a few minutes.
-App and search pipeline parameter are stored and can be updated by editing `app_config.toml`.
+This main module statschat can be either called directly or deployed as an API (using fastapi).
+A lightweight flask front end is implemented separately in a subfolder and relies on the API running.
+
+The first time you instantiate the `Inquirer` class, any ML models specified in the code will be
+downloaded to your machine. This will use a few GB of data and take a few
+minutes. App and search pipeline parameter are stored and can be updated by
+editing `statschat/_config/main.toml`.
 
-We have included three EXAMPLE scraped data files in `data/bulletins` so that
+We have included few EXAMPLE scraped data files in `data/bulletins` so that
 the preprocessing and app can be run as a small example system without waiting
 on webscraping.
 
-### To webscrape the source documents from ONS
-#### We have removed this script, and for the sake of demonstration included some example scrape results so that the process can be continued from the next step below
+### With Vertex AI
+
+If you wish to use Google's model API update the model variables in
+`statschat/_config/main.toml`:
+* to use the question-answering system with Google's PaLM2 API set the
+  `generative_model_name` parameter to `text-unicorn` or `gemini-pro` (their
+  name for the model).
+* for PaLM2 (Gecko) to create embeddings, set the `embedding_model_name`
+  parameter to `textembedding-gecko@001`. You may also wish to disable the
+  removal of near-identical documents in the preprocessing pipeline (line 59,
+  `statschat/embedding/preprocess.py`), to reduce calls to the embedding API.
+
+In addition to changing this parameter, you will need a Google Cloud Platform
+(GCP) project set up, with the Vertex AI API enabled. You will need to have the
+GCP Command Line Interface installed in the machine running this code, logged
+in to an account with sufficient permissions to access the API (you may need to
+set up [application default credentials](https://cloud.google.com/docs/authentication/provide-credentials-adc#how-to)).
+Usually this can be achieved by running:
 ```shell
-# python statschat/webscraping/main.py
+gcloud config set project "<PROJECT_ID>"
+gcloud auth application-default login
 ```
 
-### To create a local document store
-```shell
-python statschat/preprocess.py
-```
+## Example endpoint commands
 
-### To run the interactive app
+1. #### Webscraping the source documents (not included in the public repository, only examples in `data/bulletins`)
 
+    ```shell
+    python statschat/webscraping/main.py
+    ```
 
+2. #### Creating a local document store
 
-```shell
-flask --debug run
-```
-or
-```shell
-python app.py
-```
+    ```shell
+    python statschat/embedding/preprocess.py
+    ```
 
-The flask app is set respond to https requests on port 5000. To use the user UI navigate in your browser to http://localhost:5000.
+3. #### Updating an existing local document store with new articles
 
-The API default url would be http://localhost:5000/api. See [API endpoint documentation](docs/api/README.md) for more details (note, this is a work in progress).
+    ```shell
+    python statschat/embedding/preprocess_update_db.py
+    ```
 
+4. #### Run the interactive Statschat API
 
-### Search engine parameters
+    ```shell
+    uvicorn fast-api.main_api:app
+    ```
 
-There are some key parameters in `app_config.toml` that we're experimenting with to improve the search results,
-and the generated text answer.  The current values are initial guesses:
+    The fastapi is set to respond to http requests on port 8000. When running, you can see docs at http://localhost:8000/docs.
 
-| Parameter | Current Value | Function |
-| --- | --- | --- |
-| k_docs | 10 | Maximum number of search results to return |
-| similarity_threshold | 1.0 | Cosine distance, a searched document is only returned if it is at least this similar (EQUAL or LOWER) |
-| k_contexts | 3 | Number of top documents to pass to generative QA LLM |
+5. #### Run the flask web interface
 
-### Alternatively, to run the search evaluation pipeline
+    ```shell
+    python flask-app/app.py
+    ```
+    To use the user UI
+    navigate in your browser to http://localhost:5000. Note that it requires the API to be running and the endpoind specified in the app.
 
-The StatsChat pipeline is currently evaluated based on small number of test question. The main 'app_config.toml' determines pipeline setting used in evaluation and results are written to `data/model_evaluation` folder.  The evaluation script requires that project root (assumed working directory) be added to PYTHONPATH, this is handled through [direnv](https://direnv.net/) and
-the `.envrc` file.
+6. #### Run the search evaluation pipeline
+    ```shell
+    python statschat/model_evaluation/evaluation.py
+    ```
+    The StatsChat pipeline is currently evaluated based on small number of test
+    question. The main 'app_config.toml' determines pipeline setting used in
+    evaluation and results are written to `data/model_evaluation` folder.
 
-```shell
-python statschat/model_evaluation/evaluation.py
-```
 
+7. #### Testing
+    ```shell
+    python -m pytest
+    ```
+    Preferred unittesting framework is PyTest.
 
-## Testing
+### Search engine parameters
 
-Preferred unittesting framework is PyTest:
+There are some key parameters in `statschat/_config/main.toml` that we're
+experimenting with to improve the search results, and the generated text
+answer.  The current values are initial guesses:
+
+| Parameter | Current Value | Function |
+| --- | --- | --- |
+| k_docs | 10 | Maximum number of search results to return |
+| similarity_threshold | 2.0 | Cosine distance, a searched document is only returned if it is at least this similar (EQUAL or LOWER) |
+| k_contexts | 3 | Number of top documents to pass to generative QA LLM |
 
-```shell
-pytest
-```
 
 # Data Science Campus
-At the [Data Science Campus](https://datasciencecampus.ons.gov.uk/about-us/) we apply data science, and build skills, for public good across the UK and internationally. Get in touch with the Campus at [[email protected]]([email protected]).
+
+At the [Data Science Campus](https://datasciencecampus.ons.gov.uk/about-us/) we
+apply data science, and build skills, for public good across the UK and
+internationally. Get in touch with the Campus at
+[[email protected]]([email protected]).
 
 # License
 
 <!-- Unless stated otherwise, the codebase is released under [the MIT Licence][mit]. -->
 
 The code, unless otherwise stated, is released under [the MIT License][mit].
 
-The documentation for this work is subject to [© Crown copyright][copyright] and is available under the terms of the [Open Government 3.0][ogl] licence.
+The documentation for this work is subject to [© Crown copyright][copyright]
+and is available under the terms of the [Open Government 3.0][ogl] licence.
 
 [mit]: LICENSE
 [copyright]: http://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/