Skip to content

Commit

Permalink
All code
Browse files Browse the repository at this point in the history
  • Loading branch information
Martin Wood authored and Martin Wood committed Aug 22, 2023
1 parent fe4c9e4 commit 48fa25c
Show file tree
Hide file tree
Showing 54 changed files with 3,452 additions and 2 deletions.
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2023 Data Science Campus

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
157 changes: 155 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,155 @@
# statschat-app
Prototype search engine for ONS bulletins
<img src="https://github.com/datasciencecampus/awesome-campus/blob/master/ons_dsc_logo.png">

# `StatsChat`
[![Stability](https://img.shields.io/badge/stability-experimental-orange.svg)](https://github.com/mkenney/software-guides/blob/master/STABILITY-BADGES.md#experimental)
[![codecov](https://codecov.io/gh/datasciencecampus/Statschat/branch/main/graph/badge.svg?token=QALqIC7CDX)](https://codecov.io/gh/datasciencecampus/Statschat)
[![Twitter](https://img.shields.io/twitter/url?label=Follow%20%40DataSciCampus&style=social&url=https%3A%2F%2Ftwitter.com%2FDataSciCampus)](https://twitter.com/DataSciCampus)
[![Shared under the MIT License](https://img.shields.io/badge/license-MIT-green)](https://github.com/datasciencecampus/Statschat/blob/main/LICENSE)
[![Mac-OS compatible](https://shields.io/badge/MacOS--9cf?logo=Apple&style=social)]()

## Code state

**Please be aware that for development purposes, these experiments use experimental Large Language Models (LLM's) not intended for production. They can present inaccurate information, hallucinated statements and offensive text by random chance or through malevolent prompts.**

**Tested on OSX only**

**Peer-reviewed**

**Depends on external API's**

**Under development**

**Experimental**

## Introduction

This is an experimental application for semantic search of ONS statistical publications.
It uses LangChain to implement a fairly simple embedding search and QA information retrieval
process. Upon receiving a query, documents are returned as search results
using embedding similarity to score relevance. Additionally, the relevant text is
passed to a locally-hosted Large language Model (LLM), which is prompted to write an
answer to the original question, if it can, using only the information contained within
the documents.

For this prototype, the program is run entirely locally; relevant web pages are scraped and the data
stored in `data/bulletins`, the docstore / embedding store that is created is likewise
in local folders and files, and the LLM and all other code is run in memory on your
desktop or laptop.

The search program should be able to run on a system with 16GB of ram. The LLM is
set up to run on CPU at this research stage. Different models from the Hugging Face
repository can be specified for the search and QA functions.

## Installation

The project requires specific versions of some packages so it is recommended to
set up a virtual environment. Using venv and pip:

```shell
python3.10 -m venv env
source env/bin/activate

python -m pip install --upgrade pip
python -m pip install -r requirements.txt
```

### Pre-commit actions
This repository contains a configuration of pre-commit hooks. These are language agnostic and focussed on repository security (such as detection of passwords and API keys). If approaching this project as a developer, you are encouraged to install and enable `pre-commits` by running the following in your shell:
1. Install `pre-commit`:

```shell
pip install pre-commit
```
2. Enable `pre-commit`:

```shell
pre-commit install
```

Once pre-commits are activated, whenever you commit to this repository a series of checks will be executed. The pre-commits include checking for security keys, large files and unresolved merge conflict headers. The use of active pre-commits are highly encouraged and the given hooks can be expanded with Python or R specific hooks that can automate the code style and linting. For example, the `flake8` and `black` hooks are useful for maintaining consistent Python code formatting.

**NOTE:** Pre-commit hooks execute Python, so it expects a working Python build.

## Usage

By default, flask will look for a file called `app.py`, you can also name a specific python program to run.
With `--debug` in play, flask will restart every time it detects a saved change in the underlying
python files.
The first time you run the app, any ML models specified in the code will be downloaded
to your machine. This will use a few GB of data and take a few minutes.
App and search pipeline parameter are stored and can be updated by editing `app_config.toml`.

We have included three EXAMPLE scraped data files in `data/bulletins` so that
the preprocessing and app can be run as a small example system without waiting
on webscraping.

### To webscrape the source documents from ONS
#### By default we have limited the script to retrieving 10 actual articles (`statschat/webscraping/main.py`, line 61), this limit is easily edited out to allow the program to run to completion.
```shell
python statschat/webscraping/main.py
```

### To create a local document store
```shell
python statschat/preprocess.py
```

### To run the interactive app



```shell
flask --debug run
```
or
```shell
python app.py
```

The flask app is set respond to https requests on port 5000. To use the user UI navigate in your browser to http://localhost:5000.

The API default url would be http://localhost:5000/api. See [API endpoint documentation](docs/api/README.md) for more details (note, this is a work in progress).


### Search engine parameters

There are some key parameters in `app_config.toml` that we're experimenting with to improve the search results,
and the generated text answer. The current values are initial guesses:
| Parameter | Current Value | Function |
| --- | --- | --- |
| k_docs | 10 | Maximum number of search results to return |
| similarity_threshold | 1.0 | Cosine distance, a searched document is only returned if it is at least this similar (EQUAL or LOWER) |
| k_contexts | 3 | Number of top documents to pass to generative QA LLM |
### Alternatively, to run the search evaluation pipeline
The StatsChat pipeline is currently evaluated based on small number of test question. The main 'app_config.toml' determines pipeline setting used in evaluation and results are written to `data/model_evaluation` folder.
```shell
python statschat/model_evaluation/evaluation.py
```
## Testing
Preferred unittesting framework is PyTest:
```shell
pytest
```
# Data Science Campus
At the [Data Science Campus](https://datasciencecampus.ons.gov.uk/about-us/) we apply data science, and build skills, for public good across the UK and internationally. Get in touch with the Campus at [[email protected]]([email protected]).
# License
<!-- Unless stated otherwise, the codebase is released under [the MIT Licence][mit]. -->
The code, unless otherwise stated, is released under [the MIT License][mit].
The documentation for this work is subject to [© Crown copyright][copyright] and is available under the terms of the [Open Government 3.0][ogl] licence.
[mit]: LICENSE
[copyright]: http://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/
[ogl]: http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
160 changes: 160 additions & 0 deletions app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
import toml
import logging

import pandas as pd

from datetime import datetime
from flask import Flask, render_template, request, jsonify
from flask.logging import default_handler
from markupsafe import escape
from werkzeug.datastructures import MultiDict

from statschat.llm import Inquirer
from statschat.utils import deduplicator
from statschat.latest_flag_helpers import get_latest_flag, time_decay


# Config file to load
CONFIG = toml.load("app_config.toml")

# define session_id that will be used for log file and feedback
SESSION_NAME = f"statschat_app_{format(datetime.now(), '%Y_%m_%d_%H:%M')}"

logger = logging.getLogger(__name__)
log_fmt = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
logging.basicConfig(
level=logging.INFO,
format=log_fmt,
filename=f"log/{SESSION_NAME}.log",
filemode="a",
)
logger.addHandler(default_handler)


# define global variable to link last answer to ratings for feedback capture
last_answer = {}
feedback_file = f"data/feedback/{SESSION_NAME}.csv"
pd.DataFrame(
{"question": [], "answer": [], "confidence": [], "timing": [], "feedback": []}
).to_csv(feedback_file, index=False)

# initiate Statschat AI and start the app
searcher = Inquirer(**CONFIG["db"], **CONFIG["search"], logger=logger)


def make_query(question: str, latest_max: bool = True) -> dict:
"""
Utility, wraps code for querying the search engine, and then the summarizer.
Also handles storing the last answer made for feedback purposes.
Args:
question (str): The user query.
latest (bool, optional): Whether to weight in favour of recent releases.
Defaults to True.
Returns:
dict: answer and supporting documents returned.
"""
now = datetime.now()
# TODO: pass the advanced filters to the searcher
# TODO: move deduplication keys to config['app']
docs = deduplicator(
searcher.similarity_search(question),
keys=["section", "title", "date"],
)
if len(docs) > 0:
if latest_max:
for doc in docs:
# Divided by decay term because similarity scores are inverted
# Original score is L2 distance; lower is better
# https://python.langchain.com/docs/integrations/vectorstores/faiss
doc["weighted_score"] = doc["score"] / time_decay(
doc["date"], latest=latest_max
)
docs.sort(key=lambda doc: doc["weighted_score"])
logger.info(
f"Weighted and reordered docs to latest with decay = {latest_max}"
)

answer = searcher.query_texts(question, docs)
else:
answer = "NA"

results = {
"answer": answer,
"question": question,
"references": docs,
"timing": (datetime.now() - now).total_seconds(),
}
logger.info(f"Received answer: {results['answer']}")

# Handles storing last answer for feedback purposes
global last_answer
last_answer = results.copy()

return results


app = Flask(__name__)


@app.route("/")
def home():
advanced = MultiDict()
return render_template("statschat.html", advanced=advanced, question="?")


@app.route("/advanced")
def advanced():
advanced = MultiDict(
[("latest-publication", "Off"), ("bulletins", "on"), ("articles", "on")]
)
return render_template("statschat.html", advanced=advanced, question="?")


@app.route("/search", methods=["GET", "POST"])
def search():
question = escape(request.args.get("q"))
advanced, latest_max = get_latest_flag(request.args, CONFIG["app"]["latest_max"])
logger.info(f"Search query: {question}")
if question:
results = make_query(question, latest_max)
return render_template(
"statschat.html", advanced=advanced, question=question, results=results
)
else:
return render_template("statschat.html", advanced=advanced, question="?")


@app.route("/record_rating", methods=["POST"])
def record_rating():
rating = request.form["rating"]
logger.info(f"Recorded answer rating: {rating}")
last_answer["rating"] = rating
pd.DataFrame([last_answer]).to_csv(
feedback_file, mode="a", index=False, header=False
)
return "", 204 # Return empty response with status code 204


@app.route("/api/search", methods=["GET", "POST"])
def api_search():
question = escape(request.args.get("q"))
_, latest_max = get_latest_flag(request.args, CONFIG["app"]["latest_max"])
logger.info(f"Search query: {question}")
if question:
results = make_query(question, latest_max)
logger.info(f"Received {len(results['references'])} documents.")
return jsonify(results), 200
else:
return jsonify({"error": "Empty question"}), 400


@app.route("/api/about", methods=["GET", "POST"])
def about():
info = {"version": "ONS StatsChat API v0.1", "contact": "[email protected]"}
return jsonify(info)


if __name__ == "__main__":
app.run(debug=False, host="0.0.0.0")
30 changes: 30 additions & 0 deletions app_config.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
[db]
faiss_db_root = "db_lc"
embedding_model = "sentence-transformers/all-mpnet-base-v2" # "sentence-transformers/paraphrase-MiniLM-L3-v2"

[setup]
directory = "data/bulletins"
split_directory = "data/full_bulletins_split"
split_length = 1000
split_overlap = 50

[search]
model_name_or_path = "google/flan-t5-large" # "lmsys/fastchat-t5-3b-v1.0" "google/flan-t5-large" "google/flan-ul2"
k_docs = 10
k_contexts = 3
similarity_threshold = 1.0 # Threshold score below which a document is returned in a search
return_source_documents = false
llm_summarize_temperature = 0.0
llm_generate_temperature = 0.0

[app]
latest_max = 2 # Takes value int >= 0, commonly 0, 1 or 2

[NYI]
prompt_text = """Synthesize a comprehensive answer from the following text
for the given question. Provide a clear and concise response, that summarizes
the key points and information presented in the text. Your answer should be
in your own words and be no longer than 50 words. If the question cannot be
confidently answered from the information in the text, or if the question is
not related to the text, reply 'NA'. \n\n
Related text: {summaries} \n\n Question: {question} \n\n Answer:"""
16 changes: 16 additions & 0 deletions codecov.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
comment: true

coverage:
status:
project:
default:
target: auto
threshold: 80%
informational: true
patch:
default:
target: auto
threshold: 20%
informational: true

ignore: ["tests"]
Empty file added data/.gitkeep
Empty file.
Empty file added data/bulletins/.gitkeep
Empty file.
Loading

0 comments on commit 48fa25c

Please sign in to comment.