All code

datasciencecampus · Aug 22, 2023 · 48fa25c · 48fa25c
1 parent fe4c9e4
commit 48fa25c
Show file tree

Hide file tree

Showing 54 changed files with 3,452 additions and 2 deletions.
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 Data Science Campus
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -1,2 +1,155 @@
-# statschat-app
-Prototype search engine for ONS bulletins
+<img src="https://github.com/datasciencecampus/awesome-campus/blob/master/ons_dsc_logo.png">
+
+# `StatsChat`
+[![Stability](https://img.shields.io/badge/stability-experimental-orange.svg)](https://github.com/mkenney/software-guides/blob/master/STABILITY-BADGES.md#experimental)
+[![codecov](https://codecov.io/gh/datasciencecampus/Statschat/branch/main/graph/badge.svg?token=QALqIC7CDX)](https://codecov.io/gh/datasciencecampus/Statschat)
+[![Twitter](https://img.shields.io/twitter/url?label=Follow%20%40DataSciCampus&style=social&url=https%3A%2F%2Ftwitter.com%2FDataSciCampus)](https://twitter.com/DataSciCampus)
+[![Shared under the MIT License](https://img.shields.io/badge/license-MIT-green)](https://github.com/datasciencecampus/Statschat/blob/main/LICENSE)
+[![Mac-OS compatible](https://shields.io/badge/MacOS--9cf?logo=Apple&style=social)]()
+
+## Code state
+
+**Please be aware that for development purposes, these experiments use experimental Large Language Models (LLM's) not intended for production. They can present inaccurate information, hallucinated statements and offensive text by random chance or through malevolent prompts.**
+
+**Tested on OSX only**
+
+**Peer-reviewed**
+
+**Depends on external API's**
+
+**Under development**
+
+**Experimental**
+
+## Introduction
+
+This is an experimental application for semantic search of ONS statistical publications.
+It uses LangChain to implement a fairly simple embedding search and QA information retrieval
+process.  Upon receiving a query, documents are returned as search results
+using embedding similarity to score relevance.  Additionally, the relevant text is
+passed to a locally-hosted Large language Model (LLM), which is prompted to write an
+answer to the original question, if it can, using only the information contained within
+the documents.
+
+For this prototype, the program is run entirely locally; relevant web pages are scraped and the data
+stored in `data/bulletins`, the docstore / embedding store that is created is likewise
+in local folders and files, and the LLM and all other code is run in memory on your
+desktop or laptop.
+
+The search program should be able to run on a system with 16GB of ram.  The LLM is
+set up to run on CPU at this research stage.  Different models from the Hugging Face
+repository can be specified for the search and QA functions.
+
+## Installation
+
+The project requires specific versions of some packages so it is recommended to
+set up a virtual environment.  Using venv and pip:
+
+```shell
+python3.10 -m venv env
+source env/bin/activate
+
+python -m pip install --upgrade pip
+python -m pip install -r requirements.txt
+```
+
+### Pre-commit actions
+This repository contains a configuration of pre-commit hooks. These are language agnostic and focussed on repository security (such as detection of passwords and API keys). If approaching this project as a developer, you are encouraged to install and enable `pre-commits` by running the following in your shell:
+   1. Install `pre-commit`:
+
+      ```shell
+      pip install pre-commit
+      ```
+   2. Enable `pre-commit`:
+
+      ```shell
+      pre-commit install
+      ```
+
+Once pre-commits are activated, whenever you commit to this repository a series of checks will be executed. The pre-commits include checking for security keys, large files and unresolved merge conflict headers. The use of active pre-commits are highly encouraged and the given hooks can be expanded with Python or R specific hooks that can automate the code style and linting. For example, the `flake8` and `black` hooks are useful for maintaining consistent Python code formatting.
+
+**NOTE:** Pre-commit hooks execute Python, so it expects a working Python build.
+
+## Usage
+
+By default, flask will look for a file called `app.py`, you can also name a specific python program to run.
+With `--debug` in play, flask will restart every time it detects a saved change in the underlying
+python files.
+The first time you run the app, any ML models specified in the code will be downloaded
+to your machine.  This will use a few GB of data and take a few minutes.
+App and search pipeline parameter are stored and can be updated by editing `app_config.toml`.
+
+We have included three EXAMPLE scraped data files in `data/bulletins` so that
+the preprocessing and app can be run as a small example system without waiting
+on webscraping.
+
+### To webscrape the source documents from ONS
+#### By default we have limited the script to retrieving 10 actual articles (`statschat/webscraping/main.py`, line 61), this limit is easily edited out to allow the program to run to completion.
+```shell
+python statschat/webscraping/main.py
+```
+
+### To create a local document store
+```shell
+python statschat/preprocess.py
+```
+
+### To run the interactive app
+
+
+
+```shell
+flask --debug run
+```
+or
+```shell
+python app.py
+```
+
+The flask app is set respond to https requests on port 5000. To use the user UI navigate in your browser to http://localhost:5000.
+
+The API default url would be http://localhost:5000/api. See [API endpoint documentation](docs/api/README.md) for more details (note, this is a work in progress).
+
+
+### Search engine parameters
+
+There are some key parameters in `app_config.toml` that we're experimenting with to improve the search results,
+and the generated text answer.  The current values are initial guesses:
+
+| Parameter | Current Value | Function |
+| --- | --- | --- |
+| k_docs | 10 | Maximum number of search results to return |
+| similarity_threshold | 1.0 | Cosine distance, a searched document is only returned if it is at least this similar (EQUAL or LOWER) |
+| k_contexts | 3 | Number of top documents to pass to generative QA LLM |
+
+### Alternatively, to run the search evaluation pipeline
+
+The StatsChat pipeline is currently evaluated based on small number of test question. The main 'app_config.toml' determines pipeline setting used in evaluation and results are written to `data/model_evaluation` folder.
+
+```shell
+python statschat/model_evaluation/evaluation.py
+```
+
+
+## Testing
+
+Preferred unittesting framework is PyTest:
+
+```shell
+pytest
+```
+
+# Data Science Campus
+At the [Data Science Campus](https://datasciencecampus.ons.gov.uk/about-us/) we apply data science, and build skills, for public good across the UK and internationally. Get in touch with the Campus at [[email protected]]([email protected]).
+
+# License
+
+<!-- Unless stated otherwise, the codebase is released under [the MIT Licence][mit]. -->
+
+The code, unless otherwise stated, is released under [the MIT License][mit].
+
+The documentation for this work is subject to [© Crown copyright][copyright] and is available under the terms of the [Open Government 3.0][ogl] licence.
+
+[mit]: LICENSE
+[copyright]: http://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/
+[ogl]: http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
diff --git a/app.py b/app.py
@@ -0,0 +1,160 @@
+import toml
+import logging
+
+import pandas as pd
+
+from datetime import datetime
+from flask import Flask, render_template, request, jsonify
+from flask.logging import default_handler
+from markupsafe import escape
+from werkzeug.datastructures import MultiDict
+
+from statschat.llm import Inquirer
+from statschat.utils import deduplicator
+from statschat.latest_flag_helpers import get_latest_flag, time_decay
+
+
+# Config file to load
+CONFIG = toml.load("app_config.toml")
+
+# define session_id that will be used for log file and feedback
+SESSION_NAME = f"statschat_app_{format(datetime.now(), '%Y_%m_%d_%H:%M')}"
+
+logger = logging.getLogger(__name__)
+log_fmt = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+logging.basicConfig(
+    level=logging.INFO,
+    format=log_fmt,
+    filename=f"log/{SESSION_NAME}.log",
+    filemode="a",
+)
+logger.addHandler(default_handler)
+
+
+# define global variable to link last answer to ratings for feedback capture
+last_answer = {}
+feedback_file = f"data/feedback/{SESSION_NAME}.csv"
+pd.DataFrame(
+    {"question": [], "answer": [], "confidence": [], "timing": [], "feedback": []}
+).to_csv(feedback_file, index=False)
+
+# initiate Statschat AI and start the app
+searcher = Inquirer(**CONFIG["db"], **CONFIG["search"], logger=logger)
+
+
+def make_query(question: str, latest_max: bool = True) -> dict:
+    """
+    Utility, wraps code for querying the search engine, and then the summarizer.
+    Also handles storing the last answer made for feedback purposes.
+
+    Args:
+        question (str): The user query.
+        latest (bool, optional): Whether to weight in favour of recent releases.
+            Defaults to True.
+
+    Returns:
+        dict: answer and supporting documents returned.
+    """
+    now = datetime.now()
+    # TODO: pass the advanced filters to the searcher
+    # TODO: move deduplication keys to config['app']
+    docs = deduplicator(
+        searcher.similarity_search(question),
+        keys=["section", "title", "date"],
+    )
+    if len(docs) > 0:
+        if latest_max:
+            for doc in docs:
+                # Divided by decay term because similarity scores are inverted
+                # Original score is L2 distance; lower is better
+                #  https://python.langchain.com/docs/integrations/vectorstores/faiss
+                doc["weighted_score"] = doc["score"] / time_decay(
+                    doc["date"], latest=latest_max
+                )
+            docs.sort(key=lambda doc: doc["weighted_score"])
+            logger.info(
+                f"Weighted and reordered docs to latest with decay = {latest_max}"
+            )
+
+        answer = searcher.query_texts(question, docs)
+    else:
+        answer = "NA"
+
+    results = {
+        "answer": answer,
+        "question": question,
+        "references": docs,
+        "timing": (datetime.now() - now).total_seconds(),
+    }
+    logger.info(f"Received answer: {results['answer']}")
+
+    # Handles storing last answer for feedback purposes
+    global last_answer
+    last_answer = results.copy()
+
+    return results
+
+
+app = Flask(__name__)
+
+
+@app.route("/")
+def home():
+    advanced = MultiDict()
+    return render_template("statschat.html", advanced=advanced, question="?")
+
+
+@app.route("/advanced")
+def advanced():
+    advanced = MultiDict(
+        [("latest-publication", "Off"), ("bulletins", "on"), ("articles", "on")]
+    )
+    return render_template("statschat.html", advanced=advanced, question="?")
+
+
+@app.route("/search", methods=["GET", "POST"])
+def search():
+    question = escape(request.args.get("q"))
+    advanced, latest_max = get_latest_flag(request.args, CONFIG["app"]["latest_max"])
+    logger.info(f"Search query: {question}")
+    if question:
+        results = make_query(question, latest_max)
+        return render_template(
+            "statschat.html", advanced=advanced, question=question, results=results
+        )
+    else:
+        return render_template("statschat.html", advanced=advanced, question="?")
+
+
+@app.route("/record_rating", methods=["POST"])
+def record_rating():
+    rating = request.form["rating"]
+    logger.info(f"Recorded answer rating: {rating}")
+    last_answer["rating"] = rating
+    pd.DataFrame([last_answer]).to_csv(
+        feedback_file, mode="a", index=False, header=False
+    )
+    return "", 204  # Return empty response with status code 204
+
+
+@app.route("/api/search", methods=["GET", "POST"])
+def api_search():
+    question = escape(request.args.get("q"))
+    _, latest_max = get_latest_flag(request.args, CONFIG["app"]["latest_max"])
+    logger.info(f"Search query: {question}")
+    if question:
+        results = make_query(question, latest_max)
+        logger.info(f"Received {len(results['references'])} documents.")
+        return jsonify(results), 200
+    else:
+        return jsonify({"error": "Empty question"}), 400
+
+
+@app.route("/api/about", methods=["GET", "POST"])
+def about():
+    info = {"version": "ONS StatsChat API v0.1", "contact": "[email protected]"}
+    return jsonify(info)
+
+
+if __name__ == "__main__":
+    app.run(debug=False, host="0.0.0.0")
diff --git a/app_config.toml b/app_config.toml
@@ -0,0 +1,30 @@
+[db]
+faiss_db_root = "db_lc"
+embedding_model = "sentence-transformers/all-mpnet-base-v2" # "sentence-transformers/paraphrase-MiniLM-L3-v2"
+
+[setup]
+directory = "data/bulletins"
+split_directory = "data/full_bulletins_split"
+split_length = 1000
+split_overlap = 50
+
+[search]
+model_name_or_path = "google/flan-t5-large" # "lmsys/fastchat-t5-3b-v1.0" "google/flan-t5-large" "google/flan-ul2"
+k_docs = 10
+k_contexts = 3
+similarity_threshold = 1.0     # Threshold score below which a document is returned in a search
+return_source_documents = false
+llm_summarize_temperature = 0.0
+llm_generate_temperature = 0.0
+
+[app]
+latest_max = 2    # Takes value int >= 0, commonly 0, 1 or 2
+
+[NYI]
+prompt_text = """Synthesize a comprehensive answer from the following text
+        for the given question. Provide a clear and concise response, that summarizes
+        the key points and information presented in the text. Your answer should be
+        in your own words and be no longer than 50 words. If the question cannot be
+        confidently answered from the information in the text, or if the question is
+        not related to the text, reply 'NA'. \n\n
+        Related text: {summaries} \n\n Question: {question} \n\n Answer:"""
diff --git a/codecov.yml b/codecov.yml
@@ -0,0 +1,16 @@
+comment: true
+
+coverage:
+  status:
+    project:
+      default:
+        target: auto
+        threshold: 80%
+        informational: true
+    patch:
+      default:
+        target: auto
+        threshold: 20%
+        informational: true
+
+ignore: ["tests"]
diff --git a/data/.gitkeep b/data/.gitkeep
diff --git a/data/bulletins/.gitkeep b/data/bulletins/.gitkeep