Skip to content

KaiquanMah/llm-trust-lens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Trust Lens - Open Intent Classification

readme-openintentlogo

1. Overview

LLM Trust Lens - Open Intent Classification is a pipeline to evaluate the performance of various methods (such as LLMs) on various datasets, focusing on the topic of "Open Intent Classification".

What is "Open Intent Classification"?

There are 2 ways to evaluate open intent classification:

  1. Binary classification of open-intent/oos/unknown class vs 1 known class (grouped from all known classes)
  2. Multi-class Classification of open-intent/oos/unknown class vs individual known classes

Project Team Members

2. Key Features

  • Multi-Model Support: Evaluate both local models (via Ollama) and API-based models (Nebius, Google Gemini)
  • Flexible Prompt Scenarios: Support for both zero-shot and few-shot prompt scenarios
  • Multiple Datasets: Built-in support for Banking77, StackOverflow, and CLINC150OOS TSV datasets (Source: 2021 Adaptive Decision Boundary Clustering GitHub repo). For new datasets, bring them into the pipeline!
  • Multiple Dataset Formats - Support for CSV, TSV, JSON file formats
  • Configurable Experiments: YAML-based configuration system for easy experiment setup
  • Traceable Results: Generate LLM predictions, classification metrics and confusion matrix files for evaluation
  • Hybrid Embedding and Non-Embedding Approach: Explored 2 pipelines with a hybrid approach (currently in notebooks)

3. Setup

  1. Clone the Repository
# If you have not done so, update your Ubuntu packages and install git
sudo apt update && sudo apt install git -y

git clone https://github.com/KaiquanMah/llm-trust-lens.git
cd llm-trust-lens
  1. Create a Virtual Environment (Recommended)
# If you have not done so, install python, pip and venv on your Ubuntu machine
sudo apt install -y python3 python3-pip python3-venv

python -m venv venv
source venv/bin/activate    # On Windows use `venv\Scripts\activate`
  1. Install Dependencies. Install Ollama, then install the required Python packages using the requirements.txt file
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Upgrade pip
pip install --upgrade pip
# Install python dependencies to run the pipeline
pip install -r requirements.txt
  1. Test Ollama has been installed successfully
# Check Ollama version
ollama --version
# As at August 2025: ollama version is 0.9.6

ps aux | grep ollama
# codespa+    1425  0.0  0.0   7080  2048 pts/0    S+   14:21   0:00 grep --color=auto ollama

4. Environment Configuration

To use API-based models from providers like Nebius or Google, you must configure your API keys in an environment file.

Create an .env file and add your API Keys.

NEBIUS_API_KEY = "your_nebius_api_key_here"    # These variables will be loaded by the pipeline to authenticate with the respective API services.
GOOGLE_API_KEY = "your_google_api_key_here"

5. Usage

  • Non-embedding methods: Note that the experiment_*.py pipeline files currently work for non-embedding methods (zero-shot prompt and few-shot prompt)
  • Embedding methods: For embedding methods (finetune BERT, then run Adaptive Decision Boundary Clustering or Variational Autoencoder), the team was still exploring these methods. Please visit the workings in the *.ipynb files we will share below.

5.1. Non-Embedding Methods (Zero-Shot Prompt and Few-Shot Prompt)

  • The Terminal commands shown below run experiments from the root directory of the project
  • You can execute different experiments by using the appropriate experiment_*.py file and experiment configuration file
  • If you wish to check out the terminal workings and printouts during each pipeline run, please visit the terminal_workings folder

5.1.1. Common Steps for Ollama/Local Model and API Model Experiments

  1. Navigate to the llm-trust-lens folder
  2. Activate your venv virtual environment containing the required python libraries to run the pipeline
    python3 -m venv venv
    source venv/bin/activate
  3. Dataset: Use an existing CSV/TSV/JSON dataset or bring in new datasets into the data folder
  4. idx2label: Use an existing idx2label.csv (mapping class indexes to labels) or create a new idx2label.csv in the [respective data folder](https://github.com/KaiquanMah/llm-trust-lens/tree/main/data
  5. Dataset yaml: Use an existing dataset yaml file or create a new dataset yaml file in the dataset yaml folder
  6. Experiment yaml: Use an existing experiment yaml file or create a new experiment yaml file in the experiment yaml folder
    • We recommend creating separate experiment yaml files to trace back to each experiment's configuration (eg ollama vs api, the model you use, zeroshot vs fewshot, thresholdtest or not)
  7. Prompt: Use an existing zero-shot or few-shot prompt, or create a new prompt.txt in the prompts folder
  8. Few-shot Prompt Examples: If you wish to use the few-shot prompt method, please use an existing few-shot examples file, or create a new few-shot examples txt file in the few_shot_examples folder

5.1.2. Run Ollama/Local Model Experiments

  • For Ollama models, we ran the pipeline successfully for 6 models
    llama3.2:3b
    qwen3:8b (Mixture-of-Experts LLM)
    gemma3:4b-it-qat (instruction-Following and Quantised LLM)
    mistral:7b (General-Purpose LLM)
    tulu3:8b (Instruction-Following LLM)
    deepseek-r1:7b (Reasoning LLM)
    
    • We expect the pipeline to be able to support other models that will be published onto Ollama
    • To explore Ollama models you can use, please visit Ollama's model directory

5.1.2.1. Zero-Shot llama3-2-3b on Banking77 dataset

python src/experiment_ollama.py --config config/experiment/ollama_llama3_2_3b_zeroshot_banking77.yaml

5.1.2.2. Zero-Shot gemma-3-4b-it-qat on Banking77 dataset

python src/experiment_ollama.py --config config/experiment/ollama_gemma3_4b-it-qat_zeroshot_banking77.yaml

5.1.2.3. Few-Shot llama3-2-3b on Banking77 dataset

python src/experiment_ollama.py --config config/experiment/ollama_llama3_2_3b_fewshot_banking77.yaml

5.1.2.4. Few-Shot llama3-2-3b on Stackoverflow dataset

python src/experiment_ollama.py --config config/experiment/ollama_llama3_2_3b_fewshot_stackoverflow.yaml

5.1.2.5. Few-Shot llama3-2-3b on CLINIC150OOS dataset

python src/experiment_ollama.py --config config/experiment/ollama_llama3_2_3b_fewshot_clinc150oos.yaml

5.1.3. Run API Model Experiments

  • For API models, our pipeline currently supports individual calls to 2 API providers (where we had some credits)
    Nebius
    Google
    
  • To understand how to use the Nebius batch API, please visit notebooks 01l6 to 01l9* where we prepared inputs for the batch API, called it, downloaded results, then stitched with the original dataset to get the output we expect for further analysis
  • To integrate with other model providers, you will need to
    • Create a <model_provider>_utils.py file using nebius_utils.py as a template. This file covers initialising your API client, retry config, and how to work with messages
    • Add a class to experiment_api.py. The class should have 2 basic functions: initialize, predict
  • For all API models, please remember to specify your model_provider, model_name and configuration in the experiment yaml file which we shared in section 5.1.1. Common Steps - Step 6

5.1.3.1. Few-Shot Nebius Qwen API on Banking77 dataset

python src/experiment_api.py --config config/experiment/api_nebius_qwen3-30b-a3b_fewshot_banking77.yaml

5.1.3.2. Few-Shot Google Gemini API on Banking77 dataset

python src/experiment_api.py --config config/experiment/api_google_gemini-2.5-flash-preview-05-20_fewshot_banking77.yaml

5.2. Hybrid Embedding Method then Non-Embedding Method

  • Please visit the notebook in the hybrid_embedding_nonembedding folder for the workings
  • In the notebook, we
    • Finetuned BERT only on known classes to
      • Adapt BERT to the dataset's domain
      • Move sentences from the same class closer, and move sentences from different classes further apart, to give "higher quality embeddings", which can be used to better differentiate between different classes
    • Use finetuned BERT to generate embeddings
    • Trained a Variational Autoencoder (VAE) on binary classification (open class vs known class)
    • Stored embeddings in a FAISS (Facebook AI Similarity Search) vector store: IndexFlatIP (which uses cosine similarity to find the nearest neighbours)
    • Retrieved 5 nearest sentence examples and classes for each input sentence's fewshot prompt
    • Pickle file containing a list of row indexes for sentences "classified by VAE as the known class"
    • Ran 2 separate pipelines
      1. (After finetuned BERT -> Trained VAE -> ) Fewshot prompt on known subset of sentences (classified by VAE) only
      2. (After finetuned BERT -> ) Fewshot prompt on full dataset

6. Folder Structure

.
├── LICENSE
├── README.md
├── .env                               # File containing API keys. Follow the "Setup" section to create your own .env file
├── config
│   ├── dataset                        # Dataset config. Naming convention: <dataset_name>.yaml
│   │   ├── banking77.yaml             # dataset config for banking77 TSV files (to test support for TSVs - original design)
│   │   ├── clinc150oos.yaml
│   │   ├── stackoverflow.yaml
│   │   ├── banking77_csv.yaml         # dataset config for CSV files (to test support for CSVs - add-on)
│   │   └── banking77_json.yaml        # dataset config for JSON files (to test support for JSON - add-on)
│   └── experiment                     # Experiment config. Naming convention: <ollama/api>_<modelprovider>_<modelname>_<method>_<dataset>.yaml
│       ├── api_google_gemini-2.5-flash-preview-05-20_fewshot_banking77.yaml
│       ├── api_nebius_qwen3-30b-a3b_fewshot_banking77.yaml
│       ├── ollama_deepseek-r1_7b_zeroshot_banking77.yaml
│       ├── ollama_gemma3_4b-it-qat_zeroshot_banking77.yaml
│       ├── ollama_llama3_2_3b_fewshot_banking77.yaml          # experiment config for llama3.2:3b fewshot on banking77 TSV files
│       ├── ollama_llama3_2_3b_fewshot_banking77_csv.yaml
│       ├── ollama_llama3_2_3b_fewshot_banking77_json.yaml
│       ├── ollama_llama3_2_3b_fewshot_banking77_thresholdtest.yaml
│       ├── ollama_llama3_2_3b_fewshot_clinc150oos.yaml
│       ├── ollama_llama3_2_3b_fewshot_stackoverflow.yaml
│       ├── ollama_llama3_2_3b_zeroshot_banking77.yaml
│       ├── ollama_mistral_7b_zeroshot_banking77.yaml
│       ├── ollama_qwen3_8b_zeroshot_banking77.yaml
│       └── ollama_tulu3_8b_zeroshot_banking77.yaml
├── data                              # Place your datasets here - 1 folder per dataset
│   ├── banking                       # Train, dev, test TSV files, including class index to label mapping (idx2label.csv). Visit "analyse-results-zeroshot-fewshot, create-idx2label.ipynb" to understand how to create idx2label.csv for your dataset
│   │   ├── banking77_idx2label.csv
│   │   ├── dev.tsv
│   │   ├── test.tsv
│   │   └── train.tsv
│   ├── oos
│   ├── stackoverflow
│   ├── banking_simulate_csv
│   └── banking_simulate_json
├── debugger                          # Optional debugger directory to veryify ollama has been setup correctly
│   └── debug_ollama.py
├── examples
│   ├── example_usage.py
│   ├── few_shot_examples             # Fewshot prompt notebook examples
│   │   ├── 01i1-openintent-ollama-llama3-2-3b-banking77-fewshot_5hardcoded-previouslymisclassifiedexamples.ipynb
│   │   ├── 01j1-openintent-ollama-llama3-2-3b-banking77_10001_13082.ipynb
│   │   ├── 01j2-openintent-ollama-llama3-2-3b-stackoverflow_10001_19999.ipynb
│   │   ├── 01j3-openintent-ollama-llama3-2-3b-oos_22001_23699.ipynb
│   │   ├── 01l1-openintent-gemini-2.5-flash-banking77_0_19.ipynb
│   │   ├── 01l5-openintent-nebiusqwen-banking77-individualAPIcall.ipynb
│   │   ├── 01l6-openintent-nebiusqwen-banking77-batchof10.ipynb                          # Nebius qwen batch API - run for batch of 10 examples
│   │   ├── 01l7a-openintent-nebiusqwen-banking77-batchfull-n-downloadresults.ipynb       # Nebius qwen batch API - run for full banking77 dataset, then download results
│   │   ├── 01l7b-openintent-nebiusqwen-bk77-stitchresults.ipynb                          # Nebius qwen batch API - stitch results together with original dataframe for further analysis
│   │   ├── 01l8a-openintent-nebiusqwen-stackoverflow-batchfull-n-downloadresults.ipynb
│   │   ├── 01l8b-openintent-nebiusqwen-stkoflw-stitchresults.ipynb
│   │   ├── 01l9a-openintent-nebiusqwen-clincoos-batchfull.ipynb
│   │   ├── 01l9b-openintent-nebiusqwen-clincoos-downloadresults.ipynb
│   │   └── 01l9c-openintent-nebiusqwen-c150oos-stitchresults.ipynb
│   ├── hybrid_embedding_nonembedding
│   │   └── 25perc_OOS_Finetune_BERT,_train_VAE,_apply_on_full_dataset,_get_5_nearest_examples_for_fewshot.ipynb
│   ├── thresholdtest                 # Fewshot threshold test notebook examples
│   │   ├── 01k1-openintent-ollama-llama3-2-3b-banking77-1notoos.ipynb
│   │   ├── 01k1-openintent-ollama-llama3-2-3b-banking77-4notoos.ipynb
│   │   ├── 01k2-openintent-ollama-llama3-2-3b-stackoverflow-5notoos.ipynb
│   │   └── 01k3-openintent-ollama-llama3-2-3b-clinc150oos-14notoos.ipynb
│   └── zero_shot_examples             # Zeroshot prompt notebook examples
│       ├── 01e-kaggle-ollama-llama3-2-3b-banking77-w-force-oos-no-pydantic-enums.ipynb
│       ├── 01f-kaggle-ollama-llama3-2-3b-banking77-w-pydantic-schema.ipynb
│       ├── 01g1-kaggle-ollama-llama3-2-3b-banking77-no-oos-in-intentlist-keep-oos-in-enums.ipynb
│       ├── 01g2-kaggle-ollama-llama3-2-stackoverflow-no-oos-in-intentlist-keep-oos-in-enums.ipynb
│       ├── 01g3-kaggle-ollama-llama3-2-clinc150oos-no-oos-in-intentlist-keep-oos-in-enums.ipynb
│       ├── 01g4-openintent-ollama-deepseek-r1-7b-banking77-test-reasoningmodel.ipynb
│       ├── 01g5-openintent-ollama-gemma3-4b-it-qat-banking77-test-generalquantisedmodel.ipynb
│       ├── 01g6-openintent-ollama-qwen3-8b-banking77-test-mixtureofexpertmodel.ipynb
│       ├── 01g7-openintent-ollama-mistral-7b-banking77-test-generalmodel.ipynb
│       ├── 01g8-openintent-ollama-tulu3-8b-banking77-test-instructiontunedmodel.ipynb
│       └── 01h1-openintent-ollama-llama3-2-3b-banking77-group4similarclassesinoos-zeroshot.ipynb
│   └── terminal_workings             # Terminal workings on how to run the Ollama or API Model pipeline
├── prompts
│   ├── archive_zeroshot_fewshot        # Archived zeroshot, fewshot prompts
│   │   ├── fewshot_prompt_with_5_nearest_examples          # folder of 5-nearest-examples fewshot prompts (in CSV, parquet, zip of txt files format). We included the FAISS vector store containing our banking77 embeddings
│   │   ├── fewshot_prompt_with_5hardcoded-previouslymisclassifiedexamples.txt
│   │   ├── zeroshot_prompt_with_oos_in_intentlist.txt
│   │   └── zeroshot_prompt_with_oos_in_intentlist_w_anchor_confidence.txt
│   ├── few_shot_examples               # Create file containing fewshot prompt examples for each dataset. Visit "analyse-results-zeroshot-fewshot, create-idx2label.ipynb" to understand how to create these files
│   │   ├── banking77
│   │   │   ├── banking_25perc_oos.txt
│   │   │   ├── banking_only1notoos.txt
            ...
│   │   │   └── banking_only70notoos.txt
│   │   ├── oos
│   │   │   ├── oos_25perc_oos.txt
│   │   │   ├── oos_1notoos.txt
            ...
│   │   │   └── oos_100notoos.txt
│   │   └── stackoverflow
│   │       ├── stackoverflow_25perc_oos.txt
│   │       ├── stackoverflow_only1notoos.txt
            ...
│   │       └── stackoverflow_only18notoos.txt
│   ├── fewshot_prompt.txt                                       # Last used fewshot prompt
│   └── zeroshot_prompt_without_oos_in_intentlist.txt            # Last used zeroshot prompt
├── requirements.txt                                             # Python libraries to install
├── results                                                      # Folder containing experiment results
│   ├── analysis                                                 # Folder containing analysis of zeroshot, fewshot, threshold-test
│   │   ├── analyse-results-fewshot-threshold-test.ipynb
│   │   ├── analyse-results-zeroshot-fewshot, create-idx2label.ipynb    # Analyse results for zeroshot, fewshot, hybrid approaches. Create idx2label
│   │   ├── analyse-different-methods-sentence-level-errors_n_confusion-matrix.ipynb
│   │   └── EDA_THUIAR_Banking_n_StackOverflow_n_OOS_Query_Classification_Datasets.ipynb
│   ├── banking77_fewshot_google_gemini-2.5-flash-preview-05-20
│   ├── banking77_fewshot_llama3.2_3b                                    # In each experiment folder
│   │   ├── classification_report_llama3.2_3b_banking.txt                # Multi-class classification report (OOS vs individual known classes)
│   │   ├── classification_report_llama3.2_3b_banking_open_vs_known.txt  # Binary classification report (OOS/Open vs known class)
│   │   ├── cm_llama3.2_3b_banking.csv                                   # Multi-class classification's confusion matrix (OOS vs individual known classes) in CSV
│   │   ├── cm_llama3.2_3b_banking.png                                   # Multi-class classification's confusion matrix (OOS vs individual known classes) in PNG
│   │   ├── metrics_llama3.2_3b_banking.txt                              # Multi-class classification metrics (OOS vs individual known classes)
│   │   ├── metrics_llama3.2_3b_banking_open_vs_known.txt                # Binary classification metrics (OOS/Open vs known class)
│   │   └── results_llama3.2_3b_banking_0_7.json                         # Results JSON: list of dictionaries, with 1 dictionary per classified example
│   ├── banking77_fewshot_llama3.2_3b_csv
│   ├── banking77_fewshot_llama3.2_3b_json
│   ├── banking77_fewshot_nebiusqwen3-30b-a3b                            # Experiment folders from banking77, stackoverflow, oos pipeline test runs (after refactoring from Jupyter notebooks to GitHub repo)
│   ├── banking77_fewshot_thresholdtest
│   ├── banking77_fewshot_thresholdtest_only1notoos
│   ├── banking77_fewshot_thresholdtest_only4notoos
│   ├── banking77_zeroshot_deepseek-r1_7b
│   ├── banking77_zeroshot_gemma3_4b-it-qat
│   ├── banking77_zeroshot_llama_3.2_3b
│   ├── banking77_zeroshot_mistral_7b
│   ├── banking77_zeroshot_qwen3_8b
│   ├── banking77_zeroshot_tulu3_8b
│   ├── oos_fewshot_llama3.2_3b
│   ├── round1                                                           # round1 to round9 folders: Results from experiments conducted in Jupyter notebooks
│   ├── round2-force_oos
│   ├── round3-pydantic
│   ├── round4-oos-out-of-prompt
│   ├── round5-other-models
│   ├── round6-groupsimilarclasses-n-fewshot
│   ├── round7-fewshot-5examplesperknownintent
│   ├── round8-fewshot-1exampleeach-k-knownintent-restoos-100oossentences
│   ├── round9-fewshot-nebiusqwen
│   └── stackoverflow_fewshot_llama3.2_3b
└── src                                     # Folder containing files to run for each experiment
    ├── data_utils.py                       # Preprocess dataset
    ├── ollama_utils.py                     # Ollama model setup
    ├── nebius_utils.py                     # Nebius API setup
    ├── google_utils.py                     # Google API setup
    ├── experiment_common.py                # Common experiment file used by the Ollama and API pipeline
    ├── experiment_ollama.py                # Ollma experiment file
    └── experiment_api.py                   # API experiment file

7. Results Summary

Please note that for the results section below, we will show only

  • experiments using 25% of OOS classes, to compare to the THUIAR paper
  • zero-shot and few-shot experiments using pydantic enums to enforce allowed list of classes for prediction
    • we will not show the results of experiments we ran without pydantic enums, to simplify the results summary

For experiments with other percentage of OOS classes or where we initially explored not enforcing allowed list of classes, you can still access the results in the results folder.

To analyse results in Jupyter notebooks instead, please visit the results/analysis folder.

7.1 Overall Accuracy & Macro F1-score - 25% OOS Class

  • From experiments where we converted 25% of classes to 'OOS'/Open and ran the pipeline, below are the Overall Accuracy & Macro F1-scores.
  • Note that
    • overall refers to all questions/examples across the entire dataset (performing multi-class classification)
    • The figures below are in percentage terms (from 0.00% to 100.00%)
    • '-' refers to experiments which have yet to be conducted
    • In terms of sort order, we have
      • our llama3.2:3b base model on top
      • then sort from highest to lowest zeroshot model
      • then we have our fewshot models
      • then our hybrid models
Methods Banking77 StackOverflow CLINC150OOS
Overall Accuracy Overall Macro F1-score Overall Accuracy Overall Macro F1-score Overall Accuracy Overall Macro F1-score
Zero-Shot llama3.2:3b (Our Base Ollama/Local LLM) 43.74 53.00 66.62 73.10 45.76 55.79
Zero-Shot qwen3:8b (Mixture-of-Experts LLM) 53.86 63.97 - - - -
Zero-Shot gemma3:4b-it-qa (Instruction-Following & Quantised LLM) 48.17 57.48 - - - -
Zero-Shot mistral:7b (General-Purpose LLM) 46.62 54.99 - - - -
Zero-Shot tulu3:8b (Instruction-Following LLM) 44.58 52.69 - - - -
Zero-Shot deepseek-r1:7b (Reasoning LLM) 32.14 36.70 - - - -
Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 examples per known class 14.09 16.30 69.67 75.72 23.23 28.09
Few-Shot QWEN3-30B-A3B (Mixture-of-Experts API LLM) with 5 examples per known class 70.28 76.48 85.72 87.87 80.35 80.99
Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 examples reused across input sentences 50.58 50.63 - - - -
Hybrid Finetuned BERT --> Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 nearest examples to input sentence 69.14 78.92 - - - -
Hybrid Finetuned BERT --> Trained VAE --> Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 nearest examples to input sentence 64.15 74.47 - - - -

What do the overall metrics tell us?

  • Zero-shot is not enough, lagging behind approaches using
    • Few-shot with better models
    • Hybrid embedding then non-embedding methods (with the 5 nearest fewshot examples providing relevant context to the LLM to classify each input sentence)
  • Focusing on small models (defined as models less than 10B parameters)
    • qwen3:8b, a mixture-of-experts LLM, achieved the highest overall and OOS metrics
      • This could be because the mixture-of-experts might have an "expert" specialising in the banking domain, which has a better understanding of nuances in banking to separate between the classes in the banking77 dataset
      • While tulu3:8b has the same number of parameters (8B), tulu did not perform as well as qwen3:8b
  • For llama3.2:3b few-shot (when we used 5 examples for every known class in the fewshot prompt)
    • Count of fewshot examples are
      • Banking77: 285 Few-shot Examples = 57 Known classes x 5 examples
      • Stackoverflow: 75 Few-shot Examples = 15 Known classes x 5 examples
      • CLINC150OOS: 570 Few-shot Examples = 114 Known classes x 5 examples
    • Zooming into the middle (stackoverflow dataset), with less than 100 examples in the fewshot prompt, overall metrics increased alittle, while OOS F1 increased by approx 20% (from 5% to 25% OOS F1 score)
    • Too many examples in the fewshot prompt can confuse a 'small' model (less than 10B parameters), observed for the Banking77 and CLINC150OOS datasets, where overall metrics dropped (while still being greater than 0%), and OOS F1 dropped to 0%
  • For fewshot (using 5 examples for every known class in the fewshot prompt), switching to a bigger model such as QWEN3-30B-A3B
    • Increased overall metrics (relative to llama3.2:3b), and OOS metrics by a moderate to large extent
    • We infer that many examples in a few-shot prompt provides additional context for a 'large' model (defined as greater than 10B parameters)
  • Comparing standalone fewshot (using llama3.2:3b) to hybrid approaches
    • Hybrid approaches increased overall and OOS metrics
    • If we prioritise OOS F1, Hybrid BERT-VAE-Fewshot approach was the best model/technique
      • Standalone Fewshot OOS F1: 8%
      • Hybrid BERT-Fewshot OOS F1: 32%
      • Hybrid BERT-VAE-Fewshot OOS F1: 33%
    • Observing how using a large model (greater than 10B parameters) such as QWEN3-30B-A3B can increase overall and OOS metrics, we expect a greater increase in overall and OOS metrics (approximately 40-60%) by using the Hybrid BERT-VAE-Fewshot approach, with a large model (eg QWEN3-30B-A3B)
  • For more information on the overall and OOS metrics, please reach out to the team where we have a presentation walking through the results in greater detail, beyond the tables presented here for the purpose of comparing to the 2021 THUIAR paper

7.2 OOS/Open vs Known F1-score - 25% OOS Class

  • From experiments where we converted 25% of classes to 'OOS'/Open and ran the pipeline, below are the OOS/Open vs Known F1-scores.
  • Note that
    • Non-embedding methods (zero-shot prompt, few-shot prompt) perform multi-class classification. So to get the 'known' class, we grouped all non-oos classes under 'known'. Therefore for such experiments, we have
      • For multi-class classification
        • 1 classification_report.txt
        • 1 metrics.txt (Overall accuracy, Overall Weighted F1, Overall Macro F1)
        • 1 confusion_matrix.csv
        • 1 confusion_matrix.png
        • 1 results.json - containing the individual multi-class classification predictions
      • For open vs known (after grouping)
        • 1 classification_report.txt - We use the F1 scores for open vs known in this report for the table
        • 1 metrics.txt (Accuracy, Weighted F1, Macro F1)
    • Embedding methods (Adaptive Decision Boundary Clustering and Variational Autoencoder) currently perform only binary classification: open vs known
    • The figures below are in percentage terms (from 0.00% to 100.00%)
    • '-' refers to experiments which have yet to be conducted
    • We followed our sort order in section 7.1's table
    • Even though the 2021 THUIAR paper states that their table 3 consists of open vs known Macro F1-scores, it does not make any sense. The Macro F1-score is an aggregated score across all classes - in this case open and known. Therefore we treat it as a typo in the paper, and take the paper's table 3 to mean class-specific open F1-score vs known F1-score, which we summarise for our experiments below.
Methods Banking77 StackOverflow CLINC150OOS
Open Known Open Known Open Known
Zero-Shot llama3.2:3b (Our Base Ollama/Local LLM) 85.00 0.00 86.00 5.00 83.00 1.00
Zero-Shot qwen3:8b (Mixture-of-Experts LLM) 84.00 12.00 - - - -
Zero-Shot gemma3:4b-it-qa (Instruction-Following & Quantised LLM) 84.00 5.00 - - - -
Zero-Shot mistral:7b (General-Purpose LLM) 85.00 0.00 - - - -
Zero-Shot tulu3:8b (Instruction-Following LLM) 85.00 0.00 - - - -
Zero-Shot deepseek-r1:7b (Reasoning LLM) 84.00 2.00 - - - -
Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 examples per known class 85.00 0.00 87.00 25.00 84.00 0.00
Few-Shot QWEN3-30B-A3B (Mixture-of-Experts API LLM) with 5 examples per known class 89.00 56.00 91.00 79.00 90.00 75.00
Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 examples reused across input sentences 97.00 8.00 - - - -
Hybrid Finetuned BERT --> Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 nearest examples to input sentence 87.00 32.00 - - - -
Hybrid Finetuned BERT --> Trained VAE 81.00 16.00 - - - -
Hybrid Finetuned BERT --> Trained VAE --> Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 nearest examples to input sentence 83.00 33.00 - - - -

7.3 OOS/Open vs Known - Grouped Accuracy & Macro F1-score - 25% OOS Class

  • As an add-on for our project (which was not presented in the 2021 THUIAR paper), we summarise the OOS/Open vs Known Grouped Accuracy & Macro F1-score
  • We hope this helps to facilitate a comparison between
    • Section 7.1's Table: Overall Accuracy and Macro-F1 score from Multi-class Classification
    • Section 7.3's Table: OOS/Open vs Known Grouped Accuracy and Macro-F1 score from 'Binary' Classification
  • Note that for the base "ADB (2021 THUIAR Paper)" row, we assume table 2 of the paper applies to both multi-class classification, and binary classification
Methods Banking77 StackOverflow CLINC150OOS
Grouped Accuracy Grouped Macro F1-score Grouped Accuracy Grouped Macro F1-score Grouped Accuracy Grouped Macro F1-score
Zero-Shot llama3.2:3b (Our Base Ollama/Local LLM) 73.00 42.00 75.00 45.00 72.00 42.00
Zero-Shot qwen3:8b (Mixture-of-Experts LLM) 73.00 48.00 - - - -
Zero-Shot gemma3:4b-it-qa (Instruction-Following & Quantised LLM) 73.00 45.00 - - - -
Zero-Shot mistral:7b (General-Purpose LLM) 74.00 42.00 - - - -
Zero-Shot tulu3:8b (Instruction-Following LLM) 74.00 42.00 - - - -
Zero-Shot deepseek-r1:7b (Reasoning LLM) 72.00 43.00 - - - -
Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 examples per known class 74.00 43.00 78.00 56.00 72.00 42.00
Few-Shot QWEN3-30B-A3B (Mixture-of-Experts API LLM) with 5 examples per known class 82.00 73.00 87.00 85.00 85.00 82.00
Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 examples reused across input sentences 94.19 52.27 - - - -
Hybrid Finetuned BERT --> Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 nearest examples to input sentence 78.51 59.55 - - - -
Hybrid Finetuned BERT --> Trained VAE 69.23 48.71 - - - -
Hybrid Finetuned BERT --> Trained VAE --> Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 nearest examples to input sentence 72.67 57.90 - - - -

7.4 Threshold Test When Recall Drops

  • As an add-on for our project, we evaluated how the count of known intents affects the recall of a LLM
  • Note that we
    • performed threshold test using only llama3.2:3b (our base Ollama model)
    • used the few-shot prompt method, with 'k' known intents and 1 example for every known intent class
    • input 100 question strings from the 'OOS' class from each dataset into the LLM to predict an output class
  • As we increase the number of known intents for a model to classify
    • Precision stays at 100% BUT recall on ‘OOS’ class decreases
    • Intuitively, with increasingly more choices to choose from known intent classes, and with some classes similar to one another, it is reasonable to expect a LLM to be less able to classify the input question strings as the 'OOS'/Open class
    • And this contradicts table 3 from the 2021 THUIAR paper, where the F1-score of the 'Open' class decreases as the percentage of known classes increase
image

8. References

Here are references underpinning those methods we explored above

9. License

This project is licensed under the MIT License - see the LICENSE file for details.

About

LLM Trust Lens - Open Intent Classification

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors 2

  •  
  •