LLM Trust Lens - Open Intent Classification

1. Overview

LLM Trust Lens - Open Intent Classification is a pipeline to evaluate the performance of various methods (such as LLMs) on various datasets, focusing on the topic of "Open Intent Classification".

What is "Open Intent Classification"?

There are 2 ways to evaluate open intent classification:

Binary classification of open-intent/oos/unknown class vs 1 known class (grouped from all known classes)
Multi-class Classification of open-intent/oos/unknown class vs individual known classes

Project Team Members

Members: Kaiquan Mah
Mentor: Liat Friedman Antwarg
CitrusX Representatives: Shlomit Finegold, Dagan Eshar, Ran Emuna

2. Key Features

Multi-Model Support: Evaluate both local models (via Ollama) and API-based models (Nebius, Google Gemini)
Flexible Prompt Scenarios: Support for both zero-shot and few-shot prompt scenarios
Multiple Datasets: Built-in support for Banking77, StackOverflow, and CLINC150OOS TSV datasets (Source: 2021 Adaptive Decision Boundary Clustering GitHub repo). For new datasets, bring them into the pipeline!
Multiple Dataset Formats - Support for CSV, TSV, JSON file formats
Configurable Experiments: YAML-based configuration system for easy experiment setup
Traceable Results: Generate LLM predictions, classification metrics and confusion matrix files for evaluation
Hybrid Embedding and Non-Embedding Approach: Explored 2 pipelines with a hybrid approach (currently in notebooks)

3. Setup

Clone the Repository

# If you have not done so, update your Ubuntu packages and install git
sudo apt update && sudo apt install git -y

git clone https://github.com/KaiquanMah/llm-trust-lens.git
cd llm-trust-lens

Create a Virtual Environment (Recommended)

# If you have not done so, install python, pip and venv on your Ubuntu machine
sudo apt install -y python3 python3-pip python3-venv

python -m venv venv
source venv/bin/activate    # On Windows use `venv\Scripts\activate`

Install Dependencies. Install Ollama, then install the required Python packages using the requirements.txt file

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Upgrade pip
pip install --upgrade pip
# Install python dependencies to run the pipeline
pip install -r requirements.txt

Test Ollama has been installed successfully

# Check Ollama version
ollama --version
# As at August 2025: ollama version is 0.9.6

ps aux | grep ollama
# codespa+    1425  0.0  0.0   7080  2048 pts/0    S+   14:21   0:00 grep --color=auto ollama

4. Environment Configuration

To use API-based models from providers like Nebius or Google, you must configure your API keys in an environment file.

Create an .env file and add your API Keys.

NEBIUS_API_KEY = "your_nebius_api_key_here"    # These variables will be loaded by the pipeline to authenticate with the respective API services.
GOOGLE_API_KEY = "your_google_api_key_here"

5. Usage

Non-embedding methods: Note that the experiment_*.py pipeline files currently work for non-embedding methods (zero-shot prompt and few-shot prompt)
Embedding methods: For embedding methods (finetune BERT, then run Adaptive Decision Boundary Clustering or Variational Autoencoder), the team was still exploring these methods. Please visit the workings in the *.ipynb files we will share below.

5.1. Non-Embedding Methods (Zero-Shot Prompt and Few-Shot Prompt)

The Terminal commands shown below run experiments from the root directory of the project
You can execute different experiments by using the appropriate experiment_*.py file and experiment configuration file
If you wish to check out the terminal workings and printouts during each pipeline run, please visit the terminal_workings folder

5.1.1. Common Steps for Ollama/Local Model and API Model Experiments

Navigate to the llm-trust-lens folder
Activate your venv virtual environment containing the required python libraries to run the pipeline
```
python3 -m venv venv
source venv/bin/activate
```
Dataset: Use an existing CSV/TSV/JSON dataset or bring in new datasets into the data folder
idx2label: Use an existing idx2label.csv (mapping class indexes to labels) or create a new idx2label.csv in the [respective data folder](https://github.com/KaiquanMah/llm-trust-lens/tree/main/data
- To understand how to create a new idx2label.csv, please visit analyse-results-zeroshot-fewshot, create-idx2label.ipynb, then search for the sections near the end of the workbook using the "idx2label_to_nonoos_listlabels" function
- If the analysis notebook above is too huge to open on your browser, please visit: https://nbviewer.org/github/KaiquanMah/llm-trust-lens/blob/main/results/analysis/analyse-results-zeroshot-fewshot%2C%20create-idx2label.ipynb
Dataset yaml: Use an existing dataset yaml file or create a new dataset yaml file in the dataset yaml folder
Experiment yaml: Use an existing experiment yaml file or create a new experiment yaml file in the experiment yaml folder
- We recommend creating separate experiment yaml files to trace back to each experiment's configuration (eg ollama vs api, the model you use, zeroshot vs fewshot, thresholdtest or not)
Prompt: Use an existing zero-shot or few-shot prompt, or create a new prompt.txt in the prompts folder
- Remember to move old prompts to the archive_zeroshot_fewshot folder
Few-shot Prompt Examples: If you wish to use the few-shot prompt method, please use an existing few-shot examples file, or create a new few-shot examples txt file in the few_shot_examples folder

5.1.2. Run Ollama/Local Model Experiments

For Ollama models, we ran the pipeline successfully for 6 models
```
llama3.2:3b
qwen3:8b (Mixture-of-Experts LLM)
gemma3:4b-it-qat (instruction-Following and Quantised LLM)
mistral:7b (General-Purpose LLM)
tulu3:8b (Instruction-Following LLM)
deepseek-r1:7b (Reasoning LLM)
```
- We expect the pipeline to be able to support other models that will be published onto Ollama
- To explore Ollama models you can use, please visit Ollama's model directory

5.1.2.1. Zero-Shot llama3-2-3b on Banking77 dataset

python src/experiment_ollama.py --config config/experiment/ollama_llama3_2_3b_zeroshot_banking77.yaml

5.1.2.2. Zero-Shot gemma-3-4b-it-qat on Banking77 dataset

python src/experiment_ollama.py --config config/experiment/ollama_gemma3_4b-it-qat_zeroshot_banking77.yaml

5.1.2.3. Few-Shot llama3-2-3b on Banking77 dataset

python src/experiment_ollama.py --config config/experiment/ollama_llama3_2_3b_fewshot_banking77.yaml

5.1.2.4. Few-Shot llama3-2-3b on Stackoverflow dataset

python src/experiment_ollama.py --config config/experiment/ollama_llama3_2_3b_fewshot_stackoverflow.yaml

5.1.2.5. Few-Shot llama3-2-3b on CLINIC150OOS dataset

python src/experiment_ollama.py --config config/experiment/ollama_llama3_2_3b_fewshot_clinc150oos.yaml

5.1.3. Run API Model Experiments

For API models, our pipeline currently supports individual calls to 2 API providers (where we had some credits)
```
Nebius
Google
```
To understand how to use the Nebius batch API, please visit notebooks 01l6 to 01l9* where we prepared inputs for the batch API, called it, downloaded results, then stitched with the original dataset to get the output we expect for further analysis
To integrate with other model providers, you will need to
- Create a <model_provider>_utils.py file using nebius_utils.py as a template. This file covers initialising your API client, retry config, and how to work with messages
- Add a class to experiment_api.py. The class should have 2 basic functions: initialize, predict
For all API models, please remember to specify your model_provider, model_name and configuration in the experiment yaml file which we shared in section 5.1.1. Common Steps - Step 6

5.1.3.1. Few-Shot Nebius Qwen API on Banking77 dataset

python src/experiment_api.py --config config/experiment/api_nebius_qwen3-30b-a3b_fewshot_banking77.yaml

5.1.3.2. Few-Shot Google Gemini API on Banking77 dataset

python src/experiment_api.py --config config/experiment/api_google_gemini-2.5-flash-preview-05-20_fewshot_banking77.yaml

5.2. Hybrid Embedding Method then Non-Embedding Method

Please visit the notebook in the hybrid_embedding_nonembedding folder for the workings
In the notebook, we
- Finetuned BERT only on known classes to
  - Adapt BERT to the dataset's domain
  - Move sentences from the same class closer, and move sentences from different classes further apart, to give "higher quality embeddings", which can be used to better differentiate between different classes
- Use finetuned BERT to generate embeddings
- Trained a Variational Autoencoder (VAE) on binary classification (open class vs known class)
- Stored embeddings in a FAISS (Facebook AI Similarity Search) vector store: IndexFlatIP (which uses cosine similarity to find the nearest neighbours)
  - Output format: faiss, CSV
- Retrieved 5 nearest sentence examples and classes for each input sentence's fewshot prompt
  - Output formats available for use: parquet, CSV, ZIP of 1 fewshot text file per input sentence
- Pickle file containing a list of row indexes for sentences "classified by VAE as the known class"
- Ran 2 separate pipelines
  1. (After finetuned BERT -> Trained VAE -> ) Fewshot prompt on known subset of sentences (classified by VAE) only
  2. (After finetuned BERT -> ) Fewshot prompt on full dataset

6. Folder Structure

.
├── LICENSE
├── README.md
├── .env                               # File containing API keys. Follow the "Setup" section to create your own .env file
├── config
│   ├── dataset                        # Dataset config. Naming convention: <dataset_name>.yaml
│   │   ├── banking77.yaml             # dataset config for banking77 TSV files (to test support for TSVs - original design)
│   │   ├── clinc150oos.yaml
│   │   ├── stackoverflow.yaml
│   │   ├── banking77_csv.yaml         # dataset config for CSV files (to test support for CSVs - add-on)
│   │   └── banking77_json.yaml        # dataset config for JSON files (to test support for JSON - add-on)
│   └── experiment                     # Experiment config. Naming convention: <ollama/api>_<modelprovider>_<modelname>_<method>_<dataset>.yaml
│       ├── api_google_gemini-2.5-flash-preview-05-20_fewshot_banking77.yaml
│       ├── api_nebius_qwen3-30b-a3b_fewshot_banking77.yaml
│       ├── ollama_deepseek-r1_7b_zeroshot_banking77.yaml
│       ├── ollama_gemma3_4b-it-qat_zeroshot_banking77.yaml
│       ├── ollama_llama3_2_3b_fewshot_banking77.yaml          # experiment config for llama3.2:3b fewshot on banking77 TSV files
│       ├── ollama_llama3_2_3b_fewshot_banking77_csv.yaml
│       ├── ollama_llama3_2_3b_fewshot_banking77_json.yaml
│       ├── ollama_llama3_2_3b_fewshot_banking77_thresholdtest.yaml
│       ├── ollama_llama3_2_3b_fewshot_clinc150oos.yaml
│       ├── ollama_llama3_2_3b_fewshot_stackoverflow.yaml
│       ├── ollama_llama3_2_3b_zeroshot_banking77.yaml
│       ├── ollama_mistral_7b_zeroshot_banking77.yaml
│       ├── ollama_qwen3_8b_zeroshot_banking77.yaml
│       └── ollama_tulu3_8b_zeroshot_banking77.yaml
├── data                              # Place your datasets here - 1 folder per dataset
│   ├── banking                       # Train, dev, test TSV files, including class index to label mapping (idx2label.csv). Visit "analyse-results-zeroshot-fewshot, create-idx2label.ipynb" to understand how to create idx2label.csv for your dataset
│   │   ├── banking77_idx2label.csv
│   │   ├── dev.tsv
│   │   ├── test.tsv
│   │   └── train.tsv
│   ├── oos
│   ├── stackoverflow
│   ├── banking_simulate_csv
│   └── banking_simulate_json
├── debugger                          # Optional debugger directory to veryify ollama has been setup correctly
│   └── debug_ollama.py
├── examples
│   ├── example_usage.py
│   ├── few_shot_examples             # Fewshot prompt notebook examples
│   │   ├── 01i1-openintent-ollama-llama3-2-3b-banking77-fewshot_5hardcoded-previouslymisclassifiedexamples.ipynb
│   │   ├── 01j1-openintent-ollama-llama3-2-3b-banking77_10001_13082.ipynb
│   │   ├── 01j2-openintent-ollama-llama3-2-3b-stackoverflow_10001_19999.ipynb
│   │   ├── 01j3-openintent-ollama-llama3-2-3b-oos_22001_23699.ipynb
│   │   ├── 01l1-openintent-gemini-2.5-flash-banking77_0_19.ipynb
│   │   ├── 01l5-openintent-nebiusqwen-banking77-individualAPIcall.ipynb
│   │   ├── 01l6-openintent-nebiusqwen-banking77-batchof10.ipynb                          # Nebius qwen batch API - run for batch of 10 examples
│   │   ├── 01l7a-openintent-nebiusqwen-banking77-batchfull-n-downloadresults.ipynb       # Nebius qwen batch API - run for full banking77 dataset, then download results
│   │   ├── 01l7b-openintent-nebiusqwen-bk77-stitchresults.ipynb                          # Nebius qwen batch API - stitch results together with original dataframe for further analysis
│   │   ├── 01l8a-openintent-nebiusqwen-stackoverflow-batchfull-n-downloadresults.ipynb
│   │   ├── 01l8b-openintent-nebiusqwen-stkoflw-stitchresults.ipynb
│   │   ├── 01l9a-openintent-nebiusqwen-clincoos-batchfull.ipynb
│   │   ├── 01l9b-openintent-nebiusqwen-clincoos-downloadresults.ipynb
│   │   └── 01l9c-openintent-nebiusqwen-c150oos-stitchresults.ipynb
│   ├── hybrid_embedding_nonembedding
│   │   └── 25perc_OOS_Finetune_BERT,_train_VAE,_apply_on_full_dataset,_get_5_nearest_examples_for_fewshot.ipynb
│   ├── thresholdtest                 # Fewshot threshold test notebook examples
│   │   ├── 01k1-openintent-ollama-llama3-2-3b-banking77-1notoos.ipynb
│   │   ├── 01k1-openintent-ollama-llama3-2-3b-banking77-4notoos.ipynb
│   │   ├── 01k2-openintent-ollama-llama3-2-3b-stackoverflow-5notoos.ipynb
│   │   └── 01k3-openintent-ollama-llama3-2-3b-clinc150oos-14notoos.ipynb
│   └── zero_shot_examples             # Zeroshot prompt notebook examples
│       ├── 01e-kaggle-ollama-llama3-2-3b-banking77-w-force-oos-no-pydantic-enums.ipynb
│       ├── 01f-kaggle-ollama-llama3-2-3b-banking77-w-pydantic-schema.ipynb
│       ├── 01g1-kaggle-ollama-llama3-2-3b-banking77-no-oos-in-intentlist-keep-oos-in-enums.ipynb
│       ├── 01g2-kaggle-ollama-llama3-2-stackoverflow-no-oos-in-intentlist-keep-oos-in-enums.ipynb
│       ├── 01g3-kaggle-ollama-llama3-2-clinc150oos-no-oos-in-intentlist-keep-oos-in-enums.ipynb
│       ├── 01g4-openintent-ollama-deepseek-r1-7b-banking77-test-reasoningmodel.ipynb
│       ├── 01g5-openintent-ollama-gemma3-4b-it-qat-banking77-test-generalquantisedmodel.ipynb
│       ├── 01g6-openintent-ollama-qwen3-8b-banking77-test-mixtureofexpertmodel.ipynb
│       ├── 01g7-openintent-ollama-mistral-7b-banking77-test-generalmodel.ipynb
│       ├── 01g8-openintent-ollama-tulu3-8b-banking77-test-instructiontunedmodel.ipynb
│       └── 01h1-openintent-ollama-llama3-2-3b-banking77-group4similarclassesinoos-zeroshot.ipynb
│   └── terminal_workings             # Terminal workings on how to run the Ollama or API Model pipeline
├── prompts
│   ├── archive_zeroshot_fewshot        # Archived zeroshot, fewshot prompts
│   │   ├── fewshot_prompt_with_5_nearest_examples          # folder of 5-nearest-examples fewshot prompts (in CSV, parquet, zip of txt files format). We included the FAISS vector store containing our banking77 embeddings
│   │   ├── fewshot_prompt_with_5hardcoded-previouslymisclassifiedexamples.txt
│   │   ├── zeroshot_prompt_with_oos_in_intentlist.txt
│   │   └── zeroshot_prompt_with_oos_in_intentlist_w_anchor_confidence.txt
│   ├── few_shot_examples               # Create file containing fewshot prompt examples for each dataset. Visit "analyse-results-zeroshot-fewshot, create-idx2label.ipynb" to understand how to create these files
│   │   ├── banking77
│   │   │   ├── banking_25perc_oos.txt
│   │   │   ├── banking_only1notoos.txt
            ...
│   │   │   └── banking_only70notoos.txt
│   │   ├── oos
│   │   │   ├── oos_25perc_oos.txt
│   │   │   ├── oos_1notoos.txt
            ...
│   │   │   └── oos_100notoos.txt
│   │   └── stackoverflow
│   │       ├── stackoverflow_25perc_oos.txt
│   │       ├── stackoverflow_only1notoos.txt
            ...
│   │       └── stackoverflow_only18notoos.txt
│   ├── fewshot_prompt.txt                                       # Last used fewshot prompt
│   └── zeroshot_prompt_without_oos_in_intentlist.txt            # Last used zeroshot prompt
├── requirements.txt                                             # Python libraries to install
├── results                                                      # Folder containing experiment results
│   ├── analysis                                                 # Folder containing analysis of zeroshot, fewshot, threshold-test
│   │   ├── analyse-results-fewshot-threshold-test.ipynb
│   │   ├── analyse-results-zeroshot-fewshot, create-idx2label.ipynb    # Analyse results for zeroshot, fewshot, hybrid approaches. Create idx2label
│   │   ├── analyse-different-methods-sentence-level-errors_n_confusion-matrix.ipynb
│   │   └── EDA_THUIAR_Banking_n_StackOverflow_n_OOS_Query_Classification_Datasets.ipynb
│   ├── banking77_fewshot_google_gemini-2.5-flash-preview-05-20
│   ├── banking77_fewshot_llama3.2_3b                                    # In each experiment folder
│   │   ├── classification_report_llama3.2_3b_banking.txt                # Multi-class classification report (OOS vs individual known classes)
│   │   ├── classification_report_llama3.2_3b_banking_open_vs_known.txt  # Binary classification report (OOS/Open vs known class)
│   │   ├── cm_llama3.2_3b_banking.csv                                   # Multi-class classification's confusion matrix (OOS vs individual known classes) in CSV
│   │   ├── cm_llama3.2_3b_banking.png                                   # Multi-class classification's confusion matrix (OOS vs individual known classes) in PNG
│   │   ├── metrics_llama3.2_3b_banking.txt                              # Multi-class classification metrics (OOS vs individual known classes)
│   │   ├── metrics_llama3.2_3b_banking_open_vs_known.txt                # Binary classification metrics (OOS/Open vs known class)
│   │   └── results_llama3.2_3b_banking_0_7.json                         # Results JSON: list of dictionaries, with 1 dictionary per classified example
│   ├── banking77_fewshot_llama3.2_3b_csv
│   ├── banking77_fewshot_llama3.2_3b_json
│   ├── banking77_fewshot_nebiusqwen3-30b-a3b                            # Experiment folders from banking77, stackoverflow, oos pipeline test runs (after refactoring from Jupyter notebooks to GitHub repo)
│   ├── banking77_fewshot_thresholdtest
│   ├── banking77_fewshot_thresholdtest_only1notoos
│   ├── banking77_fewshot_thresholdtest_only4notoos
│   ├── banking77_zeroshot_deepseek-r1_7b
│   ├── banking77_zeroshot_gemma3_4b-it-qat
│   ├── banking77_zeroshot_llama_3.2_3b
│   ├── banking77_zeroshot_mistral_7b
│   ├── banking77_zeroshot_qwen3_8b
│   ├── banking77_zeroshot_tulu3_8b
│   ├── oos_fewshot_llama3.2_3b
│   ├── round1                                                           # round1 to round9 folders: Results from experiments conducted in Jupyter notebooks
│   ├── round2-force_oos
│   ├── round3-pydantic
│   ├── round4-oos-out-of-prompt
│   ├── round5-other-models
│   ├── round6-groupsimilarclasses-n-fewshot
│   ├── round7-fewshot-5examplesperknownintent
│   ├── round8-fewshot-1exampleeach-k-knownintent-restoos-100oossentences
│   ├── round9-fewshot-nebiusqwen
│   └── stackoverflow_fewshot_llama3.2_3b
└── src                                     # Folder containing files to run for each experiment
    ├── data_utils.py                       # Preprocess dataset
    ├── ollama_utils.py                     # Ollama model setup
    ├── nebius_utils.py                     # Nebius API setup
    ├── google_utils.py                     # Google API setup
    ├── experiment_common.py                # Common experiment file used by the Ollama and API pipeline
    ├── experiment_ollama.py                # Ollma experiment file
    └── experiment_api.py                   # API experiment file

7. Results Summary

Please note that for the results section below, we will show only

experiments using 25% of OOS classes, to compare to the THUIAR paper
zero-shot and few-shot experiments using pydantic enums to enforce allowed list of classes for prediction
- we will not show the results of experiments we ran without pydantic enums, to simplify the results summary

For experiments with other percentage of OOS classes or where we initially explored not enforcing allowed list of classes, you can still access the results in the results folder.

To analyse results in Jupyter notebooks instead, please visit the results/analysis folder.

7.1 Overall Accuracy & Macro F1-score - 25% OOS Class

From experiments where we converted 25% of classes to 'OOS'/Open and ran the pipeline, below are the Overall Accuracy & Macro F1-scores.
Note that
- overall refers to all questions/examples across the entire dataset (performing multi-class classification)
- The figures below are in percentage terms (from 0.00% to 100.00%)
- '-' refers to experiments which have yet to be conducted
- In terms of sort order, we have
  - our llama3.2:3b base model on top
  - then sort from highest to lowest zeroshot model
  - then we have our fewshot models
  - then our hybrid models

Methods	Banking77		StackOverflow		CLINC150OOS
Methods	Overall Accuracy	Overall Macro F1-score	Overall Accuracy	Overall Macro F1-score	Overall Accuracy	Overall Macro F1-score
Zero-Shot llama3.2:3b (Our Base Ollama/Local LLM)	43.74	53.00	66.62	73.10	45.76	55.79
Zero-Shot qwen3:8b (Mixture-of-Experts LLM)	53.86	63.97	-	-	-	-
Zero-Shot gemma3:4b-it-qa (Instruction-Following & Quantised LLM)	48.17	57.48	-	-	-	-
Zero-Shot mistral:7b (General-Purpose LLM)	46.62	54.99	-	-	-	-
Zero-Shot tulu3:8b (Instruction-Following LLM)	44.58	52.69	-	-	-	-
Zero-Shot deepseek-r1:7b (Reasoning LLM)	32.14	36.70	-	-	-	-
Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 examples per known class	14.09	16.30	69.67	75.72	23.23	28.09
Few-Shot QWEN3-30B-A3B (Mixture-of-Experts API LLM) with 5 examples per known class	70.28	76.48	85.72	87.87	80.35	80.99
Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 examples reused across input sentences	50.58	50.63	-	-	-	-
Hybrid Finetuned BERT --> Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 nearest examples to input sentence	69.14	78.92	-	-	-	-
Hybrid Finetuned BERT --> Trained VAE --> Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 nearest examples to input sentence	64.15	74.47	-	-	-	-

What do the overall metrics tell us?

Zero-shot is not enough, lagging behind approaches using
- Few-shot with better models
- Hybrid embedding then non-embedding methods (with the 5 nearest fewshot examples providing relevant context to the LLM to classify each input sentence)
Focusing on small models (defined as models less than 10B parameters)
- qwen3:8b, a mixture-of-experts LLM, achieved the highest overall and OOS metrics
  - This could be because the mixture-of-experts might have an "expert" specialising in the banking domain, which has a better understanding of nuances in banking to separate between the classes in the banking77 dataset
  - While tulu3:8b has the same number of parameters (8B), tulu did not perform as well as qwen3:8b
For llama3.2:3b few-shot (when we used 5 examples for every known class in the fewshot prompt)
- Count of fewshot examples are
  - Banking77: 285 Few-shot Examples = 57 Known classes x 5 examples
  - Stackoverflow: 75 Few-shot Examples = 15 Known classes x 5 examples
  - CLINC150OOS: 570 Few-shot Examples = 114 Known classes x 5 examples
- Zooming into the middle (stackoverflow dataset), with less than 100 examples in the fewshot prompt, overall metrics increased alittle, while OOS F1 increased by approx 20% (from 5% to 25% OOS F1 score)
- Too many examples in the fewshot prompt can confuse a 'small' model (less than 10B parameters), observed for the Banking77 and CLINC150OOS datasets, where overall metrics dropped (while still being greater than 0%), and OOS F1 dropped to 0%
For fewshot (using 5 examples for every known class in the fewshot prompt), switching to a bigger model such as QWEN3-30B-A3B
- Increased overall metrics (relative to llama3.2:3b), and OOS metrics by a moderate to large extent
- We infer that many examples in a few-shot prompt provides additional context for a 'large' model (defined as greater than 10B parameters)
Comparing standalone fewshot (using llama3.2:3b) to hybrid approaches
- Hybrid approaches increased overall and OOS metrics
- If we prioritise OOS F1, Hybrid BERT-VAE-Fewshot approach was the best model/technique
  - Standalone Fewshot OOS F1: 8%
  - Hybrid BERT-Fewshot OOS F1: 32%
  - Hybrid BERT-VAE-Fewshot OOS F1: 33%
- Observing how using a large model (greater than 10B parameters) such as QWEN3-30B-A3B can increase overall and OOS metrics, we expect a greater increase in overall and OOS metrics (approximately 40-60%) by using the Hybrid BERT-VAE-Fewshot approach, with a large model (eg QWEN3-30B-A3B)
For more information on the overall and OOS metrics, please reach out to the team where we have a presentation walking through the results in greater detail, beyond the tables presented here for the purpose of comparing to the 2021 THUIAR paper

7.2 OOS/Open vs Known F1-score - 25% OOS Class

From experiments where we converted 25% of classes to 'OOS'/Open and ran the pipeline, below are the OOS/Open vs Known F1-scores.
Note that
- Non-embedding methods (zero-shot prompt, few-shot prompt) perform multi-class classification. So to get the 'known' class, we grouped all non-oos classes under 'known'. Therefore for such experiments, we have
  - For multi-class classification
    - 1 classification_report.txt
    - 1 metrics.txt (Overall accuracy, Overall Weighted F1, Overall Macro F1)
    - 1 confusion_matrix.csv
    - 1 confusion_matrix.png
    - 1 results.json - containing the individual multi-class classification predictions
  - For open vs known (after grouping)
    - 1 classification_report.txt - We use the F1 scores for open vs known in this report for the table
    - 1 metrics.txt (Accuracy, Weighted F1, Macro F1)
- Embedding methods (Adaptive Decision Boundary Clustering and Variational Autoencoder) currently perform only binary classification: open vs known
- The figures below are in percentage terms (from 0.00% to 100.00%)
- '-' refers to experiments which have yet to be conducted
- We followed our sort order in section 7.1's table
- Even though the 2021 THUIAR paper states that their table 3 consists of open vs known Macro F1-scores, it does not make any sense. The Macro F1-score is an aggregated score across all classes - in this case open and known. Therefore we treat it as a typo in the paper, and take the paper's table 3 to mean class-specific open F1-score vs known F1-score, which we summarise for our experiments below.

Methods	Banking77		StackOverflow		CLINC150OOS
Methods	Open	Known	Open	Known	Open	Known
Zero-Shot llama3.2:3b (Our Base Ollama/Local LLM)	85.00	0.00	86.00	5.00	83.00	1.00
Zero-Shot qwen3:8b (Mixture-of-Experts LLM)	84.00	12.00	-	-	-	-
Zero-Shot gemma3:4b-it-qa (Instruction-Following & Quantised LLM)	84.00	5.00	-	-	-	-
Zero-Shot mistral:7b (General-Purpose LLM)	85.00	0.00	-	-	-	-
Zero-Shot tulu3:8b (Instruction-Following LLM)	85.00	0.00	-	-	-	-
Zero-Shot deepseek-r1:7b (Reasoning LLM)	84.00	2.00	-	-	-	-
Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 examples per known class	85.00	0.00	87.00	25.00	84.00	0.00
Few-Shot QWEN3-30B-A3B (Mixture-of-Experts API LLM) with 5 examples per known class	89.00	56.00	91.00	79.00	90.00	75.00
Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 examples reused across input sentences	97.00	8.00	-	-	-	-
Hybrid Finetuned BERT --> Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 nearest examples to input sentence	87.00	32.00	-	-	-	-
Hybrid Finetuned BERT --> Trained VAE	81.00	16.00	-	-	-	-
Hybrid Finetuned BERT --> Trained VAE --> Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 nearest examples to input sentence	83.00	33.00	-	-	-	-

7.3 OOS/Open vs Known - Grouped Accuracy & Macro F1-score - 25% OOS Class

As an add-on for our project (which was not presented in the 2021 THUIAR paper), we summarise the OOS/Open vs Known Grouped Accuracy & Macro F1-score
We hope this helps to facilitate a comparison between
- Section 7.1's Table: Overall Accuracy and Macro-F1 score from Multi-class Classification
- Section 7.3's Table: OOS/Open vs Known Grouped Accuracy and Macro-F1 score from 'Binary' Classification
Note that for the base "ADB (2021 THUIAR Paper)" row, we assume table 2 of the paper applies to both multi-class classification, and binary classification

Methods	Banking77		StackOverflow		CLINC150OOS
Methods	Grouped Accuracy	Grouped Macro F1-score	Grouped Accuracy	Grouped Macro F1-score	Grouped Accuracy	Grouped Macro F1-score
Zero-Shot llama3.2:3b (Our Base Ollama/Local LLM)	73.00	42.00	75.00	45.00	72.00	42.00
Zero-Shot qwen3:8b (Mixture-of-Experts LLM)	73.00	48.00	-	-	-	-
Zero-Shot gemma3:4b-it-qa (Instruction-Following & Quantised LLM)	73.00	45.00	-	-	-	-
Zero-Shot mistral:7b (General-Purpose LLM)	74.00	42.00	-	-	-	-
Zero-Shot tulu3:8b (Instruction-Following LLM)	74.00	42.00	-	-	-	-
Zero-Shot deepseek-r1:7b (Reasoning LLM)	72.00	43.00	-	-	-	-
Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 examples per known class	74.00	43.00	78.00	56.00	72.00	42.00
Few-Shot QWEN3-30B-A3B (Mixture-of-Experts API LLM) with 5 examples per known class	82.00	73.00	87.00	85.00	85.00	82.00
Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 examples reused across input sentences	94.19	52.27	-	-	-	-
Hybrid Finetuned BERT --> Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 nearest examples to input sentence	78.51	59.55	-	-	-	-
Hybrid Finetuned BERT --> Trained VAE	69.23	48.71	-	-	-	-
Hybrid Finetuned BERT --> Trained VAE --> Few-Shot llama3.2:3b (Our Base Ollama/Local LLM) with 5 nearest examples to input sentence	72.67	57.90	-	-	-	-

7.4 Threshold Test When Recall Drops

As an add-on for our project, we evaluated how the count of known intents affects the recall of a LLM
Note that we
- performed threshold test using only llama3.2:3b (our base Ollama model)
- used the few-shot prompt method, with 'k' known intents and 1 example for every known intent class
- input 100 question strings from the 'OOS' class from each dataset into the LLM to predict an output class
As we increase the number of known intents for a model to classify
- Precision stays at 100% BUT recall on ‘OOS’ class decreases
- Intuitively, with increasingly more choices to choose from known intent classes, and with some classes similar to one another, it is reasonable to expect a LLM to be less able to classify the input question strings as the 'OOS'/Open class
- And this contradicts table 3 from the 2021 THUIAR paper, where the F1-score of the 'Open' class decreases as the percentage of known classes increase

8. References

Here are references underpinning those methods we explored above

ADB Clustering (THUIAR): Zhang, H., et al (2021). Deep open intent classification with Adaptive Decision Boundary. arXiv. https://doi.org/10.48550/arXiv.2012.10209
Zero-shot Prompt: Roumeliotis, K.I., et al (2025). LLMs for product classification: zero-shot. NLP Journal, Vol 11. https://doi.org/10.1016/j.nlp.2025.100142
Few-shot Prompt: Parikh, S., et al (2023). Exploring zero and few-shot techniques for intent classification. arXiv. https://doi.org/10.48550/arXiv.2305.07157
Encoder: Zawbaa, H., et al (2024). Out-of-scope classification with dual encoding and threshold. arXiv. https://doi.org/10.48550/arXiv.2405.19967
Hybrid BERT & One-shot: Castillo-Lopez, G., et al (2025). Intent recognition in multi-party conversations. arXiv. https://doi.org/10.48550/arXiv.2507.22289

9. License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Trust Lens - Open Intent Classification

1. Overview

2. Key Features

3. Setup

4. Environment Configuration

5. Usage

5.1. Non-Embedding Methods (Zero-Shot Prompt and Few-Shot Prompt)

5.1.1. Common Steps for Ollama/Local Model and API Model Experiments

5.1.2. Run Ollama/Local Model Experiments

5.1.3. Run API Model Experiments

5.2. Hybrid Embedding Method then Non-Embedding Method

6. Folder Structure

7. Results Summary

7.1 Overall Accuracy & Macro F1-score - 25% OOS Class

What do the overall metrics tell us?

7.2 OOS/Open vs Known F1-score - 25% OOS Class

7.3 OOS/Open vs Known - Grouped Accuracy & Macro F1-score - 25% OOS Class

7.4 Threshold Test When Recall Drops

8. References

9. License

About

Uh oh!

Releases

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 219 Commits
.vscode		.vscode
config		config
data		data
debugger		debugger
examples		examples
prompts		prompts
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

KaiquanMah/llm-trust-lens

Folders and files

Latest commit

History

Repository files navigation

LLM Trust Lens - Open Intent Classification

1. Overview

2. Key Features

3. Setup

4. Environment Configuration

5. Usage

5.1. Non-Embedding Methods (Zero-Shot Prompt and Few-Shot Prompt)

5.1.1. Common Steps for Ollama/Local Model and API Model Experiments

5.1.2. Run Ollama/Local Model Experiments

5.1.3. Run API Model Experiments

5.2. Hybrid Embedding Method then Non-Embedding Method

6. Folder Structure

7. Results Summary

7.1 Overall Accuracy & Macro F1-score - 25% OOS Class

What do the overall metrics tell us?

7.2 OOS/Open vs Known F1-score - 25% OOS Class

7.3 OOS/Open vs Known - Grouped Accuracy & Macro F1-score - 25% OOS Class

7.4 Threshold Test When Recall Drops

8. References

9. License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 2

Uh oh!

Languages