Bridging the Emotional Gap: Enhancing Compact Language Model’s Emotional Intelligence through Targeted Fine-Tuning

Overview

UCL COMP0087 Statistical Natural Language Processing 2023-24 Group Project

This study explores enhancing the emotional intelligence of compact language models, such as variants of GPT-2, through systematic data augmentation, fine-tuning, and detailed evaluations using the BabyLM and Alpaca datasets. Our work integrates data processing techniques to enrich model training data with emotional depth, improving the small models' responsiveness on emotional intelligence tasks.

Setup and Installation

git clone https://github.com/chantomkit/COMP0087_SNLP
cd COMP0087_SNLP
pip install -r requirements.txt

Key Components and Scripts

Data Preparation and Augmentation

process_pretrain_data.py
- Purpose: Processes BabyLM data to prepare it for augmentation.
- Output: Processed data files ready for augmentation.
pretrain_data_augment.ipynb
- Purpose: Augments processed data using the Mistral-7B model to enrich emotional depth.
- Output: Augmented data chunks saved in babylm_augment/.
pretrain_data_corpus.ipynb
- Purpose: Compiles the augmented data into a final corpus, integrating additional wiki data from BabyLM to create a developmentally appropriate training dataset.
- Output: Final corpus and its composition logs stored in babylm_pretrain_corpus/.
alpaca-instruction/emotion-alpaca.py
- Purpose: Augments alpaca-related data using the Mistral-7B model to enrich emotional granularity.
- Output: Emotionally augmented alpaca data saved in alpaca-instruction/.

Model Fine-Tuning and Development

casual_lm.ipynb
- Purpose: Applies causal language modeling fine-tuning using augmented BabyLM dataset to improve the generative qualities of the model in generating emotionally rich responses.
- Output: Fine-tuned models ready for further development, evaluation and deployment.
fine-tune/sft_instruction.ipynb
- Purpose: Performs supervised fine-tuning on the models using the emotionally augmented Alpaca dataset.
- Output: Fine-tuned models ready further development, evaluation and deployment.

Each notebook targets a different aspect of language model fine-tuning, focusing on enhancing the emotional intelligence of language models for specific data contexts. The sft_instruction.ipynb is used for refining response quality in instruction-based tasks using Alpaca data, while casual_lm.ipynb focuses on casual generative capabilities tailored to the linguistic and emotional development depicted in BabyLM data.

Augmented Dataset Analysis

prompt_analysis_babylm.ipynb
- Purpose: Analyzes the diversity and effectiveness of prompts used in the augmented BabyLM dataset.
- Output: Statistics and insights into the variation and richness of the augmented BabyLM dataset.
prompt_analysis.ipynb
- Purpose: Analyzes the diversity and effectiveness of prompts used in the augmented Alpaca dataset.
- Output: Statistics and insights into the variation and richness of the augmented Alpaca dataset.

Model Emotional Evaluation and Analysis

quantitative_analysis/quantitative_analysis.ipynb
- Purpose: Provides a quantitative assessment of model outputs, evaluating how different input emotions influence the sentiment of the outputs for both original and finetuned models.
- Output: Sentiment analysis results detailing the frequency and type of sentiment outputs for each input emotion.
model_analysis.ipynb
- Purpose: Analyzes fine-tuned models, assessing enhancements in emotional intelligence from original and fine-tuned models.
- Output: Comparable language model visualizations such as logit lens.

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
alpaca-instruction		alpaca-instruction
babylm_augment		babylm_augment
babylm_augment_gridsearch		babylm_augment_gridsearch
babylm_augment_samples		babylm_augment_samples
babylm_evaluate		babylm_evaluate
babylm_pretrain_corpus		babylm_pretrain_corpus
deprecated		deprecated
fine-tune		fine-tune
google_form		google_form
pretrain_data		pretrain_data
quantitative_analysis		quantitative_analysis
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
NLPeekaboo.pdf		NLPeekaboo.pdf
README.md		README.md
casual_lm.ipynb		casual_lm.ipynb
cross_comparison.ipynb		cross_comparison.ipynb
emotion_pipeline.ipynb		emotion_pipeline.ipynb
fig.jpg		fig.jpg
model_analysis.ipynb		model_analysis.ipynb
pretrain_data_augment.ipynb		pretrain_data_augment.ipynb
pretrain_data_corpus.ipynb		pretrain_data_corpus.ipynb
process_pretrain_data.py		process_pretrain_data.py
prompt_analysis.ipynb		prompt_analysis.ipynb
prompt_analysis_babylm.ipynb		prompt_analysis_babylm.ipynb
quantitative_analysis.ipynb		quantitative_analysis.ipynb
requirements.txt		requirements.txt
utils.py		utils.py
zeroshot_eval.ipynb		zeroshot_eval.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bridging the Emotional Gap: Enhancing Compact Language Model’s Emotional Intelligence through Targeted Fine-Tuning

Overview

UCL COMP0087 Statistical Natural Language Processing 2023-24 Group Project

Setup and Installation

Key Components and Scripts

Data Preparation and Augmentation

Model Fine-Tuning and Development

Augmented Dataset Analysis

Model Emotional Evaluation and Analysis

Further Reading and References

About

Releases

Packages

Contributors 6

Languages

License

chantomkit/COMP0087_SNLP

Folders and files

Latest commit

History

Repository files navigation

Bridging the Emotional Gap: Enhancing Compact Language Model’s Emotional Intelligence through Targeted Fine-Tuning

Overview

UCL COMP0087 Statistical Natural Language Processing 2023-24 Group Project

Setup and Installation

Key Components and Scripts

Data Preparation and Augmentation

Model Fine-Tuning and Development

Augmented Dataset Analysis

Model Emotional Evaluation and Analysis

Further Reading and References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages