Skip to content

Commit

Permalink
update training procedure
Browse files Browse the repository at this point in the history
  • Loading branch information
peternutter committed Dec 21, 2023
1 parent 969c0be commit 09fce67
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 1 deletion.
1 change: 1 addition & 0 deletions report/sections/methodology.tex
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ \subsection*{Phase 2: Transferring Knowledge via Finetuning}

% Training
Training is performed on the \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} dataset for a maximum of 100 epochs. We use a 30\% held-out validation split from the \texttt{crowdsourced} dataset to monitor the validation F1 score and stop training if no improvement is observed for 10 epochs. This is to prevent overfitting the LLM labels. We perform hyperparameter grid search to Bayesian TPE sampler from Optuna~\cite{optuna} for $\eta=100$ trials and $\tau=10$ startup trials to effectively search the hyperparameter space. The hyperparameter values are detailed in Table~\ref{tab:hyperparameters}. The model which performs best on macro F1 in the validation split is chosen for the evaluation.
The training loss, defined as the average binary cross-entropy over 14 classes, includes a reweighting factor to address class imbalance, based on the negative-to-positive sample ratio.

\input{tables/hyperparameters.tex}

Expand Down
3 changes: 2 additions & 1 deletion report/sections/summary.tex
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
\section{Summary}\label{sec:summary}

We have demonstrated that LLMs can provide cost-effective, and high-quality annotations in the settign of multilingual, multilabel website topic classification. Our approach, which involved finetuning a pre-trained Homepage2vec model on LLM-generated labels, resulted in a improvement of 4.3 percentage points in the macro F1 score. Additionally, we are releasing the \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} datasets \cite{curlie-gpt-10k} with the intention of supporting further research in the open-source community.
We have demonstrated that LLMs can provide cost-effective, and high-quality annotations in the settign of multilingual, multilabel website topic classification. Our approach, which involved finetuning a pre-trained Homepage2vec model on LLM-generated labels, resulted in a improvement of 4.3 percentage points in the macro F1 score.
Additionally, the \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} datasets \cite{curlie-gpt-10k} are being released to aid open-source research.

0 comments on commit 09fce67

Please sign in to comment.