Merge branch 'main' of github.com:CS-433/ml-project-2-mlp

CS-433 · Dec 21, 2023 · 969c0be · 969c0be
2 parents 5b5ef12 + 2a348c5
commit 969c0be
Show file tree

Hide file tree

Showing 11 changed files with 131 additions and 217 deletions.
diff --git a/README.md b/README.md
@@ -100,7 +100,7 @@ The command triggers the following steps:
 5. Finetune Homepage2Vec on the train dataset while validation and evaluating on splits from the test dataset.
 
 
-📣 **Important:** By default the repository only contains the raw URLs for a website corpus. Thus, running this script will first scrape, process and embed all the webpages for the dataset and then subsequently annotate it with the given labeler. For the `curlie` dataset, this will take a significant amount of time. To test reproducibility, you can download the entire compressed `data` folder from [Google Drive](https://drive.google.com/file/d/1ts8nDp21JrN1oqyiLihQeIWgzSs7lDp4/view?usp=sharing). The folder contains all scraped, processed and embedded websites and the labels from all labelers considered in this study. Uncompress the folder and put it in the correct location and re-run the above command to run a finetuning run.
+📣 **Important:** By default the repository only contains the raw URLs for a website corpus. Thus, running this script will first scrape, process and embed all the webpages for the dataset and then subsequently annotate it with the given labeler. For the `curlie` dataset, this will take a significant amount of time. To test reproducibility, you can download the entire compressed `data` folder from [Google Drive](https://drive.google.com/file/d/1tKRNv9PtUG13Z_hZ1JO_4foyMNG6rsuT/view?usp=sharing). The folder contains all scraped, processed and embedded websites and the labels from all labelers considered in this study. Uncompress the folder and put it in the correct location and re-run the above command to run a finetuning run.
 
 Finally, we provide convenience bash scripts that exactly reproduce the experiments in the report. In the first phase, we aim to find the best labeler by re-annotating the `crowdsourced` data with all GPT labelers. Run this script as follows:
 

diff --git a/notebooks/analysis.ipynb b/notebooks/analysis.ipynb
diff --git a/report/figures/finetune-results.pdf b/report/figures/finetune-results.pdf
diff --git a/report/main.tex b/report/main.tex
@@ -56,7 +56,7 @@
     BoldItalicFont = {Times New Roman Bold Italic}
 ]
 
-\setlength{\parskip}{.5em}
+\setlength{\parskip}{.3em}
 
 % ---- Main document
 \begin{document}
@@ -78,20 +78,21 @@
 
 \begin{figure}[!h]
     \centering
-    \includegraphics[width=.8\columnwidth]{figures/labeler-grid.pdf}
+    \includegraphics[width=.6\columnwidth]{figures/labeler-grid.pdf}
     \caption{\textbf{Labeler Parameter Grid.} The Figure displays the mean macro F1 score for all unique parameter combinations of the LLM labelers. For example, the top-right cell shows the average macro F1 score for all labelers that use GPT-3.5 with \texttt{1-shot} across all contexts.}
     \label{fig:labelers-grid}
 \end{figure}
 
 \input{sections/results}
 
-\input{sections/discussion}
+\input{sections/limitations.tex}
 \input{sections/summary}
 
 \newpage
 \bibliographystyle{plainnat}
 \bibliography{literature}
 
+\newpage
 \appendix
 
 \input{sections/appendix}

diff --git a/report/sections/abstract.tex b/report/sections/abstract.tex
@@ -1,6 +1,6 @@
 \thispagestyle{empty} % To prevent number on the first page
 \begin{abstract}
 
-Homepage2Vec~\cite{homepage2vec}, a state-of-the-art open-source model for multilingual, multilabel website classification, has proven powerful in accurately classifying website topics. However, it is limited by its initial training data, which on average only contains a single topic for a website. This study explores the use of Large Language Models (LLMs) for creating a high-quality finetuning dataset that more accurately reflects the topic diversity of a website. We assess various LLM-based labelers and select the best one through comparison to crowdsourced annotations. We generate two variants of a new 10,000-website dataset, \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k}, for finetuning \nobreak{Homepage2Vec}. We show that finetuning \nobreak{Homepage2Vec} with these datasets improves its macro F1 from 38\% to 42\%. We release both LLM-annotated datasets \cite{curlie-gpt-10k} publicly to encourage further research in this area.
+Homepage2Vec~\cite{homepage2vec}, a state-of-the-art open-source model for multilingual, multilabel website classification, has proven powerful in accurately classifying website topics. However, it is limited by its initial training data, which on average only contains a single topic for a website. This study explores the use of Large Language Models (LLMs) for creating a high-quality finetuning dataset that more accurately reflects the topic diversity of a website. We assess various LLM-based labelers and select the best one through comparison to crowdsourced annotations. We generate two variants of a new 10,000-website dataset, \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k}, for finetuning \nobreak{Homepage2Vec}. We show that finetuning \nobreak{Homepage2Vec} with these datasets improves its macro F1 from 39\% to 43\%. We release both LLM-annotated datasets \cite{curlie-gpt-10k} publicly to encourage further research in this area.
 
 \end{abstract}
diff --git a/report/sections/appendix.tex b/report/sections/appendix.tex
@@ -6,15 +6,14 @@ \subsection{Acknowledgements}\label{appendix:acknowledgements}
 
 % -------- Ethical considerations
 \subsection{Ethical Considerations}\label{appendix:ethical-considerations}
-This study employs the Curlie dataset, managed by dedicated volunteers and moderators ensuring its content remains legal and free from marketing schemes. 
-To further support these efforts, we are releasing the re-labeled datasets \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} to the public.
 
-Additionally, we employed the \texttt{crowdsourced} dataset, originally created by Amazon Mechanical Turk workers for the homepage2vec paper \cite{homepage2vec}. 
-These workers were compensated in accordance with ethical standards and minimum wage requirements set by the Fair Work platform \cite{ethics2}.
+This study employs the Curlie dataset, managed by dedicated volunteers and moderators ensuring its content remains legal and free from marketing schemes. To further support these efforts, we are releasing the re-labeled datasets \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k} to the public.
+
+Additionally, we employed the \texttt{crowdsourced} dataset, originally created by Amazon Mechanical Turk workers for the homepage2vec paper~\cite{homepage2vec}. 
+These workers were compensated in accordance with ethical standards and minimum wage requirements set by the Fair Work platform~\cite{ethics2}.
 
 The use of LLMs for annotation, while efficient, raises concerns regarding the economic impact on human annotators who depend on such tasks for their livelihood. 
-It is imperative to ensure that this process supplements, rather than replaces, human annotators. In this context, providing platforms like Dynamo \cite{ethics1} for Amazon Mechanical Turk workers to communicate and organize is crucial.
-Additionally, it is crirical to maintain these principles and be cautious of influences from large entities that may hinder the efforts of workers to organize and advocate for their rights.
+It is imperative to ensure that this process supplements, rather than replaces, human annotators. In this context, providing platforms like Dynamo~\cite{ethics1} for Amazon Mechanical Turk workers to communicate and organize is crucial. Additionally, it is critical to maintain these principles and be cautious of influences from large entities that may hinder the efforts of workers to organize and advocate for their rights.
 
 Moreover, the extensive datasets training LLMs may contain biases, potentially influencing the labeling process and perpetuating stereotypes or inequalities. 
 It's essential to address these biases to maintain fairness and uphold ethical standards in automated systems.
@@ -82,3 +81,23 @@ \subsection{Example for a \texttt{1-shot} model}\label{app:example-1-shot}
     ...
 }
 \end{lstlisting}
+
+\subsection{Best Hyperparameters}\label{app:hyperparameters}
+
+Table~\ref{tab:best-hyperparameters} shows the best hyperparameters found for finetuning Homepage2Vec on labels from the GPT-3.5 and GPT-4 labeler.
+
+\begin{table}[h]
+    \centering
+    \caption{\textbf{Best Hyperparameters.} Details the 
+    best hyperparameters found for finetuning Homepage2Vec on labels from the GPT-3.5 and GPT-4 labeler. Notation follows as in Section~\ref{sec:methodology}}
+    \begin{tabular}{lcccc}
+        \toprule
+        \textbf{Model} & $\lambda$ & $\beta$ & $\gamma$ & $\delta$ \\
+        \midrule
+        GPT-3.5 & 1.6e-5 & 6.4e-2 & 3.7e-1 & 64 \\
+        GPT-4 & 1.5e-3 & 2.5e-4 & 4.6e-1 & 64 \\
+        \bottomrule
+
+    \end{tabular}
+    \label{tab:best-hyperparameters}
+\end{table}
diff --git a/report/sections/introduction.tex b/report/sections/introduction.tex
@@ -6,4 +6,4 @@ \section{Introduction}
 
 In summary, our work contributes in three key areas. Firstly, we demonstrate the use of LLMs to obtain high-quality annotations for multilingual multilabel website classification. Secondly, we enhance Homepage2vec's performance through finetuning on LLM-annotated data. Lastly, we release two LLM-annotated datasets \cite{curlie-gpt-10k}, \texttt{curlie-gpt3.5-10k} and \texttt{curlie-gpt4-10k}, facilitating further advancements in the field of multilingual website classification.
 
-\textit{The code and experiments are available on \href{https://github.com/CS-433/ml-project-2-mlp}{GitHub} and \href{https://wandb.ai/ml-project-2-mlp/homepage2vec}{W\&B}.}
+\textit{The code and experiments are available on \href{https://github.com/CS-433/ml-project-2-mlp}{GitHub} and \href{https://wandb.ai/ml-project-2-mlp/homepage2vec}{W\&B}. You can find a demo \href{https://huggingface.co/spaces/ludekcizinsky/homepage2vec}{here}.}
diff --git a/report/sections/discussion.tex → report/sections/limitations.tex b/report/sections/discussion.tex → report/sections/limitations.tex
@@ -1,5 +1,5 @@
-\section{Discussion}
-\label{sec:discussion}
+\section{Limitations \& Future Work}
+\label{sec:limitations}
 
 A significant difficulty in website topic classification is the scarcity of extensive, high-quality open source datasets. The subjective nature of multilabel website classification often leads to ambiguous ground truths, as evidenced by the low inter-annotator agreement scores observed within the \texttt{crowdsourced} dataset. To enhance model learning and performance, the development of more precise and narrowly defined category scopes is essential.
 

diff --git a/report/sections/results.tex b/report/sections/results.tex
@@ -2,22 +2,18 @@ \section{Results}
 
 \begin{figure*}[h]
     \centering
-    \includegraphics[width=\textwidth]{./figures/finetune-results.pdf}
+    \includegraphics[width=1\textwidth]{./figures/finetune-results.pdf}
     \caption{\textbf{Finetune Results.} Class-wise F1 score for the pre-trained model and the finetuned model on te original crowdsourced data.}
     \label{fig:finetune-results}
 \end{figure*}
 
 
 \subsection*{Phase 1: Identifying an Optimal LLM Labeler}
 
-Table~\ref{tab:labeler-results} shows the results of re-labelling \texttt{crowdsourced} dataset. Our findings demonstrate that LLM labelers can provide \textit{consistent}, \textit{cost-effective}, and \textit{high-quality} annotations for the complex task of multilingual, multilabel website topic classification. 
-
-% Consistency
-Remarkably, not a single incorrect output was produced, underscoring the reliability the models in annotating websites.
-
+Table~\ref{tab:labeler-results} shows the results of re-labelling the \texttt{crowdsourced} dataset. Our findings demonstrate that LLM labelers can provide \textit{cost-effective}, and \textit{high-quality} annotations for the complex task of multilingual, multilabel website topic classification. 
 
 % Cost
-In terms of cost, the labeling of the \texttt{crowdsourced} corpus cost approximately \$130 per 1000 pages. Our approach, utilising GPT-3.5 and GPT-4 labelers, drastically reduces this cost to an average of \$0.54 and \$6.44, respectively, achieving a reduction by factors of 240x and 20x.
+The labeling cost for the \texttt{crowdsourced} corpus was around \$130 per 1,000 pages. By employing GPT-3.5 and GPT-4 labelers, we reduced the expense to merely \$0.54 and \$6.44 on average respectively, achieving cost reductions of 240x and 20x.
 
 % Calculations
 % Human annotator cost: 327 USD
@@ -34,22 +30,19 @@ \subsection*{Phase 1: Identifying an Optimal LLM Labeler}
 % GPT-3.5: 130 / 0.54 = 240x
 % GPT-4: 130 / 6.44 = 20x
 
-% Performance
-Performance-wise, the best labeler, GPT-4 with \texttt{context3} and \texttt{1-shot}, achieves a macro F1 score of 46\% compared to the human annotations on the same dataset. Thus, the GPT labelers are better website classifiers than the baseline Homepage2Vec model, which achieves a macro F1 score of 39\% on the same dataset. This improvement gives us reason to believe that Homepage2Vec can learn from knowledge of the LLM labelers - the goal of the second phase of our study.
+% Performance & Effect of Parameters
+The best labeler, GPT-4 with \texttt{context3} and \texttt{1-shot}, achieves a macro F1 score of 46\% compared to the human annotations on the same dataset. It is therefore a better website classifier than the baseline Homepage2Vec model, which achieves a macro F1 score of 39\% on the same dataset. This improvement gives us reason to believe that Homepage2Vec can learn from knowledge of the LLM labelers - the goal of the second phase of our study.
 
+We find that label quality improves with increased information (context and few-shot examples) and model complexity, as shown in Figure~\ref{fig:labelers-grid}. A notable enhancement in label quality occurs when upgrading from \texttt{context1} to \texttt{context2}, and from GPT-3.5 to GPT-4. However, adding sentences and links in \texttt{context3} yields only minor improvements. This implies that solely using the domain and meta-tags in \texttt{context1} is insufficient for accurate topic prediction. Moreover, except for GPT-3.5 in \texttt{context1}, few-shot examples have limited impact on most labelers, suggesting that the task is sufficiently clear from the system prompt and website context.
 
-\input{tables/labeler-results.tex}
+The range of labels assigned by annotators spans from 0.4 to 2.8. Models are reluctant to assign multiple topics to websites when provided with limited context. However, as more website information becomes available, the number of labels increases, aligning the annotations more closely with those made by humans, who had full website access during their annotation.
 
-% GPT labeler parameter grid
-\textbf{Labeler Parameter Grid.} Figure~\ref{fig:labelers-grid} visualises the effect of the labeler parameters on the annotation quality. As expected, we find that the quality of the labels increases with the amount of context provided and the complexity of the model used. Interestingly, the added features in \texttt{context3} (links and text) do not increase the annotation quality on average.
+\input{tables/labeler-results.tex}
 
 
 % Cost-quality trade-of
-\textbf{Cost-Quality Trade-Off:} Our analysis reveals a positve trend between label quality and cost, attributable to the use of longer prompts or more sophisticated models. In the next phase, we aimed to select two labelers, one per model. In case of marginal improvements in label quality, we opted for the cheaper labeler. 
-The best balance was achieved using \texttt{context2}; the GPT-3.5 labeler employed \texttt{1-shot}, whereas the GPT-4 used \texttt{0-shot}.
-
-% Curlie-10k dataset
-\textbf{Curlie-10k Dataset.} The average number of topics assigned to a page by the GPT 3.5 labeler is \textbf{1.6} and \textbf{2.03} for the GPT-4 labeler, which is both significantly than \textbf{1.07} for the original Curlie dataset. Figure~\ref{fig:curlie-10k-dist} shows the distribution of the labels in the re-labelled dataset compared to the original. We can see that, as hoped, more topics are assigned to each page. Interesting differences in the GPT-3.5 and GPT-4 labelers visible: the GPT-4 labeler tends to assign more websites to the topics that are less frequent in the original dataset, such as \textit{References}, \textit{Kids \& Teens} and \textit{Games}, leading to a more balanced distribution of topics. Surprisingly, the category \textit{Recreation} is assigned to a disproportionally high number of websites by the GPT-4 labeler.
+\textbf{Curlie-10k Dataset.} Our analysis reveals a positve trend between label quality and cost, attributable to the use of longer prompts or more sophisticated models. In the next phase, we aimed to select two labelers, one per model. In case of marginal improvements in label quality, we opted for the cheaper labeler. 
+The best balance was achieved using \texttt{context2}; the GPT-3.5 labeler employed \texttt{1-shot}, whereas the GPT-4 used \texttt{0-shot}. The average number of topics assigned to a page by the GPT 3.5 labeler is \textbf{1.6} and \textbf{2.03} for the GPT-4 labeler, which is both significantly more than \textbf{1.07} for the original Curlie dataset. Figure~\ref{fig:curlie-10k-dist} shows the distribution of the labels in the re-labelled dataset compared to the original. We can see that, as hoped, more topics are assigned to each page. Interesting differences in the GPT-3.5 and GPT-4 labelers are visible: the GPT-4 labeler tends to assign more websites to the topics that are less frequent in the original dataset, such as \textit{References}, \textit{Kids \& Teens} and \textit{Games}, leading to a more balanced distribution of topics. Surprisingly, the category \textit{Recreation} is assigned to a disproportionally high number of websites by the GPT-4 labeler.
 
 \begin{figure}[!ht]
     \centering
@@ -60,10 +53,17 @@ \subsection*{Phase 1: Identifying an Optimal LLM Labeler}
 
 \subsection*{Phase 2: Transferring Knowledge via Finetuning}
 
-Table~\ref{tab:finetune-results} shows the results of the finetuning experiments. 
-We observe that the model increases the recall significantly from 39.4\% to 47.6\% and /49\% when finetuned on GPT-3.5 and GPT-4 labels, respectively.
-However, this comes at cost of a minor decreases in precision. Overall, the macro F1 score from increases from 39.2\% to 42.6\% and 42.8\% - an improvement of 3.4 and 3.6 percentage points, respectively.
-This improvement shows that we were able to transfer the superior labeling capabilities of the LLM to Homepage2Vec, by finetuning on LLM-generated labels. Figure~\ref{fig:finetune-results} shows that the increase in macro F1 score is achieved by consistently acrosss the classes, with 12 out of the 14 classes improving.
+% GPT-3.5: 
+% LR/ Weight Decay / Scheduler Factor/ Batch Size
+% 0.000016	0.064037	0.376673	64
+% 1.6e-05 / 6.40e-02 / 3.77e-01 / 64
+
+% GPT 4:
+% 0.001535	0.000252	0.460896	64
+% 1.5e-03 / 2.52e-04 / 4.61e-01 / 64
+
+Table~\ref{tab:finetune-results} shows the results of the finetuning experiments. We report only the results for the model with the hyperparameter configuration with the best validation macro F1 score. The best hyperparameters are listed in the Appendix~\ref{app:hyperparameters} section. We observe that both models increase the recall from 39.4\% to 51.1\% and 46.4\% when finetuned on GPT-3.5 and GPT-4 labels, respectively. Overall, the macro F1 score increases from 39.2\% to 43.5\% and 43.1\% - an improvement of 4.3 and 3.9 percentage points, respectively.
+This improvement shows that we were able to transfer the superior labeling capabilities of the LLM to Homepage2Vec. Figure~\ref{fig:finetune-results} shows that the increase in macro F1 score is achieved consistently acrosss the classes, with 12 out of the 14 classes improving for both models.
 
 % 0.391610 = 39.2% (Pre-trained Homepage2Vec)
 % 0.426289 = 42.6% (GPT-3.5) (+3.4 percentage points)