Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methodological error in zero cost, zero time, zero shot notebook #511

Open
stephantul opened this issue Apr 20, 2024 · 3 comments
Open

Methodological error in zero cost, zero time, zero shot notebook #511

stephantul opened this issue Apr 20, 2024 · 3 comments

Comments

@stephantul
Copy link

stephantul commented Apr 20, 2024

Hi,

I was looking at the zero cost, zero time, zero shot notebook for financial sentiment analysis (i.e., this one), and discovered a methodological error that invalidates the conclusions of the distillation section.

What happens is that the train and test dataframes, i.e., the CSV files loaded from Moritz Laurer's blog, are created by splitting the train split of the dataset (the dataset doesn't have a test split). Later on, when distilling, the authors of blog post reload the entire train split of the dataset, and then use this to distill the MLP. This means that the test data is also used to distill the model, which leads to a big overestimation of performance.

In my experiments, the original score PRF score I got was:

(array([0.85507246, 0.97348485, 0.94166667]),
 array([0.96721311, 0.96981132, 0.88976378]),
 array([0.90769231, 0.97164461, 0.91497976]),
 array([ 61, 265, 127]))

Which is close to the reported score in the article.
If I instead remove the test data from the data used to distill the MLP, I get much lower scores:

(array([0.76785714, 0.87632509, 0.78947368]),
 array([0.70491803, 0.93584906, 0.70866142]),
 array([0.73504274, 0.90510949, 0.74688797]),
 array([ 61, 265, 127]))

These scores are much lower than the reported scores, and also much lower than the LLM scores, which invalidates the conclusion of the notebook and article. Note that these scores are still a bit higher than the scores you would get when just directly optimizing cross entropy, so you could argue that the point still makes sense.

If you want I can do a PR on the notebook.

@tomaarsen
Copy link
Member

@MosheWasserb

@MosheWasserb
Copy link
Collaborator

Hi @tomaarsen, Sorry miss your message :(
Great catch.
Yes, go ahead and issue a PR

@stephantul
Copy link
Author

Hey @MosheWasserb ,

Thanks for replying, really appreciated.

Before I submit a PR, could we maybe discuss what you want the final conclusion of the article to look like? Because the part after you reload the dataset doesn't work any more. Should I just remove those parts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants