Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would it be possible to include date information? #1

Open
johann-petrak opened this issue Feb 17, 2023 · 2 comments
Open

Would it be possible to include date information? #1

johann-petrak opened this issue Feb 17, 2023 · 2 comments

Comments

@johann-petrak
Copy link

Since topics, NEs etc. related to sexist remarks change over time, it would be interesting to have date information associated with the labeled and unlabeled texts. Would that be possible?

@paul-rottger
Copy link

Hi Johann! Thanks for raising this and sorry for the late reply.

We cannot easily provide timestamps, but even if we could, I would not expect that information to be super useful because of how we sampled the data. All the comments we collected from Gab and Reddit were originally posted between August 2016 and October 2018. This is the time span of the Gab dump we used, and we chose to match it on Reddit. Beyond that, we did not account for time in our sampling. Therefore, there is likely a strong imbalance across time periods correlated with general activity (e.g. Gab started in 2016 and was much more active in 2018). Also, each individual month will have relatively little data. Based on my experience in other work, you need a fairly large and well-structured dataset to meaningfully investigate language change.

Sorry to not have better news! This is a super interesting problem, but not one we set up to investigate with this dataset.

Cheers,
Paul

@johann-petrak
Copy link
Author

johann-petrak commented Apr 13, 2023

Thank you! I had just been wondering if it would be technically possible to trace back the date, because it often turns out that datasets like this one can get used in downstream research for different research aims. Even if the time span is not that long or the data a bit sparse, I have seen data like this to e.g. get combined with other data in which case date information can be useful. Also equal distribution over time is not necessarily needed for all kinds of research, just knowing which time period the texts are from would be extremely useful.
So if there is a technical way to add date information, I still think it could greatly benefit the the research community eventually.
This would be useful to have for both the labeled and unlabeled data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants