Skip to content

Commit

Permalink
Update sbom
Browse files Browse the repository at this point in the history
Signed-off-by: Arthit Suriyawongkul <[email protected]>
  • Loading branch information
bact committed Nov 7, 2024
1 parent a2d94cb commit 96b66d4
Show file tree
Hide file tree
Showing 3 changed files with 106 additions and 72 deletions.
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ authors:
given-names: "Can"
orcid: "https://orcid.org/0000-0002-7090-0536"
title: "PyThaiNLP/Wisesight Sentiment Corpus with Word Tokenization Label"
version: v1.0.1
version: v1.1
license: CC0-1.0
identifiers:
- description: This is the collection of archived snapshots of all versions of the dataset
Expand All @@ -39,4 +39,4 @@ identifiers:
type: doi
value: "10.5281/zenodo.3457447"
repository: "https://github.com/PyThaiNLP/wisesight-sentiment/"
date-released: 2024-11-06
date-released: 2024-11-07
122 changes: 68 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,49 +7,47 @@ SPDX-License-Identifier: CC0-1.0
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3457446.svg)](https://doi.org/10.5281/zenodo.3457446)

ข้อความภาษาไทยจากสื่อสังคมออนไลน์ พร้อมกับป้ายกำกับความรู้สึก (บวก, กลางๆ, ลบ, คำถาม) รวม 26,737 ข้อความ
**เผยแพร่เป็นสมบัติสาธารณะ** ภายใต้[สัญญาอนุญาต Creative Commons Zero v1.0 Universal][cc]
**เผยแพร่เป็นสมบัติสาธารณะ** โดยการสละสิทธิ์ตาม [CC0 1.0 Universal][cc0]

Social media messages in Thai language with sentiment label (positive, neutral, negative, question).
**Released to public domain** under [Creative Commons Zero v1.0 Universal license][cc].
**Dedicated to the public domain** under [CC0 1.0 Universal][cc0].

[cc]: https://creativecommons.org/publicdomain/zero/1.0/
[cc0]: https://creativecommons.org/publicdomain/zero/1.0/

## Changelog

- 2024-11-06 - Release v1.0.1 with updated contributors and software bill of materials (SBOM)
- 2020-12-01 - Add Hugging Face format - [PR #7](https://github.com/PyThaiNLP/wisesight-sentiment/pull/7)
- 2019-10-01 - Fix path in data preparation notebook - [PR #6](https://github.com/PyThaiNLP/wisesight-sentiment/pull/6)
- 2019-08-22 - Add tokenization annotation for ~1,000 samples - [PR #4](https://github.com/PyThaiNLP/wisesight-sentiment/pull/4)
- 2019-07-03 - Add tokenization annotation for 160 samples - [PR #2](https://github.com/PyThaiNLP/wisesight-sentiment/pull/2)
- 2019-03-31 - Update data
- 2024-11-07: Released v1.1 with updated copyright text, contributors, and a software bill of materials (SBOM).
- 2020-12-01: Added Hugging Face format - [PR #7](https://github.com/PyThaiNLP/wisesight-sentiment/pull/7)
- 2019-10-01: Fixed path in data preparation notebook - [PR #6](https://github.com/PyThaiNLP/wisesight-sentiment/pull/6)
- 2019-08-22: Added tokenization annotation for ~1,000 samples - [PR #4](https://github.com/PyThaiNLP/wisesight-sentiment/pull/4)
- 2019-07-03: Added tokenization annotation for 160 samples - [PR #2](https://github.com/PyThaiNLP/wisesight-sentiment/pull/2)
- 2019-03-31: Updated data.

## Related corpus
## Related datasets

- For `wisesight-160` and `wisesight-1000`, which are samples from this corpus in a tokenized form,
see <https://github.com/PyThaiNLP/wisesight-sentiment/tree/master/word-tokenization>

- For data exploration and classification examples,
see [Thai Text Classification Benchmarks](https://github.com/PyThaiNLP/classification-benchmarks).

- Also available as Huggingface datasets:
see [word-tokenization](./word-tokenization/) directory.
- Also available at Hugging Face Datasets:
- [wisesight_sentiment](https://huggingface.co/datasets/wisesight_sentiment)
(using the earlier version of this corpus)
- [wisesight1000](https://huggingface.co/datasets/wisesight1000)
- For data exploration and classification examples,
see [Thai Text Classification Benchmarks](https://github.com/PyThaiNLP/classification-benchmarks).

## Source

- Size: 26,737 messages
- Language: Central Thai
- Style: Informal and conversational. With some news headlines and advertisement.
- Time period: Approximately 2016 to early 2019, with a small amount from other periods.
- Domains: Mixed. Majority are consumer products and services
- **Size:** 26,737 messages.
- **Language:** Central Thai.
- **Style:** Informal and conversational. With some news headlines and advertisement.
- **Time period:** Approximately 2016 to early 2019, with a small amount from other periods.
- **Domains:** Mixed. Majority are consumer products and services
(restaurants, cosmetics, drinks, car, hotels), with some current affairs.
- Privacy:
- **Privacy:**
- Only messages that made available to the public on the internet
(websites, blogs, social network sites).
- For Facebook, this means the public comments (everyone can see) that made on a public page.
- Private/protected messages and messages in groups, chat, and inbox are not included.
- Alternations and modifications:
- **Alternations and modifications:**
- Keep in mind that this corpus does not statistically represent anything in the language register.
- Large amount of messages are not in their original form. Personal data are removed or masked.
- Duplicated, leading, and trailing whitespaces are removed.
Expand All @@ -63,19 +61,19 @@ Social media messages in Thai language with sentiment label (positive, neutral,

## Corpus file structure

- All files are UTF-8 encoded plaintext
- All files are UTF-8 encoded plaintext.
- One message per line. A newline character in the original message will be replaced with a space.
- `q.txt` Questions (575 messages)
- `neg.txt` Message with negative sentiment (6,823)
- `neu.txt` Message with neutral sentiment (14,561)
- `pos.txt` Message with positive sentiment (4,778)
- `q.txt`: Questions (575 messages).
- `neg.txt`: Message with negative sentiment (6,823).
- `neu.txt`: Message with neutral sentiment (14,561).
- `pos.txt`: Message with positive sentiment (4,778).
- The legacy dataset in Kaggle competition format is also provided inside `kaggle-competition/` directory:
- `train.txt` - Message for training (24,066 messages)
- `train_label.txt` - Label for training. Each line is the label corresponding to the same line in `train.txt`
- `train.txt`: Message for training (24,066 messages).
- `train_label.txt`: Label for training. Each line is the label corresponding to the same line in `train.txt`.
- `test.txt` - Message for testing (2,674 messages)
- `test_label.txt` - Label for testing. Each line is the label corresponding to the same line in `test.txt`
- `test_majority.csv` - Sample submission in Kaggle format. Contains `neu` class as all the predictions.
- `test_solution.csv` - Test solution in Kaggle format.
- `test_label.txt`: Label for testing. Each line is the label corresponding to the same line in `test.txt`.
- `test_majority.csv`: Sample submission in Kaggle format. Contains `neu` class as all the predictions.
- `test_solution.csv`: Test solution in Kaggle format.
- Sample code for data exploration, training, and prediction are also provided.

## Personal data
Expand All @@ -101,24 +99,38 @@ Social media messages in Thai language with sentiment label (positive, neutral,

## Copyright and Disclaimer

- If applicable, copyright of each message content belongs to the original poster.
- **Annotation data (labels) are released to public domain.**
- [Wisesight (Thailand) Co., Ltd.](https://github.com/wisesight/) helps facilitate the annotation,
but does not necessarily agree upon the labels made by the human annotators.
This annotation is for research purpose and does not reflect the professional work that Wisesight has been done for its customers.
- The human annotator does not necessarily agree or disagree with the message.
Likewise, the label he/she made to the message does not necessarily reflect his/her personal view towards the message.
This dataset contains social media text extracted from publicly accessible
sources on the internet. The selection, organization, and curation of this
dataset are original works that were previously copyrighted.
However, the copyright holder has waived all rights to this dataset and
dedicated it to the public domain under the
[Creative Commons Zero v1.0 Universal Public Domain Dedication][cc0].

Any trademarks or trade names appearing in the messages belong to their
respective owners.

[Wisesight (Thailand) Co., Ltd.](https://wisesight.com/) has assisted in the
collection and sentiment labeling of this dataset, but does not necessarily
endorse the labels assigned by human annotators.
These annotations are for research purposes only and do not represent the
professional work Wisesight performs for its clients.

Please note that human annotators may not personally agree or disagree with
the messages they label. Additionally, the labels assigned do not necessarily
reflect their personal opinions on the content.

*You are free to use this dataset for any purpose, without any restrictions.*

## Citation

Please cite the following if you make use of the dataset:

> Arthit Suriyawongkul, Ekapol Chuangsuwanich, Pattarawat Chormai, Nitchakarn Chantarapratin, Ponrawee Prasertsom, Jitkapat Sawatphol, Nozomi Yamada, Attapol Rutherford, Charin Polpanumas, and Can Udomcharoenchaikit. 2024. **PyThaiNLP/Wisesight Sentiment Corpus with Word Tokenization Label (Version 1.0.1)** November.
> Arthit Suriyawongkul, Ekapol Chuangsuwanich, Pattarawat Chormai, Nitchakarn Chantarapratin, Ponrawee Prasertsom, Jitkapat Sawatphol, Nozomi Yamada, Attapol Rutherford, Charin Polpanumas, and Can Udomcharoenchaikit. 2024. **PyThaiNLP/Wisesight Sentiment Corpus with Word Tokenization Label (Version 1.1)** November.
BibTeX:

```
@software{Suriyawongkul_PyThaiNLP_Wisesight_Sentiment_Corpus_2020,
```bibtex
@misc{Suriyawongkul_PyThaiNLP_Wisesight_Sentiment_Corpus_2020,
author = {Suriyawongkul, Arthit and
Chuangsuwanich, Ekapol and
Chormai, Pattarawat and
Expand All @@ -135,20 +147,22 @@ BibTeX:
publisher = {Zenodo},
title = {{PyThaiNLP/Wisesight Sentiment Corpus with Word Tokenization Label}},
url = {https://doi.org/10.5281/zenodo.3457446},
version = {v1.0.1},
version = {v1.1},
year = 2024
}
```

## Acknowledgement

- Thanks [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp) community and
[Kitsuchart Pasupa](https://www.it.kmitl.ac.th/~kitsuchart/)
(Faculty of Information Technology, King Mongkut's Institute of Technology
Ladkrabang) for advice.
- The tokenization annotation was done by the support of
the [Natural Language Processing Lab](https://attapol.github.io/lab.html)
at Department of Linguistics, Faculty of Arts, Chulalongkorn University.
- The original Kaggle competition, by Ekapol Chuangsuwanich,
using the earlier version of this corpus,
can be found at <https://www.kaggle.com/c/wisesight-sentiment/>.
We would like to thank:

- The [PyThaiNLP community](https://github.com/PyThaiNLP/) and
[Kitsuchart Pasupa](https://www.it.kmitl.ac.th/~kitsuchart/) (Faculty of
Information Technology, King Mongkut's Institute of Technology Ladkrabang)
for their advice.
- The [Natural Language Processing Lab](https://attapol.github.io/lab.html) at
Department of Linguistics, Faculty of Arts, Chulalongkorn University for
their support with tokenization annotation.
- Ekapol Chuangsuwanich for his initiative in creating the original Kaggle
competition using an earlier version of this corpus. The competition can be
found at <https://www.kaggle.com/c/wisesight-sentiment/>.
Loading

0 comments on commit 96b66d4

Please sign in to comment.