Update sbom

Signed-off-by: Arthit Suriyawongkul <[email protected]>
PyThaiNLP · Nov 7, 2024 · 96b66d4 · 96b66d4
1 parent a2d94cb
commit 96b66d4
Show file tree

Hide file tree

Showing 3 changed files with 106 additions and 72 deletions.
diff --git a/CITATION.cff b/CITATION.cff
@@ -29,7 +29,7 @@ authors:
   given-names: "Can"
   orcid: "https://orcid.org/0000-0002-7090-0536"
 title: "PyThaiNLP/Wisesight Sentiment Corpus with Word Tokenization Label"
-version: v1.0.1
+version: v1.1
 license: CC0-1.0
 identifiers:
   - description: This is the collection of archived snapshots of all versions of the dataset
@@ -39,4 +39,4 @@ identifiers:
     type: doi
     value: "10.5281/zenodo.3457447"
 repository: "https://github.com/PyThaiNLP/wisesight-sentiment/"
-date-released: 2024-11-06
+date-released: 2024-11-07
diff --git a/README.md b/README.md
@@ -7,49 +7,47 @@ SPDX-License-Identifier: CC0-1.0
 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3457446.svg)](https://doi.org/10.5281/zenodo.3457446)
 
 ข้อความภาษาไทยจากสื่อสังคมออนไลน์ พร้อมกับป้ายกำกับความรู้สึก (บวก, กลางๆ, ลบ, คำถาม) รวม 26,737 ข้อความ
-**เผยแพร่เป็นสมบัติสาธารณะ** ภายใต้[สัญญาอนุญาต Creative Commons Zero v1.0 Universal][cc]
+**เผยแพร่เป็นสมบัติสาธารณะ** โดยการสละสิทธิ์ตาม [CC0 1.0 Universal][cc0]
 
 Social media messages in Thai language with sentiment label (positive, neutral, negative, question).
-**Released to public domain** under [Creative Commons Zero v1.0 Universal license][cc].
+**Dedicated to the public domain** under [CC0 1.0 Universal][cc0].
 
-[cc]: https://creativecommons.org/publicdomain/zero/1.0/
+[cc0]: https://creativecommons.org/publicdomain/zero/1.0/
 
 ## Changelog
 
-- 2024-11-06 - Release v1.0.1 with updated contributors and software bill of materials (SBOM)
-- 2020-12-01 - Add Hugging Face format - [PR #7](https://github.com/PyThaiNLP/wisesight-sentiment/pull/7)
-- 2019-10-01 - Fix path in data preparation notebook - [PR #6](https://github.com/PyThaiNLP/wisesight-sentiment/pull/6)
-- 2019-08-22 - Add tokenization annotation for ~1,000 samples - [PR #4](https://github.com/PyThaiNLP/wisesight-sentiment/pull/4)
-- 2019-07-03 - Add tokenization annotation for 160 samples - [PR #2](https://github.com/PyThaiNLP/wisesight-sentiment/pull/2)
-- 2019-03-31 - Update data
+- 2024-11-07: Released v1.1 with updated copyright text, contributors, and a software bill of materials (SBOM).
+- 2020-12-01: Added Hugging Face format - [PR #7](https://github.com/PyThaiNLP/wisesight-sentiment/pull/7)
+- 2019-10-01: Fixed path in data preparation notebook - [PR #6](https://github.com/PyThaiNLP/wisesight-sentiment/pull/6)
+- 2019-08-22: Added tokenization annotation for ~1,000 samples - [PR #4](https://github.com/PyThaiNLP/wisesight-sentiment/pull/4)
+- 2019-07-03: Added tokenization annotation for 160 samples - [PR #2](https://github.com/PyThaiNLP/wisesight-sentiment/pull/2)
+- 2019-03-31: Updated data.
 
-## Related corpus
+## Related datasets
 
 - For `wisesight-160` and `wisesight-1000`, which are samples from this corpus in a tokenized form,
-  see <https://github.com/PyThaiNLP/wisesight-sentiment/tree/master/word-tokenization>
-
-- For data exploration and classification examples,
-  see [Thai Text Classification Benchmarks](https://github.com/PyThaiNLP/classification-benchmarks).
-
-- Also available as Huggingface datasets:
+  see [word-tokenization](./word-tokenization/) directory.
+- Also available at Hugging Face Datasets:
   - [wisesight_sentiment](https://huggingface.co/datasets/wisesight_sentiment)
     (using the earlier version of this corpus)
   - [wisesight1000](https://huggingface.co/datasets/wisesight1000)
+- For data exploration and classification examples,
+  see [Thai Text Classification Benchmarks](https://github.com/PyThaiNLP/classification-benchmarks).
 
 ## Source
 
-- Size: 26,737 messages
-- Language: Central Thai
-- Style: Informal and conversational. With some news headlines and advertisement.
-- Time period: Approximately 2016 to early 2019, with a small amount from other periods.
-- Domains: Mixed. Majority are consumer products and services
+- **Size:** 26,737 messages.
+- **Language:** Central Thai.
+- **Style:** Informal and conversational. With some news headlines and advertisement.
+- **Time period:** Approximately 2016 to early 2019, with a small amount from other periods.
+- **Domains:** Mixed. Majority are consumer products and services
   (restaurants, cosmetics, drinks, car, hotels), with some current affairs.
-- Privacy:
+- **Privacy:**
   - Only messages that made available to the public on the internet
     (websites, blogs, social network sites).
   - For Facebook, this means the public comments (everyone can see) that made on a public page.
   - Private/protected messages and messages in groups, chat, and inbox are not included.
-- Alternations and modifications:
+- **Alternations and modifications:**
   - Keep in mind that this corpus does not statistically represent anything in the language register.
   - Large amount of messages are not in their original form. Personal data are removed or masked.
   - Duplicated, leading, and trailing whitespaces are removed.
@@ -63,19 +61,19 @@ Social media messages in Thai language with sentiment label (positive, neutral,
 
 ## Corpus file structure
 
-- All files are UTF-8 encoded plaintext
+- All files are UTF-8 encoded plaintext.
 - One message per line. A newline character in the original message will be replaced with a space.
-- `q.txt` Questions (575 messages)
-- `neg.txt` Message with negative sentiment (6,823)
-- `neu.txt` Message with neutral sentiment (14,561)
-- `pos.txt` Message with positive sentiment (4,778)
+- `q.txt`: Questions (575 messages).
+- `neg.txt`: Message with negative sentiment (6,823).
+- `neu.txt`: Message with neutral sentiment (14,561).
+- `pos.txt`: Message with positive sentiment (4,778).
 - The legacy dataset in Kaggle competition format is also provided inside `kaggle-competition/` directory:
-  - `train.txt` - Message for training (24,066 messages)
-  - `train_label.txt` - Label for training. Each line is the label corresponding to the same line in `train.txt`
+  - `train.txt`: Message for training (24,066 messages).
+  - `train_label.txt`: Label for training. Each line is the label corresponding to the same line in `train.txt`.
   - `test.txt` - Message for testing (2,674 messages)
-  - `test_label.txt` - Label for testing. Each line is the label corresponding to the same line in `test.txt`
-  - `test_majority.csv` - Sample submission in Kaggle format. Contains `neu` class as all the predictions.
-  - `test_solution.csv` - Test solution in Kaggle format.
+  - `test_label.txt`: Label for testing. Each line is the label corresponding to the same line in `test.txt`.
+  - `test_majority.csv`: Sample submission in Kaggle format. Contains `neu` class as all the predictions.
+  - `test_solution.csv`: Test solution in Kaggle format.
   - Sample code for data exploration, training, and prediction are also provided.
 
 ## Personal data
@@ -101,24 +99,38 @@ Social media messages in Thai language with sentiment label (positive, neutral,
 
 ## Copyright and Disclaimer
 
-- If applicable, copyright of each message content belongs to the original poster.
-- **Annotation data (labels) are released to public domain.**
-- [Wisesight (Thailand) Co., Ltd.](https://github.com/wisesight/) helps facilitate the annotation,
-  but does not necessarily agree upon the labels made by the human annotators.
-  This annotation is for research purpose and does not reflect the professional work that Wisesight has been done for its customers.
-- The human annotator does not necessarily agree or disagree with the message.
-  Likewise, the label he/she made to the message does not necessarily reflect his/her personal view towards the message.
+This dataset contains social media text extracted from publicly accessible
+sources on the internet. The selection, organization, and curation of this
+dataset are original works that were previously copyrighted.
+However, the copyright holder has waived all rights to this dataset and
+dedicated it to the public domain under the
+[Creative Commons Zero v1.0 Universal Public Domain Dedication][cc0].
+
+Any trademarks or trade names appearing in the messages belong to their
+respective owners.
+
+[Wisesight (Thailand) Co., Ltd.](https://wisesight.com/) has assisted in the
+collection and sentiment labeling of this dataset, but does not necessarily
+endorse the labels assigned by human annotators.
+These annotations are for research purposes only and do not represent the
+professional work Wisesight performs for its clients.
+
+Please note that human annotators may not personally agree or disagree with
+the messages they label. Additionally, the labels assigned do not necessarily
+reflect their personal opinions on the content.
+
+*You are free to use this dataset for any purpose, without any restrictions.*
 
 ## Citation
 
 Please cite the following if you make use of the dataset:
 
-> Arthit Suriyawongkul, Ekapol Chuangsuwanich, Pattarawat Chormai, Nitchakarn Chantarapratin, Ponrawee Prasertsom, Jitkapat Sawatphol, Nozomi Yamada, Attapol Rutherford, Charin Polpanumas, and Can Udomcharoenchaikit. 2024. **PyThaiNLP/Wisesight Sentiment Corpus with Word Tokenization Label (Version 1.0.1)** November.
+> Arthit Suriyawongkul, Ekapol Chuangsuwanich, Pattarawat Chormai, Nitchakarn Chantarapratin, Ponrawee Prasertsom, Jitkapat Sawatphol, Nozomi Yamada, Attapol Rutherford, Charin Polpanumas, and Can Udomcharoenchaikit. 2024. **PyThaiNLP/Wisesight Sentiment Corpus with Word Tokenization Label (Version 1.1)** November.
 
 BibTeX:
 
-```
-@software{Suriyawongkul_PyThaiNLP_Wisesight_Sentiment_Corpus_2020,
+```bibtex
+@misc{Suriyawongkul_PyThaiNLP_Wisesight_Sentiment_Corpus_2020,
   author       = {Suriyawongkul, Arthit and
                   Chuangsuwanich, Ekapol and
                   Chormai, Pattarawat and
@@ -135,20 +147,22 @@ BibTeX:
   publisher    = {Zenodo},
   title        = {{PyThaiNLP/Wisesight Sentiment Corpus with Word Tokenization Label}},
   url          = {https://doi.org/10.5281/zenodo.3457446},
-  version      = {v1.0.1},
+  version      = {v1.1},
   year         = 2024
 }
 ```
 
 ## Acknowledgement
 
-- Thanks [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp) community and
-  [Kitsuchart Pasupa](https://www.it.kmitl.ac.th/~kitsuchart/)
-  (Faculty of Information Technology, King Mongkut's Institute of Technology
-  Ladkrabang) for advice.
-- The tokenization annotation was done by the support of
-  the [Natural Language Processing Lab](https://attapol.github.io/lab.html)
-  at Department of Linguistics, Faculty of Arts, Chulalongkorn University.
-- The original Kaggle competition, by Ekapol Chuangsuwanich,
-  using the earlier version of this corpus,
-  can be found at <https://www.kaggle.com/c/wisesight-sentiment/>.
+We would like to thank:
+
+- The [PyThaiNLP community](https://github.com/PyThaiNLP/) and
+  [Kitsuchart Pasupa](https://www.it.kmitl.ac.th/~kitsuchart/) (Faculty of
+  Information Technology, King Mongkut's Institute of Technology Ladkrabang)
+  for their advice.
+- The [Natural Language Processing Lab](https://attapol.github.io/lab.html) at
+  Department of Linguistics, Faculty of Arts, Chulalongkorn University for
+  their support with tokenization annotation.
+- Ekapol Chuangsuwanich for his initiative in creating the original Kaggle
+  competition using an earlier version of this corpus. The competition can be
+  found at <https://www.kaggle.com/c/wisesight-sentiment/>.