- 
GlotScript-Resource: provides a resource displaying the writing systems for various languages. 
- 
GlotScript-Tool: determines the script (writing system) of input text using ISO 15924. 
What writing system is each language written in?
Example:
| Language | CORE | AUXILLARY | 
|---|---|---|
| Turkish (tur) | Latn | Arab, Cyrl, Grek | 
| Thai (tha) | Thai | Latn | 
| Vietnamese (vie) | Latn | Hani | 
See metadata folder for more languages.
It's a Python library that detects the script (writing system) of text based on ISO 15924.
- Unicode version: 15.0.0
- The codes were sourced from Wikipedia ISO_15924.
- Unicode ranges were extracted from Unicode Character Database.
Special codes
- Zinhcode is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.
- Zyyycode is the Unicode script for "Common" characters.
- Zzzzcode is for Unicode script for "uncoded" script.
from pip
pip3 install GlotScriptfrom git
pip3 install GlotScript@git+https://github.com/cisnlp/GlotScriptScript Detection
from GlotScript import spsp('これは日本人です')
>> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})sp('This is Latin')[:1]
>> ('Latn', 1.0)sp('මේක සිංහල')[0]
>> 'Sinh'Script Separation
from GlotScript import sc sent = "Hello Salut سلام 你好 こんにちは שלום مرحبا"
sc(sent)
>> {
   "Latn":"Hello Salut     ",
   "Hebr":"     שלום ",
   "Arab":"  سلام    مرحبا",
   "Hani":"   你好   ",
   "Hira":"    こんにちは  "
}Click to Exapand
- List of Unicode characters - Wikipedia
- Lightweight Plain-Text Editor for macOS - CotEditor
- The Cygwin Terminal – terminal emulator for Cygwin, MSYS, and WSL - mintty
- ISO_15924 Wikipedia
- Unicode Character Database (Blocks) - Unicode
- Unicode Character Database (Scripts) - Unicode
- A free, web-based font editor, focusing on font design hobbyists. - Glyphr-Studio-1
- Kotlin - JetBrains
- UNIX-like reverse engineering framework and command-line toolset - radare2
- FreeOrion Game
- DOMinator - Firefox
- SHSans-derived CJK font family - glow-sans
- Unicode Subset Bitfields - Microsoft
- Stops - FAIR NLLB FB
- Gradient Boosting on Decision Trees - catboost
- Blender
- Unicode Wikipedia
If you use any part of this our resource or tool in your research, please cite it using the following BibTex entry.
@inproceedings{kargaran-etal-2024-glotscript-resource,
    title = "{G}lot{S}cript: A Resource and Tool for Low Resource Writing System Identification",
    author = {Kargaran, Amir Hossein  and
      Yvon, Fran{\c{c}}ois  and
      Sch{\"u}tze, Hinrich},
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.687",
    pages = "7774--7784"
}