kataoto

kataoto stands for 音 (oto, vocal information or phonetic information) and 形 (katachi, visual information or glyphic information).

This is a repository offering model checkpoints and codes related to the encoding and decoding process described in the paper "Japanese Sub-Character Tokenization Utilizing both Vocal and Visual Information". The paper proposed a sub-character tokenization algorithm which helps neural models understand Japanese text in a more natural way.

issueを英語で書く必要はありません。
不必使用英语提出issue。

To install prerequisites, run the following command.

pip install -r requirements.txt
python -m unidic download

To get your hands dirty with some encoding and decoding examples, try example.py. Enter a blank line to exit.

$ python src/example.py
original:  吾輩は猫である。名前はまだ無い。
encoded:  #わがはい<bixj kcfk>は#ねこ<qmek>である。#なまえ<rsj uaqk>はまだ#ない<maeu い>。
decoded:  吾輩は猫である。名前はまだ無い。
original: ハロー世界
encoded:  ハロー#せかい<eaz kion>
decoded:  ハロー世界
original: 
Goodbye!

To explore our models in an interactive way, download the model checkpoints in advance and extract them to the folder model/. After extracting, the project folder should be something like this.

kataoto
├── LICENSE
├── README.md
├── data
├── model
│   ├── aspec_encode
│   │   ├── bpecodes.en
│   │   ├── bpecodes.ja
│   │   ├── dict.en.txt
│   │   ├── dict.ja.txt
│   │   └── model.pt
│   ├── aspec_kana
│   ├── jesc_encode
│   └── jesc_kana
├── requirements.txt
└── src

Then try interact.py. Enter a blank line to exit.

$ python src/interact.py
Please choose the type of the dataset.
0 for jesc and 1 for aspec: 0 
You chose jesc
Please choose the type of the model.
0 for encode and 1 for kana: 0
You chose encode
...
2023-05-10 08:58:09 | INFO | fairseq_cli.interactive | Type the input sentence and press return:
日本語お上手ですね
2023-05-10 08:58:17 | INFO | fairseq.tasks.fairseq_task | can_reuse_epoch_itr = True
2023-05-10 08:58:17 | INFO | fairseq.tasks.fairseq_task | reuse_dataloader = True
2023-05-10 08:58:17 | INFO | fairseq.tasks.fairseq_task | rebuild_batches = False
2023-05-10 08:58:17 | INFO | fairseq.tasks.fairseq_task | creating new batches for epoch 1
S-0     # にっぽん < ka favv > # ご < sbxj > お # じょうず < ida md > です ね
W-0     0.156   seconds
H-0     -0.7305049300193787     you speak very good japanese .
D-0     -0.7305049300193787     you speak very good japanese .
P-0     -0.9716 -0.7284 -1.3580 -0.6123 -0.4469 -0.8214 -0.1750

Goodbye!
2023-05-10 08:58:25 | INFO | fairseq_cli.interactive | Total time: 17.218 seconds; translation time: 0.156

All the codes and model checkpoints are available under the MIT license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kataoto

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
model		model
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

liuchengwucn/kataoto

Folders and files

Latest commit

History

Repository files navigation

kataoto

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages