Skip to content

joshdavham/jreadability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation


Text readability calculator for Japanese learners πŸ‡―πŸ‡΅


jreadability allows python developers to calculate the readability of Japanese text using the model developed by Jae-ho Lee and Yoichiro Hasebe in "Readability measurement of Japanese texts based on levelled corpora." Note that this is not an official implementation.

Installation

pip install jreadability

Quickstart

from jreadability import compute_readability

# "Good morning! The weather is nice today."
text = 'γŠγ―γ‚ˆγ†γ”γ–γ„γΎγ™οΌδ»Šζ—₯γ―ε€©ζ°—γŒγ„γ„γ§γ™γ­γ€‚' 

score = compute_readability(text)

print(score) # 5.596333333333334

Readability scores

Level Readability score range
Upper-advanced 0.5-1.4
Lower-advanced 1.5 - 2.4
Upper-intermediate 2.5 - 3.4
Lower-intermediate 3.5 - 4.4
Upper-elementary 4.5 - 5.4
Lower-elementary 5.5 - 6.4

Note that this readability calculator is specifically for non-native speakers learning to read Japanese. This is not to be confused with something like grade level or other readability scores meant for native speakers.

Model

readability = {mean number of words per sentence} * -0.056
            + {proportion of kango} * -0.126
            + {proportion of wago} * -0.042
            + {proportion of verbs} * -0.145
            + {proportion of auxiliary verbs} * -0.044
            + 11.724

* "kango" (ζΌ’θͺž) means Japanese word of Chinese origin while "wago" (ε’Œθͺž) means native Japanese word.

Note on model consistency

The readability scores produced by this python package tend to differ slightly from the scores produced on the official jreadability website. This is likely due to the version difference in UniDic between these two implementations as this package uses UniDic 2.1.2 while theirs uses UniDic 2.2.0. This issue will hopefully be resolved in the future.

Batch processing

jreadability makes use of fugashi's tagger under the hood and initializes a new tagger everytime compute_retrievability is invoked. If you are processing a large number of texts, it is recommended to initialize the tagger first on your own, then pass it as an argument to each subsequent compute_retrievability call.

from fugashi import Tagger

texts = [...]

tagger = Tagger()

for text in texts:
    
    score = compute_readability(text, tagger) # fast :D
    #score = compute_readability(text) # slow :'(
    ...