Compute percentage of languages present in a document (paragraph) #159
Unanswered
thangld201
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I'm trying to compute the percentage of each language appearing in a document. My current use cases including two known language and a document which has the two languages mixed in (code switching). I'm training an ML model to try making the output monolingual (more leaned towards a certain language), so I need a reliable measure to estimates whether the ML model is making progress or not (language percent changes positively). Currently, I use lingua with the compute_language_confidence_values() function but the prediction is quite poor.
For example, given a piece of text in Japanese and English:
So it's not quite correct (should be 0.8-0.2 or something similar), do you have any advice on how I can improve/modify the library for my use case ?
Beta Was this translation helpful? Give feedback.
All reactions