Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 57 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,69 +29,70 @@ This is a refined and re-implemented version of the archived plugin for ElasticS
## About this library

The library uses 3-gram character and a Bayesian filter with various normalizations and feature sampling.
The precision is over **99%** for **53** languages.
The precision is over **99%** for **54** languages.

See the following PR description to read about the benchmaks done by @yanirs : https://github.com/jprante/elasticsearch-langdetect/pull/69

### Supported ISO 639-1 codes

The following is a list of ISO 639-1 languages code recognized:

| Code | Description |
|--------|--------------------------------------------------|
| af | Afrikaans |
| ar | Arabic |
| bg | Bulgarian |
| bn | Bengali |
| cs | Czech |
| da | Danish |
| de | German |
| el | Greek |
| en | English |
| es | Spanish |
| et | Estonian |
| fa | Farsi |
| fi | Finnish |
| fr | French |
| gu | Gujarati |
| he | Hebrew |
| hi | Hindi |
| hr | Croatian |
| hu | Hungarian |
| id | Indonesian |
| it | Italian |
| ja | Japanese |
| kn | Kannada |
| ko | Korean |
| lt | Lithuanian |
| lv | Latvian |
| mk | Macedonian |
| ml | Malayalam |
| mr | Marathi |
| ne | Nepali |
| nl | Dutch |
| no | Norwegian |
| pa | Eastern Punjabi |
| pl | Polish |
| pt | Portuguese |
| ro | Romanian |
| ru | Russian |
| sk | Slovak |
| sl | Slovene |
| so | Somali |
| sq | Albanian |
| sv | Swedish |
| sw | Swahili |
| ta | Tamil |
| te | Telugu |
| th | Thai |
| tl | Tagalog |
| tr | Turkish |
| uk | Ukrainian |
| ur | Urdu |
| vi | Vietnamese |
| zh-cn | Chinese |
| zh-tw | Traditional Chinese (Taiwan, Hongkong and Macau) |
| Code | Description |
|-------|--------------------------------------------------|
| af | Afrikaans |
| ar | Arabic |
| bg | Bulgarian |
| bn | Bengali |
| cs | Czech |
| da | Danish |
| de | German |
| el | Greek |
| en | English |
| es | Spanish |
| et | Estonian |
| fa | Farsi |
| fi | Finnish |
| fr | French |
| gu | Gujarati |
| he | Hebrew |
| hi | Hindi |
| hr | Croatian |
| hu | Hungarian |
| id | Indonesian |
| it | Italian |
| ja | Japanese |
| kn | Kannada |
| ko | Korean |
| lb | Luxembourgish |
| lt | Lithuanian |
| lv | Latvian |
| mk | Macedonian |
| ml | Malayalam |
| mr | Marathi |
| ne | Nepali |
| nl | Dutch |
| no | Norwegian |
| pa | Eastern Punjabi |
| pl | Polish |
| pt | Portuguese |
| ro | Romanian |
| ru | Russian |
| sk | Slovak |
| sl | Slovene |
| so | Somali |
| sq | Albanian |
| sv | Swedish |
| sw | Swahili |
| ta | Tamil |
| te | Telugu |
| th | Thai |
| tl | Tagalog |
| tr | Turkish |
| uk | Ukrainian |
| ur | Urdu |
| vi | Vietnamese |
| zh-cn | Chinese |
| zh-tw | Traditional Chinese (Taiwan, Hongkong and Macau) |

### Quick detection of CJK languages

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
public class LanguageDetectionSettings {

private static final String ALL_SUPPORTED_ISO_CODES_639_1 =
"af,ar,bg,bn,ca,cs,da,de,el,en,es,et,fa,fi,fr,gu,he,hi,hr,hu,id,it,ja,kn,ko,lt,lv,mk,ml,mr,ne,nl,no,pa,pl,pt,"
"af,ar,bg,bn,ca,cs,da,de,el,en,es,et,fa,fi,fr,gu,he,hi,hr,hu,id,it,ja,kn,ko,lb,lt,lv,mk,ml,mr,ne,nl,no,pa,pl,pt,"
+ "ro,ru,si,sk,sl,so,sq,sv,sw,ta,te,th,tl,tr,uk,ur,vi,zh-cn,zh-tw";

static final LanguageDetectionSettings DEFAULT_SETTINGS_ALL_LANGUAGES =
Expand Down
1 change: 1 addition & 0 deletions src/main/resources/langdetect/lb

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions src/main/resources/langdetect/merged-average/lb

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions src/main/resources/langdetect/short-text/lb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -213,4 +213,23 @@ public final void languageDetectorRespondsWithUndeterminedLanguage() throws Exce
assertEquals("und", detector.detectAll("1234567").get(0).getIsoCode639_1());
assertEquals("und", detector.detectAll("한국어").get(0).getIsoCode639_1());
}

@Test
public final void languageDetectorTestLuxembourgish() throws Exception {
final LanguageDetectionSettings supportedLanguages =
LanguageDetectionSettings.fromIsoCodes639_1("de,lb").build();
final LanguageDetectorFactory factory = new LanguageDetectorFactory(supportedLanguages);
final LanguageDetector detector =
new LanguageDetector(factory.getSupportedIsoCodes639_1(), factory.getLanguageCorporaProbabilities());

assertEquals(
"lb", detector.detectAll("Ech léiere Lëtzebuergesch").get(0).getIsoCode639_1());
assertEquals("de", detector.detectAll("Ich lerne Deutsch").get(0).getIsoCode639_1());

assertEquals(
"lb",
detector.detectAll("Schwätzt wannechgelift méi lues").get(0).getIsoCode639_1());
assertEquals(
"de", detector.detectAll("Bitte sprechen Sie langsamer").get(0).getIsoCode639_1());
}
}