GitHub - martinhoang11/lexpp: A module for lexical pre-processing

lexpp : 同義語辞書による日本語テキストへの前処理ツール

このモジュールは開発中です

このモジュールについて

このモジュールはトークナイズされた辞書登録単位への同義語関連の処理を提供します．日本語における同義語が収められたデフォルト辞書としてSudachiDictのsynonym辞書を利用しています．

インストール方法

pip install lexpp

使い方

import lexpp as lp
pp = lp.Lexpp()

現在のバージョンでは，下記の機能を提供しています．

文字列を辞書引きして，辞書に登録されている情報(Entry)を呼び出す． lookup(surface: str) -> Tuple(Entry)

TESTCASE = "マンガ喫茶"
result = pp.lookup(TESTCASE)

Entryをキーとして，同じグループに登録されている文字列のタプルを得る． get_synset(e: Entry) -> Tuple(str)

entry = result[0]
synset = pp.get_synset(entry)
# synset = ["漫画喫茶", "まんが喫茶", "マンガ喫茶", "漫喫", "まん喫", "マン喫"]

Entryをキーとして，代表表記として登録されている文字列を得る． get_representive_form(e: Entry) -> str

repr_form = pp.get_representive_form(entry)
# repr_form = "漫画喫茶"

複数の文字列をクエリとして，共通して登録されているグループIDの集合を得ます．共通して登録されているグループが存在しない場合は，空の集合が返されます． get_common_category_id_set(surfaces: List[str]) -> Set[int]

TESTCASE_LIST = ["漫画喫茶", "まんが喫茶", "マンガ喫茶", "漫喫", "まん喫", "マン喫"]
gid_set = pp.get_common_category_id_set(TESTCASE_LIST)
# gid_set = {27}

サンプルコードを下記に示します．

samples/sample.py

独自辞書の作り方

python -m lexpp.dict_builder {your lexicon} {output filename}

注意点: 入力ファイルはsynonym辞書と同じフォーマットであることを想定しています．

ビルド後，Lexppクラスのインスタンスのパラメータとしてファイル名を指定してください．

pp = Lexpp(external_dict_path = {your dictionary})

ライセンス

Apache 2.0ライセンスの条件下にて，利用していただけます．このソフトウェアには, Apache 2.0ライセンスで配布されている製作物が含まれています.

参考文献

有用なデータセットの公開に感謝します．SudachiDict

lexpp: lexical pre-processing module for Japanese text

THIS MODULE IS UNDER DEVELOPING

What this module is

This module provides you to pre-process Japanese text by using lexical knowledge. The default dictionary is built based on Sudachi synonym dict.

How to install

pip install lexpp

How to use

import lexpp as lp
pp = lp.Lexpp()

The current version of the software provides the following utilities.

Lookup a key string from the dictionary to get lexical entities. lookup(surface: str) -> Tuple(Entry)

TESTCASE = "マンガ喫茶"
result = pp.lookup(TESTCASE)

Lookup a key entry to obtain a synset(a tuple of synonyms). get_synset(e: Entry) -> Tuple(str)

entry = result[0]
synset = pp.get_synset(entry)
# synonyms = ["漫画喫茶", "まんが喫茶", "マンガ喫茶", "漫喫", "まん喫", "マン喫"]

Transform a key entry into a string of representive form. get_representive_form(e: Entry) -> str

repr_form = pp.get_representive_form(entry)
# repr_form = "漫画喫茶"

Lookup a set of group id which is commonly registered among the input surface list. If no groups existed , an empty set will be returned. get_common_category_id_set(surfaces: List[str]) -> Set[int]

TESTCASE_LIST = ["漫画喫茶", "まんが喫茶", "マンガ喫茶", "漫喫", "まん喫", "マン喫"]
gid_set = pp.get_common_category_id_set(TESTCASE_LIST)
# gid_set = {27}

For more details, See samples/sample.py

How to build your dictionary

python -m lexpp.dict_builder {your lexicon} {output filename}

NOTE: The input file must be formatted by [the Sudachi synonym dict format]((https://github.com/WorksApplications/SudachiDict/blob/develop/docs/synonyms.md).

When instantiating Lexpp class, specify to your dictionary as a parameter.

pp = Lexpp(external_dict_path = {your dictionary})

License

This software is licensed under Apache 2.0.

This software contains the derivative from the product developed under Apache 2.0.

References

Thanks to SudachiDict for releasing useful resources.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
lexpp		lexpp
samples		samples
tests		tests
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lexpp : 同義語辞書による日本語テキストへの前処理ツール

このモジュールについて

インストール方法

使い方

独自辞書の作り方

ライセンス

参考文献

lexpp: lexical pre-processing module for Japanese text

What this module is

How to install

How to use

How to build your dictionary

License

References

About

Releases

Packages

Languages

License

martinhoang11/lexpp

Folders and files

Latest commit

History

Repository files navigation

lexpp : 同義語辞書による日本語テキストへの前処理ツール

このモジュールについて

インストール方法

使い方

独自辞書の作り方

ライセンス

参考文献

lexpp: lexical pre-processing module for Japanese text

What this module is

How to install

How to use

How to build your dictionary

License

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages