Skip to content

Commit

Permalink
重构部分代码,增加两个新模型
Browse files Browse the repository at this point in the history
  • Loading branch information
AlongWY committed Apr 6, 2021
1 parent 0551569 commit c715708
Show file tree
Hide file tree
Showing 46 changed files with 2,676 additions and 952 deletions.
34 changes: 29 additions & 5 deletions MODELS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

| 模型 | 大小 | SHA256 |
| :-------: | :---------------------------------------------: | :--------------------------------------------------------------: |
| Base1(v3) | [583MB](http://39.96.43.154/ltp/v3/base1.tgz) | 397c3893e39692ced5858930e0cf8556454747a7c76521d70423a147d6f8c6d7 |
| Base2(v3) | [583MB](http://39.96.43.154/ltp/v3/base2.tgz) | 685a195f09c1947231394ef1bb814e8608252888a9a6dcc1fa5080a5a186e096 |
| Base(v3) | [491.9MB](http://39.96.43.154/ltp/v3/base.tgz) | 777a97d6770285e5ab3b0720923bc86781e3279508a72a30c2dd9140b09e5ec8 |
| Small(v3) | [156.8MB](http://39.96.43.154/ltp/v3/small.tgz) | 0992d5037cd1c62779a3b5c6d45b883a46e4782c6bcc5850117faf69a9ee6c56 |
| Tiny(v3) | [31.3MB](http://39.96.43.154/ltp/v3/tiny.tgz) | d0ab69f1493db232676423270d481080bf636bf8547e4297129b6a21c6f73612 |
Expand All @@ -16,14 +18,27 @@

## V2/v3 指标

| 模型 | 分词 | 词性 | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) |
| :-------------: | :---: | :---: | :------: | :------: | :------: | :------: | :--------: |
| LTP 4.0 (Base) | 98.7 | 98.5 | 95.4 | 80.6 | 89.5 | 75.2 | |
| LTP 4.0 (Small) | 98.4 | 98.2 | 94.3 | 78.4 | 88.3 | 74.7 | 12.58 |
| LTP 4.0 (Tiny) | 96.8 | 97.1 | 91.6 | 70.9 | 83.8 | 70.1 | 29.53 |
| 模型 | 分词 | 词性 | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) |
| :-------------------: | :---: | :---: | :------: | :------: | :------: | :------: | :--------: |
| LTP 4.0 (Base) | 98.7 | 98.5 | 95.4 | 80.6 | 89.5 | 75.2 | 39.12 |
| LTP 4.0 (Base1) | 99.22 | 98.73 | 96.39 | 79.28 | 89.57 | 76.57 | --.-- |
| LTP 4.0 (Base2) | 99.18 | 98.69 | 95.97 | 79.49 | 90.19 | 76.62 | --.-- |
| LTP 4.0 (Small) | 98.4 | 98.2 | 94.3 | 78.4 | 88.3 | 74.7 | 43.13 |
| LTP 4.0 (Tiny) | 96.8 | 97.1 | 91.6 | 70.9 | 83.8 | 70.1 | 53.22 |

**备注**: 本版本SDP采用 [CCL2020语义依存分析](http://ir.hit.edu.cn/sdp2020ccl) 语料,其他语料同V1

测试环境如下:

+ Python 3.8.5
+ LTP 4.1 Batch Size = 8
+ CentOS Linux release 8.3.2011
+ Tesla V100-SXM2-16GB
+ Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz

**备注**: 速度数据在人民日报命名实体测试数据上获得,速度计算方式均为所有任务顺序执行的结果。另外,语义角色标注与语义依存新旧版采用的语料不相同,因此无法直接比较(新版语义依存使用Semeval
2016语料,语义角色标注使用CPB3.0语料)。

## V1 指标

| 模型 | 分词 | 词性 | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) | 模型大小 |
Expand Down Expand Up @@ -55,3 +70,12 @@
| GSDSimp + OntoNotes ([GSDSimp](http://39.96.43.154/ltp/ud/gsds.tgz)) | 98.44 | 96.84 | 78.06 | 87.58 | d51b8508f290ab82d1c3844541eb774506213c1f6cf7d2b86fe7d69358d0d52a |
| GSDSimp + OntoNotes ([GSDSimp+CRF](http://39.96.43.154/ltp/ud/gsds_crf.tgz)) | 98.4 | 96.47 | 79.69 | 86.39 | 1e9ea20cfbc1837bf5736f8b8502aaecb9343590e98b04bba18e15724d3092b2 |

### UD System Performance

| 模型 | 分词 | 词性(XPOS) | 命名实体 | 依存句法 | SHA256 |
| :--------------------------------------------------------------------------: | :---: | :--------: | :------: | :------: | :--------------------------------------------------------------: |
| GSD + OntoNotes ([GSD](http://39.96.43.154/ltp/ud/gsd.tgz)) | 98.12 | 97.22 | 78.56 | 86.91 | e4fd41c6f2c6d84d6df2657f1e47078cb98364366d91e852f0980102c755592a |
| GSD + OntoNotes ([GSD+CRF](http://39.96.43.154/ltp/ud/gsd_crf.tgz)) | 97.96 | 96.81 | 79.77 | 86.06 | 0264b4a92e34bb97054ff06f99068b884c54908d1ad265926b0983f2594e1e6a |
| GSDSimp + OntoNotes ([GSDSimp](http://39.96.43.154/ltp/ud/gsds.tgz)) | 97.49 | 96.24 | 78.06 | 82.48 | d51b8508f290ab82d1c3844541eb774506213c1f6cf7d2b86fe7d69358d0d52a |
| GSDSimp + OntoNotes ([GSDSimp+CRF](http://39.96.43.154/ltp/ud/gsds_crf.tgz)) | 97.25 | 96.22 | 79.69 | 82.92 | 1e9ea20cfbc1837bf5736f8b8502aaecb9343590e98b04bba18e15724d3092b2 |

188 changes: 95 additions & 93 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,93 +1,95 @@
[![LTP](https://img.shields.io/pypi/v/ltp?label=LTP4%20ALPHA)](https://pypi.org/project/ltp/)
![VERSION](https://img.shields.io/pypi/pyversions/ltp)
![CODE SIZE](https://img.shields.io/github/languages/code-size/HIT-SCIR/ltp)
![CONTRIBUTORS](https://img.shields.io/github/contributors/HIT-SCIR/ltp)
![LAST COMMIT](https://img.shields.io/github/last-commit/HIT-SCIR/ltp)
[![Documentation Status](https://readthedocs.org/projects/ltp/badge/?version=latest)](https://ltp.readthedocs.io/zh_CN/latest/?badge=latest)
[![PyPI Downloads](https://img.shields.io/pypi/dm/ltp)](https://pypi.python.org/pypi/ltp)

# LTP 4

LTP(Language Technology Platform) 提供了一系列中文自然语言处理工具,用户可以使用这些工具对于中文文本进行分词、词性标注、句法分析等等工作。

If you use any source codes included in this toolkit in your work, please kindly cite the following paper. The bibtex
are listed below:
<pre>
@article{che2020n,
title={N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models},
author={Che, Wanxiang and Feng, Yunlong and Qin, Libo and Liu, Ting},
journal={arXiv preprint arXiv:2009.11616},
year={2020}
}
</pre>

## 快速使用

```python
from ltp import LTP

ltp = LTP() # 默认加载 Small 模型
seg, hidden = ltp.seg(["他叫汤姆去拿外衣。"])
pos = ltp.pos(hidden)
ner = ltp.ner(hidden)
srl = ltp.srl(hidden)
dep = ltp.dep(hidden)
sdp = ltp.sdp(hidden)
```

**[详细说明](docs/quickstart.rst)**

## Language Bindings

+ C++
+ Rust
+ Java
+ Python Rebinding

[libltp](https://github.com/HIT-SCIR/libltp)

## 指标

| 模型 | 分词 | 词性 | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) |
| :-------------: | :---: | :---: | :------: | :------: | :------: | :------: | :--------: |
| LTP 4.0 (Base) | 98.7 | 98.5 | 95.4 | 80.6 | 89.5 | 75.2 | |
| LTP 4.0 (Small) | 98.4 | 98.2 | 94.3 | 78.4 | 88.3 | 74.7 | 12.58 |
| LTP 4.0 (Tiny) | 96.8 | 97.1 | 91.6 | 70.9 | 83.8 | 70.1 | 29.53 |

**[模型下载地址](MODELS.md)**

## 模型算法

+ 分词: Electra Small<sup>[1](#RELTRANS)</sup> + Linear
+ 词性: Electra Small + Linear
+ 命名实体: Electra Small + Relative Transformer<sup>[2](#RELTRANS)</sup> + Linear
+ 依存句法: Electra Small + BiAffine + Eisner<sup>[3](#Eisner)</sup>
+ 语义依存: Electra Small + BiAffine
+ 语义角色: Electra Small + BiAffine + CRF

## 构建 Wheel 包

```shell script
python setup.py sdist bdist_wheel
python -m twine upload dist/*
```

## 作者信息

+ 冯云龙 <<[[email protected]](mailto:[email protected])>>

## 开源协议

1. 语言技术平台面向国内外大学、中科院各研究所以及个人研究者免费开放源代码,但如上述机构和个人将该平台用于商业目的(如企业合作项目等)则需要付费。
2. 除上述机构以外的企事业单位,如申请使用该平台,需付费。
3. 凡涉及付费问题,请发邮件到 [email protected] 洽商。
4. 如果您在 LTP 基础上发表论文或取得科研成果,请您在发表论文和申报成果时声明“使用了哈工大社会计算与信息检索研究中心研制的语言技术平台(LTP)”.
同时,发信给[email protected],说明发表论文或申报成果的题目、出处等。

## 脚注

+ <a name="RELTRANS">1</a>:: [Chinese-ELECTRA](https://github.com/ymcui/Chinese-ELECTRA)
+ <a name="RELTRANS">
2</a>:: [TENER: Adapting Transformer Encoder for Named Entity Recognition](https://arxiv.org/abs/1911.04474)
+ <a name="Eisner">
3</a>:: [A PyTorch implementation of "Deep Biaffine Attention for Neural Dependency Parsing"](https://github.com/yzhangcs/parser)
[![LTP](https://img.shields.io/pypi/v/ltp?label=LTP4%20ALPHA)](https://pypi.org/project/ltp/)
![VERSION](https://img.shields.io/pypi/pyversions/ltp)
![CODE SIZE](https://img.shields.io/github/languages/code-size/HIT-SCIR/ltp)
![CONTRIBUTORS](https://img.shields.io/github/contributors/HIT-SCIR/ltp)
![LAST COMMIT](https://img.shields.io/github/last-commit/HIT-SCIR/ltp)
[![Documentation Status](https://readthedocs.org/projects/ltp/badge/?version=latest)](https://ltp.readthedocs.io/zh_CN/latest/?badge=latest)
[![PyPI Downloads](https://img.shields.io/pypi/dm/ltp)](https://pypi.python.org/pypi/ltp)

# LTP 4

LTP(Language Technology Platform) 提供了一系列中文自然语言处理工具,用户可以使用这些工具对于中文文本进行分词、词性标注、句法分析等等工作。

If you use any source codes included in this toolkit in your work, please kindly cite the following paper. The bibtex
are listed below:
<pre>
@article{che2020n,
title={N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models},
author={Che, Wanxiang and Feng, Yunlong and Qin, Libo and Liu, Ting},
journal={arXiv preprint arXiv:2009.11616},
year={2020}
}
</pre>

## 快速使用

```python
from ltp import LTP

ltp = LTP() # 默认加载 Small 模型
seg, hidden = ltp.seg(["他叫汤姆去拿外衣。"])
pos = ltp.pos(hidden)
ner = ltp.ner(hidden)
srl = ltp.srl(hidden)
dep = ltp.dep(hidden)
sdp = ltp.sdp(hidden)
```

**[详细说明](docs/quickstart.rst)**

## Language Bindings

+ C++
+ Rust
+ Java
+ Python Rebinding

[libltp](https://github.com/HIT-SCIR/libltp)

## 指标

| 模型 | 分词 | 词性 | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) |
| :--------------: | :---: | :---: | :------: | :------: | :------: | :------: | :--------: |
| LTP 4.0 (Base) | 98.7 | 98.5 | 95.4 | 80.6 | 89.5 | 75.2 | 39.12 |
| LTP 4.0 (Base1) | 99.22 | 98.73 | 96.39 | 79.28 | 89.57 | 76.57 | --.-- |
| LTP 4.0 (Base2) | 99.18 | 98.69 | 95.97 | 79.49 | 90.19 | 76.62 | --.-- |
| LTP 4.0 (Small) | 98.4 | 98.2 | 94.3 | 78.4 | 88.3 | 74.7 | 43.13 |
| LTP 4.0 (Tiny) | 96.8 | 97.1 | 91.6 | 70.9 | 83.8 | 70.1 | 53.22 |

**[模型下载地址](MODELS.md)**

## 模型算法

+ 分词: Electra Small<sup>[1](#RELTRANS)</sup> + Linear
+ 词性: Electra Small + Linear
+ 命名实体: Electra Small + Relative Transformer<sup>[2](#RELTRANS)</sup> + Linear
+ 依存句法: Electra Small + BiAffine + Eisner<sup>[3](#Eisner)</sup>
+ 语义依存: Electra Small + BiAffine
+ 语义角色: Electra Small + BiAffine + CRF

## 构建 Wheel 包

```shell script
python setup.py sdist bdist_wheel
python -m twine upload dist/*
```

## 作者信息

+ 冯云龙 <<[[email protected]](mailto:[email protected])>>

## 开源协议

1. 语言技术平台面向国内外大学、中科院各研究所以及个人研究者免费开放源代码,但如上述机构和个人将该平台用于商业目的(如企业合作项目等)则需要付费。
2. 除上述机构以外的企事业单位,如申请使用该平台,需付费。
3. 凡涉及付费问题,请发邮件到 [email protected] 洽商。
4. 如果您在 LTP 基础上发表论文或取得科研成果,请您在发表论文和申报成果时声明“使用了哈工大社会计算与信息检索研究中心研制的语言技术平台(LTP)”.
同时,发信给[email protected],说明发表论文或申报成果的题目、出处等。

## 脚注

+ <a name="RELTRANS">1</a>:: [Chinese-ELECTRA](https://github.com/ymcui/Chinese-ELECTRA)
+ <a name="RELTRANS">
2</a>:: [TENER: Adapting Transformer Encoder for Named Entity Recognition](https://arxiv.org/abs/1911.04474)
+ <a name="Eisner">
3</a>:: [A PyTorch implementation of "Deep Biaffine Attention for Neural Dependency Parsing"](https://github.com/yzhangcs/parser)
1 change: 1 addition & 0 deletions ltp/algorithms/eisner.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
def eisner(scores, mask) -> torch.Tensor:
lens = mask.sum(1)
batch_size, seq_len, _ = scores.shape
# [batch_size, w, n]
scores = scores.permute(2, 1, 0)
s_i = torch.full_like(scores, float('-inf'))
s_c = torch.full_like(scores, float('-inf'))
Expand Down
1 change: 1 addition & 0 deletions ltp/data/dataset/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from typing import Optional, Union, Dict, List
from datasets import DatasetBuilder, Features, Split, DatasetDict, Dataset
from datasets import Sequence, ClassLabel, Value, Translation, TranslationVariableLanguages


def load_dataset(
Expand Down
93 changes: 68 additions & 25 deletions ltp/data/dataset/bio.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,41 @@

import logging

import os
import itertools
from collections import Counter

import datasets
from os.path import join
from dataclasses import dataclass
from ltp.data.utils import iter_blocks
from ltp.data.utils import iter_blocks, vocab_builder

_TRAINING_FILE = "train.bio"
_DEV_FILE = "dev.bio"
_TEST_FILE = "test.bio"


@vocab_builder
def build_vocabs(data_dir, *files):
counter = Counter()

if os.path.exists(os.path.join(data_dir, 'vocabs', 'bio.txt')):
return

if not os.path.exists(os.path.join(data_dir, 'vocabs')):
os.makedirs(os.path.join(data_dir, 'vocabs'))

for filename in files:
for line_num, block in iter_blocks(filename=filename):
values = [list(value) for value in zip(*block)]
counter.update(values[1])

with open(os.path.join(data_dir, 'vocabs', 'bio.txt'), mode='w') as f:
tags = sorted(counter.keys())
tags.remove('O')
f.write('\n'.join(['O'] + tags))


def create_feature(file=None):
if file:
return datasets.ClassLabel(names_file=file)
Expand All @@ -31,37 +56,55 @@ class BioConfig(datasets.BuilderConfig):
class Bio(datasets.GeneratorBasedBuilder):
BUILDER_CONFIG_CLASS = BioConfig

@staticmethod
def default_files(data_dir) -> dict:
return {
datasets.Split.TRAIN: join(data_dir, _TRAINING_FILE),
datasets.Split.VALIDATION: join(data_dir, _DEV_FILE),
datasets.Split.TEST: join(data_dir, _TEST_FILE),
}

def _info(self):
build_vocabs(self.config)
feats = {'bio': self.config.bio}
for key in feats:
if feats[key] is None:
feats[key] = os.path.join(self.config.data_dir, 'vocabs', f'{key}.txt')

return datasets.DatasetInfo(
features=datasets.Features(
{
"words": datasets.Sequence(datasets.Value("string")),
"bio": datasets.Sequence(create_feature(self.config.bio))
"form": datasets.Sequence(datasets.Value("string")),
"bio": datasets.Sequence(create_feature(feats['bio']))
}
),
supervised_keys=None,
)

def _split_generators(self, dl_manager):
data_files = {
"train": join(self.config.data_dir, _TRAINING_FILE),
"dev": join(self.config.data_dir, _DEV_FILE),
"test": join(self.config.data_dir, _TEST_FILE),
}
data_files = dl_manager.download_and_extract(data_files)

return [
datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": data_files["train"]}),
datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"filepath": data_files["dev"]}),
datasets.SplitGenerator(name=datasets.Split.TEST, gen_kwargs={"filepath": data_files["test"]}),
]

def _generate_examples(self, filepath):
logging.info("⏳ Generating examples from = %s", filepath)
for line_num, block in iter_blocks(filename=filepath):
# last example
words, bio = [list(value) for value in zip(*block)]

yield line_num, {
"words": words, "bio": bio
}
"""We handle string, list and dicts in datafiles"""
if not self.config.data_files:
raise ValueError(f"At least one data file must be specified, but got data_files={self.config.data_files}")
data_files = dl_manager.download_and_extract(self.config.data_files)
if isinstance(data_files, (str, list, tuple)):
files = data_files
if isinstance(files, str):
files = [files]
return [datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"files": files})]
splits = []
for split_name, files in data_files.items():
if isinstance(files, str):
files = [files]
splits.append(datasets.SplitGenerator(name=split_name, gen_kwargs={"files": files}))
return splits

def _generate_examples(self, files):
for filename in files:
logging.info("⏳ Generating examples from = %s", files)
for line_num, block in iter_blocks(filename=filename):
# last example
words, bio = [list(value) for value in zip(*block)]

yield line_num, {
"form": words, "bio": bio
}
Loading

0 comments on commit c715708

Please sign in to comment.