重构部分代码，增加两个新模型

HIT-SCIR · Apr 6, 2021 · c715708 · c715708
1 parent 0551569
commit c715708
Show file tree

Hide file tree

Showing 46 changed files with 2,676 additions and 952 deletions.
diff --git a/MODELS.md b/MODELS.md
@@ -2,6 +2,8 @@
 
 |   模型    |                      大小                       |                              SHA256                              |
 | :-------: | :---------------------------------------------: | :--------------------------------------------------------------: |
+| Base1(v3) | [583MB](http://39.96.43.154/ltp/v3/base1.tgz)  | 397c3893e39692ced5858930e0cf8556454747a7c76521d70423a147d6f8c6d7 |
+| Base2(v3) | [583MB](http://39.96.43.154/ltp/v3/base2.tgz)  | 685a195f09c1947231394ef1bb814e8608252888a9a6dcc1fa5080a5a186e096 |
 | Base(v3)  | [491.9MB](http://39.96.43.154/ltp/v3/base.tgz)  | 777a97d6770285e5ab3b0720923bc86781e3279508a72a30c2dd9140b09e5ec8 |
 | Small(v3) | [156.8MB](http://39.96.43.154/ltp/v3/small.tgz) | 0992d5037cd1c62779a3b5c6d45b883a46e4782c6bcc5850117faf69a9ee6c56 |
 | Tiny(v3)  |  [31.3MB](http://39.96.43.154/ltp/v3/tiny.tgz)  | d0ab69f1493db232676423270d481080bf636bf8547e4297129b6a21c6f73612 |
@@ -16,14 +18,27 @@
 
 ## V2/v3 指标
 
-|      模型       | 分词  | 词性  | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) |
-| :-------------: | :---: | :---: | :------: | :------: | :------: | :------: | :--------: |
-| LTP 4.0 (Base)  | 98.7  | 98.5  |   95.4   |   80.6   |   89.5   |   75.2   |            |
-| LTP 4.0 (Small) | 98.4  | 98.2  |   94.3   |   78.4   |   88.3   |   74.7   |   12.58    |
-| LTP 4.0 (Tiny)  | 96.8  | 97.1  |   91.6   |   70.9   |   83.8   |   70.1   |   29.53    |
+|         模型          | 分词  | 词性  | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) |
+| :-------------------: | :---: | :---: | :------: | :------: | :------: | :------: | :--------: |
+|    LTP 4.0 (Base)     | 98.7  | 98.5  |   95.4   |   80.6   |   89.5   |   75.2   |   39.12    |
+|    LTP 4.0 (Base1)    | 99.22 | 98.73 |  96.39   |  79.28   |  89.57   |  76.57   |   --.--    |
+|    LTP 4.0 (Base2)    | 99.18 | 98.69 |  95.97   |  79.49   |  90.19   |  76.62   |   --.--    |
+|    LTP 4.0 (Small)    | 98.4  | 98.2  |   94.3   |   78.4   |   88.3   |   74.7   |   43.13    |
+|    LTP 4.0 (Tiny)     | 96.8  | 97.1  |   91.6   |   70.9   |   83.8   |   70.1   |   53.22    |
 
 **备注**: 本版本SDP采用 [CCL2020语义依存分析](http://ir.hit.edu.cn/sdp2020ccl) 语料，其他语料同V1
 
+测试环境如下：
+
++ Python 3.8.5
++ LTP 4.1 Batch Size = 8
++ CentOS Linux release 8.3.2011
++ Tesla V100-SXM2-16GB
++ Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz
+
+**备注**: 速度数据在人民日报命名实体测试数据上获得，速度计算方式均为所有任务顺序执行的结果。另外，语义角色标注与语义依存新旧版采用的语料不相同，因此无法直接比较（新版语义依存使用Semeval
+2016语料，语义角色标注使用CPB3.0语料）。
+
 ## V1 指标
 
 |      模型       | 分词  | 词性  | 命名实体 |      语义角色      | 依存句法 | 语义依存 | 速度(句/S) | 模型大小 |
@@ -55,3 +70,12 @@
 |     GSDSimp + OntoNotes ([GSDSimp](http://39.96.43.154/ltp/ud/gsds.tgz))     | 98.44 |   96.84    |    78.06    |  87.58   | d51b8508f290ab82d1c3844541eb774506213c1f6cf7d2b86fe7d69358d0d52a |
 | GSDSimp + OntoNotes ([GSDSimp+CRF](http://39.96.43.154/ltp/ud/gsds_crf.tgz)) | 98.4  |   96.47    |    79.69    |  86.39   | 1e9ea20cfbc1837bf5736f8b8502aaecb9343590e98b04bba18e15724d3092b2 |
 
+### UD System Performance
+
+|                                     模型                                     | 分词  | 词性(XPOS) | 命名实体 | 依存句法 |                              SHA256                              |
+| :--------------------------------------------------------------------------: | :---: | :--------: | :------: | :------: | :--------------------------------------------------------------: |
+|         GSD + OntoNotes ([GSD](http://39.96.43.154/ltp/ud/gsd.tgz))          | 98.12 |   97.22    |  78.56   |  86.91   | e4fd41c6f2c6d84d6df2657f1e47078cb98364366d91e852f0980102c755592a |
+|     GSD + OntoNotes ([GSD+CRF](http://39.96.43.154/ltp/ud/gsd_crf.tgz))      | 97.96 |   96.81    |  79.77   |  86.06   | 0264b4a92e34bb97054ff06f99068b884c54908d1ad265926b0983f2594e1e6a |
+|     GSDSimp + OntoNotes ([GSDSimp](http://39.96.43.154/ltp/ud/gsds.tgz))     | 97.49 |   96.24    |  78.06   |  82.48   | d51b8508f290ab82d1c3844541eb774506213c1f6cf7d2b86fe7d69358d0d52a |
+| GSDSimp + OntoNotes ([GSDSimp+CRF](http://39.96.43.154/ltp/ud/gsds_crf.tgz)) | 97.25 |   96.22    |  79.69   |  82.92   | 1e9ea20cfbc1837bf5736f8b8502aaecb9343590e98b04bba18e15724d3092b2 |
+
diff --git a/README.md b/README.md
@@ -1,93 +1,95 @@
-[![LTP](https://img.shields.io/pypi/v/ltp?label=LTP4%20ALPHA)](https://pypi.org/project/ltp/)
-![VERSION](https://img.shields.io/pypi/pyversions/ltp)
-![CODE SIZE](https://img.shields.io/github/languages/code-size/HIT-SCIR/ltp)
-![CONTRIBUTORS](https://img.shields.io/github/contributors/HIT-SCIR/ltp)
-![LAST COMMIT](https://img.shields.io/github/last-commit/HIT-SCIR/ltp)
-[![Documentation Status](https://readthedocs.org/projects/ltp/badge/?version=latest)](https://ltp.readthedocs.io/zh_CN/latest/?badge=latest)
-[![PyPI Downloads](https://img.shields.io/pypi/dm/ltp)](https://pypi.python.org/pypi/ltp)
-
-# LTP 4
-
-LTP（Language Technology Platform） 提供了一系列中文自然语言处理工具，用户可以使用这些工具对于中文文本进行分词、词性标注、句法分析等等工作。
-
-If you use any source codes included in this toolkit in your work, please kindly cite the following paper. The bibtex
-are listed below:
-<pre>
-@article{che2020n,
-  title={N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models},
-  author={Che, Wanxiang and Feng, Yunlong and Qin, Libo and Liu, Ting},
-  journal={arXiv preprint arXiv:2009.11616},
-  year={2020}
-}
-</pre>
-
-## 快速使用
-
-```python
-from ltp import LTP
-
-ltp = LTP()  # 默认加载 Small 模型
-seg, hidden = ltp.seg(["他叫汤姆去拿外衣。"])
-pos = ltp.pos(hidden)
-ner = ltp.ner(hidden)
-srl = ltp.srl(hidden)
-dep = ltp.dep(hidden)
-sdp = ltp.sdp(hidden)
-```
-
-**[详细说明](docs/quickstart.rst)**
-
-## Language Bindings
-
-+ C++
-+ Rust
-+ Java
-+ Python Rebinding
-
-[libltp](https://github.com/HIT-SCIR/libltp)
-
-## 指标
-
-|      模型       | 分词  | 词性  | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) |
-| :-------------: | :---: | :---: | :------: | :------: | :------: | :------: | :--------: |
-| LTP 4.0 (Base)  | 98.7  | 98.5  |   95.4   |   80.6   |   89.5   |   75.2   |            |
-| LTP 4.0 (Small) | 98.4  | 98.2  |   94.3   |   78.4   |   88.3   |   74.7   |   12.58    |
-| LTP 4.0 (Tiny)  | 96.8  | 97.1  |   91.6   |   70.9   |   83.8   |   70.1   |   29.53    |
-
-**[模型下载地址](MODELS.md)**
-
-## 模型算法
-
-+ 分词: Electra Small<sup>[1](#RELTRANS)</sup> + Linear
-+ 词性: Electra Small + Linear
-+ 命名实体: Electra Small + Relative Transformer<sup>[2](#RELTRANS)</sup> + Linear
-+ 依存句法: Electra Small + BiAffine + Eisner<sup>[3](#Eisner)</sup>
-+ 语义依存: Electra Small + BiAffine
-+ 语义角色: Electra Small + BiAffine + CRF
-
-## 构建 Wheel 包
-
-```shell script
-python setup.py sdist bdist_wheel
-python -m twine upload dist/*
-```
-
-## 作者信息
-
-+ 冯云龙 <<[[email protected]](mailto:[email protected])>>
-
-## 开源协议
-
-1. 语言技术平台面向国内外大学、中科院各研究所以及个人研究者免费开放源代码，但如上述机构和个人将该平台用于商业目的（如企业合作项目等）则需要付费。
-2. 除上述机构以外的企事业单位，如申请使用该平台，需付费。
-3. 凡涉及付费问题，请发邮件到 [email protected] 洽商。
-4. 如果您在 LTP 基础上发表论文或取得科研成果，请您在发表论文和申报成果时声明“使用了哈工大社会计算与信息检索研究中心研制的语言技术平台（LTP）”.
-   同时，发信给[email protected]，说明发表论文或申报成果的题目、出处等。
-
-## 脚注
-
-+ <a name="RELTRANS">1</a>:: [Chinese-ELECTRA](https://github.com/ymcui/Chinese-ELECTRA)
-+ <a name="RELTRANS">
-  2</a>:: [TENER: Adapting Transformer Encoder for Named Entity Recognition](https://arxiv.org/abs/1911.04474)
-+ <a name="Eisner">
-  3</a>:: [A PyTorch implementation of "Deep Biaffine Attention for Neural Dependency Parsing"](https://github.com/yzhangcs/parser)
+[![LTP](https://img.shields.io/pypi/v/ltp?label=LTP4%20ALPHA)](https://pypi.org/project/ltp/)
+![VERSION](https://img.shields.io/pypi/pyversions/ltp)
+![CODE SIZE](https://img.shields.io/github/languages/code-size/HIT-SCIR/ltp)
+![CONTRIBUTORS](https://img.shields.io/github/contributors/HIT-SCIR/ltp)
+![LAST COMMIT](https://img.shields.io/github/last-commit/HIT-SCIR/ltp)
+[![Documentation Status](https://readthedocs.org/projects/ltp/badge/?version=latest)](https://ltp.readthedocs.io/zh_CN/latest/?badge=latest)
+[![PyPI Downloads](https://img.shields.io/pypi/dm/ltp)](https://pypi.python.org/pypi/ltp)
+
+# LTP 4
+
+LTP（Language Technology Platform） 提供了一系列中文自然语言处理工具，用户可以使用这些工具对于中文文本进行分词、词性标注、句法分析等等工作。
+
+If you use any source codes included in this toolkit in your work, please kindly cite the following paper. The bibtex
+are listed below:
+<pre>
+@article{che2020n,
+  title={N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models},
+  author={Che, Wanxiang and Feng, Yunlong and Qin, Libo and Liu, Ting},
+  journal={arXiv preprint arXiv:2009.11616},
+  year={2020}
+}
+</pre>
+
+## 快速使用
+
+```python
+from ltp import LTP
+
+ltp = LTP()  # 默认加载 Small 模型
+seg, hidden = ltp.seg(["他叫汤姆去拿外衣。"])
+pos = ltp.pos(hidden)
+ner = ltp.ner(hidden)
+srl = ltp.srl(hidden)
+dep = ltp.dep(hidden)
+sdp = ltp.sdp(hidden)
+```
+
+**[详细说明](docs/quickstart.rst)**
+
+## Language Bindings
+
++ C++
++ Rust
++ Java
++ Python Rebinding
+
+[libltp](https://github.com/HIT-SCIR/libltp)
+
+## 指标
+
+|       模型       | 分词  | 词性  | 命名实体 | 语义角色 | 依存句法 | 语义依存 | 速度(句/S) |
+| :--------------: | :---: | :---: | :------: | :------: | :------: | :------: | :--------: |
+| LTP 4.0 (Base)   | 98.7  | 98.5  |   95.4   |   80.6   |   89.5   |   75.2   |   39.12    |
+| LTP 4.0 (Base1)  | 99.22 | 98.73 |  96.39   |  79.28   |  89.57   |  76.57   |   --.--    |
+| LTP 4.0 (Base2)  | 99.18 | 98.69 |  95.97   |  79.49   |  90.19   |  76.62   |   --.--    |
+| LTP 4.0 (Small)  | 98.4  | 98.2  |   94.3   |   78.4   |   88.3   |   74.7   |   43.13    |
+|  LTP 4.0 (Tiny)  | 96.8  | 97.1  |   91.6   |   70.9   |   83.8   |   70.1   |   53.22    |
+
+**[模型下载地址](MODELS.md)**
+
+## 模型算法
+
++ 分词: Electra Small<sup>[1](#RELTRANS)</sup> + Linear
++ 词性: Electra Small + Linear
++ 命名实体: Electra Small + Relative Transformer<sup>[2](#RELTRANS)</sup> + Linear
++ 依存句法: Electra Small + BiAffine + Eisner<sup>[3](#Eisner)</sup>
++ 语义依存: Electra Small + BiAffine
++ 语义角色: Electra Small + BiAffine + CRF
+
+## 构建 Wheel 包
+
+```shell script
+python setup.py sdist bdist_wheel
+python -m twine upload dist/*
+```
+
+## 作者信息
+
++ 冯云龙 <<[[email protected]](mailto:[email protected])>>
+
+## 开源协议
+
+1. 语言技术平台面向国内外大学、中科院各研究所以及个人研究者免费开放源代码，但如上述机构和个人将该平台用于商业目的（如企业合作项目等）则需要付费。
+2. 除上述机构以外的企事业单位，如申请使用该平台，需付费。
+3. 凡涉及付费问题，请发邮件到 [email protected] 洽商。
+4. 如果您在 LTP 基础上发表论文或取得科研成果，请您在发表论文和申报成果时声明“使用了哈工大社会计算与信息检索研究中心研制的语言技术平台（LTP）”.
+   同时，发信给[email protected]，说明发表论文或申报成果的题目、出处等。
+
+## 脚注
+
++ <a name="RELTRANS">1</a>:: [Chinese-ELECTRA](https://github.com/ymcui/Chinese-ELECTRA)
++ <a name="RELTRANS">
+  2</a>:: [TENER: Adapting Transformer Encoder for Named Entity Recognition](https://arxiv.org/abs/1911.04474)
++ <a name="Eisner">
+  3</a>:: [A PyTorch implementation of "Deep Biaffine Attention for Neural Dependency Parsing"](https://github.com/yzhangcs/parser)
diff --git a/ltp/algorithms/eisner.py b/ltp/algorithms/eisner.py
@@ -10,6 +10,7 @@
 def eisner(scores, mask) -> torch.Tensor:
     lens = mask.sum(1)
     batch_size, seq_len, _ = scores.shape
+    # [batch_size, w, n]
     scores = scores.permute(2, 1, 0)
     s_i = torch.full_like(scores, float('-inf'))
     s_c = torch.full_like(scores, float('-inf'))

diff --git a/ltp/data/dataset/__init__.py b/ltp/data/dataset/__init__.py
@@ -8,6 +8,7 @@
 
 from typing import Optional, Union, Dict, List
 from datasets import DatasetBuilder, Features, Split, DatasetDict, Dataset
+from datasets import Sequence, ClassLabel, Value, Translation, TranslationVariableLanguages
 
 
 def load_dataset(

diff --git a/ltp/data/dataset/bio.py b/ltp/data/dataset/bio.py
@@ -4,16 +4,41 @@
 
 import logging
 
+import os
+import itertools
+from collections import Counter
+
 import datasets
 from os.path import join
 from dataclasses import dataclass
-from ltp.data.utils import iter_blocks
+from ltp.data.utils import iter_blocks, vocab_builder
 
 _TRAINING_FILE = "train.bio"
 _DEV_FILE = "dev.bio"
 _TEST_FILE = "test.bio"
 
 
+@vocab_builder
+def build_vocabs(data_dir, *files):
+    counter = Counter()
+
+    if os.path.exists(os.path.join(data_dir, 'vocabs', 'bio.txt')):
+        return
+
+    if not os.path.exists(os.path.join(data_dir, 'vocabs')):
+        os.makedirs(os.path.join(data_dir, 'vocabs'))
+
+    for filename in files:
+        for line_num, block in iter_blocks(filename=filename):
+            values = [list(value) for value in zip(*block)]
+            counter.update(values[1])
+
+    with open(os.path.join(data_dir, 'vocabs', 'bio.txt'), mode='w') as f:
+        tags = sorted(counter.keys())
+        tags.remove('O')
+        f.write('\n'.join(['O'] + tags))
+
+
 def create_feature(file=None):
     if file:
         return datasets.ClassLabel(names_file=file)
@@ -31,37 +56,55 @@ class BioConfig(datasets.BuilderConfig):
 class Bio(datasets.GeneratorBasedBuilder):
     BUILDER_CONFIG_CLASS = BioConfig
 
+    @staticmethod
+    def default_files(data_dir) -> dict:
+        return {
+            datasets.Split.TRAIN: join(data_dir, _TRAINING_FILE),
+            datasets.Split.VALIDATION: join(data_dir, _DEV_FILE),
+            datasets.Split.TEST: join(data_dir, _TEST_FILE),
+        }
+
     def _info(self):
+        build_vocabs(self.config)
+        feats = {'bio': self.config.bio}
+        for key in feats:
+            if feats[key] is None:
+                feats[key] = os.path.join(self.config.data_dir, 'vocabs', f'{key}.txt')
+
         return datasets.DatasetInfo(
             features=datasets.Features(
                 {
-                    "words": datasets.Sequence(datasets.Value("string")),
-                    "bio": datasets.Sequence(create_feature(self.config.bio))
+                    "form": datasets.Sequence(datasets.Value("string")),
+                    "bio": datasets.Sequence(create_feature(feats['bio']))
                 }
             ),
             supervised_keys=None,
         )
 
     def _split_generators(self, dl_manager):
-        data_files = {
-            "train": join(self.config.data_dir, _TRAINING_FILE),
-            "dev": join(self.config.data_dir, _DEV_FILE),
-            "test": join(self.config.data_dir, _TEST_FILE),
-        }
-        data_files = dl_manager.download_and_extract(data_files)
-
-        return [
-            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"filepath": data_files["train"]}),
-            datasets.SplitGenerator(name=datasets.Split.VALIDATION, gen_kwargs={"filepath": data_files["dev"]}),
-            datasets.SplitGenerator(name=datasets.Split.TEST, gen_kwargs={"filepath": data_files["test"]}),
-        ]
-
-    def _generate_examples(self, filepath):
-        logging.info("⏳ Generating examples from = %s", filepath)
-        for line_num, block in iter_blocks(filename=filepath):
-            # last example
-            words, bio = [list(value) for value in zip(*block)]
-
-            yield line_num, {
-                "words": words, "bio": bio
-            }
+        """We handle string, list and dicts in datafiles"""
+        if not self.config.data_files:
+            raise ValueError(f"At least one data file must be specified, but got data_files={self.config.data_files}")
+        data_files = dl_manager.download_and_extract(self.config.data_files)
+        if isinstance(data_files, (str, list, tuple)):
+            files = data_files
+            if isinstance(files, str):
+                files = [files]
+            return [datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"files": files})]
+        splits = []
+        for split_name, files in data_files.items():
+            if isinstance(files, str):
+                files = [files]
+            splits.append(datasets.SplitGenerator(name=split_name, gen_kwargs={"files": files}))
+        return splits
+
+    def _generate_examples(self, files):
+        for filename in files:
+            logging.info("⏳ Generating examples from = %s", files)
+            for line_num, block in iter_blocks(filename=filename):
+                # last example
+                words, bio = [list(value) for value in zip(*block)]
+
+                yield line_num, {
+                    "form": words, "bio": bio
+                }