希望增加tok保存空格的选项，以便分词后还原文本 #1802

amalgame21 · 2023-01-22T11:28:16Z

Describe the feature and the current behavior/state.

文本的空格（全形和半形）会在tok舍弃

Will this change the current api? How?

不知道

Who will benefit with this feature?

使用简繁转换的人

Are you willing to contribute it (Yes/No):

力有不逮

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Arch Linux
Python version: 3.10.9
HanLP version: 2.1.0b45，用pip install hanlp安装

Any other info

我主要是想用hanlp来进行文本简繁转换

因为opencc的简繁转换有时会出现问题（例如只和隻的转换）
在其github #224 (comment)的讨论中，看到有人使用HanLP分词再丢给opencc
所以试了一整天，感觉不错
但是因为tok未能保存空格以文本未能成功还原

例子

import hanlp
tok = hanlp.load(hanlp.pretrained.tok.COARSE_ELECTRA_SMALL_ZH)
print(tok(['2021年HanLPv2.1为生产环境带来次世代最先进的多语种Neuro-linguistic programming技术。', '阿婆主来到北京立方庭参观自然语义科技公司。']))

输出为：

[['2021年', 'HanLPv2.1', '为', '生产', '环境', '带来', '次世代', '最', '先进', '的', '多', '语种', 'Neuro-linguistic', 'programming', '技术', '。'], ['阿婆', '主', '来到', '北京立方庭', '参观', '自然语义科技公司', '。']]

Neuro-linguistic programming 两个词中的空格消失了
把这段输出丢给opencc再还原后
就会变成Neuro-linguisticprogramming

因为我编程能力极度有限
现在我只是使用python读取txt档
再像上面那样python的hanlp的tok分词
再使用json.dumps掉进terminal
在terminal用opencc进行简繁转换
再使用jq,sed等工具还原文本

或者有没有什么更有效的分词简繁转换方法？
谢谢！

I've carefully completed this form.

The text was updated successfully, but these errors were encountered:

hankcs · 2023-01-22T16:17:39Z

Hi,

tok不会舍弃文本的空格（全形和半形）。单词之间的空格不属于单词的一部分，理所当然不会出现在单词中。如果tok认为单词本身含有空格，该空格会作为单词的一部分保留。比如'2021年蝴蝶图标HanLPv2.1为生产环境带来次世代最先进的多语种Neuro-linguistic programming技术。'会被分作['2021年', '蝴蝶', '图标', 'HanLPv2.1', '为', '生产', '环境', '带来', '次世代', '最', '先进', '的', '多', '语种', 'Neuro-linguistic', 'programming', '技术', '。']。
你的意思应该是认为'Neuro-linguistic programming'应当分作一个单词，这属于对分词颗粒度看法不同。按照MSR分词标准，英文词组应当被拆开，HanLP的模型很准确。
从你的目的来讲，HanLP支持输出每个单词在文本中的原始位置，“还原文本”完全可行，几行代码的事情：https://colab.research.google.com/drive/1Q-CV_G-zSErzoT7PlVWzgYj-MNK1BBpf?usp=sharing

wencan · 2024-02-24T00:48:26Z

这有点尴尬，我自己写代码，比对原文和分割后的列表，实现了 “还原文本”

amalgame21 added the feature request Suggest an idea for this project label Jan 22, 2023

amalgame21 assigned hankcs Jan 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

希望增加tok保存空格的选项，以便分词后还原文本 #1802

希望增加tok保存空格的选项，以便分词后还原文本 #1802

amalgame21 commented Jan 22, 2023 •

edited

Loading

hankcs commented Jan 22, 2023

wencan commented Feb 24, 2024

希望增加tok保存空格的选项，以便分词后还原文本 #1802

希望增加tok保存空格的选项，以便分词后还原文本 #1802

Comments

amalgame21 commented Jan 22, 2023 • edited Loading

hankcs commented Jan 22, 2023

wencan commented Feb 24, 2024

amalgame21 commented Jan 22, 2023 •

edited

Loading