Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用pdf2txt.py输入一份中文pdf,输出的内容为一系列英文字符,不是想要的中文内容,要怎么处理? #16

Open
LilySys opened this issue Feb 6, 2024 · 2 comments

Comments

@LilySys
Copy link

LilySys commented Feb 6, 2024

使用pdf2txt.py跑的pdf内容为中文,结果出来的结果如下:
image

@xiaowanziyayaya
Copy link

请问姐解决了吗,我也遇到了同样的问题

@conser12
Copy link

conser12 commented Dec 24, 2024

我遇到同样的问题,通过以下方式解决的:

  1. 安装tesseract时,勾选中文语言包
    image
  2. 调用接口partition_pdf时,设置中文语言languages=['chi_sim']

不过最终解析出来的结果不是很理想,可能对中文支持的效果不是很好

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants