Symbols parsed as text #3161

pn12 · 2025-02-26T18:20:46Z

pn12
Feb 26, 2025

I am trying to extract text using pdfreader. Symbols for email, linkedin , phone are converted to text. Is there any way, I can restrict them to not to convert to text?

stefan6419846 · 2025-02-26T18:56:02Z

stefan6419846
Feb 26, 2025
Maintainer

Without seeing the actual PDF file, this is just guessing, but in general, pypdf just extracts the characters present there. You might be able to filter this by text position, but it sounds like this is not what you are looking for.

5 replies

pn12 Feb 27, 2025
Author

Hi @stefan6419846 , please see the attached file, icons for email, linkedin, phone are converted to text:
♂phone+9184208873 /linkedinwww.linkedin.com/in/priyaa-aora /envel⌢[email protected]
19536802cc0f0a49 (3) Copy.pdf

stefan6419846 Feb 27, 2025
Maintainer

I just had a look at the file. This is because the corresponding cmaps have /Differences entries which map this accordingly, for example to avoid loosing the semantic meaning.

pn12 Feb 27, 2025
Author

ok, thanks for sharing the information.
do you suggest, some fix/solution to parse these icons correctly?
like, in all the similar pdf input data files, the icons for emails/linkedin/phone are parsed as text; which is hindering the process for extracting these entities - email, linkedin & phone.

stefan6419846 Feb 27, 2025
Maintainer

No, I do not have a reliable/direct solution for this without any side effects, although my experience with cmaps is rather limited. pypdf probably will not change this unless there is evidence we are violating the standard here and given that a fix will work for most of the other use cases.

Apart from this, we as pypdf are not responsible for your post-processing as perfect text extraction from PDF files is nearly impossible - PDFs are meant for printing, not for automated processing (except maybe with the PDF/UA standard). If all or most of your CVs look like this, you should easily be able to post-process this accordingly or in general account for such alternatives.

pn12 Feb 27, 2025
Author

ok, Thank you for sharing this. I'll just try to look for an alternative solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Symbols parsed as text #3161

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Symbols parsed as text #3161

pn12 Feb 26, 2025

Replies: 1 comment · 5 replies

stefan6419846 Feb 26, 2025 Maintainer

pn12 Feb 27, 2025 Author

stefan6419846 Feb 27, 2025 Maintainer

pn12 Feb 27, 2025 Author

stefan6419846 Feb 27, 2025 Maintainer

pn12 Feb 27, 2025 Author

pn12
Feb 26, 2025

Replies: 1 comment 5 replies

stefan6419846
Feb 26, 2025
Maintainer

pn12 Feb 27, 2025
Author

stefan6419846 Feb 27, 2025
Maintainer

pn12 Feb 27, 2025
Author

stefan6419846 Feb 27, 2025
Maintainer

pn12 Feb 27, 2025
Author