Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New GetText() option: NegativeGapAsWhitespace #952

Merged
merged 1 commit into from
Dec 9, 2024

Conversation

Kizaemon
Copy link
Contributor

@Kizaemon Kizaemon commented Dec 8, 2024

When parsing PDF files with tables containing multiple lines in a cell or "merged" cells, the separate words can appear out of horizontal order. This option can better predict the spaces between the words.

When parsing PDF files with tables containing multiple lines in a cell or "merged" cells, the separate words can appear out of horizontal order. This option can better predict the spaces between the words.
@Kizaemon
Copy link
Contributor Author

Kizaemon commented Dec 8, 2024

Solves [#951]

@Kizaemon
Copy link
Contributor Author

Kizaemon commented Dec 8, 2024

This is a sample screenshot from the PDF, where the next word starts to the left of the previous word, making the negative gaps.

image

@BobLd
Copy link
Collaborator

BobLd commented Dec 8, 2024

@Kizaemon thanks for sharing the information here.

Table extraction is generally tricky to do in pdf documents.

Can I ask you to look into tabula sharp which is a package based on PdfPig but specialised in table extraction?

@Kizaemon
Copy link
Contributor Author

Kizaemon commented Dec 8, 2024

@BobLd Thank you for recommending tabula sharp, I will surely take a look at it.

As for this pull request, I still feel it could be useful to have an option to detect the negative gaps as whitespaces.
The current implementation does not do anything with negative letter distances.

@BobLd
Copy link
Collaborator

BobLd commented Dec 8, 2024

@Kizaemon I tend to agree here, reviewing your code I started to wonder if we should not simply always use the absolute value.

I'll give it a proper look later today

@Kizaemon
Copy link
Contributor Author

Kizaemon commented Dec 9, 2024

There is a possibility that this change (unconditional absolute value) can break the existing parsing patterns for the current users.
Therefore I have suggested to add an option for backwards compatibility.

You will know better the typical usage patterns by the existing users and may just absolute in all cases.

@BobLd BobLd merged commit a2ae1f1 into UglyToad:master Dec 9, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants