Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing table columns #752

Open
mpele opened this issue Dec 21, 2024 · 1 comment
Open

Parsing table columns #752

mpele opened this issue Dec 21, 2024 · 1 comment
Labels

Comments

@mpele
Copy link

mpele commented Dec 21, 2024

I want to parse pdf document with table.
I have got text and its coordinates with getDataTm(). I have expected to define limits of x coordinate where the columns should be and that it will solve all my problems.

Unfortunately, I got some confusing values for coordinates, I have tried to find out what is happening but without success.

I have noted two anomalies. The first is the values for row numbers in the first column:

50 331 1
50 298 2
796 42 3

Visually the numbers are one above the other. Also I have to mention that the page is landscape and $details['MediaBox'] are 842.25 and 595.5 . I have noticed that 796+50 ~ 842 and that approximate row high is ~35 for all other cells, so is it possible that the reference point has been changed to the right bottom of the table?

Second mystery is the x coordinate of the last column for I got values:

396 367 16.12.2024
396 333 16.12.2024
396 299 16.12.2024

The problem is that those x values are in the middle of the table. There are columns with greater x value that are left from the mentioned column.

My question is: Is there some math that I have missed, and is it possible that the coordinates do not use the same reference system for the whole document?

@mpele
Copy link
Author

mpele commented Dec 22, 2024

I have created scatter plot of coordinates and everything looks as it should be:
newplot

All coordinates are read well, it looks like that every text is read correctly, but they are not paired correctly - some coordinates and texts are mixed.

Definitely, it is a bug.

After investigation: The second read text is empty and after that every read text is shifted to coordinates for next text. The last text (page number) is not shown but its coordinates are used.
I am not sure is it valid note that the problems starts with processing text in logo area.

@k00ni k00ni added the bug label Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants