getDataTm is not returning all the text #761

ridgey-dev · 2025-02-06T11:40:21Z

PHP Version: 8.2.27
PDFParser Version: 2.11.0

Description:

PDF input

Expected output & actual output

The PDF contains the word {{signer1}}, but the getDataTm does not return this text for the second page.

[
  [
    [
      "1",
      "0",
      "0",
      "1",
      "36.266",
      "754.031"
    ],
    "Test"
  ],
  [
    [
      "1",
      "0",
      "0",
      "1",
      "34.016",
      "701.653"
    ],
    ""
  ]
]

Note that: The getTextArray is returning {{signer1}}. The problem has to do something with getTextArray returning an empty string for the second page. (Probably because of the image?)

[
  "Test",
  "",
  "{{signer1}}"
]

Code

$parser = new Parser();
$document = $parser->parseContent($content);

foreach ($document->getPages() as $page) {
    foreach ($page->getDataTm() as $value) {
        var_dump($value);
    }
}

The text was updated successfully, but these errors were encountered:

MaheKarim · 2025-02-12T09:33:37Z

// Extract PDF (Text Based Content)
$parser = new Parser();
$pdf = $parser->parseFile($filePath);
$resumeContent = $pdf->getText();

This works for me

ridgey-dev · 2025-02-12T10:24:28Z

Yup. That is because the getText function contains some trim logic:

pdfparser/src/Smalot/PdfParser/Document.php

Line 439 in 0ddcc54

if ($text = trim($page->getText())) {

But sadly, I need the position of the text too. So I can't just use the getText function.

Maybe a solution could be to move that check to getTextArray ? I'm not very familiar with that code, so it would be helpful if someone can think about this.

ridgey-dev · 2025-02-12T13:10:53Z

I created a PR to fix this: #762. Can someone take a look at it?

k00ni added the bug label Feb 6, 2025

ridgey-dev added a commit to ridgey-dev/pdfparser that referenced this issue Feb 12, 2025

Fix issue smalot#761 - Add trim check in getTextArray

f3caf26

ridgey-dev mentioned this issue Feb 12, 2025

getTextArray: Add trim check in Do command #762

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getDataTm is not returning all the text #761

getDataTm is not returning all the text #761

ridgey-dev commented Feb 6, 2025 •

edited

Loading

MaheKarim commented Feb 12, 2025

ridgey-dev commented Feb 12, 2025 •

edited

Loading

ridgey-dev commented Feb 12, 2025 •

edited

Loading

getDataTm is not returning all the text #761

getDataTm is not returning all the text #761

Comments

ridgey-dev commented Feb 6, 2025 • edited Loading

Description:

PDF input

Expected output & actual output

Code

MaheKarim commented Feb 12, 2025

ridgey-dev commented Feb 12, 2025 • edited Loading

ridgey-dev commented Feb 12, 2025 • edited Loading

ridgey-dev commented Feb 6, 2025 •

edited

Loading

ridgey-dev commented Feb 12, 2025 •

edited

Loading

ridgey-dev commented Feb 12, 2025 •

edited

Loading