Processing Results from PDF Scanning #2

pecknigel · 2020-09-07T18:26:56Z

I'm trying to make use of this and may be missing something. I get:

TypeError: Cannot read property '0' of undefined
    at init (.../node_modules/line-segmentation-gcp-vision-ocr/index.js:12:37)

It seems that it is looking for data.textAnnotations[0] but the data I'm getting back from GC Vision is structured as:

{ fullTextAnnotation: [Object], context: [Object] }

Has the data format changed? Am I missing something here?

Thanks.

The text was updated successfully, but these errors were encountered:

ionistrati · 2020-09-07T21:36:57Z

hi @nigelbpeck, google vision ocr provides this list of keys in response:

[
  'faceAnnotations',
  'landmarkAnnotations',
  'logoAnnotations',
  'labelAnnotations',
  'textAnnotations',
  'localizedObjectAnnotations',
  'safeSearchAnnotation',
  'imagePropertiesAnnotation',
  'error',
  'cropHintsAnnotation',
  'fullTextAnnotation',
  'webDetection',
  'productSearchResults',
  'context'
]

fullTextAnnotation contains only the text response using google's algorithm. GC Vision textDetection and documentTextDetection both provide the full response as I gave the example above.
you should initialize your client like this:

const vision = require('@google-cloud/vision');
const client = new vision.ImageAnnotatorClient({
    // Proiect ID from Google Cloud Vision
    projectId: 'your-project-id',
    // Keyfile from Google Cloud Vision
    keyFilename: './google-vision-api-keyfile.json',
});

and call the textDetection from your defined client:
data = client.textDetection(filePath);
or
data = client.documentTextDetection(filePath);
inside of data you should have the full response. (maybe in an array with a single object)

please tell me if you had initialized the GC Vision OCR differently or if you get a different response doing the same steps, so I can help you.

pecknigel · 2020-09-08T12:51:25Z

Hi @ionistrati ,

Thanks for getting back to me.

I'm going off the approach recommended for detecting text in files on the Google Cloud Vision API documentation (link below), which recommends using asyncBatchAnnotateFiles() and not client.textDetection(filePath) or client.documentTextDetection(filePath).

This seems to be because my source is a PDF rather than an image.

Detect text in files (PDF/TIFF)
https://cloud.google.com/vision/docs/pdf

Submitting a one page PDF with:

    const [operation] = await gcVisionClient.asyncBatchAnnotateFiles({
        requests: [
            {
                inputConfig: {
                    // Supported mime_types are: 'application/pdf' and 'image/tiff'
                    mimeType: 'application/pdf',
                    gcsSource: {
                        uri: `gs://${bucketName}/${fileName}`,
                    },
                },
                features: [{type: 'DOCUMENT_TEXT_DETECTION'}],
                outputConfig: {
                    gcsDestination: {
                        uri: `gs://${bucketName}/${resultsOutputFolder}`,
                    },
                },
            },
        ],
    });

I get back the following data:

{
  fullTextAnnotation: {
    pages: [
             {
               property: { detectedLanguages: [] },
               width: 746,
               height: 829,
               blocks: [ {}, {}, {}, ... ]
             }
           ],
    text: 'Many lines of text\n' +
      'Many lines of text\n' +
      '...\n'
  },
  context: {
    uri: 'gs://bucket-name/test.pdf',
    pageNumber: 1
  }
}

So it only has fullTextAnnotation and no textAnnotations.

pecknigel changed the title ~~Change in Data Format?~~ Processing Results from PDF Scanning Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing Results from PDF Scanning #2

Processing Results from PDF Scanning #2

pecknigel commented Sep 7, 2020 •

edited

Loading

ionistrati commented Sep 7, 2020 •

edited

Loading

pecknigel commented Sep 8, 2020 •

edited

Loading

Processing Results from PDF Scanning #2

Processing Results from PDF Scanning #2

Comments

pecknigel commented Sep 7, 2020 • edited Loading

ionistrati commented Sep 7, 2020 • edited Loading

pecknigel commented Sep 8, 2020 • edited Loading

pecknigel commented Sep 7, 2020 •

edited

Loading

ionistrati commented Sep 7, 2020 •

edited

Loading

pecknigel commented Sep 8, 2020 •

edited

Loading