You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a future-looking issue, not something we're likely to address immediately, but something to keep in mind as we drive progress of ALTO to be a suitable representation of all OCR output or ground truth in general, whether the source be scanned documents, scene text, screenshots, or video.
Video may require special consideration because the straightforward approach, having an ALTO record for the result of OCR of each frame, would be grossly inefficient since in most videos, OCR is present in only some portions, and text tends to persist over segments, either entirely or partially.
To track and drive this capability, this issue proposes that ALTO should represent the "ideal" OCR results of the attached video in much the way a human commentator would: by describing the overall text once, and representing dynamically the changing text-region boundaries in the moving scene in an efficient manner, e.g., by encoding differences in bounding boxes or by describing the motion parametrically.
Considering video may also drive discussion of the relative roles of layout representation and text-fragment representation, and of collection-level annotation (e.g., book or video or newspaper) and page-level annotation.
This issue will be considered fixed when the following has happened:
For the referenced two video files, represent the ideal OCR results (i.e., OCR groundtruth) efficiently using ALTO and attach the XML files to this issue.
The text was updated successfully, but these errors were encountered:
Hi All, as discussed in our last meeting, I think ALTO should consider the use case where text in video is being OCR'ed. Some common types of video text:
Incidental text in the background (signs, storefronts, etc.)
hard captions or subtitles, e.g., those burned into the pixels
ALTO should support OCR of video efficiently.
This is a future-looking issue, not something we're likely to address immediately, but something to keep in mind as we drive progress of ALTO to be a suitable representation of all OCR output or ground truth in general, whether the source be scanned documents, scene text, screenshots, or video.
Video may require special consideration because the straightforward approach, having an ALTO record for the result of OCR of each frame, would be grossly inefficient since in most videos, OCR is present in only some portions, and text tends to persist over segments, either entirely or partially.
To track and drive this capability, this issue proposes that ALTO should represent the "ideal" OCR results of the attached video in much the way a human commentator would: by describing the overall text once, and representing dynamically the changing text-region boundaries in the moving scene in an efficient manner, e.g., by encoding differences in bounding boxes or by describing the motion parametrically.
Considering video may also drive discussion of the relative roles of layout representation and text-fragment representation, and of collection-level annotation (e.g., book or video or newspaper) and page-level annotation.
Relevant files:
This issue will be considered fixed when the following has happened:
For the referenced two video files, represent the ideal OCR results (i.e., OCR groundtruth) efficiently using ALTO and attach the XML files to this issue.
The text was updated successfully, but these errors were encountered: