-
Notifications
You must be signed in to change notification settings - Fork 4
/
paper.tex
463 lines (362 loc) · 37.4 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
\documentclass[runningheads]{llncs}
\usepackage{graphicx}
\usepackage[colorlinks]{hyperref}
\usepackage{listings}
\usepackage{enumitem}
\usepackage{courier}
\usepackage{booktabs}
\usepackage[dvipsnames]{xcolor}
\usepackage{amssymb}
\usepackage{pifont}
\usepackage{listings}
\usepackage{times}
\renewcommand{\floatpagefraction}{.8}
\usepackage{etoolbox}
\usepackage{multirow}
\usepackage{tabularx}
\usepackage{sectsty}
\usepackage{wrapfig}
\AtBeginEnvironment{quote}{\par\singlespacing\small}
% Spacing around floats
\setlength{\floatsep}{11pt plus 2pt minus 4pt}
\setlength{\textfloatsep}{11pt plus 2pt minus 4pt}
\setlength{\dblfloatsep}{\floatsep}
\setlength{\dbltextfloatsep}{11pt plus 2pt minus 4pt}
\setlength{\intextsep}{\floatsep}
\setlength{\abovecaptionskip}{5pt plus 3pt minus 2pt}
% check mark and x for table
\newcommand{\cmark}{{\color{ForestGreen}\ding{51}}}%
\newcommand{\xmark}{{\color{Maroon}\ding{55}}}%
\hypersetup{
colorlinks = false,
linkbordercolor = {blue}
}
\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.92}
\definecolor{superlightgray}{RGB}{242,242,242}
\lstdefinestyle{mystyle}{
backgroundcolor=\color{backcolour},
commentstyle=\color{codegreen},
keywordstyle=\color{magenta},
numberstyle=\tiny\color{codegray},
stringstyle=\color{codepurple},
basicstyle=\ttfamily\footnotesize,
breakatwhitespace=false,
breaklines=true,
captionpos=b,
keepspaces=true,
numbers=left,
numbersep=5pt,
showspaces=false,
showstringspaces=false,
showtabs=false,
tabsize=2
}
\lstset{style=mystyle}
\newcommand{\numAugraphyAugmentations}{\emph{30}}
\newcommand{\numNoisyOfficeUNetEpochs}{\emph{426}}
%
\begin{document}
\title{Augraphy: A Data Augmentation Library for Document Images}
\author{Alexander Groleau\inst{1} \and
Kok Wei Chee\inst{1} \and
Stefan Larson\inst{2} \and
Samay Maini\inst{1} \and
Jonathan Boarman\inst{1}}
\authorrunning{A. Groleau et al.}
\institute{Sparkfish LLC (\email{[email protected]}) \and
Vanderbilt University}
\maketitle
\begin{abstract}
This paper introduces \emph{Augraphy}, a Python library for constructing data augmentation pipelines which produce distortions commonly seen in real-world document image datasets.
\emph{Augraphy} stands apart from other data augmentation tools by providing many different strategies to produce augmented versions of clean document images that appear as if they have been altered by standard office operations, such as printing, scanning, and faxing through old or dirty machines, degradation of ink over time, and handwritten markings.
This paper discusses the \emph{Augraphy} tool, and shows how it can be used both as a data augmentation tool for producing diverse training data for tasks such as document denoising, and also for generating challenging test data to evaluate model robustness on document image modeling tasks.
\end{abstract}
\section{Introduction and Motivation}
The \emph{Augraphy} library is designed to introduce realistic degradations to images of documents, producing training datasets for supervised learning by pairing synthetic dirty documents with their clean ground truth document image sources. The need for such a data augmentation tool in document image analysis is twofold: existing document analytics software must be robust against various forms of noise, and ground truth images are unavailable for most of the documents recorded within human history. Automated recovery of knowledge stored in such documents is thwarted by the presence of distortion artifacts, which typically increase with time. As such, there is an ongoing \textit{and growing} need to improve both document restoration techniques and software processes, unlocking contents for use in search or other tasks and providing challenging data for tooling development. Augraphy answers this demand by supporting document-modality tasks such as reconstruction (Figure~\ref{fig:intro}), denoising (Figure~\ref{fig:shabbypages_denoising}), and robustness testing (Figure~\ref{fig:ocr_output}).
An ongoing digital transformation has reduced everyday reliance on paper. However, major industries such as healthcare~\cite{menachemi2011benefits}, insurance~\cite{gatzert2016impact}, legal~\cite{staudt2015access}, financial services~\cite{chuen2017handbook}, government~\cite{albano2018role}, and education~\cite{alkhezzi2020factors} continue to produce physical documents that are later digitally archived.
\begin{figure}
\centering
\begin{tabularx}{\linewidth}{*{3}{>{\centering\arraybackslash}X}}
Noisy Original & Clean Reproduction & \emph{Augraphy} Reproduction \\
\multicolumn{3}{c}{
\resizebox{\textwidth}{!}{
\includegraphics{figures/archetype-memo.pdf}}} \\
\end{tabularx}
\caption{\emph{Augraphy} can be used to introduce noisy perturbations to document images like the original noise seen above, a real-life sample from RVL-CDIP. We demonstrate this by reproducing a clean image, then introducing noise with \emph{Augraphy}.}
\label{fig:intro}
\end{figure}
Paper documents accumulate physical damage and alterations through printing, scanning, photocopying and regular use. Real-world processes introduce many types of distortions that result in color changes, shadows in scanned documents, and degraded text where low ink, annotations, highlighters, or other markings add noise that hinder semantic access by digital means.
Document image analysis tasks are impacted by the presence of such noise; high-level tasks like document classification and information extraction are frequently expected to perform well even on noisily-scanned document images.
For instance, the RVL-CDIP document classification corpus \cite{harley2015icdar-rvlcdip} consists of scanned document images, many of which have substantial amounts of scanner-induced noise, as does the FUNSD form understanding benchmark \cite{jaume2019-funsd}.
Other intermediate-level tasks like optical character recognition (OCR) and page layout analysis may perform optimally if noise in a document image is minimized \cite{character-recognition-systems,ogorman-document-image-analysis,Rotman2022-hh}.
Attempts have been made to denoise document images in order to improve document tasks \cite{ref_noisyoffice,blind-denoising-iccv-2021,kulkarni-2020,patch-based-document-denoising,Mustafa_2018-wan}. However, they have lacked the benefit of large datasets with ground truth images to train those denoisers.
A tool such as \emph{Augraphy} that helps to produce copious amounts of training data with document noise artefacts is essential in this space to developing advanced document binarization and denoising processes.
For this reason we introduce \emph{Augraphy}, an open-source Python-based data augmentation library for generating versions of document images that contain realistic noise artefacts commonly introduced via scanning, photocopying, and other office procedures.
\emph{Augraphy} differs from most image data augmentation tools by specifically targeting the types of alterations and degradations seen in document images.
\emph{Augraphy} offers \numAugraphyAugmentations ~individual augmentation methods out-of-the-box across three ``phases" of augmentations, and these individual phase augmentations can be composed together along with a ``paper factory" step where different paper backgrounds can be added to the augmented image.
The resulting images are realistic, noisy versions of clean documents, as evidenced in Figure~\ref{fig:intro}, where we apply \emph{Augraphy} augmentations to a clean document image in order to mimic the types of noise seen in a real-world noisy document image from RVL-CDIP.
Several research efforts have used \emph{Augraphy}: Larson et al. (2022) \cite{larson-2022-rvlcdip-ood} used \emph{Augraphy} to mimic scanner-like noise for evaluating document classifiers trained on RVL-CDIP; Jadhav et al (2022) \cite{jadhav2022pix2pix} used \emph{Augraphy} to generate noisy document images for training a document denoising GAN; and Kim et al. (2022) \cite{webvicob-2022-naver} used \emph{Augraphy} as part of a document generation pipeline for document understanding tasks.
\begin{table}
\centering
\caption{Comparison of various image-based data augmentation libraries. Number of augmentations is a rough count, and many augmentations in other tools are what \emph{Augraphy} calls Utilities. Further, many single augmentations in \emph{Augraphy} --- geometric transforms, for example --- are represented by multiple classes in other libraries.}
\resizebox{0.9\textwidth}{!}{
\begin{tabular}{lccccc}
\toprule
& \textbf{Number of} & \textbf{Document} & \textbf{Pipeline} & & \\
\textbf{Library} & \textbf{Augmentations} & \textbf{Centric} & \textbf{Based} & \textbf{Python} & \textbf{License} \\
\midrule
\emph{Augmentor} \cite{augmentor} & 27 & \xmark & \cmark & \cmark & MIT \\
\emph{Albumentations} \cite{ref_albumentations} & 216 & \xmark & \cmark & \cmark & MIT \\
\emph{imgaug} \cite{ref_imgaug} & 168 & \xmark & \cmark & \cmark & MIT \\
\emph{Augly} \cite{Papakipos2022-gq-augly} & 34 & \xmark & \cmark & \cmark & MIT \\
\emph{DocCreator} \cite{ref_DocCreator} & 12 & \cmark & \xmark & \xmark & LGPL-3.0 \\
\textbf{\emph{Augraphy} (ours)} & \textbf{\numAugraphyAugmentations} & \cmark & \cmark & \cmark & \textbf{MIT}\\
\bottomrule
\end{tabular}}
\label{tab:augmentation-library-comparison}
\end{table}
\section{Related Work}
This section reviews prior work on data augmentation and available tooling for robustness testing, particularly in the context of document understanding, processing, and analysis tasks.
Table~\ref{tab:augmentation-library-comparison} compares \emph{Augraphy} with other image augmentation libraries and tools, highlighting that most existing data augmentation libraries do not specifically cater to the corruptions typically encountered in document image analysis datasets. Instead of general image manipulation, \emph{Augraphy} focuses on document-specific distortions and alterations that are better-suited to challenging and evaluating OCR, object-detection, and other models with images which still retain sufficient quality to remain human-readable. Crucially, \emph{Augraphy} is also designed to be easily integrated into machine learning pipelines alongside other popular Python packages used in machine learning and document analysis.
\subsection{Data Augmentation}
A wide variety of data augmentation tools and pipelines exist for machine learning tasks, including natural language processing (e.g., \cite{feng-etal-2021-survey,fadaee-etal-2017-data,wei-zou-2019-eda}), audio and speech processing (e.g., \cite{ko15_interspeech,audiogmenter,audio-framework}), and computer vision and image processing.
As it pertains to computer vision, popular tools such as \emph{Augly} \cite{Papakipos2022-gq-augly}, \emph{Augmentor} \cite{augmentor}, \emph{Albumentations} \cite{ref_albumentations}, and \emph{imgaug} \cite{ref_imgaug} provide image-centric augmentation strategies including general-purpose transformations like rotations, warps, masking, and color modifications.
In contrast, as seen in Table~\ref{tab:augmentation-library-comparison}, only \emph{Augraphy} offers a data augmentation library that supports imitating the corruptions commonly seen in document analysis corpora and is also easily integrated in modern machine learning training processes which rely on Python, including use of specialized paper, ink and post-processing pipelines not seen elsewhere. Best-effort attempts to use \emph{Albumentations} to recreate the document in Figure~\ref{fig:intro} yielded results so poor and unlike the target that we could not in good faith include them in this paper.
\emph{DocCreator} \cite{ref_DocCreator} is the next best option, providing an image-synthesizing tool for creating synthetic images that mimic common corruptions seen in document collections. However, \emph{DocCreator} differs from \emph{Augraphy} in several crucial ways.
The first difference is that \emph{DocCreator}'s augmentations are meant to imitate those seen in historical (e.g., ancient or medieval) documents, while \emph{Augraphy} is intended to replicate a wider variety of both historical and modern scenarios.
\emph{DocCreator} was also written in the C++ programming language as a monolithic \emph{what-you-see-is-what-you-get (WYSIWYG)} tool, and does not have a scripting or API interface to enable use in a broader machine learning pipeline.
\emph{Augraphy}, on the other hand, is written in Python and can be easily integrated into machine learning model development and evaluation pipelines, alongside other Python packages like PyTorch \cite{ref_pytorch}.
\subsection{Robustness Testing}
Testing a model's robustness with challenging datasets is a vital step to evaluate a model's resilience against real-world scenarios, helping developers identify and address weaknesses for enhanced effectiveness in document processing tasks.
Prior work for evaluating image classification and object detection models includes image blur; contrast, brightness and color alterations; masking; geometric transforms; pixel-level noise; and compression artefacts (e.g., \cite{image-quality-impact,imagenet-c,pathology-recommendations,Hosseini2017-kn-google-api,face-recognition-impact,pathology-schomig,Vasiljevic2016-al-bluring-impact}).
More specific to the document understanding field, recent prior work has used basic noise-like corruptions to evaluate the robustness of document classifiers trained on RVL-CDIP \cite{saifullah-2022}.
Our paper also uses robustness testing of OCR and face-detection models, as a way to showcase \emph{Augraphy's} efficacy in producing document-centric distortions which challenge these models while remaining human-readable.
\section{Document Distortion, Theory \& Technique}
As referenced in the \emph{Related Work} section of this paper, many general-purpose image distortion techniques exist for data augmentation and robustness testing. While these techniques are useful in general image analysis and understanding, they bear little resemblance to the features commonly found in real-world documents.
\emph{Augraphy}'s suite of augmentations was designed to faithfully reproduce the level of complexity in the document lifecycle as seen in the real world. Such real-world conditions and features have direct implementations either within the library or on the development roadmap.
\begin{samepage}
\noindent Some techniques exist for introducing these features into images of documents, including but not limited to the following:
\begin{enumerate}
\item Text can be generated independently of the paper texture, overlaid onto the ``paper'' by a number of blending functions, allowing use of a variety of paper textures.
\item Similarly, any markup features may be generated and overlaid by the same methods.
\item Documents can be digitized with a commercial scanner, or converted to a continuous analog signal and back with a fax machine.
\item The finished document image can be used as a texture and attached to a 3D mesh, then projected back to 2 dimensions to simulate physical deformation. DocCreator's 3D mesh support here is excellent and has inspired \emph{Augraphy} to add similar functionality in its roadmap. In the meantime, \emph{Augraphy} currently offers support for paper folds and book bindings.
\end{enumerate}
\end{samepage}
\begin{figure}
\centering
\scalebox{0.3}{
\includegraphics{figures/pipeline.png}}
\caption{Visualization of an \emph{Augraphy} pipeline, showing the composition of several image augmentations together with a specific paper background}
\label{fig:pipeline}
\end{figure}
\emph{Augraphy} attempts to decompose the lifetime of features accumulating in a document by separating the pipeline into \emph{three phases}: \textbf{ink}, \textbf{paper}, and \textbf{post}.
The ink phase exists to sequence effects which specifically alter the printed ink --- like bleedthrough from too much ink on page, extraneous lines or regions from a mechanically-faulty printer, or fading over time --- and transform them prior to ``printing".
The paper phase applies transformations to the underlying paper on which the ink gets printed; here, a \texttt{PaperFactory} generator creates a random texture from a set of given texture images, as well as effects like random noise, shadowing, watermarking, and staining.
After the ink and paper textures are separately computed, they are merged in the manner of Technique 1 from the previous section, simulating the printing of the document.
After ``printing", the document enters the post phase, wherein it may undergo modifications that might alter an already-printed document out in the world.
Augmentations are available here which simulate the printed page being folded along multiple axes, marked by handwriting or color highlighter, faxed, photocopied, scanned, photographed, burned, stained, and so on.
Figure~\ref{fig:pipeline} shows the individual phases of an example pipeline combining to produce a noised document image.
\section{Augraphy}
\emph{Augraphy} is a lightweight Python package for applying realistic perturbations to document images. It is registered on the Python Package Index (PyPI) and can be installed simply using \colorbox{superlightgray}{\textbf{\texttt{pip install augraphy}}}.
\emph{Augraphy} requires only a few commonly-used Python scientific computing or image handling packages, such as NumPy \cite{ref_numpy}, and has been tested on Windows, Linux, and Mac computing environments and supports recent major versions of Python 3. Below is a basic out-of-the-box \emph{Augraphy} pipeline demonstrating its usage:
\begin{lstlisting}[language=Python, label={lst:sample-code}, caption={Augmenting a document image with \emph{Augraphy}.}]
import augraphy; import cv2
pipeline = augraphy.default_augraphy_pipeline()
img = cv2.imread("image.png")
data = pipeline.augment(img)
augmented_img = data["output"]
\end{lstlisting}
\emph{Augraphy} is designed to be immediately useful with little effort, especially as part of a preprocessing step for training machine learning models, so care was taken to establish good defaults.
\emph{Augraphy} provides \numAugraphyAugmentations ~unique augmentations, which may be sequenced into pipeline objects which carry out the image manipulation. The default \emph{Augraphy} pipeline (used in the code snippet above) makes use of all of these augmentations, with starting parameters selected after manual visual inspection of thousands of images.
Users of the library can define directed acyclic graphs of image transformations via the \texttt{AugraphyPipeline} API, representing the passage of a document through real-world alterations.
\begin{figure}
\includegraphics[width=\textwidth]{figures/augmentation-matrix.png}
\caption{Examples of some \emph{Augraphy} augmentations.} \label{fig2}
\end{figure}
\subsection{Augraphy Augmentations}
\emph{Augraphy} provides \numAugraphyAugmentations ~out-of-the-box augmentations, listed in Table~\ref{tab:augmentations}.
As mentioned before, Ink Phase augmentations include those that imitate noisy processes that occur in a document's life cycle when ink is printed on paper.
These augmentations include \texttt{BleedThrough}, which imitates what happens when ink bleeds through from the opposite side of the page.
Another, \texttt{LowInkLines}, produces a streaking behavior common to printers running out of ink.
Augmentations provided by the Paper Phase include \texttt{BrightnessTexturize}, which introduces random noise in the brightness channel to emulate paper textures, and \texttt{Watermark}, which imitates watermarks in a piece of paper.
Finally, the Post Phase includes augmentations that imitate noisy-processes that occur after a document has been created.
Here we find \texttt{BadPhotoCopy}, which uses added noise to generate an effect of dirty copier, and \texttt{BookBinding}, which creates an effect with shadow and curved lines to imitate how a page from a book might appear after capture by a flatbed scanner.
\begin{table}
\centering
\caption{Individual \emph{Augraphy} augmentations for each augmentation phase, in suggested position within a pipeline. Augmentations that work well in more than one phase are listed in the last column.}
\resizebox{0.9\textwidth}{!}{
\begin{tabular}{llll}
\toprule
\textbf{Ink Phase} & \textbf{Paper Phase} & \textbf{Post Phase} & \textbf{Multiple} \\
\midrule
\textsf{BleedThrough} & \textsf{ColorPaper}& \textsf{BadPhotoCopy} & \textsf{BrightnessTexturize} \\
\textsf{LowInkPeriodicLines} & \textsf{Watermark} & \textsf{BindingsAndFasteners} & \textsf{DirtyDrum} \\
\textsf{LowInkRandomLines} & \textsf{Gamma} & \textsf{BookBinding} & \textsf{DirtyRollers} \\
\textsf{InkBleed} & \textsf{LightingGradient} & \textsf{Folding} & \textsf{Dithering} \\
\textsf{Letterpress} & \textsf{SubtleNoise} & \textsf{JPEG} & \textsf{Geometric}\\
& \textsf{DelaunayTessellation} & \textsf{Markup} & \textsf{NoiseTexturize}\\
& \textsf{VoronoiTesselation} & \textsf{Faxify} & \textsf{Scribbles}\\
& \textsf{PatternGenerator}& \textsf{PageBorder} & \textsf{LinesDegradation}\\
& & & \textsf{Brightness}\\
\bottomrule
\end{tabular}}
\label{tab:augmentations}
\end{table}
In addition to document-specific transformations, \emph{Augraphy} also includes transformations like blur, scaling, and rotation, reducing the need for additional general-purpose augmentation libraries.
Descriptions of all \emph{Augraphy} augmentations are available online, along with the motivation for their development and usage examples. \footnote{\url{https://augraphy.readthedocs.io/en/latest/}}
\subsection{The Library}
There are four ``main sequence" classes in the \emph{Augraphy} codebase, which together provide the bulk of the library's functionality, and when composed generate realistic synthetically-augmented document images:
\smallskip
\noindent\textbf{\texttt{Augmentation}.} ~The \texttt{Augmentation} class is the most basic class in the project, and essentially exists as a thin wrapper over a probability value in the interval [0,1].
Every augmentation contained in a pipeline has its own chance of being applied during that pipeline's execution.
\smallskip
\noindent\textbf{\texttt{AugmentationResult}.} ~After an augmentation is applied, the output of its execution is stored in an \texttt{AugmentationResult} object and passed forward through the pipeline.
These also record a full copy of the augmentation runtime data, as well as any metadata that might be relevant for debugging or other advanced use.
\smallskip
\noindent\textbf{\texttt{AugmentationSequence}.} ~A list of \texttt{Augmentation}s --- intended to be applied in sequence --- determines an \texttt{AugmentationSequence}, which is itself both an \texttt{Augmentation} and callable (behaves like a function).
In practice, these model the pipeline phases discussed previously with lists of \texttt{Augmentation} constructor calls which produce callable \texttt{Augmentation} objects of the various flavors explored in \texttt{AugmentationSequences} applied in each \texttt{AugmentationPipeline} phase, and in each case yield the image, transformed by some of the Augmentations in the sequence.
\smallskip
\noindent\textbf{\texttt{AugmentationPipeline}.} ~The bulk of the heavy lifting in \emph{Augraphy} resides in the Augmentation pipeline, which is our abstraction over one or more events in a physical document's life.
\smallskip
\noindent~Consider the following sequence:
\begin{enumerate}[leftmargin=4em]
\item Ink is adhered to paper material during the initial printing of a document.
\item The document is attached to a public bulletin board in a high-traffic area.
\item The pages are annotated, defaced, and eventually torn away from their securing staples, flying away in the wind.
\item Fifty years later, the tattered pages are discovered and turned over to library archivists.
\item These conservationists use delicate tools to carefully position and record images of the document, storing these in a digital repository.
\end{enumerate}
An \texttt{AugmentationPipeline} represents such document lifetimes by composing augmentations modeling each of these individual events, while collecting runtime metadata about augmentations and their parameters and storing copies of intermediate images, allowing for inspection and fine-tuning to achieve outputs with desired features, facilitating (re)production of documents as in Figure \ref{fig:intro}.
Realistically reproducing effects in document images requires careful attention to how those effects are produced in the real world.
Many issues, like the various forms of misprint, only affect text and images on the page.
Others, like a coffee spill, change properties of the paper itself. Further still, there are transformations like physical deformations which alter the geometry and topology of both the page material and the graphical artifacts on it.
Effectively capturing processes like these in reproducible augmentations means separating our model of a document pipeline into ink, paper, and post-processing layers, each containing some augmentations that modify the document image as it passes through.
With \emph{Augraphy}, producing realistically-noisy document images can be reduced to the definition and application of one or more \emph{Augraphy} pipelines to some clean, born-digital document images.
\sectionfont{\keepall}
There are also two classes that provide additional critical functionality in order to round out the \emph{Augraphy} base library:
\noindent\textbf{\texttt{OneOf}.} ~To model the possibility that a document image has undergone one and only one of a collection of augmentations, we use \texttt{OneOf}, which simply selects one of those augmentations from the given list, and uses this to modify the image.
\sectionfont{\normalfont}
\noindent\textbf{\texttt{PaperFactory}.} ~We often print on multiple sizes and kinds of paper, and out in the world we certainly \textit{encounter} such diverse documents.
We introduce this variation into the \texttt{AugmentationPipeline} by including \texttt{PaperFactory} in the \texttt{paper} phase of the pipeline.
This augmentation checks a local directory for images of paper to crop, scale, and use as a background for the document image.
The pipeline contains edge detection logic for lifting only text and other foreground objects from a clean image, greatly simplifying the ``printing" onto another ``sheet", and capturing in a reproducible way the construction method used to generate the NoisyOffice database \cite{ref_NoisyOfficeDatabase}.
Taken together, \texttt{PaperFactory} makes it trivial to re-print a document onto other surfaces, like hemp paper, cardboard, or wood.
\smallskip
Interoperability and flexibility are core requirements of any data augmentation library, so \emph{Augraphy} includes several utility classes designed to improve developer experience:
%\smallskip
\noindent\textbf{\texttt{ComposePipelines}.} ~This class provides a means of composing two pipelines into one, allowing for the construction of complex multi-pipeline operations.
\smallskip
\noindent\textbf{\texttt{Foreign}.} ~This class can be used to wrap augmentations from external projects like \emph{Albumentations} and \emph{imgaug}.
\smallskip
\noindent\textbf{\texttt{OverlayBuilder}.} ~This class uses various blending algorithms to fuse foreground and background images together, which is useful for ``printing" or applying other artifacts like staples or stamps.
\smallskip
\noindent\textbf{\texttt{NoiseGenerator}.} ~This class uses make blobs algorithm to generate mask of noises in different shape and location.
\section{Deep Learning with \emph{Augraphy}}
\emph{Augraphy} aims to facilitate rapid dataset creation, advancing the state of the art for document image analysis tasks.
This section describes a brief experiment in which \emph{Augraphy} was used to augment a new collection of clean documents, producing a new corpus which fills a feature gap in existing public datasets. This collection was used to train a denoising convolutional neural network capable of high-accuracy predictions, demonstrating \emph{Augraphy}'s utility for robust data augmentation.
All code used in these experiments is available in the \texttt{augraphy-paper} GitHub repository \footnote{\url{https://github.com/sparkfish/augraphy-paper}}.
\subsection{Model Architecture}
To evaluate \emph{Augraphy}, we used an off-the-shelf Nonlinear Activation-Free Network (NAFNet), \cite{ref_nafnet}, making only minor alterations to the model's training hyperparameters, changing the batch size and learning epoch count to fit our training data.
\subsection{Data Generation}
Despite recent techniques (\cite{ref_tvt2040}, \cite{ref_vtssds}) for reducing the volume of input data required to train models, large datasets remain king; feeding a model more data during training can help ensure better latent representations of more features, improving robustness of the model to noise and increasing its capacity for generalization to new data.
To train our model, we developed the \emph{ShabbyPages} dataset \cite{ref_ShabbyPages}, a real-world document image corpus with \emph{Augraphy}-generated synthetic noise.
Gathering the 600 ground-truth documents was the most challenging part, requiring the efforts of several crowdworkers across multiple days. Adding the noise to these images was trivial however; \emph{Augraphy} makes it easy to produce large training sets.
For additional realistic variation, we also gathered 300 paper textures and used the PaperFactory augmentation to ``print'' our source documents onto these.
The initial 600 PDF documents were split into their component pages, totaling 6202 clean document images exported at 150 pixels per inch.
We iteratively developed an \emph{Augraphy} pipeline which produced satisfactory output with few overly-noised images, then ran each of the clean images through this to produce the base training set.
Patches were taken from each of these images and used for the final training collection.
Another new dataset --- \emph{ShabbyReal} --- was produced as part of the larger \emph{ShabbyPages} corpus \cite{ref_ShabbyPages}; this data was manually produced by applying sequences of physical operations to real paper. \emph{ShabbyReal} is fully out-of-distribution for this experiment, and also has a very high degree of diversity, providing good conditions for evaluating \emph{Augraphy}'s effect.
\begin{figure}
\centering
\includegraphics[width=0.9\columnwidth]{figures/shabbyreal_denoised.png}
\caption{Sample \emph{ShabbyReal} noisy images (left) and their denoised counterparts (right).}
\label{fig:shabbyreal_denoised}
\end{figure}
\subsection{Training Regime}
The NAFNet was given a large sample to learn from: for each image in the ShabbyPages set, we sampled ten patches of 400x400 pixels, using these to train the instance across just 16 epochs.
The model was trained with mean squared error as the loss function, using the Adam optimizer and cosine annealing learning rate scheduler, then evaluated with the mean average error metric.
\begin{figure}
\centering
\includegraphics[width=0.95\columnwidth]{figures/shabbypages_denoising.png}
\caption{Sample \emph{ShabbyPages} noisy images (left), denoised by Shabby\_NAFNet (center), and groundtruth (right).}
\label{fig:shabbypages_denoising}
\end{figure}
\begin{samepage}
\subsection{Results}
Sample predictions from the model on the validation task are presented in Figure \ref{fig:shabbyreal_denoised}. The SSIM score indicates that these models generalize very well, though they do overcompensate for shadowing features around text, by increasing the line thickness in the prediction, resulting in a bold font.
To compare the model's performance on the validation task, we considered the root mean square error (RMSE), structural similarity index (SSIM), and peak signal-to-noise ratio (PSNR) metrics.
\end{samepage}
\begin{table}
\centering
\caption{Denoising performance on \emph{ShabbyReal} validation task}\label{tab1}
\begin{tabular}{@{\hspace{2em}}l@{\qquad}@{\hspace{2em}}l@{\qquad}}
\toprule
\textbf{Metric} & \textbf{Score} \\
\midrule
SSIM & 0.71\\
PSNR & 32.52\\
RMSE & 6.52\\
\hline
\end{tabular}
\end{table}
As can be seen in Figure \ref{fig:shabbyreal_denoised}, the model was able to remove a significant amount of blur, line noise, shadowing from tire tracks, and watercoloring.
\section{Robustness Testing}
In this section, we examine the use of \emph{Augraphy} to add noise which challenges existing models.
We first use \emph{Augraphy} to add noise to an image of text with known groundtruth.
The Tesseract \cite{ref_tesseract} pre-trained OCR model's performance on the clean image is compared to the OCR result on the \emph{Augraphy}-noised image using the Levenshtein distance.
Then, we add \emph{Augraphy}-generated noise to document images which contain pictures of people, and perform face-detection on the output, comparing the detection accuracy to that of the clean images.
\subsection{Character Recognition}
We first compiled 15 ground-truth, noise-free document images from a new corpus of born-digital documents, whose ground-truth strings are known. We then used Tesseract to generate OCR predictions on these noise-free documents, as a baseline for comparison. We considered these OCR predictions as the ground-truth labels for each document. Next, we generated noisy versions of the 15 documents by running them through an \emph{Augraphy} pipeline, and again used Tesseract to generate OCR predictions on these noisy documents. We compared the word accuracy rate on the noisy OCR results versus the ground-truth noise-free OCR results, and found that the noisy OCR results were on average 52\% less accurate, with a range of up to 84\%. This example use-case demonstrates the effectiveness of using \emph{Augraphy} to create challenging test data that remains human-readable, suitable for evaluating OCR systems.
\begin{figure}
\centering
\includegraphics[width=0.9\columnwidth]{figures/ocr_figure.png}
\caption{Sample \emph{ShabbyReal} clean/noisy image pair, and the OCR output for each. The computed Levenshtein distance between the OCR result strings is 203.}
\label{fig:ocr_output}
\end{figure}
\subsection{Face Detection Robustness Testing}
\begin{wraptable}{r}{6cm}
\centering
\caption{Face detection performance.}
\begin{tabular}{lr}
\toprule
\textbf{Model} & \textbf{Accuracy} \\
\midrule
Azure & 57.1\%\\
Google & 60.1\%\\
Amazon & 49.1\%\\
UltraFace & 4.5\%\\
\bottomrule
\end{tabular}
\label{tab:face-detection-results}
\end{wraptable}
In this section, we move beyond text-related tasks to robustness analysis of image classifiers and object detectors. In particular, we examine the robustness of face detection models on \emph{Augraphy}-altered images containing faces, such as newspapers \cite{newspaper-navigator} and legal identification documents \cite{midv-500}, which are often captured with noisy scanners.
We begin by sampling 75 images from the FDDB face detection benchmark \cite{fddbTech}, using \emph{Augraphy} to generate 10 altered versions of each image, manually removing any augmented image where the face(s) is not reasonably visible.
We then test four face detection models on the noisy and noise-free images.
These models are: proprietary face detection models from Google,\footnote{\url{https://cloud.google.com/vision/docs/detecting-faces}} Amazon,\footnote{\url{https://aws.amazon.com/rekognition/}} and Microsoft\footnote{\url{https://azure.microsoft.com/en-us/services/cognitive-services/face}}, and lastly the freely-available UltraFace\footnote{\url{https://github.com/onnx/models/tree/main/vision/body_analysis/ultraface}} model.
Table~\ref{tab:face-detection-results} shows face detection accuracy of both models on the noisy test set, and example images where no faces were detected by the Microsoft model are shown in Figure \ref{fig:face-detection-mistakes}.
We see that all four models struggle on the \emph{Augraphy}-augmented data, with the proprietary models seeing detection performances drop to between roughly 50-60\%, and UltraFace detecting only 4.5\% of faces that it found in the noise-free data.
\begin{figure}
\centering\scalebox{0.47}{
\includegraphics{figures/azure_1x3_face_errors.pdf}}
\caption{Example augmented images that yielded false-positive predictions with the Azure face detector.}
\label{fig:face-detection-mistakes}
\end{figure}
\section{Conclusion and Future Work}
We presented \emph{Augraphy}, a framework licensed under the MIT open source license for generating realistic synthetically-augmented datasets of document images.
Other available tools were examined and found to lack needed features, motivating the creation of this library specifically supporting alterations and degradations seen in document images.
We described how to use \emph{Augraphy} to create a new document image dataset containing synthetic real-world noise, then compared results obtained by training a convolutional NAFNet instance on this corpus.
Finally, we examined \emph{Augraphy}'s efficacy in generating confounding data for testing the robustness of document vision models.
Future work on the \emph{Augraphy} library will focus on adding new types of augmentations, increasing performance to enable faster creation of larger datasets, providing more scale-invariant support for all document image resolutions, and responding to community-initiated feature requests.
%
% ---- Bibliography ----
%
% BibTeX users should specify bibliography style 'splncs04'.
% References will then be sorted and formatted in the correct style.
%
\bibliographystyle{splncs04}
\bibliography{paper}
\end{document}