diff --git a/prog/cleanpdf.c b/prog/cleanpdf.c index 9976771af..727d37e00 100644 --- a/prog/cleanpdf.c +++ b/prog/cleanpdf.c @@ -87,6 +87,10 @@ * * Whenever possible, the images will be deskewed. * + * As the first step in processing, images are saved in the ./image + * directory as RGB at 300 ppi in ppm format. Each image is about 26MB. + * Delete those images after use. + * * Some pdf files have oversize media boxes. PDF is a * resolution-independent format for storing data that can be imaged. * Usually the data is stored in fonts, which are a description of the diff --git a/prog/croppdf.c b/prog/croppdf.c index ec12a12ee..24e866ea2 100644 --- a/prog/croppdf.c +++ b/prog/croppdf.c @@ -28,55 +28,28 @@ * croppdf.c * * This program concatenates all pdfs in a directory by rendering them - * as images, optionally scaling the images, and generating an output pdf. - * The pdfs are taken in lexical order. Pages are encoded with either - * tiffg4 or jpeg (DCT), or a mixture of them depending on input parameters - * and page color content. For DCT encoding, the jpeg quality factor - * can be used to trade off the size of the resulting pdf against - * the image quality. + * as images, cropping each image to the foreground region with + * options for noise removal and margins, slightly thickens long + * horizontal lines (e.g., of a music staff), and to some extent + * scales the width to fill a printed page. See documentation for + * pixCropImage() for the parameters. * - * If the pages are monochrome (black and white), use of the %onebit - * flag will achieve better compression with less distortion. - * If most of the pages are black and white, but some have color that - * needs to be saved, input parameters %onebit and %savecolor should - * be both set to 1. Then the pages with color are compressed with DCT - * and the monochrome pages are compressed with tiffg4. + * The pdfs are concatenated in lexical order, and each image + * is encoded with tiffg4. * - * The first step is to render the images as RGB, using Poppler's pdftoppm. - * Compare compresspdf with cleanpdf, which carries out several cleanup - * operations, such as deskewing and adaptive thresholding to clean - * noisy or dark backgrounds in grayscale or color images, resulting - * in high resolution, 1 bpp tiffg4 encoded images in the pdf. - * - * Syntax: - * compresspdf basedir scalefactor onebit savecolor quality title fileout + * Syntax: + * croppdf basedir threshold lrclear tbclear edgeclean + * lradd tbadd title fileout * * The %basedir is a directory where the input pdf files are located. * The program will operate on every file in this directory with * the ".pdf" extension. * - * The %scalefactor is typically used to downscale the image to - * reduce the size of the generated pdf. It should not affect the - * pdf display otherwise. For normal text on images scanned at 300 ppi, - * a 2x reduction (%scalefactor = 0.5) may be satisfactory. - * We compute an output resolution for that pdf that will cause it - * to print 11 inches high, based on the height in pixels of the - * first image in the set. - * - * Images are saved in the ./image directory as RGB in ppm format. - * If the %onebit flag is 0, these will be encoded in the output pdf - * using DCT. To force the images to be 1 bpp, with tiffg4 encoding, set - * the $onebit flag to 1. - * - * The %savecolor flag is ignored unless %onebit is 1. In that case, - * if %savecolor is 1, the image is tested for color content, and if - * even a relatively small amount is found, the image will be encoded - * with DCT instead of tiffg4. + * The %lrclear and %tbclear parameters give the number of background + * pixels to be added to the foreground region. * - * The %quality is the jpeg output quality factor for images stored - * in the pdf. Use 0 for the default value (50), which is satisfactory - * for many purposes. Use 75 for standard jpeq quality; 85-95 are very - * high quality. Allowed values are between 25 and 95. + * The %edgeclean parameter is used to remove edge noise, going from + * 0 (default, no removal) to 15 (maximally aggressive removal). * * The %title is the title given to the pdf. Use %title == "none" * to omit the title. @@ -84,14 +57,9 @@ * The pdf output is written to %fileout. It is advisable (but not * required) to have a '.pdf' extension. * - * The intent is to use pdftoppm to render the images at 150 pixels/inch - * for a full page, when scalefactor = 1.0. The renderer uses the - * mediaboxes to decide how big to make the images. If those boxes - * have values that are too large, the intermediate ppm images can - * be very large. To prevent that, we compute the resolution to input - * to pdftoppm that results in RGB ppm images representing page images - * at about 150 ppi (when scalefactor = 1.0). These images are about - * 6MB, but are written quickly because there is no compression. + * As the first step in processing, images are saved in the ./image + * directory as RGB at 300 ppi in ppm format. Each image is about 26MB. + * Delete those images after use. * * N.B. This requires the Poppler package of pdf utilities, such as * pdfimages and pdftoppm. For non-unix systems, this requires @@ -123,21 +91,20 @@ l_int32 main(int argc, char buf[256]; char *basedir, *fname, *tail, *basename, *imagedir, *title, *fileout; l_int32 threshold, lrclear, tbclear, edgeclean, lradd, tbadd; -l_int32 render_res, onebit, savecolor, i, n, ret; -l_float32 scalefactor; +l_int32 render_res, i, n, ret; SARRAY *sa; if (argc != 10) return ERROR_INT( - "Syntax: croppdf basedir threshold lrclean tbclear edgeclean " + "Syntax: croppdf basedir threshold lrclear tbclear edgeclean " "lradd tbadd title fileout", __func__, 1); basedir = argv[1]; threshold = atoi(argv[2]); - lrclear = atoi(argv[3]); /* set to 1 to enforce 1 bpp tiffg4 encoding */ - tbclear = atoi(argv[4]); /* set to 1 to enforce 1 bpp tiffg4 encoding */ - edgeclean = atoi(argv[5]); /* jpeg quality */ - lradd = atoi(argv[6]); /* set to 1 to enforce 1 bpp tiffg4 encoding */ - tbadd = atoi(argv[7]); /* set to 1 to enforce 1 bpp tiffg4 encoding */ + lrclear = atoi(argv[3]); + tbclear = atoi(argv[4]); + edgeclean = atoi(argv[5]); + lradd = atoi(argv[6]); + tbadd = atoi(argv[7]); title = argv[8]; fileout = argv[9]; setLeptDebugOK(1); @@ -184,17 +151,13 @@ SARRAY *sa; } sarrayDestroy(&sa); - /* Optionally binarize, then scale and collect all images in memory. - * If n > 100, use pixacomp instead of pixa to store everything - * before generating the pdf. - * When using the onebit option, It is important to binarize - * the images in leptonica. Do not let 'pdftoppm -mono' do - * the binarization, because it will apply error-diffusion - * dithering to gray and color images. */ + /* Process each image and collect all resulting 1 bpp images + * in memory. If n > 200, use pixacomp instead of pixa to + * store the images before generating the pdf. */ sa = getSortedPathnamesInDirectory(imagedir, NULL, 0, 0); lept_free(imagedir); sarrayWriteStderr(sa); - lept_stderr("compressing ...\n"); + lept_stderr("croping ...\n"); cropFilesToPdf(sa, threshold, lrclear, tbclear, edgeclean, lradd, tbadd, title, fileout); diff --git a/src/pdfapp.c b/src/pdfapp.c index 511bb5fc2..b2b1ea1e9 100644 --- a/src/pdfapp.c +++ b/src/pdfapp.c @@ -231,41 +231,29 @@ PIXAC *pixac1 = NULL; * \brief cropFilesToPdf() * * \param[in] sa sorted full pathnames of images - * \param[in] threshold set to 1 to enforce 1 bpp tiffg4 encoding - * \param[in] lr_clear set to 1 to enforce 1 bpp tiffg4 encoding - * \param[in] tb_clear if %onebit == 1, set to 1 to save color - * \param[in] edgeclean scaling factor applied to each image; > 0.0 - * \param[in] lr_add for jpeg: 0 for default (50; otherwise 25 - 95. - * \param[in] tb_add for jpeg: 0 for default (50; otherwise 25 - 95. + * \param[in] threshold threshold for binarization + * \param[in] lr_clear full res pixels cleared at left and right sides + * \param[in] tb_clear full res pixels cleared at top and bottom sides + * \param[in] edgeclean parameter for removing edge noise (0-15) + * default = 0 (no removal); + * \param[in] lr_add full res expansion of crop box on left and right + * \param[in] tb_add full res expansion of crop box on top and bottom * \param[in] title [optional] pdf title; can be null * \param[in] fileout pdf file of all images * \return 0 if OK, 1 on error * *
  * Notes:
- *    (1) This function is designed to optionally scale and compress a set of
- *        images, wrapping them in a pdf in the order given in the input %sa.
- *    (2) It does the image processing for prog/compresspdf.c.
- *    (3) Images in the output pdf are encoded with either tiffg4 or jpeg (DCT),
- *        or a mixture of them depending on parameters %onebit and %savecolor.
- *    (4) Parameters %onebit and %savecolor work as follows:
- *        %onebit = 0: no depth conversion, default encoding depends on depth
- *        %onebit = 1, %savecolor = 0: all images converted to 1 bpp
- *        %onebit = 1, %savecolor = 1: images without color are converted
- *           to 1 bpp; images with color have the color preserved.
- *    (5) In use, if most of the pages are 1 bpp but some have color that needs
- *        to be preserved, %onebit and %savecolor should both be 1.  This
- *        causes DCT compression of color images and tiffg4 compression
- *        of monochrome images.
- *    (6) The images will be concatenated in the order given in %sa.
- *    (7) The scalefactor is applied to each image before encoding.
- *        If you enter a value <= 0.0, it will be set to 1.0.
- *    (8) Default jpeg quality is 50; otherwise, quality factors between
- *        25 and 95 are enforced.
- *    (9) Page images at 300 ppi are about 8 Mpixels.  RGB(A) rasters are
- *        then about 32 MB (1 bpp images are about 1 MB).  If there are
- *        more than 25 images, store the images after processing as an
- *        array of compressed images (a Pixac); otherwise, use a Pixa.
+ *    (1) This function is designed to optionally remove white space from
+ *        around the page images, and generate a pdf that prints with
+ *        foreground occupying much of the full page.
+ *    (2) It does the image processing for prog/croppdf.c.
+ *    (3) Images in the output pdf are 1 bpp and encoded with tiffg4.
+ *    (4) See documentation in pixCropImage() for details on the processing.
+ *    (5) The images will be concatenated in the order given in %sa.
+ *    (6) Page images at 300 ppi are about 1 Mpixels.  We allow up to 200
+ *        uncompressed rasters to be stored in memory.  If more than 200
+ *        pages, the stored images are compressed with tiffg4.
  * 
*/ l_ok