Mx ocrmypdf #690

mimimi1968 · 2020-07-16T13:37:27Z

First: I like the project and it was the point for me to really switch over to a paperless workflow, thanks a lot for your work!
Second: I'm new to python and django. I used to code in C and C++, but python always stood on the todo-list... So this are my first lines ever in python. And django? It's really really cool but a lot to learn.

So in short I made the following changes:

fight against python, pip, docker, alpine, pypi, dependencies... ;-)
added a ocrmypdf based parser, because I wanted the OCR to put the text into my PDFs. Unfortunately this feature was sold separately and only for Windows in addition to my scanner (Brother ADS 1700W)
changed the way, the consumer works: it can "eat" the input documents (default) and it can preserve the unchanged input files. So I can put the originals again into some more advanced OCR tool just in case in the future.
the consumer had some problems due to the slow transfer from the scanner into the consume dir together with docker on MacOS: the no-inotify method ended up with zero length documents and errors
i18n for most of the admin GUI
some more small enhancements like actions for tags and correspondents directly in the form

I hope someone will find it useful.

This change was needed for slowly growing files coming in from scanner. The old implementation grabbed them even when they were not completely written. So we wait now (when not using inotify) and watch for changes in mtime and/or size of the file. Normally this gives enough time to the scanning process to complete the work and transfer.

The ocrmypdf process came back with tesseract took too long... So instead investigating this, I took the shortcut.

It was too short when deleting som 10 files. So I set this to 90s for now.

better user experience

The handling of the source filename is a kind of non trivial, possibly broken. So this is work for a later moment... When the file object is created, the file doesn't have to exist, but this is assumed for now (and searched for a moved file).

There were different functions to cleanup the directory structure in MEDIAROOT/originals when using a intelligent file naming format. This was reduced and cleaned up to be more to the point. The idea is to address all situations, when a file is moved or could be renamed by us or others. After this it is time to clean up empty dirs. There is no need for optimization, so one recursive function can do this for us.

pitkley · 2020-10-03T17:10:11Z

Thank you for this huge contribution! I just wanted to let you know that you will have a very hard time getting a review for this PR, mainly because it is incredibly large and contains multiple, distinct features, which makes reviewing it extra tough.

If you want to see your changes land in Paperless, your best bet is to split this up into separate PRs, where each PR contains a distinct feature.

Also, without looking into this in detail, dropping support for Python 3.5, while sensible since it is EOL, at least requires an update to the documentation, which I'm not seeing. I'm not seeing any documentation at all, and for new features it would be nice to at least have short descriptions for what each thing does (and what might change for users of the application, because there seem to be quite a few changes that could be breaking to our users).

mimimi1968 · 2020-10-05T19:27:45Z

Thank you for your nice notice and the hints! I made some development straight forward to suit my needs. Now the current state is some point, where I reason what to do next. I can break down the rather huge change into small and distinct branches, each a separate PR. But I'm unsure if this is the way for me. Yes it is completely understood, that any merge into upstream needs review. My main question is, will there be any review in the next future?

Second point of interest is if my changes are acceptable at all? I tried not to break the behavior, beside some changes e.g. in the consumer code, where I let the configuration decide. So I made some rather small enhancements, started i18n with german localization, fixed small errors.

One big feature was the support for another OCR backend, that indeed changes the content of the consumed document to have a searchable PDF.
Is there "a big plan" for features, acceptable changes or some guidance for such questions?
Of cause I owe some updated documentation. I will add this soon, at least for me to remember the design decisions.

Thanks again for the time you spent looking into it and writing the comment.

It was too hard to build a reliably working ocrmypdf. So for now we use the pre-built docker container and make a multi-stage build for the rest.

mimimi1968 added 30 commits July 16, 2020 12:35

Switch to ocrmypdf

7ccf228

Introduce 'move' feature in infolder

b775152

Exclude ./consume dir

0be5e8a

Optimize order in Dockerfile

ede0f87

Catch the exception when Ocrmypdf dies

25a0982

Made LANGUAGE configurable

ad34445

Just corrected the path for static/

7caeafa

Include some management actions into GUI

10e5cbc

Renamed gunicorn.conf to remove warning

211d15e

Updated Django to 2.2 and fixed the deps

3cedc37

Started i18n localization

686b2c8

Filter for "added" field too

c26f6ab

Remove hack for date input

8db0e7c

Corrected the Dockerfile after squashing the commits

a01c9d9

Upgraded to alpine 3.12 - it just worked

9a826e7

Fixed the tests to pass at least

6cc9917

Finished the i18n for admin interface

fba442b

Added the forgotten package dep

24648f8

Added jquery.min.js so the colouring works now

ddff376

Changed some translated words to be consistent

7387257

Rollback to alpine 3.10 for tesseract not working in 3.12

220a6a7

The ocrmypdf process came back with tesseract took too long... So instead investigating this, I took the shortcut.

Raised the worker timeout

d0530a1

It was too short when deleting som 10 files. So I set this to 90s for now.

Fixed the whitespace handling and small improvement for get_text

d60b94b

Better handling of change detection when scanning /consume

5785416

Add a page count to gui and business logic

01332c2

Changed the 'created' Field to DateField for

c277c57

better user experience

Display the version information in footer

b36128e

Ignore further changes of version.txt

633d194

Silence pycodestyle errors and warnings

18b50c1

mimimi1968 added 16 commits July 31, 2020 23:20

Instruct ocrmypdf to rotate when needed

d06d08c

Enable forcing the conversion again

dd574c4

Update to Alpine 3.11

07e8325

Move transaction to _store function

e6efbd1

Silence optipng output

98fd129

Make ocrmypdf more robust and debug friendly

ed350b6

Remove build for python 3.5 and add 3.8

753ecfd

Just added some .keep files

d951ae1

Keep the fiel metadata when exporting/importing

8fa4a3c

Added the field "pages" to the serialiser code

b163c96

Create directories in MEDIAROOT when we save a new document

28a1f52

Updated the packages

2b0db66

Support for mysql as database

03e38d5

Update packages because of factoryboy not running

79c7778

Fixed an i18n issue in change form

c7305c9

mimimi1968 added 10 commits November 27, 2020 16:01

Introduced an upload facility for documents

75ab1a1

Silence pycodestyle warnings

cc3f00b

Fix some C-style format strings

6275aff

Add some built-time deps for Pillow package

6737560

Fixes again for the last 2 commits

55f5fd3

Switch over to a binary dependency on ocrmypdf

4cb1047

It was too hard to build a reliably working ocrmypdf. So for now we use the pre-built docker container and make a multi-stage build for the rest.

Add /usr/local/bin to find ocrmypdf in path

d182fb1

Silence a db warning for mysql backend

a01c5fe

Optimized Dockerfile for size of the image

0d2570e

Lock the dependencies for now

1868fff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mx ocrmypdf #690

Mx ocrmypdf #690

mimimi1968 commented Jul 16, 2020

pitkley commented Oct 3, 2020

mimimi1968 commented Oct 5, 2020

Mx ocrmypdf #690

Are you sure you want to change the base?

Mx ocrmypdf #690

Conversation

mimimi1968 commented Jul 16, 2020

pitkley commented Oct 3, 2020

mimimi1968 commented Oct 5, 2020