Skip to content
This repository has been archived by the owner on Feb 19, 2021. It is now read-only.

Mx ocrmypdf #690

Open
wants to merge 57 commits into
base: master
Choose a base branch
from
Open

Conversation

mimimi1968
Copy link

First: I like the project and it was the point for me to really switch over to a paperless workflow, thanks a lot for your work!
Second: I'm new to python and django. I used to code in C and C++, but python always stood on the todo-list... So this are my first lines ever in python. And django? It's really really cool but a lot to learn.

So in short I made the following changes:

  • fight against python, pip, docker, alpine, pypi, dependencies... ;-)
  • added a ocrmypdf based parser, because I wanted the OCR to put the text into my PDFs. Unfortunately this feature was sold separately and only for Windows in addition to my scanner (Brother ADS 1700W)
  • changed the way, the consumer works: it can "eat" the input documents (default) and it can preserve the unchanged input files. So I can put the originals again into some more advanced OCR tool just in case in the future.
  • the consumer had some problems due to the slow transfer from the scanner into the consume dir together with docker on MacOS: the no-inotify method ended up with zero length documents and errors
  • i18n for most of the admin GUI
  • some more small enhancements like actions for tags and correspondents directly in the form

I hope someone will find it useful.

This change was needed for slowly growing files coming in from
scanner. The old implementation grabbed them even when they were
not completely written.
So we wait now (when not using inotify) and watch for changes in
mtime and/or size of the file. Normally this gives enough time to
the scanning process to complete the work and transfer.
The ocrmypdf process came back with tesseract took too long...
So instead investigating this, I took the shortcut.
It was too short when deleting som 10 files. So I set this to 90s
for now.
The handling of the source filename is a kind of non trivial,
possibly broken. So this is work for a later moment...
When the file object is created, the file doesn't have to exist,
but this is assumed for now (and searched for a moved file).
There were different functions to cleanup the directory structure
in MEDIAROOT/originals when using a intelligent file naming format.
This was reduced and cleaned up to be more to the point. The idea is
to address all situations, when a file is moved or could be renamed
by us or others. After this it is time to clean up empty dirs.
There is no need for optimization, so one recursive function can do
this for us.
@pitkley
Copy link
Member

pitkley commented Oct 3, 2020

Thank you for this huge contribution! I just wanted to let you know that you will have a very hard time getting a review for this PR, mainly because it is incredibly large and contains multiple, distinct features, which makes reviewing it extra tough.

If you want to see your changes land in Paperless, your best bet is to split this up into separate PRs, where each PR contains a distinct feature.

Also, without looking into this in detail, dropping support for Python 3.5, while sensible since it is EOL, at least requires an update to the documentation, which I'm not seeing. I'm not seeing any documentation at all, and for new features it would be nice to at least have short descriptions for what each thing does (and what might change for users of the application, because there seem to be quite a few changes that could be breaking to our users).

@mimimi1968
Copy link
Author

Thank you for your nice notice and the hints! I made some development straight forward to suit my needs. Now the current state is some point, where I reason what to do next. I can break down the rather huge change into small and distinct branches, each a separate PR. But I'm unsure if this is the way for me. Yes it is completely understood, that any merge into upstream needs review. My main question is, will there be any review in the next future?

Second point of interest is if my changes are acceptable at all? I tried not to break the behavior, beside some changes e.g. in the consumer code, where I let the configuration decide. So I made some rather small enhancements, started i18n with german localization, fixed small errors.

One big feature was the support for another OCR backend, that indeed changes the content of the consumed document to have a searchable PDF.
Is there "a big plan" for features, acceptable changes or some guidance for such questions?
Of cause I owe some updated documentation. I will add this soon, at least for me to remember the design decisions.

Thanks again for the time you spent looking into it and writing the comment.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants