|
1 | 1 | # saram - Image/PDF OCR conversion
|
2 |
| -Get OCR in txt form from an image or pdf extension supporting multiple files from directory using pytesseract |
| 2 | +Get OCR in txt form from an image or pdf extension supporting multiple files from directory using `pytesseract` with support for rotation in case of wrong orientation along. |
| 3 | + |
| 4 | +**Currently in alpha state** |
| 5 | + |
| 6 | +[](https://youtu.be/Cpj3XVdsK_g) |
| 7 | + |
| 8 | +**Note:** |
| 9 | +Mkae sure you have a OCR tool like `tesseract` and certain data value for comparing OCR, eg `tesseract-data-eng` along with `Pillow` and `Wand` for image conversion and loading. |
| 10 | + |
| 11 | +## Installation |
| 12 | + |
| 13 | +Clone the source locally: |
| 14 | +``` |
| 15 | +$ git clone https://github.com/aryaminus/saram |
| 16 | +$ cd saram |
| 17 | +$ git checkout py-module |
| 18 | +$ python main.py <dirname> |
| 19 | +``` |
| 20 | + |
| 21 | +## Todo |
| 22 | +- [x] Add support for PDF by PDF -> image -> txt with converted image deletion after processing |
| 23 | +- [x] Double check for orientation in case of image and PDF |
| 24 | +- [ ] Add NLP to process the most repeated frequent characters to filer content |
| 25 | +- [ ] Add Cloud Vision support for effective character recognization |
| 26 | + |
| 27 | +## Reference |
| 28 | +1. <a href="https://github.com/lucab85/PDFtoTXT" target="_blank">PDFtoTXT</a> |
| 29 | +2. <a href="https://github.com/prabhakar267/ocr-convert-image-to-text" target="_blank">ocr-convert-image-to-text</a> |
| 30 | +3. <a href="https://pastebin.com/QFMpp28T" target="_blank">Fix-image-rotation</a> |
| 31 | + |
| 32 | + |
| 33 | +----------------------------------------------------------------------------------------------------------- |
| 34 | + |
| 35 | +## Contributing |
| 36 | + |
| 37 | +1. Fork it (<https://github.com/aryaminus/saram/fork>) |
| 38 | +2. Create your feature branch (`git checkout -b feature/fooBar`) |
| 39 | +3. Commit your changes (`git commit -am 'Add some fooBar'`) |
| 40 | +4. Push to the branch (`git push origin feature/fooBar`) |
| 41 | +5. Create a new Pull Request |
| 42 | + |
| 43 | +**Enjoy!** |
0 commit comments