The project enables the automatic translation of image-based documents into various language.
This project was my first major project involving AI models.
This project arose from an idea to translate larger image-based texts into my own language in order to better understand the content and meaning of the text and to avoid misunderstandings.
Therefore, because it is my first major project, it is not inevitable that everything will work.
For this project i use Tesseract for text extraction and multiple models from HuggingFace Helsinki-NLP for text translation.
- Extracting text from images
- Translating text
- Supporting different languages
- Tesseract 5.x.x
- Python 3.11.1+
- Pipenv (optional)
Please install this requirements. pipenv
is not necessary, but you have access to a virtual enviroment and its easier to install the necessary packages for this project.
Check installation with:
tesseract --version
python --version
pipenv --version
You have to do a few changes in the following files:
project/frontend/templates/index.html
project/backend/utility/language_shortcuts.json
To have access to your new languages in the web browser you have to add two lines of code:
<!-- index.html -->
<!-- Example adding Polish -->
<option value="polish" {{ "selected" if src_language == "polish" else "" }} >Polish</option>
The models also need to know the new languages. So we have to update the language_shortcuts.json
-file. Just add following lines of code (example with polish). You need to know the shortcuts of the languages. You have access to shortcuts for Tesseract here and for the Translattion model here:
{
"english": {
"tesseract": "eng",
"translation": "en"
},
"french": {
"tesseract": "fra",
"translation": "fr"
},
"german": {
"tesseract": "deu",
"translation": "de"
},
"spanish": {
"tesseract": "spa",
"translation": "es"
},
// Adding new language
"polish": {
"tesseract": "pol",
"translation": "pl"
}
}
Tesseract still dont know the new languages yet. You have to install the corresponding language package from here. Just download the xxx.traineddata
file for you language and copy this to your Tesseract installation path in the existing folder tessdata
.
The Translation model will automaticaly download the model for the translation if it is available on Huggingface. Therefore, certain directions may not work. You can watch here to see which models are available.
You have to change the path in file ocr_model.py
in line 59. You need your absolut path to your Tesseract OCR exe.
pytesseract.pytesseract.tesseract_cmd = os.path.join("C:\\Users", getpass.getuser(), r"AppData\Local\Programs\Tesseract-OCR" , "tesseract.exe")
You have to change the apth in file utility.py
in line 15. You need yout absolut paht to your language_shortcuts.json
-file.
return os.path.join("C:\\Users", getpass.getuser(), r"Desktop\text_translation_of_image_base_documents\project\backend\utility", r"language_shortcuts.json")
To create the virtual environment, the necessary Pipfiles
are already in the project
folder.
Just navigate to project/
and execute following command:
pipenv install
All packages are being installed. You may need to reload your IDE to have access to your created environemnt.
To start the application, navigate to project/frontend
and execute following command:
pipenv run py .\main.py
Now you started the application. To use the application in you web-browser, you have to know your ipv4-address. You can check it:
ipconfig
Now go into your browser and pass follwing URL: <your-ipv4-address>:8000
. Now you should see the application.
You can use the same URL to run the application on different devices.