subtile-ocr

subtile-ocr is a blazingly fast and accurate DVD VobSub to SRT subtitle conversion tool. It's started as a fork of vobsubocr.

Background

DVD subtitles are unfortunately encoded essentially as a series of images. This presents problems when needing a text representation of the subtitle, e.g. for language learning. subtile-ocr can alleviate this problem by generating SRT subtitles from an input VobSub file, leveraging the power of Tesseract.

Installation

Install the latest release with cargo:

cargo install subtile-ocr

Or alternatively, install the development version from git:

cargo install --git https://github.com/gwen-lg/subtile-ocr

You will need to have Tesseract's development libraries installed; see the leptess readme for more details. If you use Nix, the provided shell.nix provides an environment with all of the necessary dependencies.

Usage

# Convert simplified Chinese vobsub subtitles and print them to stdout.
subtile-ocr -l chi_sim shrek_chi.idx

# Convert English vobsub subtitles and write them to a file named "shrek_eng.srt".
subtile-ocr -l eng -o shrek_eng.srt shrek_eng.idx

We can also specify more advanced configuration options for Tesseract with -c.

# Convert subtitles and blacklist the specified characters from being (mistakenly) recognized.
subtile-ocr -l eng -c tessedit_char_blacklist='|\/`_~' shrek_eng.idx

How does it work/compare to similar tools?

The most comparable tool to subtile-ocr is VobSub2SRT, but subtile-ocr has significantly better output, especially for non-English languages, mainly because VobSub2SRT does not do much preprocessing of the image at all before sending it to Tesseract. For example, Tesseract 4.0 expects black text on a white background, which VobSub2SRT does not guarantee, but subtile-ocr does. Additionally, subtile-ocr splits each line into separate images to take advantage of page segmentation method 7, which greatly improves accuracy of non-English languages in particular.

Official documentation on how to improve accuracy of Tesseract output can be viewed here.

Miscellaneous Notes

From my understanding, the chi_sim and chi_tra Tesseract models work on both simplified and traditional Chinese text, but automatically convert said text to their respective forms.

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.github/workflows		.github/workflows
src		src
.envrc		.envrc
.gitignore		.gitignore
.typos.toml		.typos.toml
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.txt		LICENSE.txt
README.md		README.md
shell.nix		shell.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

subtile-ocr

Background

Installation

Usage

How does it work/compare to similar tools?

Miscellaneous Notes

About

Releases 1

Packages

Contributors 4

Languages

License

gwen-lg/subtile-ocr

Folders and files

Latest commit

History

Repository files navigation

subtile-ocr

Background

Installation

Usage

How does it work/compare to similar tools?

Miscellaneous Notes

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 4

Languages

Packages