Caption processing and tagging for YouTube videos with Japanese captions.
- Get video metadata and captions
- Extract captions and model to a structure
- Segment the Japanese text
- Compute frequency of words
- Compare kanji against 常用漢字 (Jōyō kanji)
- Visualize frequency graph
- Visualize other interesting data
- Rank words/kanji based on JLPT level (maybe, data is kinda meh to find)
- Rank entire video captions based on frequency and difficulty
- Store the output in a reasonable way
- Decide if to make this a package or add a REST API