Kinopoisk Review Analysis

KPRA is an experimental service to reduce tons of letters in film reviews to quicklier grasp the films' essence and answer the two questions:

Development Process Outline

Find the location (in HTML) of the numbers that indicate how many positive/negative/neutral reviews there are, and extract them.
Depending of the above numbers, calculate the maximum number of reviews per page to display (in order to minimize the number of pages to download and parse).
Write a script to download all the reviews (and, perhaps, store them temporary locally).
Bring all words to the initial form in order to compute tf–idf. For example, via Yandex'es mystem.
Compute the tf–idf statistic on the obtained data. Presumably the better way is to treat the primary data as follows: Each review is a document, each collection of reviews according to some mood is a collection, or corpus. However note that since it's important to know the word weight for a certain mood, there's probably good logic in treating the whole set of reviews (independent from the mood) as a single corpus as well.
Define collocations by using t-test, chi-square, MLE, MI/PMI, etc. As well as above, maybe it's needed to work on the initial word forms. After obtaining various metrics, opt for the most appropriate (basing on some factors?) collocations.
Develop a simple GUI for a web service.

KPRA should function as a separate web service that enables users to promptly check the info about the film.
KPRA should somehow retain the already collected information for the quicker processing of further requests.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
431148_bad.txt		431148_bad.txt
431148_good.txt		431148_good.txt
431148_neutral.txt		431148_neutral.txt
README.md		README.md
benchmark.rb		benchmark.rb
kpra.rb		kpra.rb
sketch.rb		sketch.rb
tfidf.rb		tfidf.rb
yaparse.rb		yaparse.rb
ym.rb		ym.rb