cinemaanalytics

JFuture 2019 Cinema Analytics System Challenge

Challenge

Imagine that you're a developer working on Cinema Analytics System, who needs to collect data and make calculations on the following items:

Display the dynamics of film releases in China and the United States over the past 3 years. Please take into account 5 different film genres.

Example: Fiction, 2017: China — 23 films; USA — 42. Fiction, 2018: China — 10 films; USA — 22.
Make the list of top 5 Directors of the highest-rated films.

Assumptions

The data is collected for 2019, 2018 and 2017 years, as 2019 is almost over. Due to automatic data parsing and loading it is easy to add support for 2016 year instead of 2019 by using the corresponding link to the Wikipedia page.
When setting the gross (in USD) value for the film the mapping is done by the film's title, so there is the potential chance of duplicate films with the same title per year per country.
The actual output format for the "top 5 Directors of the highest-rated films" has not been explicitly specified, so it is set to "Director (Country) / Title (month day, year) / Gross (in USD) / [Genres]".
"Top 5 Directors of the highest-rated films" query is run on top of the whole data set, not per country.
Criteria for choosing the highest-rated films is based on the gross (in USD) value of the film.

Information gathering and calculation

Data for the films is taken from Wikipedia pages using Wikipedia REST API.
For each of the page the corresponding GET /page/title/{title} has been called manually to locate revision value (rev in the response), so the data is "frozen" and parsing can rely on known page HTML format.
Each page is then programmatically fetched using Wikipedia REST API and parsed for the corresponding data, both films and rankings (gross value in USD). HTML parsing is done using jsoup Java library.
Data is "flushed" (see configuration parameter below) using analytics.action=flush option when running into the local JSON data files. Which are then used for the analytics queries, unless analytics.mode=online when local data is ignored, and the data is fetched from Wikipedia.
There are subtle differences in both China and USA Wikipedia films pages, so there are different parsers for each of the countries.
In addition, there are subtle differences in parsing current 2019 year data, and passed years, like 2018 and 2017, this is also reflected in the parser implementations.

Here is the list of the Wikipedia pages used to fetch and parse the data (these are also controlled by china.sources and usa.sources configuration parameters described below):

Configuration

The application supports a few configuration parameters:

parameter	values	default	description
`analytics.action`	`flush` - (it then ignores the `analytics.mode` property, i.e. it is always run in online mode)	`run`	action to run
	`run` - (run actual analytics)
`analytics.mode`	`online` - (fetches the data from the Wikipedia REST API)	`offline`	either read local data or remote
	`offline` - (reads the data from the locally stored JSON data files)
`analytics.genres`	list of genres (from `Genre`enum separated by comma)	`COMEDY,DRAMA,SCI_FI,HISTORY,FANTASY`	genres to use in the dynamics of film releases analytics
`analytics.topdirectors`	number	5	number of top directors
`useragent.email`	email of user running the analytics in the `online` mode	`[email protected]`	Wikipedia REST API requires to set the contact information in the `User-Agent` header
`china.sources`	list of Wikipedia pages to retrieve data from	see `application.properties`	list of china films pages in Wikipedia
`usa.sources`	list of Wikipedia pages to retrieve data from	see `application.properties`	list of usa films pages in Wikipedia

Note: if you are using analytics.mode=online be sure you change the value of the useragent.email unless you want to use the default value!

To set the parameter on run either (in the priority order):

set JVM system property by adding -D for the running command, ex.: gradle -Danalytics.action=flush run or gradle -Danalytics.mode=online -Danalytics.genres=DRAMA,COMEDY,NATURE run, etc.
OS specific environment variable (in that case upper case the property name and replace . with _, i.e. analytics.action becomes ANALYTICS_ACTION
change the value directly in the application.properties

Running

gradle -q run

It assumes you have Gradle installed. Otherwise it is easy to install it with SDKMAN: sdk install gradle.

Output

Comedy, 2017: China - 99 films; USA - 81. Comedy, 2018: China - 100 films; USA - 68. Comedy, 2019: China - 29 films; USA - 75.
Drama, 2017: China - 146 films; USA - 109. Drama, 2018: China - 169 films; USA - 117. Drama, 2019: China - 51 films; USA - 102.
Sci-Fi, 2017: China - 5 films; USA - 30. Sci-Fi, 2018: China - 5 films; USA - 25. Sci-Fi, 2019: China - 4 films; USA - 21.
History, 2017: China - 12 films; USA - 3. History, 2018: China - 4 films; USA - 4. History, 2019: China - 3 films; USA - 1.
Fantasy, 2017: China - 34 films; USA - 19. Fantasy, 2018: China - 32 films; USA - 20. Fantasy, 2019: China - 10 films; USA - 20.
------------------------------
Anthony and Joe Russo (USA) / Avengers: Endgame (APRIL 26, 2019) / $858,373,000 / [Action, Adventure, Drama, Epic, Fantasy, Sci-Fi, Superhero]
Wu Jing (CHINA) / Wolf Warriors 2 (JULY 27, 2017) / $854,248,869 / [Action]
Yu Yang (CHINA) / Nezha (JULY 26, 2019) / $710,400,000 / [Animation, Comedy, Drama, Fantasy]
Ryan Coogler (USA) / Black Panther (FEBRUARY 16, 2018) / $700,059,566 / [Action, Adventure, Superhero]
Anthony Russo and Joe Russo (USA) / Avengers: Infinity War (APRIL 27, 2018) / $678,815,482 / [Action, Adventure, Drama, Epic, Fantasy, Sci-Fi, Superhero]

License

MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
gradle/wrapper		gradle/wrapper
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
build.gradle		build.gradle
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cinemaanalytics

Challenge

Assumptions

Information gathering and calculation

Configuration

Running

Output

License

About

Releases

Packages

Languages

License

zshamrock/cinemaanalytics

Folders and files

Latest commit

History

Repository files navigation

cinemaanalytics

Challenge

Assumptions

Information gathering and calculation

Configuration

Running

Output

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages