GitHub - alejandrojdf/Kafka-examples

Project have the following structure

App: backbone of the project, where all the dataframe transformation are being made
SparkResources: contains all the basic configuration for Spark session and some methods related to DF processing, also contains some unused method (due to lack of time and environment limitation) that can calculate dynamically the spark resources needed.
HDFSServices: contains a small method in case it's required to get the properties from HDFS

ETL Steps:

First all the source files are read
Then we get the top 20 movies with a minimum of 50 votes with the ranking determined by: (numVotes/averageNumberOfVotes) * averageRating
After that we perform a series of joiner, aggregation other transformation to the data in order to get the list the persons who are most often credited and list the different titles of the 20 movies.

Resources:

The file contains a project.properties file in the resources folder, which contains the path to all the needed sources, so it is not need to be hardcoded.

In case that the source files path change more dynamically, the path can be placed outside the project, so there is no need to package and re-deploy the software.

The file also can be configured to be taken from HDFS, to avoid path problems when running in yarn cluster mode.

Running the project:

Please first git clone the project and wait for the Maven dependencies to be indexed and downloaded. All dependencies are public.
After this, please package the jar file using Maven.
Ir order to run the jar file:
- Local mode:
spark-submit --master local[*] --deploy-mode client --class com.bgc.scala.App C:\Users\aleja\Downloads\bgcExample\target\bgcExample-1.0-SNAPSHOT.jar
- Cluster mode (Spark standalone mode):
spark-submit --master spark://IP:PORT --num-executors=2 --class com.bgc.scala.App C:\Users\aleja\Downloads\bgcExample\target\bgcExample-1.0-SNAPSHOT.jar

Testing: Two tested were performed, local mode and cluster mode. For the second test, a standalone local cluster was created with one master and 3 executors, both test were successful and provided the following result:

tconst	primaryNames	titleslist
tt0111161	[Stephen King, Bo...	[The Shawshank Re...
tt0468569	[Lorne Orleans, M...	[The Dark Knight,...
tt1375666	[Hans Zimmer, Chr...	[Inception, Incep...
tt0137523	[Ross Grayson Bel...	[Fight Club, Figh...
tt0110912	[Bruce Willis, La...	[Pulp Fiction, Pu...
tt0109830	[Gary Sinise, Rob...	[Forrest Gump, Fo...
tt0944947	[Kit Harington, E...	[Game of Thrones,...
tt0133093	[Carrie-Anne Moss...	[The Matrix, The ...
tt0120737	[J.R.R. Tolkien, ...	[The Lord of the ...
tt0167260	[Fran Walsh, Pete...	[The Lord of the ...
tt0068646	[Mario Puzo, Fran...	[The Godfather, T...
tt1345836	[Gary Oldman, Ann...	[The Dark Knight ...
tt0167261	[J.R.R. Tolkien, ...	[The Lord of the ...
tt0816692	[Emma Thomas, Jes...	[Interstellar, In...
tt0114369	[Richard Francis-...	[Se7en, Se7en]
tt0903747	[Steven Michael Q...	[Breaking Bad, Br...
tt1853728	[Kerry Washington...	[Django Unchained...
tt0172495	[Russell Crowe, R...	[Gladiator, Gladi...
tt0372784	[Christopher Nola...	[Batman Begins, B...
tt0848228	[Robert Downey Jr...	[The Avengers, Th...

Please find the following screenshots with the test performed in Cluster mode. https://drive.google.com/open?id=1BuAYaw98siOXGY59rOqJn80JThaAUBrK

Final notes:

jdk1.8.0_191 used
Although in my daily work I always focus in the details and try to be strict as possible in good practices, due to the lack of time it was unfeasible to perform unit and integration test were possible and some inconsistencies could result in naming conventions, etc. All the effort were focused on producing the expected results.
No spark.sql(...) statements were used as required.
All the DataFrames were filtered before join or aggregation steps to improve performance and reduce shuffle.
The result of the process is a standard output as no output were specified, such as HDFS csv file, a Hive or a NOSQL table.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src/main		src/main
README.md		README.md
bgcExample.iml		bgcExample.iml
build.sbt		build.sbt
ctm-data-eng-exercise.zip		ctm-data-eng-exercise.zip
derby.log		derby.log
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

alejandrojdf/Kafka-examples

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages