WIP: movie_json_data analysis by dayanajoseph3091 · Pull Request #2 · FEND16/movie-json-data

dayanajoseph3091 · 2020-12-15T15:24:42Z

it is a work in progress pullrequest.

extract, transformed json to python DataFrame.
load transformed data to MS SQL DB
added DB schema
added teardown for repeat runs
modeling and reporting
document insights
project cleanup and addressing remaining review comments

and connect MSSQL

ETL

Feature/dayana

remove unused code

ababo · 2020-12-15T22:41:43Z

code/__main__.py

+import load_db as ldb
+if __name__ == '__main__':
+    # execute only if run as the entry point into the program
+    ldb.main()


Missing end of line.

added end of line.

ababo · 2020-12-15T22:42:33Z

code/extract_transform.py

+    # Load JSON
+#x=pd.DataFrame()
+def load():
+    with open("../json/top-rated-movies-02.json") as f:


I wouldn't hardcode names, rather would pass the file name as a parameter.

moved the same to dev.ini

ababo · 2020-12-15T22:42:49Z

code/extract_transform.py

+    #dataframe
+    # type conversion
+    dataframe['genres'] = dataframe['genres'].astype('str').apply(
+        lambda x: x.lower().strip().replace("[", "").replace("]", "").replace("\'", "").replace("\"", "").replace(", ",


Please keep lines under 80 characters long.

updated accordingly

ababo · 2020-12-15T22:44:26Z

code/extract_transform.py

+    print(dataframe)
+    #dataframe
+    # type conversion
+    dataframe['genres'] = dataframe['genres'].astype('str').apply(


Why do we need this?

added relevant comments to all transformations.

ababo · 2020-12-15T22:44:58Z

code/extract_transform.py

+        lambda x: x.strip().replace("PT", "").replace("M", "")).astype(int)
+    dataframe['imdbRating'] = dataframe['imdbRating'].astype('float')
+    dataframe['actors'] = dataframe['actors'].astype('str').apply(
+        lambda x: x.lower().strip().replace("[", "").replace("]", "").replace("\'", "").replace("\"", "").replace(", ",


What is a purpose of this transformation?

extract relevant duration PT89M --> 89
handling names like Genelia D'Souza which was causing string handling issues
added relevant comments to all transformations.

ababo · 2020-12-15T22:45:34Z

code/load_db.py

+
+    # DB_Connection
+    conn = pyodbc.connect(
+        'DRIVER={ODBC Driver 17 for SQL Server};TrustServerCertificate=No;DATABASE=Movies_DB;WSID=LAPTOP-BLDSMT2E;APP={Microsoft® Windows® Operating System};Trusted_Connection=Yes;SERVER=(localdb)\MSSQLLocalDB;Description=movies')


Why do you hardcode the DB connection string?

moved configuration to dev.ini

ababo · 2020-12-15T22:46:33Z

code/load_db.py

+    # create the connection cursor
+    cursor = conn.cursor()
+    # Create Tables
+    cursor.execute('\n'


Please use multiline string literals.

ababo · 2020-12-15T22:47:46Z

schema/Publish.sql

@@ -0,0 +1,310 @@
+/*


How did you generate this code?

From visual studio 2019 SQL Server Object Explorer ,we can use the the below steps to replicate for a database

Extract Data-tier Application to generate the SQL schema and save (extension will be .dacpac) to local

Right Click Databases, select Publish Data-tier Application option

Browse and choose SQL Schema file(.dacpac) file, in local, generated in first step and choose generate Script.

Tableau based modeling and reporting

adding end of file instruction

all instructions intended to under 80 char Added comments

Feature/dayana

…dev.ini

added json path in config_handler.py for reading json file path from …

ababo

Some thoughts and suggestions.

You include an SQL file that is generated bu external tooling. Generally projects should be self-contained, i.e. capable to generate files from source files included in the project itself.
You include several binary files, and it's not clear how they are generated. Much better way would be to include source files that are used to generate the binaries instead of the resulting binaries.
Your "release year and genre shares of movies" is totally unreadable. I would prefer series of pie-graphs instead.
See https://stephenfollows.com/are-movies-getting-longer/ document that shows a correlation between average movie durations among other stuff and this data contradicts to your linear regression graph. Do you have an explanation?

ababo · 2020-12-19T15:23:22Z

code/config_handler.py

+def get_SQLCONFIG():
+    parser = load_config()
+    # Read corresponding file parameters
+    _driver = parser.get("db", "driver")


Why do you prepend local variables with underscore?

ababo · 2020-12-19T15:26:30Z

code/load_db.py

+
+    # Inserting data in SQL Table:-
+    for index, row in dataframe.iterrows():
+        cursor.execute(


I guess here you compile SQL-query on each iteration. Maybe it's cached internally, but typically it makes sense to create pre-compiled queries.

bulk executing database commands by pre compiling them updating variable names

addressing review comments

dayanajoseph3091 · 2020-12-21T11:14:31Z

Some thoughts and suggestions.

You include an SQL file that is generated bu external tooling. Generally projects should be self-contained, i.e. capable to generate files from source files included in the project itself.

You include several binary files, and it's not clear how they are generated. Much better way would be to include source files that are used to generate the binaries instead of the resulting binaries.

Your "release year and genre shares of movies" is totally unreadable. I would prefer series of pie-graphs instead.

See https://stephenfollows.com/are-movies-getting-longer/ document that shows a correlation between average movie durations among other stuff and this data contradicts to your linear regression graph. Do you have an explanation?

The SQL file was added just incase you wanted to run in MSSQL independently. the code with python project will create the tables in every run and insert necessary data. The database name will be processed as per the name shared in config file
dacpac and tableau files are the binary files. Tableau files will need Tableau software to open them. I had assumed you were having access to Tableau. pdf and powerpoint formats will give only visuals but do hamper convenience. Regarding dacpac file for schema, I was of the understanding you were looking for that file when you sought schema.
The data had from data from 1930 to 2017 . Having pie chart for such a long duration will be too confusing for the end user and too much scrolling. I can go over with you in a zoom call to go over the release year and genre share. Some of the readability issues are due to non Tableau formats.
The data provided for the tests had only few movies per year making the sample set very small enough such that one movie(eg: in 2012, the value is skewed by a single movie) of a year determines the fate. Hence the same can only be used for tests and not extensive analysis.

I believe for items 3 and 4 , it is better we connect to go over the implications of the graphs and their premises.

dayanajoseph3091 and others added 7 commits December 13, 2020 00:03

adding to gitignore

abc5228

Parse JSON to dataframe

733d731

and connect MSSQL

Parse JSON to dataframe

f28b89e

and connect MSSQL

SQL Schema

6611c9c

ETL

Merge pull request #1 from dayanajoseph3091/feature/dayana

60c8749

Feature/dayana

remove unused code

c362e57

Merge pull request #2 from dayanajoseph3091/feature/dayana

8f24778

remove unused code

ababo reviewed Dec 15, 2020

View reviewed changes

dayanajoseph3091 and others added 13 commits December 16, 2020 20:10

tableau_reports

fd4f923

Tableau based modeling and reporting

984e281

Merge pull request #3 from dayanajoseph3091/feature/dayana

6081887

Tableau based modeling and reporting

adding end of file instruction

b6f000d

Merge pull request #4 from dayanajoseph3091/feature/dayana

5aa1779

adding end of file instruction

added config_handler.py for reading db configuration file dev.ini

bdbb2b5

all instructions intended to under 80 char Added comments

Adding Tableau file t

c111967

added config_handler.py for reading db configuration file dev.ini

828673a

all instructions intended to under 80 char Added comments

Merge remote-tracking branch 'origin/feature/dayana' into feature/dayana

c372cb0

Merge pull request #5 from dayanajoseph3091/feature/dayana

46e5e1a

Feature/dayana

added json path in config_handler.py for reading json file path from …

cfcaf23

…dev.ini

Merge pull request #6 from dayanajoseph3091/feature/dayana

4e1bec6

added json path in config_handler.py for reading json file path from …

Add files via upload

145e778

ababo reviewed Dec 19, 2020

View reviewed changes

dayanajoseph3091 and others added 2 commits December 21, 2020 03:44

addressing review comments

3624a7a

bulk executing database commands by pre compiling them updating variable names

Merge pull request #7 from dayanajoseph3091/feature/dayana

d9affe4

addressing review comments

Conversation

dayanajoseph3091 commented Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ababo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dayanajoseph3091 commented Dec 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dayanajoseph3091 commented Dec 15, 2020 •

edited

Loading