Skip to content

WIP: movie_json_data analysis#2

Open
dayanajoseph3091 wants to merge 22 commits intoFEND16:masterfrom
dayanajoseph3091:master
Open

WIP: movie_json_data analysis#2
dayanajoseph3091 wants to merge 22 commits intoFEND16:masterfrom
dayanajoseph3091:master

Conversation

@dayanajoseph3091
Copy link

@dayanajoseph3091 dayanajoseph3091 commented Dec 15, 2020

it is a work in progress pullrequest.

  • extract, transformed json to python DataFrame.
  • load transformed data to MS SQL DB
  • added DB schema
  • added teardown for repeat runs
  • modeling and reporting
  • document insights
  • project cleanup and addressing remaining review comments

code/__main__.py Outdated
import load_db as ldb
if __name__ == '__main__':
# execute only if run as the entry point into the program
ldb.main() No newline at end of file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing end of line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added end of line.

# Load JSON
#x=pd.DataFrame()
def load():
with open("../json/top-rated-movies-02.json") as f:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't hardcode names, rather would pass the file name as a parameter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved the same to dev.ini

#dataframe
# type conversion
dataframe['genres'] = dataframe['genres'].astype('str').apply(
lambda x: x.lower().strip().replace("[", "").replace("]", "").replace("\'", "").replace("\"", "").replace(", ",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep lines under 80 characters long.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated accordingly

print(dataframe)
#dataframe
# type conversion
dataframe['genres'] = dataframe['genres'].astype('str').apply(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added relevant comments to all transformations.

lambda x: x.strip().replace("PT", "").replace("M", "")).astype(int)
dataframe['imdbRating'] = dataframe['imdbRating'].astype('float')
dataframe['actors'] = dataframe['actors'].astype('str').apply(
lambda x: x.lower().strip().replace("[", "").replace("]", "").replace("\'", "").replace("\"", "").replace(", ",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a purpose of this transformation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract relevant duration PT89M --> 89
handling names like Genelia D'Souza which was causing string handling issues
added relevant comments to all transformations.

code/load_db.py Outdated

# DB_Connection
conn = pyodbc.connect(
'DRIVER={ODBC Driver 17 for SQL Server};TrustServerCertificate=No;DATABASE=Movies_DB;WSID=LAPTOP-BLDSMT2E;APP={Microsoft® Windows® Operating System};Trusted_Connection=Yes;SERVER=(localdb)\MSSQLLocalDB;Description=movies')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you hardcode the DB connection string?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved configuration to dev.ini

code/load_db.py Outdated
# create the connection cursor
cursor = conn.cursor()
# Create Tables
cursor.execute('\n'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use multiline string literals.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -0,0 +1,310 @@
/*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you generate this code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From visual studio 2019 SQL Server Object Explorer ,we can use the the below steps to replicate for a database

  • Extract Data-tier Application to generate the SQL schema and save (extension will be .dacpac) to local
  • Right Click Databases, select Publish Data-tier Application option
  • Browse and choose SQL Schema file(.dacpac) file, in local, generated in first step and choose generate Script.

Copy link

@ababo ababo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts and suggestions.

  1. You include an SQL file that is generated bu external tooling. Generally projects should be self-contained, i.e. capable to generate files from source files included in the project itself.
  2. You include several binary files, and it's not clear how they are generated. Much better way would be to include source files that are used to generate the binaries instead of the resulting binaries.
  3. Your "release year and genre shares of movies" is totally unreadable. I would prefer series of pie-graphs instead.
  4. See https://stephenfollows.com/are-movies-getting-longer/ document that shows a correlation between average movie durations among other stuff and this data contradicts to your linear regression graph. Do you have an explanation?

def get_SQLCONFIG():
parser = load_config()
# Read corresponding file parameters
_driver = parser.get("db", "driver")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you prepend local variables with underscore?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

code/load_db.py Outdated

# Inserting data in SQL Table:-
for index, row in dataframe.iterrows():
cursor.execute(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess here you compile SQL-query on each iteration. Maybe it's cached internally, but typically it makes sense to create pre-compiled queries.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

dayanajoseph3091 and others added 2 commits December 21, 2020 03:44
bulk executing database commands by pre compiling them
updating variable names
@dayanajoseph3091
Copy link
Author

Some thoughts and suggestions.

  1. You include an SQL file that is generated bu external tooling. Generally projects should be self-contained, i.e. capable to generate files from source files included in the project itself.
  2. You include several binary files, and it's not clear how they are generated. Much better way would be to include source files that are used to generate the binaries instead of the resulting binaries.
  3. Your "release year and genre shares of movies" is totally unreadable. I would prefer series of pie-graphs instead.
  4. See https://stephenfollows.com/are-movies-getting-longer/ document that shows a correlation between average movie durations among other stuff and this data contradicts to your linear regression graph. Do you have an explanation?
  1. The SQL file was added just incase you wanted to run in MSSQL independently. the code with python project will create the tables in every run and insert necessary data. The database name will be processed as per the name shared in config file
  2. dacpac and tableau files are the binary files. Tableau files will need Tableau software to open them. I had assumed you were having access to Tableau. pdf and powerpoint formats will give only visuals but do hamper convenience. Regarding dacpac file for schema, I was of the understanding you were looking for that file when you sought schema.
  3. The data had from data from 1930 to 2017 . Having pie chart for such a long duration will be too confusing for the end user and too much scrolling. I can go over with you in a zoom call to go over the release year and genre share. Some of the readability issues are due to non Tableau formats.
  4. The data provided for the tests had only few movies per year making the sample set very small enough such that one movie(eg: in 2012, the value is skewed by a single movie) of a year determines the fate. Hence the same can only be used for tests and not extensive analysis.

I believe for items 3 and 4 , it is better we connect to go over the implications of the graphs and their premises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants