Conversation
and connect MSSQL
and connect MSSQL
ETL
Feature/dayana
remove unused code
code/__main__.py
Outdated
| import load_db as ldb | ||
| if __name__ == '__main__': | ||
| # execute only if run as the entry point into the program | ||
| ldb.main() No newline at end of file |
code/extract_transform.py
Outdated
| # Load JSON | ||
| #x=pd.DataFrame() | ||
| def load(): | ||
| with open("../json/top-rated-movies-02.json") as f: |
There was a problem hiding this comment.
I wouldn't hardcode names, rather would pass the file name as a parameter.
There was a problem hiding this comment.
moved the same to dev.ini
code/extract_transform.py
Outdated
| #dataframe | ||
| # type conversion | ||
| dataframe['genres'] = dataframe['genres'].astype('str').apply( | ||
| lambda x: x.lower().strip().replace("[", "").replace("]", "").replace("\'", "").replace("\"", "").replace(", ", |
There was a problem hiding this comment.
Please keep lines under 80 characters long.
| print(dataframe) | ||
| #dataframe | ||
| # type conversion | ||
| dataframe['genres'] = dataframe['genres'].astype('str').apply( |
There was a problem hiding this comment.
added relevant comments to all transformations.
code/extract_transform.py
Outdated
| lambda x: x.strip().replace("PT", "").replace("M", "")).astype(int) | ||
| dataframe['imdbRating'] = dataframe['imdbRating'].astype('float') | ||
| dataframe['actors'] = dataframe['actors'].astype('str').apply( | ||
| lambda x: x.lower().strip().replace("[", "").replace("]", "").replace("\'", "").replace("\"", "").replace(", ", |
There was a problem hiding this comment.
What is a purpose of this transformation?
There was a problem hiding this comment.
extract relevant duration PT89M --> 89
handling names like Genelia D'Souza which was causing string handling issues
added relevant comments to all transformations.
code/load_db.py
Outdated
|
|
||
| # DB_Connection | ||
| conn = pyodbc.connect( | ||
| 'DRIVER={ODBC Driver 17 for SQL Server};TrustServerCertificate=No;DATABASE=Movies_DB;WSID=LAPTOP-BLDSMT2E;APP={Microsoft® Windows® Operating System};Trusted_Connection=Yes;SERVER=(localdb)\MSSQLLocalDB;Description=movies') |
There was a problem hiding this comment.
Why do you hardcode the DB connection string?
There was a problem hiding this comment.
moved configuration to dev.ini
code/load_db.py
Outdated
| # create the connection cursor | ||
| cursor = conn.cursor() | ||
| # Create Tables | ||
| cursor.execute('\n' |
| @@ -0,0 +1,310 @@ | |||
| /* | |||
There was a problem hiding this comment.
From visual studio 2019 SQL Server Object Explorer ,we can use the the below steps to replicate for a database
- Extract Data-tier Application to generate the SQL schema and save (extension will be .dacpac) to local
- Right Click Databases, select Publish Data-tier Application option
- Browse and choose SQL Schema file(.dacpac) file, in local, generated in first step and choose generate Script.
Tableau based modeling and reporting
adding end of file instruction
all instructions intended to under 80 char Added comments
all instructions intended to under 80 char Added comments
Feature/dayana
added json path in config_handler.py for reading json file path from …
ababo
left a comment
There was a problem hiding this comment.
Some thoughts and suggestions.
- You include an SQL file that is generated bu external tooling. Generally projects should be self-contained, i.e. capable to generate files from source files included in the project itself.
- You include several binary files, and it's not clear how they are generated. Much better way would be to include source files that are used to generate the binaries instead of the resulting binaries.
- Your "release year and genre shares of movies" is totally unreadable. I would prefer series of pie-graphs instead.
- See https://stephenfollows.com/are-movies-getting-longer/ document that shows a correlation between average movie durations among other stuff and this data contradicts to your linear regression graph. Do you have an explanation?
code/config_handler.py
Outdated
| def get_SQLCONFIG(): | ||
| parser = load_config() | ||
| # Read corresponding file parameters | ||
| _driver = parser.get("db", "driver") |
There was a problem hiding this comment.
Why do you prepend local variables with underscore?
code/load_db.py
Outdated
|
|
||
| # Inserting data in SQL Table:- | ||
| for index, row in dataframe.iterrows(): | ||
| cursor.execute( |
There was a problem hiding this comment.
I guess here you compile SQL-query on each iteration. Maybe it's cached internally, but typically it makes sense to create pre-compiled queries.
bulk executing database commands by pre compiling them updating variable names
addressing review comments
I believe for items 3 and 4 , it is better we connect to go over the implications of the graphs and their premises. |
it is a work in progress pullrequest.