Sales-data-analysis-using-PySpark

The dataset is a sample sales data containing information about orders and products sold. It has 53,130 records and various columns. The code does the following steps:

Reads the 'sales_data_sample_csv' table into a Spark DataFrame.
Prints the schema of the DataFrame using the dtypes method.
Selects the columns 'QUANTITYORDERED', 'PRICEEACH', 'ORDERLINENUMBER', and 'SALES' from the DataFrame and displays the results using the display method.
Creates a VectorAssembler object with input columns 'QUANTITYORDERED' and 'PRICEEACH' and output column 'features'.
Transforms the DataFrame by adding the 'features' column using the VectorAssembler object created in step 4 and selects the 'features' and 'total_sales' columns only.
Splits the transformed DataFrame into a training set (90%) and a test set (10%) using the randomSplit method.
Displays the test set using the display method.
Creates a LinearRegression object with input features column 'features' and label column 'total_sales', and fits the model using the training set.
Prints the coefficients and intercept of the linear regression model.
Calculates and prints the root mean squared error (RMSE) and R-squared (R2) of the model on the training set.
Predicts the total_sales values of the test set using the trained linear regression model and displays the predictions, actual total_sales values, and input features using the display method.
Calculates and prints the R2 of the predictions using the RegressionEvaluator object.
Predicts the total_sales values of the test set using a DecisionTreeRegressor model and calculates and prints the RMSE of the predictions.
Displays the predictions using the show method.
Predicts the total_sales values of the test set using a GBTRegressor model and displays the predictions, actual total_sales values, and input features of the first 5 records using the show method.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
Sales data Analysis.ipynb		Sales data Analysis.ipynb
sales_data_sample.csv		sales_data_sample.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sales-data-analysis-using-PySpark

About

Releases

Packages

Languages

shree-prada/Sales-data-analysis-using-PySpark

Folders and files

Latest commit

History

Repository files navigation

Sales-data-analysis-using-PySpark

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages