10 days Hackathon on ML !!
Google colab
This Project needs python and uses the following libraries:
Numpy
Pandas
Matplotlib
Seaborn
Scikit learn
Scipy
An introduction to machine learning was given, followed by a discussion of the roadmap.
We set up a private github repository named as 'GeeksFiesta_AI-ML_geeky_couple' and invited mentors to join as collaborators.
To clean/process the Data, find count of missing values in each column and describe the dataset.
The majority of data states are available in.csv format. The CSV file format is used to store tabular data in plain text. To read a local CSV file as a dataframe, we used the Pandas library's read_csv function.
Data can be missing for a variety of reasons and must be handled in such a way that it does not impair the performance of our model. We used backfilling method to replace the missing data in this case. As the data set is already sorted concerning hours, day, month, year. So if we fill the missing values with mean-value, then the deviation would increase. Hence error would increase. The given features would be almost the same in the next hour also. For example, the temperature is 1.9 at a particular hour, then the next hour if the temperature would be around 1.9 only, not 14.356532039734322. So we opted backfilling method.
Categorical data variable that contains level values rather than numeric values. the number of possible values is often limited to a fixed set. For example here the wind_direction column cannot be used in the mathematical equation of the model so we need to encode these variables into numbers. To achieve this we import the pandas library and use the get_dummies function.
High magnitude features will be given more weight in the distance calculation than low magnitude features. In order, to avoid this we used maxminscaler() from sklearn.processing.
We identified Outliers by creating a Boxplot for each column and observing the range and number of dots or "too far away" points, which are values that deviate more or less than the expected range.
First, we print the outlier's location by specifying the required condition. After that, we set the detected oulier to null and fill it using the median approach.
we found the relationship between each column and all other columns using the corr() function of the pandas library and then plotting graph using seaborn library.
we found PM2.5 relation with every column using corr() funtion of pandas.
For Correlation analysis, we used the heatmap method from the seaborn library.
We then plotted a Graph for the dataframe using Pandas library.
We observed in column "year" and "temperature" have median> mean while in column "month" and "wind_speed" have median<mean.
we then plotted a pairplot with PM2.5 as the x-axis vs each column. And we observed, PM2.5 is constant across all columns.
We have to build and train a model which will predict the quality of air and PM 2.5 concentrations.
We splitted the Dataset into 70:30 ratio for training and testing using skicit learn library.
Here, We used the fit method to train our model by importing RandomForestRegressor from sklearn.ensemble.
Here, We used the fit method to train our model by importing DecisiontreeRegressor from sklearn.ensemble.
We used R^2 for calculating accuracy and also printed error like mean absolute, squared and root mean squared error.
Here, we tested out model with the remaning 30% of data which we splitted in the beginning.
We plotted a graph between the actual values and prediccted values by using matplotlib.
Here, we observed that random forest gives more Accuracy i.e (46.35%) than decision tree(30.10%) as shown below. So, we opted random forest regression for our model.
In this step, Our random forest model will provide predictions of PM2.5 concentration for ten alternative inputs to the dataset that we import.
At first, We filled the missing values of the given test dataset.
After that, we used feature scaling to scale out the testing dataset.
Then, using our model, we predicted PM2.5 concentrations and added them to our final dataframe.