In our project, we have relied on a traffic prediction dataset to train several models, with the aim of reaching a high accuracy, and observing which of the models perform better. Kaggle link:
https://www.kaggle.com/datasets/hasibullahaman/traffic-prediction-dataset/data
The traffic prediction dataset we have used contains 5952 records and 9 columns. The information collected is the traffic situation measured each 15 minutes over a period of 2 months. The observation is done based on the total number of cars, bikes, buses and trucks during each 15 minute interval. The total column represents the total number of vehicles for that time interval. 4 traffic situations are distinguished: low, normal, high and heavy.
After understanding of the information contained within the dataset and visualizing it we used these steps as follows:
- Splitting the time column into hours and minutes/ splitting the time column into seconds.
- Creating a midday column containing the AM/PM which we encoded with 0 for AM and 1 for PM.
- Encoding the days of the week with numbers from 1-7 for each day respectively.
- Encoding the traffic situation with number 0-3 for each traffic situation respectively.
- Removing the target column(traffic situation) from the features.
- Normalizing the features using:
- MinMaxScaler
- StandardScaler
- Applying dimensionality reduction using PCA.
We decided to use both classification (supervised learning) and clustering (unsupervised learning). Classification was possible because we have the target column, which we remove and do not use for clustering.
- Classification Model:
- Artificial Neural Networks with Multilayer Perceptron and Backpropagation.
- Clustering Model:
- Agglomerative Clustering.
- Classification Model:
- Support Vector Machines.
- Clustering Model:
- DBSCAN.
- Classification Model:
- Decision Trees.
- Clustering Model:
- K-means.
It is important to note that the performance of unsupervised models on this dataset might be suboptimal. Since the dataset was curated with a focus on classification, unsupervised algorithms, which often thrive on inherent data structures, did not yield satisfactory results.
The neural network of the model, dataset of which has not experienced dimensionality reduction, consists of two dense layers:
- Input layers with 16 neurons, 9 inputs (equal to the number of features) and ReLu activation function.
- Output layer with 4 neurons (equal to the number of Traffic Situation classes) and softmax activation function for multiclass classification.
The neural network of the model, dataset of which has experienced dimensionality reduction, consists of two dense layers:
- Input layers with 16 neurons, 2 inputs (equal to the number of features after PCA is being applied) and ReLu activation function.
- Output layer with 4 neurons (equal to the number of Traffic Situation classes) and softmax activation function for multiclass classification.
I have also experimented with the number of hidden neurons as well as activation and optimization functions, but these are shown in more detail below.
This visualization shows the input, hidden and output layer, demonstrating their relation to each other (16 nodes for hidden layer, no PCA applied). |
---|
For each model I have experimented with a different number of hidden nodes, batch size and activation and optimization function. I also have gathered statistics about accuracy scores for different train-test splits while drawing conclusions that will be listed above.
StandardScaler normalized data results:
The case of Relu, Softmax, Adam with small batch size and large number of hidden nodes:
MinMaxScaler and PCA normalized data result:
Conclusions regarding the number of hidden nodes:
After I did several experiments with the model, while increasing and decreasing the number of hidden nodes I came to the conclusion that these operations have distinct effects on the model:
- Fewer hidden nodes result in a simpler model that requires less computational time during training and prediction.
- More hidden nodes provide the model with a higher capacity to learn complex patterns in the training data, with the risk of overfitting present, where the model memorizes the training data but fails to generalize unseen data.
- A large number of hidden nodes increase computational cost during training and inference.
As seen above I have tried building models with a dataset in which PCA (dimensionality reduction is applied) and a dataset where PCA is not applied. Results have shown that the dataset we have chosen, Traffic Dataset, works better in the second case and to be realistic our dataset doesn’t really need PCA, since it has a small number of features. However, if we would talk about computational performance results, the cases where we are using the reduced dataset clearly require less computational resources.
Batch size defines the number of samples used in each iteration of training the neural network. From the results of the table we can clearly see that a greater value of batch size leads to better computational performance, as more samples are processed in parallel. However, we should keep in mind that smaller batch sizes lead to better generalization since the model updates its weights more frequently based on fewer examples, even though it requires more iterations for convergence.
Agglomerative Clustering is a form of unsupervised learning used to group objects in clusters based on their similarity. The algorithm starts by treating each object as a singleton cluster. Next, closest pairs of clusters are successively merged until all clusters have been merged into one big cluster containing all objects. The result is a tree-based representation of the objects, named dendrogram.
Data has been normalized in the same way I have already mentioned above.
I have used complete and ward linkage. Complete linkage is max-agglomerative clustering, where we merge two closest clusters and choose the max of these clusters for the next step. The silhouette score has shown better results with ward linkage rather than complete linkage. Number of clusters is set to be 4 in both cases, since we have 4 classes of Traffic Situation.
The very low results are justified by the fact that our problem is a classification one, so it’s normal to have low accuracy scores.
Regarding Support Vector Machines I have developed a python script, where initially I familiarized with the data, by also visualizing it.
- I transformed the time column into a column named hours and a column named minutes.
- I also created a midday column that contains whether that hour is in the AM or PM.
- I have also encoded each of the categorical attributes, namely the day of the week and the traffic situation.
- I performed the data normalizations as explained in the third section, by first applying the MinMaxScaler and saving it into the features_first_normalization, and then applying StandardScaler and saving it into the features_second_normalization.
- For dimensionality reduction I have used PCA, and then visualized the data using PCA for each normalization.
Support vector machines use several kernel functions. In order to obtain the best results, for this project I have used the rbf, linear, polynomial and sigmoid kernel functions. I have applied SVM to the data normalized with MinMaxScaler, StandardScaler, data reduced with PCA and to the not normalized data, and for each kernel in each of the forms of data I have used 60%, 70% and 80% train-test split. This has resulted in 48 different implementations of the SVM, and 48 different performances.
Data normalized with MinMaxScaler:
Data normalized with StandardScaler:
Not normalized data:
Data with dimensionality reduction(with PCA):
As we can see from the tables, the overall performance is quite satisfying. There are not a lot of differences occurring from the normalization choice. The highest accuracy achieved is 0.95, with an RBF kernel and a training size of 80% of the original data. Sigmoid kernel has not been effective with this dataset, producing very low results. Also data that has undergone dimensionality reduction does not achieve very high results, especially it has flaws in the precision, recall and F1 score. This is understandable, because dimensionality reduction is used to trade performance with efficiency when we have a huge number of columns(which would take a lot of time to compute). For our dataset, it was not necessary to perform dimensionality reduction, because the number of features is 10. However, I used it to compare the results.