-
Notifications
You must be signed in to change notification settings - Fork 54
/
Copy path2-1-phones.Rmd
167 lines (102 loc) · 22.7 KB
/
2-1-phones.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
## Does brand has an impact on smartphone prices? {#xai2-phones}
*Authors: Agata Kaczmarek, Agata Makarewicz, Jacek Wiśniewski (Warsaw University of Technology)*
### Abstract
Mobile phone became indispensable item in our daily life. From a simple device enabling contacting other people, it developed into a tool facilitating web browsing, gaming, creating/playing multimedia and much more. Therefore, the choice of phone is an important decision, but given the wide variety of models available nowadays, it is sometimes hard to choose the best one and also easy to overpay.
In this paper, we analyze the phone market to investigate whether the prices of the phones depend only on their technical parameters, or some of them have an artificially higher price regarding the possessed functionalities. Research is conducted on the *phones* dataset, provided by the lecturer, containing prices and features of different phones. As a regressor, Random Forest from package *ranger* [@2-1-ranger] was chosen. Given the type of the model (black box i.e. non-interpretable), we use Explainable Artificial Intelligence (XAI) methods from *DALEX* [@2-1-dalex] and *DALEXtra* [@2-1-DALEXtra] packages, for both local and global explanations, to interpret its predictions and get to know which features influence them the most, and raise (or lower) the price of the phone.
### Introduction, motivation and related work
Mobile phones, since their coming onto the market, have gradually entered people everyday life. According to GSMA [@2-1-gsma], almost 70% of the world's population has one. This market has gained significant popularity in 2007, with the introduction of Apple's iPhone [@2-1-phones-smartphone]. It revolutionized the industry by offering features such as a touch screen interface and a virtual keyboard. The smartphone market has been steadily developing and growing since then, both in size, as well as in models and suppliers [@2-1-mobile-industry],[@2-1-mobile-industry-2]. Smartphone, due to its mobility and computer abilities, has become a source of entertainment, a communication tool, a search engine and so much more.
Suppliers constantly outdo each other introducing new improvements, better cameras or batteries, to attract customers. One could wonder which of them (or whether all of them) cause the price increase. Another question is, whether the price depends on the manufacturer - are there certain producers whose phones are more expensive, no matter the parameters? The task of determining a relationship between a smartphone's features, brand, and the price is surely non-trivial.
In such a problem, machine learning can be useful. Machine learning algorithms step into more and more areas of our life. We use them in risk analysis, medical diagnosis or credit approval [@2-1-risk-analysis], [@2-1-medical-diagnosis], [@2-1-credit-scoring], so why could they not be used in phone pricing? The same question Ibrahim M. Nasser et.al. asked themselves, later proving the point in predicting smartphone prices with the use of neural networks [@2-1-phones-ann]. Also, Ritika Singh [@2-1-phones-eda] examines the importance of various features in predicting smartphone prices.
In general, we can distinguish two types of models: glass-box, which steps can be followed from inputs to outputs, and black-box, which do not have a readable way of determining predictions [@EMA]. In many cases, such as the one considered in this article, simple interpretable models are not capable of dealing with our problem satisfyingly, so we turn to more complex, non-transparent ones, which grant us higher accuracy, but lower understanding, and therefore, lower trust. Explainable Artificial Intelligence (XAI) addresses this problem. It is a set of tools to help you understand and interpret predictions made by your machine learning models [@2-1-xai-1]. With it, one can debug and improve model performance, and help others understand models' behaviour. More XAI methods were described by Przemysław Biecek et.al. in "Explanatory Model Analysis" book [@EMA], with local methods as Breakdown, Shapley and Ceteris Paribus; and global methods as Partial-dependence profiles being also explained there.
In the article below, we deal with the problem of creating an explainable regressor for mobile phones prices. We build a black-box model and then use XAI methods to find out which features and brands contribute mostly to the final price.
### Methodology
**Data description**
The research was carried out on the *phones* dataset provided by the lecturer i. e. data frame with prices and technical parameters of 414 phones. It contains ten explanatory variables and one target variable (`price`), therefore we deal with regression task. The sample of the data is presented below (\@ref(fig:2-1-sample-data)).
```{r 2-1-sample-data, fig.align='center',fig.cap='Sample of the data',out.width="100%", echo=FALSE, warning=FALSE, message=FALSE}
knitr::include_graphics('images/2-1-sample.PNG')
```
**Exploratory Data Analysis and data preprocessing**
At the beginning of our research, we conducted Exploratory Data Analysis to get a better understanding of the data we deal with. We mainly focused on the target variable and its distribution versus explanatory ones to identify potential influential features for our prediction. Below we present some results important for further work.
```{r mean-price, out.width='700', fig.align="center", fig.cap='Ten brands with the highest mean price of the phone among the observations from the dataset.', echo=FALSE}
knitr::include_graphics('images/2-1-eda-mean-price.png')
```
Analyzing brands by the mean price of the phones produced, there can be identified a distinct leader, which is an Apple company. On average, a phone from it costs more than 3500 PLN. In the top 10 brands, we can also see common ones such as Samsung or Huawei.
```{r violin, fig.align="center", fig.cap='Phone price distribution for 4 most common brands among the observations from the dataset.', echo=FALSE}
knitr::include_graphics('images/2-1-eda-violin-plot.png')
```
On the plot above (\@ref(fig:violin)) we can identify some outliers in terms of price, especially a phone made by Samsung company, which costs 9000 PLN. Concerning the Xiaomi brand, we can observe that despite being a popular choice, the price of a single phone is relatively low - no phones are exceeding 4000 PLN. As for the Apple products, conclusions from the plot above are confirmed.
```{r corr, fig.align="center", fig.cap='Correlation matrix for numeric features', echo=FALSE}
knitr::include_graphics('images/2-1-correlation-matrix.png')
```
Based on the partially presented EDA, we needed to conduct simple data preprocessing before modelling. The following steps were executed:
* ***handling missing values:*** Two features containing missing values were identified; both related to camera parameters (`back_camera_mpix`, `front_camera_mpix`). Those values turned out to be meaningful, as they mean that the given mobile phone has no camera (back or front). Given that information, NAs were imputed with a constant value - 0.
* ***removing outliers:*** Based on features distribution, some extreme values were identified in the dataset's explanatory variables (`back_camera_mpix`, `front_camera_mpix`, `battery_mAh`, `flash_gb`, `price`), which would weaken the model's performance. Therefore they were removed.
* ***dealing with unimportant and correlated features:*** The variable `name` has been omitted, because it was practically unique in the dataset and naturally connected to the `brand` feature. Moreover, `height_px` and `width_px` were deleted due to their strong correlation with the `diag` feature (and with each other) (\@ref(fig:corr)); this feature was considered as a sufficient determinant of the phone's dimensions.
**Models**
The next step after EDA was creating prediction models. To compare results and find the best model for mentioned data, there were created 3 models: Random Forest from *ranger* package [@2-1-ranger], XGBoost from *mlr* package [@2-1-mlr] and SVM, also from the *mlr* package. For each of them, training using 5-fold cross-validation was performed. SVM and XGBoost models cannot be trained using categorical variables so variable `brand` needed to be target encoded in these cases. There were used two measures to compare models results: Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). The results are presented in the table below.
```{r models, fig.cap='5-fold crossvalidation errors - Root Mean Squared Error and Mean Absolute Error - for each of the tested models (Random Forest, XGBoost, SVM)', echo=FALSE}
models <- data.frame(rmse = c(547, 1515, 678), mae = c(295, 1033, 376), row.names = c('ranger', 'xgboost', 'svm'))
knitr::kable(models, format='markdown')
```
The presented results point, that the ranger model is the best choice. This model was used in further analysis.
### Results
Ranger model results have been analyzed using Explainable Artificial Intelligence methods. Following paragraphs present local and global explanations.
**Local explainations**
In the first step of our XAI, the focus was on instance level explanations - analysis of single predictions and how each feature influences their values. Every local explanation method [@2-1-local-explanations] takes one observation and draws the results for it. The results for each observation can differ greatly from the results for the other observation. Thanks to it we can eg. compare two observations and the influence of various features on the prediction. However, this causes that we cannot assume general ideas for a whole data set based on results from local explanations. Therefore below we present only the most interesting observations found during the research. Breakdown [@xai1-breakdown], SHAP [@xai1-shapleyvalues], Lime [@2-5-lime] and Ceteris Paribus [@EMA] profiles were used to show dependencies and draw conclusions. Observations below were grouped into pairs to show how identical parameters, but different brands can lead to totally different prices or vice versa.
```{r 2-1-first-example, fig.align="center", fig.cap='Two phones (Samsung and Realme) from data set, which have similar features, various prices.',out.width="100%", echo=FALSE}
knitr::include_graphics('images/2-1-breakdown-sample.PNG')
```
```{r 2-1-breakdown-plot20, fig.align="center", fig.cap='Breakdown profile plot drawn for observation 20 - Samsung phone from above table.',out.width="100%", echo=FALSE}
knitr::include_graphics('images/2-1-breakdown_20.png')
```
```{r 2-1-breakdown-plot246, fig.align="center", fig.cap='Breakdown profile plot drawn for observation 246 - Realme phone from above table.',out.width="100%", echo=FALSE}
knitr::include_graphics('images/2-1-breakdown_246.png')
```
The above-shown Breakdown [@xai1-breakdown] profiles help answer a common question: which variable contributes to the prediction the most. It is of course only one of the possible solutions, some of the others are presented further. This method shows the decomposition of the prediction done by model, most important features plotting the highest. Next to each feature, there is a number showing the value of the influence of the feature on the prediction. We can separate features with positive (green bars) and negative (red) influence.
Shown above two observations (\@ref(fig:2-1-first-example)) vary three features - *ram_gb*, *brand* and *diag*. It is visible on these two Breakdown charts (\@ref(fig:2-1-breakdown-plot20), \@ref(fig:2-1-breakdown-plot246)) that Samsung has a bigger diagonal, but less RAM GB and according to the model is more expensive by 200. However, this difference, in reality, is higher, Samsung is more expensive by 700. There are also differences in the impact of features - in first *batter_mAh* has a positive impact and in the second negative. In the first case for model the most important were *ram_gb*, *diag* and *brand* (in this order), in second *ram_gb*, *brand* and *front_camera_px*. The question is, whether in reality *brand* does not have a bigger impact on price than shown here?
```{r 2-1-second-example, fig.align="center", fig.cap='Two phones (Apple and Motorola) from data set, Motorola has better features and is cheaper.',out.width="100%", echo=FALSE}
knitr::include_graphics('images/2-1-lime-sample.PNG')
```
```{r 2-1-lime-plot-40, fig.align="center", fig.cap='Lime plot drawn for observation 40 - Apple phone from above table.',out.width="100%", echo=FALSE}
knitr::include_graphics('images/2-1-lime-40.png')
```
```{r 2-1-lime-plot-319, fig.align="center", fig.cap='Lime plot drawn for observation 319 - Motorola phone from above table.',out.width="100%", echo=FALSE}
knitr::include_graphics('images/2-1-lime-319.png')
```
The above-shown Breakdown profile is most suitable for models with a small number of features. However, it sometimes happens that we need to use more features and Breakdown is not the best choice. Better is Lime (Local Interpretable Model-agnostic Explanations) profile [@2-5-lime]. It is one of the most popular sparse explainers, which uses a small number of variables. The main idea of it is to change the complicated black-box model into approximated glass-box one. In this situation, the second one is much easier to interpret. On the Lime plots, we can also see the influence of the features on the final prediction, with division (shown as various colours) on positive and negative.
On LIME plots (\@ref(fig:2-1-lime-plot-40), \@ref(fig:2-1-lime-plot-319)) two phones, which have similar values in many features (\@ref(fig:2-1-second-example)), in two (*battery_mAh* and *diag*) second phone (\@ref(fig:2-1-lime-plot-319)) has better values than first one (\@ref(fig:2-1-lime-plot-40)). Even though the price of the first phone is three times higher according to our model. The only difference not mentioned above between them is a brand - the first one is the iPhone. That seems to be a conclusion consistent with reality.
```{r 2-1-third-example, fig.align="center", fig.cap='Two phones (OPPO and Apple) from data set, Apple is about three times more expensive having similar features.',out.width="100%", echo=FALSE}
knitr::include_graphics('images/2-1-ceteris-sample.PNG')
```
```{r 2-1-ceteris, fig.align="center", fig.cap='Ceteris paribus profile plot drawn for observations 59 and 77 (shown in above table).',out.width="100%", echo=FALSE}
knitr::include_graphics('images/2-1-ceteris-paribus.jpg')
```
Above mentioned profiles showed the various influence of features on the prediction and the comparison between them. Ceteris Paribus profile [@EMA] shows the effect on the model's predictions made by changes in the value of one individual feature, while the rest of them held constant. These profiles help us understand how changes in one feature affect the prediction of the model for all of the features.
Ceteris Paribus profile (\@ref(fig:2-1-ceteris)) shows different influence of some features concerning two mobile phones (\@ref(fig:2-1-third-example)). The biggest contrast we can observe in the case of *battery_mAh*, which lowers the price significantly in the case of OPPO phone, and increases when it comes to Apple one, leading to the same prediction for both if the value exceeds 5000 mAh. It is quite surprising because in the case of the first one such battery parameters should lead to a bigger price. Another difference which can be observed in `front_camera_mpix` influence - whereas above ~ 15 Mpix we reach similar price, for smaller values it causes prediction's increase for Apple, and steady value for OPPO (for both peaks around 10 Mpix value). Once more those impacts are unexpected because the OPPO phone has a much better front camera.
**Global explanations**
In the second step of our XAI analysis, we focused on dataset level explanations - those concerning not a single observation, but all of them. Global methods let us gain a better understanding of how our model behaves in general and what influences it the most. We perform an analysis of all predictions for the dataset together to find out how each feature affects their average value. The methods used for this purpose are Permutation-based Variable Importance (Feature Importance) [@2-1-feat-imp] and Partial Dependence Profile (PDP) [@2-1-pdp].
Feature Importance is a method assessing which variables have the most significant impact on the overall model prediction. The idea of it is to evaluate a change in the model's performance in a situation when the effect of a given variable is omitted. This simulation is achieved by resampling or permuting the values of the variable. If a given feature is important to our model, then we expect that after those permutations the model’s performance will worsen; therefore the greater the change in the model's performance, the bigger is the importance of a feature. The calculations, which results are presented below, were performed for 5000 permutations, to achieve stability.
```{r 2-1-feat-importance, fig.align="center", fig.cap='Feature importance explanation for the phones dataset variables. Blue thick bars correspond to RMSE loss after permutations. Dark blue bars represent boxplots for 100 bootstrapped explanations.',out.width="100%", echo=FALSE}
knitr::include_graphics('images/2-1-feature-importance.png')
```
According to the plot (\@ref(fig:2-1-feat-importance)), the most important variables for the model are `ram_gb`, `flash_gb` and `brand`, with visible domination of `ram_gb`. After these, the next variables in terms of impact are the camera-related ones, then `diag` and `battery_mAh`. These observations coincide with the conclusions drawn from local explanations (\@ref(fig:2-1-breakdown-plot20), \@ref(fig:2-1-breakdown-plot246)) - for single predictions, brand and memory-related variables were also very important and had a significant impact on the price of the phone, and camera-related were next. Whereas those previous results could be a coincidence, and true only for those 2 chosen observations, now we are ensured that those conclusions are adequate for the whole dataset. Moreover, these results confirm also our intuition. Memory parameters are arguably the most important phone's technical parameters and the high position of `brand` on the chart shows, that, probably in the case of the bigger companies, the prices of the phones are likely to be biased with regard to their parameters.
In order to perform further analysis of the model's behaviour on the global level, Partial Dependence Profile was used. The general idea underlying the construction of PDP is to show how the expected value of model prediction behaves as a function of a selected explanatory variable. Comparison of these profiles may give additional insight into the stability of the model's prediction.
```{r 2-1-pdp, fig.align="center", fig.cap='Partial Dependence Profile (PDP) explanation for numeric variables in phones dataset. Blue lines represent variable (horizontal axis) and price (vertical axis) dependence.',out.width="100%", echo=FALSE}
knitr::include_graphics('images/2-1-pdp.png')
```
The plot above (\@ref(fig:2-1-pdp)) confirms observation from the previous plot (\@ref(fig:2-1-feat-importance)), presenting a strong dependency between our target (`price`) and memory parameters. For the variables `flash_gb` and `ram_gb`, we can observe similar profiles - as the value of the variable increases, the estimated price of the phone increases. In the case of the `front_camera_mpix` variable, there can be seen an unnatural behaviour near 10-12 mpx, suggesting that phones with these specific parameters are the most expensive. We have already observed this characteristic behaviour in previous analyzes for single predictions (\@ref(fig:2-1-ceteris)). In order to investigate this anomaly, the further analysis presented below was performed.
```{r 2-1-pdp-front-camera, fig.align="center", out.width='60%',fig.cap='Results of the research explaining front camera mpix PD profile anomaly. The table above presents the number of phones grouped by brand having from 10 to 12 front camera mpix value. The PDP plot below the table presents the front camera mpix PD profile with marked 10 and 12 values.', echo=FALSE}
brands <- data.frame(count = c(21, 3, 14), row.names = c('Apple', 'Motorola', 'Samsung'))
colnames(brands) <- 'Number of phones'
knitr::kable(brands, format='markdown')
knitr::include_graphics('images/2-1-front-camera.png')
```
The research proved that phones having from 10 to 12 megapixels in their front camera are mostly represented by expensive brands like Samsung and Apple in the *phones* dataset. This leads to the conclusion, that in this case, it was the brand that influenced the phone's price, not the quality of the front camera.
```{r 2-1-pdp-brand, fig.align="center", fig.cap='Partial Dependence Profile (PDP) explanation for brand variable in phones dataset. Blue bars represent the average predictions for each brand in the dataset.',out.width="100%", echo=FALSE}
knitr::include_graphics('images/2-1-pdp-brand.png')
```
Partial Dependence Profile (\@ref(fig:2-1-pdp-brand)) looks slightly different for the brand variable because it is a categorical variable. Surprisingly, this plot presents a weak brand name impact on price, unlike previous plots: (\@ref(fig:2-1-pdp)) and (\@ref(fig:2-1-feat-importance)) - for most brands we observe a very similar prediction value. Brands that increase the price are Apple, Archos, and CAT, but only one of those brands (Apple) is a big phone company, however, for this brand, we actually already observed "unexpected" high prices of phones disproportionate to their parameters (. CAT brand is represented by three phones wheres Archos appears in the dataset only once. That leads to the conclusion, that their PD profile might not be stable.
### Summary and conclusion
The data set explored in the above research included only a small number of features, even though the obtained results seem to be realistic from the customer's point of view. The evaluation of our model revealed expected results.
To summarize, according to all explanations shown above, there are several conclusions, which can be drawn. The biggest impact on the predicted price for the model had features describing brand, RAM storage and internal memory size. We believe that the most expensive brands such as Samsung and Apple have biased prices because they are higher than predicted by the model. What is important to highlight is that these conclusions were made for this specific data set, which had only eleven features at the beginning.
The constantly changing market of mobile devices is unexplored and further research based on more complicated models and bigger data sets, with more features, could be beneficial to both customers and companies. These data sets could contain information about processor and durability, which seem to be important features for customers, yet were not included in our data set. Also, phone prices vary across continents and countries, so the results will be different.