Ahmet Zamanis
- Introduction
- Data Preparation
- Exploratory Analysis
- OLS Linear Regression
- Bayesian Linear Regression
- Conclusions
Bayesian Linear Regression is an alternative approach to linear regression that is based on Bayesian inference, and the Bayes Theorem. While the better known Ordinary Least Squares approach to linear regression estimates a single value for the regression model parameters, the Bayesian approach generates a probability distribution of possible values for each model parameter. BLR offers some benefits and additional insight compared to the OLS method:
- We can more easily quantify and visualize the uncertainty in the model, since the output is a range of values for the model parameters, instead of single values. A more spread out probability distribution for the parameters will mean our model has more uncertainty.
- If we have knowledge or guesses about the prior distributions of model parameters, we can use them in building the model as informative prior distributions.
In this example analysis, we have a dataset of used cars, which includes their prices and various features. We will try to predict the price of used cars, first using regular OLS linear regression, then with BLR. We will compare their results, and analyze the additional information provided by BLR.
The dataset was sourced from Kaggle, provided by the contributors Nehal Birla, Nishant Verma and Nikhil Kushawa.
Let’s load our dataset and view the first 5 rows, along with simple summary statistics.
Used cars dataset | ||||||||
---|---|---|---|---|---|---|---|---|
name | year | selling_price | km_driven | fuel | seller_type | transmission | owner | |
1 | Maruti 800 AC | 2007 | 60000 | 70000 | Petrol | Individual | Manual | First Owner |
2 | Maruti Wagon R LXI Minor | 2007 | 135000 | 50000 | Petrol | Individual | Manual | First Owner |
3 | Hyundai Verna 1.6 SX | 2012 | 600000 | 100000 | Diesel | Individual | Manual | First Owner |
4 | Datsun RediGO T Option | 2017 | 250000 | 46000 | Petrol | Individual | Manual | First Owner |
5 | Honda Amaze VX i-DTEC | 2014 | 450000 | 141000 | Diesel | Individual | Manual | Second Owner |
## name year selling_price km_driven
## Length:4340 Min. :1992 Min. : 20000 Min. : 1
## Class :character 1st Qu.:2011 1st Qu.: 208750 1st Qu.: 35000
## Mode :character Median :2014 Median : 350000 Median : 60000
## Mean :2013 Mean : 504127 Mean : 66216
## 3rd Qu.:2016 3rd Qu.: 600000 3rd Qu.: 90000
## Max. :2020 Max. :8900000 Max. :806599
## fuel seller_type transmission
## CNG : 40 Dealer : 994 Automatic: 448
## Diesel :2153 Individual :3244 Manual :3892
## Electric: 1 Trustmark Dealer: 102
## LPG : 23
## Petrol :2123
##
## owner
## Test Drive Car : 17
## First Owner :2832
## Second Owner :1106
## Third Owner : 304
## Fourth & Above Owner: 81
##
The original dataset has 4340 observations, 8 columns. The variables in the original dataset are:
- name: Name of the car model. Character variable.
- year: Year of the car model. Numeric variable.
- selling_price: Selling price of the car. Numeric variable. The outcome variable we will try to predict.
- km_driven: The distance traveled by the car, in kilometers. Numeric variable.
- fuel: The fuel type of the car. Factor variable with five levels.
- seller_type: The car’s seller. Factor variable with three levels.
- transmission: The car’s transmission type. Factor variable with two levels.
- owner: Number of previous owners. Factor variable with five levels.
There are some data cleaning and recoding steps to consider:
- Selling price and kilometers driven are both large numbers, in hundreds of thousands and millions. Let’s divide both by 1000 to get them in units of thousands.
- The fuel types CNG, LPG and Electric have very few observations, compared to Petrol and Diesel which have almost the same number of observations. We may want to drop the fuel categories besides Petrol and Diesel, to achieve a simpler, almost perfectly balanced variable.
- Similarly, there are few observations with the “Trustmark Dealer” seller type. We can recode these as “Dealer” type observations.
- There are few observations with “Test Drive Car” and “Fourth & Above Owner” levels in the owner variable. Furthermore, without Test Drive Cars, number of owners would be a naturally ordered factor variable. We can drop observations with “Test Drive Car” and “Fourth & Above Owner”, along with these levels of the owner variable.
## name year selling_price km_driven
## Length:4182 Min. :1995 Min. : 20.0 Min. : 0.001
## Class :character 1st Qu.:2011 1st Qu.: 211.2 1st Qu.: 35.000
## Mode :character Median :2014 Median : 370.0 Median : 60.000
## Mean :2013 Mean : 512.4 Mean : 65.724
## 3rd Qu.:2016 3rd Qu.: 600.0 3rd Qu.: 90.000
## Max. :2020 Max. :8900.0 Max. :806.599
## fuel seller_type transmission owner
## Diesel:2112 Dealer :1066 Automatic: 444 First Owner :2801
## Petrol:2070 Individual:3116 Manual :3738 Second Owner:1081
## Third Owner : 300
##
##
##
With these changes, we have a dataset of 4182 variables, and simplified, less unbalanced factor variables.
Let’s look at the distributions of our numeric variables.
Our outcome variable, selling price, follows a very right skewed
distribution. There are numerous very large outliers compared to the
median and the interquartile range.
- While the minimum value is 20k, and the median 370k, the maximum is
8.9m, with numerous observations at a price level higher than 1m.
- It may be sensible to remove the very high priced cars from the dataset, and try to predict the prices of more “ordinary” cars, as different price segments are likely to have different dynamics.
- However, since we want a model with some uncertainty to demonstrate the merits of BLR, we will keep the entire dataset for this analysis.
- The year variable is left skewed, with a median year of 2014. Most cars in our dataset are newer models, roughly after 2010. Very few cars are from before 2005. Half of the cars are from 2011-2016.
- The km_driven variable is very right skewed, with a median of 60k,
and a maximum of 806k.
- There are numerous cars with hundreds of thousands of kilometers driven. One observation has only 1 kilometer driven.
Let’s look at the balances of our factor variables.
- Fuel type is balanced almost perfectly between diesel and petrol.
- Previous owners is unbalanced, most cars are from first owners, and very few are from third owners.
- Seller type is unbalanced, most cars are from individual sellers.
- Transmission type is very unbalanced, most cars are manual transmission cars.
Let’s check the correlations between our variables, starting with the
numeric variables.
- There is a moderate, positive correlation of 0.4 between year and
selling price. The coefficient is highly statistically significant.
- The scatterplot suggests a non-linear, exponential-like relationship. We can see some outliers with much higher selling prices compared to observations in their year. Again, these are the luxury cars with millions in selling price, likely subject to different pricing dynamics.
- There is a moderate, negative correlation of -0.42 between year and
km_driven. The coefficient is highly statistically significant.
- The scatterplot suggests a non-linear relationship, where kilometers driven increases from 1995 to around 2010, and declines again after 2010.
- This is likely due to a lack of observations from older years, as most cars in our dataset are from 2010 and onwards. Intuitively, we would expect older cars to have higher kilometers driven. It’s possible that older cars with higher kilometers are not put up for sale.
- The correlation between kilometers driven and selling price is weak,
with a negative coefficient of -0.18. The coefficient is highly
statistically significant.
- Before a selling price of roughly 1.25m, we have some cars with kilometers driven above 200k. But after a selling price of roughly 1.25m, all cars are below 100k kilometers.
- It’s likely that very high-priced, luxury or classic cars are driven less than ordinary, daily driver cars.
Let’s check the degrees of association between our factor variables,
with tile plots and chi-square tests.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tchi1
## X-squared = 4.5774, df = 1, p-value = 0.0324
We see that dealers are a bit more likely to sell diesel cars, while individuals are a bit more likely to sell petrol cars. The association is weak but statistically significant.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tchi2
## X-squared = 8.0562, df = 1, p-value = 0.004535
Automatic cars are a bit more likely to be diesel, and manual cars are a bit more likely to be petrol. Again, a weak but statistically significant association.
##
## Pearson's Chi-squared test
##
## data: tchi3
## X-squared = 0.5651, df = 2, p-value = 0.7539
There is no significant association between fuel type and number of owners.
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tchi4
## X-squared = 192.08, df = 1, p-value < 0.00000000000000022
Individual sellers are much more likely to sell manual cars compared to dealers. This is a strong, highly statistically significant association.
##
## Pearson's Chi-squared test
##
## data: tchi5
## X-squared = 284.24, df = 2, p-value < 0.00000000000000022
Dealers sell roughly 1/3rd of first owner cars, 1/10th of second owner cars, and 1/33th of third owner cars. There is a very strong and highly statistically significant association between these variables. Dealers strongly prefer selling first owner cars.
##
## Pearson's Chi-squared test
##
## data: tchi6
## X-squared = 30.363, df = 2, p-value = 0.0000002552
First owner cars are a bit more likely to be automatic transmission cars. There is a moderate, statistically significant association between these variables.
Finally, let’s look at the relationships between numeric and factor
variables, using boxplots of numeric variables grouped by factor
variables.
- We see that diesel cars tend to be priced higher than petrol cars.
- Cars sold by dealers tend to be priced a bit higher than cars sold by individuals.
- Automatic transmission cars tend to be priced much higher than manuals.
- The more previous owners a car has, the lower it is priced. The effect is more pronounced for first owner cars, and less so between second and third owner cars.
- Overall, diesel cars tend to be a bit newer than petrol cars, though there are numerous older diesels as well.
- Cars sold by dealers tend to be a bit newer compared to cars sold by individuals.
- Automatic transmission cars tend to be a bit newer compared to manuals.
- Generally, the more previous owners a car has, the older it is.
- Diesel cars tend to have higher kilometers driven compared to petrols. This is expected as diesel engines are known to be more durable.
- Cars sold by individuals tend to have higher kilometers driven compared to cars sold by dealers.
- Manual transmission cars tend to have higher kilometers driven compared to automatics.
- Generally, cars with more previous owners tend to have higher kilometers driven, though there are some first owner cars with the highest kilometers driven among the cars in the dataset.
Now let’s consider the relationships between selling price, and the other variables in our dataset. We have already done this to a degree with our correlations analysis above. For factor variables, we can say:
- Fuel type has a moderate relationship with selling price. Diesel cars are more expensive than petrol cars.
- Transmission type has a strong relationship with selling price. Automatic cars tend to be more expensive, though we have few observations with automatic transmissions..
- Previous owners has a moderate relationship with selling price. First owner cars tend to be more expensive, but the effect for second and third owner cars is less clear, as there are few observations with these levels.
- Seller type has a somewhat weaker relationship with selling price. Cars sold by dealers tend to be priced a bit higher.
Let’s use scatterplots to better visualize the relationship between the
two numeric predictors, year and km_driven, and the selling prices.
Let’s also group the observations by the factor variables for more
possible insights.
- Selling price tends to increase with year, in a non-linear,
exponential-like fashion.
- Diesels, automatics, dealer cars and first owner cars all tend to be newer, and higher priced.
- Selling price doesn’t have a clear relationship with kilometers
driven, though there are no cars priced above roughly 1.25m with
more than 200k kilometers driven.
- Automatics, dealer cars and first owner cars all tend to have lower kilometers.
Before we fit an OLS linear regression model, there are a few transformations we need to consider:
- Our outcome variable, selling price, follows a non-normal distribution. We will use the log transformed version as the outcome variable in our regression, to try and satisfy the normality of residuals assumption.
- The relationship between year and selling price is likely exponential and not linear. Since we will apply a log transformation to selling price, we don’t need to apply an exponential transformation to year.
As for variable choice, we saw that year, fuel type, transmission type and owner type are all potential predictors of selling price. There is no apparent strong correlation within these variables, so we can include all of them as predictors.
Kilometers driven and seller type exhibit weak relationships with selling price. Besides, they may be largely caused by the other variables in our dataset:
- Kilometers driven is correlated / associated with year and fuel type. Diesel engines tend to last longer than petrol engines, and intuitively we would expect older cars to be driven for longer distances. We can consider km_driven largely as a consequence of fuel type and year, rather than a separate predictor.
- Similarly, seller type is associated with transmission type and owner type. It’s likely that dealers prefer automatic, first owner cars, as well as possibly newer, automatic and lower kilometer cars. Again, we can consider dealer type as a consequence of the other predictors in our dataset.
- These variables are largely caused by the other predictors, while not being significant predictors of the outcome variable themselves. Including them in our model would lead to selection bias: Some of the effects of the true predictors on the outcome variable would be held constant, leading to misleading results about the effects of our true predictor variables.
Therefore, we will keep kilometers driven and seller type out of our regression model.
Let’s fit our first linear model, with the log transformed selling price as the outcome variable, and year, fuel, transmission, owner as our predictor variables.
lm1 <- lm(log(selling_price) ~ year + fuel + transmission + owner, data=df)
summary(lm1)
##
## Call:
## lm(formula = log(selling_price) ~ year + fuel + transmission +
## owner, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77309 -0.29788 0.00815 0.31172 2.35456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -229.245853 4.136982 -55.414 < 0.0000000000000002 ***
## year 0.117310 0.002052 57.171 < 0.0000000000000002 ***
## fuelPetrol -0.504037 0.015026 -33.543 < 0.0000000000000002 ***
## transmissionManual -0.850391 0.024433 -34.805 < 0.0000000000000002 ***
## ownerSecond Owner -0.081459 0.018610 -4.377 0.0000123193 ***
## ownerThird Owner -0.166369 0.031027 -5.362 0.0000000867 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4811 on 4176 degrees of freedom
## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6661
## F-statistic: 1669 on 5 and 4176 DF, p-value: < 0.00000000000000022
- Our model explains 67% of the variance in the selling prices.
- All predictors are highly significant, with p-values close to zero.
- Year has a positive effect on price, while the other predictors all have negative effects.
- The intercept is -229, which is very close to 0 when we reverse the log transformation.
Since the coefficients are log transformed, let’s reverse the transformations, convert them into percentage changes, and get the 95% confidence intervals.
## 2.5 % estimate 97.5 %
## year 11.99531 12.446749 12.900012
## fuelPetrol -41.34493 -39.591267 -37.785173
## transmissionManual -59.27353 -57.275207 -55.178830
## ownerSecond Owner -11.12550 -7.822957 -4.397695
## ownerThird Owner -20.32377 -15.326617 -10.016050
Here’s how to interpret these transformed coefficient estimates, and the confidence intervals:
- Increasing year by 1 is expected to increase price by 12.44%, holding everything else constant. The 95% confidence interval is 12-13%: There is a 95% chance the true mean of the coefficient translates into a 12-13% increase in price.
- A petrol car is expected to sell for 40% less compared to a diesel car.
- A manual transmission car is expected to sell for 57% less compared to an automatic.
- A second owner car is expected to sell for 8% less, and a third owner car for 15% less, compared to a first owner car.
Another way to visualize the individual effect of each variable on
price, while holding the effects of other variables constant, is with
added variable plots:
- Year has the biggest effect on price: A positive, approximately
linear increase. The log transformation of selling price apparently
worked well to capture the relationship between year and selling
price.
- Year had the second smallest coefficient, but this can be
misleading: The coefficients are unit effects on price, and year
is a numeric variable with a range of values from 1995-2020. The
other variables are factor variables with 2 or 3 levels.
Therefore, year has more potential to affect the price.
- For example, going from a year value of 1995 to 2020, while keeping other predictors constant, we can expect the new price to be (1.12)^25 times the old one. That is 17 times the original price!
- Year had the second smallest coefficient, but this can be
misleading: The coefficients are unit effects on price, and year
is a numeric variable with a range of values from 1995-2020. The
other variables are factor variables with 2 or 3 levels.
Therefore, year has more potential to affect the price.
- Transmission has the second biggest effect on price: A decline for manual transmission cars, or an increase for automatics.
- Then fuel: A decline for petrol cars, or an increase for diesels.
- Finally owner type: A small decline for second owners, and a slightly bigger decline for third owners.
Let’s plot the diagnostic graphs for lm1, and see if our model fits the
assumptions for an OLS linear regression.
## GVIF Df GVIF^(1/(2*Df))
## year 1.286950 1 1.134438
## fuel 1.019771 1 1.009837
## transmission 1.023523 1 1.011693
## owner 1.253309 2 1.058070
- The linearity assumption is generally satisfied, as there are no strong patterns in the residuals vs fitted plot.
- The normality of residuals assumption generally holds, but the residuals for the highest quantiles deviate from normality, which is likely because of the higher priced, luxury cars in our dataset not being predicted well.
- The equal variance assumption is generally satisfied, as there are no strong patterns in the scale-location plot. However, the residuals for the highest fitted values are considerably higher, again likely due to the high-priced luxury cars being poorly predicted.
- There are no particularly high leverage observations present. There are a few outliers in terms of standardized residuals.
- The variance inflation factors for all variables are between 1-1.3, meaning there are no multicollinearity issues present in our model.
Another way to visually assess the model’s fit is to plot the fitted
values against the real observed values. Let’s do this, while reversing
the log transformations on the model’s predictions.
While the fitted vs. observed values are reasonably close to following a
straight line, quite a few observations, especially higher priced cars
above 2.5m, are vastly underpredicted. We expected this from our
exploratory analysis, and splitting the data can offer more accurate
predictions, but for the sake of our example, let’s stick with these
results.
Now, let’s fit a BLR model with the same formula as lm1. We will go over the outputs of a BLR model, and compare them with the outputs of our OLS model lm1. Our outcome variable is log transformed, which leads to a distribution reasonably close to a standard normal distribution, so we will leave the default setting of a normal distribution as the prior distribution. We don’t have any prior information on the regression coefficients’ distributions, so we will leave them default too.
Our BLR model is constructed with a Markov Chain Monte Carlo algorithm, which draws random samples from the model parameters’ distribution. Each iteration of a BLR model will yield different results, so we need to set seed to get replicable outputs.
set.seed(1)
blr1 <- stan_glm(log(selling_price) ~ year + fuel + transmission + owner,
data=df, seed=1)
The coefficients of the model blr1 are log transformed, so let’s reverse the transformations, convert them into percentage changes, and see the posterior distributions of the model parameters.
## Summary of Posterior Distribution
##
## Parameter | Median | 95% CI | pd | ROPE | % in ROPE | Rhat | ESS
## ---------------------------------------------------------------------------------------------------
## year | 12.46 | [ 11.96, 12.90] | 100% | [-1.00, 1.01] | 0% | 0.999 | 4329.00
## fuelPetrol | -39.58 | [-41.30, -37.76] | 100% | [-1.00, 1.01] | 0% | 1.000 | 5993.00
## transmissionManual | -57.26 | [-59.22, -55.19] | 100% | [-1.00, 1.01] | 0% | 1.000 | 5666.00
## ownerSecond Owner | -7.81 | [-11.10, -4.32] | 100% | [-1.00, 1.01] | 0% | 1.000 | 4917.00
## ownerThird Owner | -15.35 | [-20.29, -9.90] | 100% | [-1.00, 1.01] | 0% | 1.000 | 4602.00
Let’s compare these with the estimated coefficients of lm1, and their 95% confidence intervals.
## 2.5 % estimate 97.5 %
## year 11.99531 12.446749 12.900012
## fuelPetrol -41.34493 -39.591267 -37.785173
## transmissionManual -59.27353 -57.275207 -55.178830
## ownerSecond Owner -11.12550 -7.822957 -4.397695
## ownerThird Owner -20.32377 -15.326617 -10.016050
We can see that the estimates of lm1 and blr1 are very close. The 95% confidence intervals of lm1 are also very close to the 95% credible intervals of blr1. However, their interpretations are a bit different:
- The estimates of lm1 are estimated means of the regression coefficients, while the estimates of blr1 are the estimated medians.
- With an OLS linear regression model, there is a 95% chance that the true mean of the coefficient lies within the 95% confidence interval.
- With a BLR model, 95% of all coefficient estimates lie within the 95% credible interval.
We also have a few metrics to assess the accuracy of each parameter in blr1:
- pd is the probability that the direction of the coefficient is true. This is 100% for all our coefficients.
- ROPE is a range close to zero, which is considered practically zero.
If the credible interval for a coefficient is fully covered by the
ROPE, we can consider the effect of that variable to be practically
zero. All the coefficients in our model are fully outside the ROPE.
- The interpretation of “% in ROPE” is somewhat similar to the p-value in OLS: x% in ROPE means there is a x% chance that the effect is not significant.
- Rhat is the scale reduction factor. If it is close to 1, the MCMC algorithm came close to reaching convergence. In our case, all Rhat values are 1 or very close.
- ESS is effective sample saze, the number of independent draws that give us the same information as the MCMC algorithm. Generally, a high ESS, higher than 400, means the MCMC was informative. In our case, the lowest ESS is 4329.
BLR generates a posterior probability distribution for the model parameters, which can help visualize their level of uncertainty. Let’s plot the posterior distributions of our model coefficients, with the log transformations reversed. Keep in mind that the X-axis values in the below plots are just exponents of the log coefficients. The closer they are to 1, the smaller their effect on price. A coefficient greater than 1 has a positive effect on price, and smaller than 1 has a negative effect on price.
Let’s plot the posterior probability distributions of the model
coefficients.
If we want to compare the effect and uncertainty of each coefficient, we can plot their distributions together, though it will be hard to see the distribution for each individual coefficient this way.
In this plot, coefficients that are farther away from 1 have a bigger
unit effect on the price, coefficients greater than 1 have a positive
effect, and coefficients smaller than 1 have a negative effect.
- Manual transmission has the biggest unit effect on price, with a negative effect.
- Petrol fuel follows with a negative effect.
- Third owner and second owner coefficients have smaller negative effects on price, in that order of effectiveness.
- Year has the second smallest unit effect on the price, after second
owner.
- Again, keep in mind that year is numeric and its values range from 1995 to 2020, and the other predictors are all factor predictors with 2-3 levels, so year has a bigger potential effect on price overall.
The ranges of the coefficient posterior distributions give us an idea about their uncertainty.
- Year’s coefficient has the narrowest range, which means the effect of this variable has the least amount of uncertainty.
- In contrast, third owner’s coefficient has the widest range, making it the variable with the most uncertain effect.
- This is expected, as there are very few observations of the second and third owner types.
The BLR model also gives us a posterior probability distribution for the
predicted values of the outcome variable. We can plot this against the
distribution of the original selling price in our dataset, to check if
the predictions fit well with the original data.
Overall, the posterior probability distributions are reasonably close to
the distribution of the original outcome variable, though there are
considerable deviations in certain ranges. We can see from these plots
that the log transformation brought the distribution of selling price
reasonably close to a normal distribution.
We can plot the average selling prices predicted by blr1, against the
actual values of selling price, just like we did with lm1. This will
yield very similar results, as the average estimates of blr1 are very
close to the estimates of lm1.
We can see that the average fitted values of blr1 are practically the
same with the fitted values of lm1. They overlap almost perfectly in the
plot, though blr1’s predictions are slightly higher.
While the OLS model lm1 generates one prediction per one data point, the BLR model blr1 generates 4,000 predictions per data point. We can visualize the range of predictions made by blr1 as a histogram, and overlay the single prediction made by lm1. Let’s do this for one observation in our dataset:
## name year selling_price km_driven fuel
## 1048 Ford Ecosport 1.5 DV5 MT Ambiente 2013 500 60 Diesel
## seller_type transmission owner
## 1048 Individual Manual Second Owner
Let’s predict a single price for this car, using lm1, and 4,000 prices, using blr1. Let’s plot the prices predicted by blr1 as a histogram, along with the point prediction of lm1, and the original price of the car.
lm1’s point prediction for this observation is 390k, underestimating the
real price of 500k. However, blr1’s posterior prediction tells us more
about the degree of uncertainty in the predictions of price:
- Roughly 50% of blr1’s predictions fall between 250k and 650k, with the median prediction of 385k. This shows us that blr1 tends to underestimate the price of this particular car.
- The full range of blr1’s predictions for this car goes as low as 70k, and as high as 2.4m.
- Just like the distribution of selling prices in the original dataset, the posterior predictive distribution is very right-skewed with large outliers.
Our OLS linear regression model, lm1, yielded a decent fit for our data, generally satisfying the assumptions of OLS linear regression, or coming close.
- The model explained 67% of the variance in used car selling prices, using the highly significant predictor variables year, fuel type, transmission type and owner type.
- The model is prone to underpredicting the price of most higher-priced cars, especially those with selling prices in millions.
Our BLR model, blr1, yielded average estimates for model parameters and average predictions close to those of lm1.
- However, we were able to get a better understanding of the uncertainty in our model, thanks to the posterior distributions generated by blr1.
- blr1 identified transmission type as the predictor with the strongest unit effect on selling price, while the coefficient for year had the lowest uncertainty, and the coefficient for third owner type had the highest uncertainty.
- The posterior predictive distributions from our model, just like the original selling price variable, follow a very right-skewed distribution, with large outliers.
Overall, both models confirmed the predictor variables used as very significant predictors of selling price, but the overall predictions can be very inaccurate, especially for higher priced cars.
- Since we would expect different price dynamics for different segments of cars, removing the high-priced segment and predicting the prices of “ordinary” cars is likely to yield more accurate predictions.
- However, with less data points for the model, we can expect a lower amount of variance explained. There are likely other significant factors at play in used car prices, and without bringing them into our model, we can’t improve our predictions by a lot.