02-data.Rmd

# Data

Before doing any analysis, the factors within the dataset were first checked for missing or invalid data. The individual factors can be described as follows: 15 continuous, 10 nominal, and 1 integer.

Seven of the factors contained missing or improperly coded data. In this dataset in partular, all missing data has been coded with the value of `?`. In all cases below, the records containing the missing data have been removed.  

Index  Factor             Number of records missing a value
-----  -----------------  ---------------------------------
2      normalized-losses  41
6      num-of-doors       2
19     bore               4
20     stroke             4
22     horsepower         2
23     peak-rpm           2
26     price              4

Of the original 205 records, X were removed because they contained missing data for the `normalized-lossess` factor, which was coded as a `?`. This resulted in a dataset of 164 records of clean data. No other factors needed cleaning up, as the data was properly coded for each record.

```{r data-dictionary, echo=FALSE}
data_dict <- readxl::read_xlsx(path = "data/data-dictionary.xlsx")
knitr::kable(data_dict, caption = 'Data Dictionary - Initial', booktabs = TRUE)
```

Of these factors, 10 of the initial 26 were removed, resulting in the 16 factors that will be used in analysis. These factors are noted in green in `Keep` column of the above table.

```{r raw-data, echo=FALSE}
data_dict  <- readxl::read_xlsx(path = "data/data-dictionary.xlsx")
data_clean <- read.csv("data/data-clean.csv", header = TRUE)
names(data_clean) <- data_dict[['Description']]
```

The objective factor in the dataset is determined to be `symboling`.

Next, the data was partitioned into three groups named _training_, _test_, and _validation_. This was 

```{r data-summary-stats, echo=FALSE}
data <- read.csv("data/data-clean.csv")
summary(data)
```