-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-data.Rmd
40 lines (29 loc) · 1.85 KB
/
02-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Data
Before doing any analysis, the factors within the dataset were first checked for missing or invalid data. The individual factors can be described as follows: 15 continuous, 10 nominal, and 1 integer.
Seven of the factors contained missing or improperly coded data. In this dataset in partular, all missing data has been coded with the value of `?`. In all cases below, the records containing the missing data have been removed.
Index Factor Number of records missing a value
----- ----------------- ---------------------------------
2 normalized-losses 41
6 num-of-doors 2
19 bore 4
20 stroke 4
22 horsepower 2
23 peak-rpm 2
26 price 4
Of the original 205 records, X were removed because they contained missing data for the `normalized-lossess` factor, which was coded as a `?`. This resulted in a dataset of 164 records of clean data. No other factors needed cleaning up, as the data was properly coded for each record.
```{r data-dictionary, echo=FALSE}
data_dict <- readxl::read_xlsx(path = "data/data-dictionary.xlsx")
knitr::kable(data_dict, caption = 'Data Dictionary - Initial', booktabs = TRUE)
```
Of these factors, 10 of the initial 26 were removed, resulting in the 16 factors that will be used in analysis. These factors are noted in green in `Keep` column of the above table.
```{r raw-data, echo=FALSE}
data_dict <- readxl::read_xlsx(path = "data/data-dictionary.xlsx")
data_clean <- read.csv("data/data-clean.csv", header = TRUE)
names(data_clean) <- data_dict[['Description']]
```
The objective factor in the dataset is determined to be `symboling`.
Next, the data was partitioned into three groups named _training_, _test_, and _validation_. This was
```{r data-summary-stats, echo=FALSE}
data <- read.csv("data/data-clean.csv")
summary(data)
```