-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathHw_7.Rmd
153 lines (122 loc) · 6.57 KB
/
Hw_7.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
title: "CPSC 375 Homework 7"
author: "Kenn Son, Hamid Suda, Vivian Truong"
date: "4/5/2022"
output:
pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r intro}
library(class)
library(tidyverse)
```
1. Consider the toy dataset below which shows if 4 subjects have diabetes or not, along with two diagnostic measurements.
| Preg | BP |HasDiabetes|Preg.Norm|BP.Norm|Euc.OG|Euc.Norm|
|:----:|:--:|:---------:|:-------:|:-----:|:----:|:------:|
| 2 | 74 | No | 0.5 | 1.0 | 4 | 0.5385165 | 0.2 |
| 3 | 58 | Yes | 1.0 | 0.2 | 12.04159 | 0.781025 |
| 2 | 58 | Yes | 0.5 | 0.2 | 12 | .6 |
| 1 | 54 | No | 0.0 | 0.0 | 16.03122 | 0.9433981 |
| 2 | 70| ? | 0.5 | 0.8 | 0 | 0 |
| a) Which variable is the “Class” variable?
| HasDiabetes is the "Class" variable
| b) Normalize the Preg and BP values by scaling the minimum-maximum range of each column to 0-1. Fill in the empty columns in the table.
|
```{r 1b}
bp <- c(74,58,58,54,70)
preg <- c(2,3,2,1,2)
normalize <- function(x) { return ((x-min(x)) / (max(x)-min(x)) )}
normalize(bp)
normalize(preg)
```
| c) Predict whether a subject with Preg=2, BP=70 will have diabetes using the 1-NN algorithm and
| i) Using Euclidean distance on the original variables
| According the the Euclidean distance on the original variables the subject doesn't have diabetes (HasDiabetes = No , 1-NN = Row 1)
| ii) Using Euclidean distance on the normalized variables
| According the the Euclidean distance on the normalized variables the subject doesn't have diabetes (HasDiabetes = No , 1-NN = Row 1)
| For each of these cases, give the nearest distance, nearest neighbor (e.g., “Row 1” or “Row 2”), and prediction.
|
|
|
2. The pima-indians-diabetes-resampled.csv file on Canvas contains records indicating whether the subjects have diabetes or not, along with certain diagnostic measurements. All subjects are of Pima Indian heritage and this dataset is called the Pima Indian Diabetes Database. The goal is to see if it is possible to predict if a subject has diabetes given some of the diagnostic measurements. (Note: this problem is an extension of the classwork assignment; R code from the class is also posted on Canvas.)
| a) Read the data file [code]
```{r 2a}
setwd("~/Documents/CPSC-375")
pih <- read.csv("pima-indians-diabetes-resampled.csv")
```
| b) What does “Preg” represent in the dataset? (2-3 sentences. Search for the Pima Indian Diabetes Database online and read up on its background.)
| "Preg" represents the number of times of pregnacy the patient has had.
| Backgroud: All patients are female and 21 years old of pima indian heritage.
| c) 0 values in the Glucose column indicate missing values. Remove rows which contain missing values in the Glucose column. You should have 763 rows. [code]
```{r 2c}
pih.no_gluc <- pih %>% filter(Glucose > 0)
pih.no_gluc %>% nrow()
```
| d) Create three new columns/variables which are the normalized versions of Preg, Pedigree, and Glucose columns, scaling the minimum-maximum range of each column to 0-1 (you can use the code developed in class). [code]
```{r 2d}
pih <- pih.no_gluc %>% mutate(Preg.Norm=normalize(Preg),
Pedigree.Norm=normalize(Pedigree),
Glucose.Norm=normalize(Glucose))
```
| e) Split the dataset into train and test datasets with the first 500 rows for training, and the remaining rows for test. Do NOT randomly sample the data (though resampling is usually done, this hw problem does not use this step for ease of grading).
```{r 2e}
trainindex <- 1:500
```
| f) Train and test a k-nearest neighbor classifier with the dataset. Consider only the normalized Preg and Pedigree columns. Set k=1. What is the error rate (number of misclassifications)? [code, error rate]
```{r 2f}
trainfeatures <- pih[trainindex, c(10,11)] #using Normalized Preg and Pedigree
trainlabels <- pih[trainindex, 9]
testindex <-setdiff(1:nrow(pih), trainindex)
testfeatures <- pih[testindex, c(10,11)]
testlabels <- pih[testindex, 9]
predicted <- knn(train = trainfeatures, test = testfeatures, cl=trainlabels, k=1)
## table(testlabels, predicted)
## predicted
## testlabels 0 1
## 0 120 50
## 1 55 38
```
| (FP + FN)/ALL = (55+50)/263 = 0.3992395 = 39.92%
|
| g) Repeat part (f) but consider the normalized Preg, Pedigree, and Glucose columns. Set k=1. What is the error rate? Will the error rate always decrease with a larger number of features? Why or why not: answer in 2-3 sentences? [code, error rate, answer]
```{r 2g}
trainfeatures <- pih[trainindex, c(10,11,12)] #using Normalized Preg and Pedigree
trainlabels <- pih[trainindex, 9]
testindex <-setdiff(1:nrow(pih), trainindex)
testfeatures <- pih[testindex, c(10,11,12)]
testlabels <- pih[testindex, 9]
predicted <- knn(train = trainfeatures, test = testfeatures, cl=trainlabels, k=1)
table(testlabels, predicted)
```
| (FP + FN)/ALL = (42+42)/263 = 0.3193916 = 31.94%
|
| The more trained features you apply the error rate will decrease. This is because the test feature will be able to compare to more features leading to more accurate predictions or lower error rate.
|
| h) Repeat part (g) but set k=5. What is the error rate? [code, error rate]
```{r 2h}
trainfeatures <- pih[trainindex, c(10,11,12)] #using Normalized Preg, Pedigree and Glucose
trainlabels <- pih[trainindex, 9]
testindex <-setdiff(1:nrow(pih), trainindex)
testfeatures <- pih[testindex, c(10,11,12)]
testlabels <- pih[testindex, 9]
predicted <- knn(train = trainfeatures, test = testfeatures, cl=trainlabels, k=5)
table(testlabels, predicted)
```
| (FP + FN)/ALL = (42+21)/263 = 0.2395437 = 23.95%
|
| i) Repeat part (h) but set k=11. What is the error rate? Considering your observations from (g)-(i), which is the best value for k? [code, error rate, answer]
```{r 2i}
trainfeatures <- pih[trainindex, c(10,11,12)] #using Normalized Preg, Pedigree and Glucose
trainlabels <- pih[trainindex, 9]
testindex <-setdiff(1:nrow(pih), trainindex)
testfeatures <- pih[testindex, c(10,11,12)]
testlabels <- pih[testindex, 9]
predicted <- knn(train = trainfeatures, test = testfeatures, cl=trainlabels, k=11)
table(testlabels, predicted)
```
| (FP + FN)/ALL = (42+16)/263 = 0.2205323 = 22.05%
|
| We think that the best value for k is 11 since the error rate was lower then the other k-values and we think its due to the fact it has more neighbors to compare with.