forked from WinVector/zmPDSwR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PracticalDataScienceWithRErrata.html
315 lines (282 loc) · 15.7 KB
/
PracticalDataScienceWithRErrata.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Practical Data Science with R Errata</title>
<style>
body {
font-size: 10pt;
margin: 1.5em;
background-color: lightblue;
color: darkblue;
font-family: Verdana,sans-serif;
}
h1 {
font-size: 1.2em;
font-weight: bold;
margin-top: 2em;
}
h2 {
font-size: 1.1em;
font-weight: bold;
}
fieldset {
width: 740px;
margin-bottom: 12px;
border-color: #00457b;
background-color: #cfeace;
}
fieldset div {
margin-bottom: 6px;
font-weight: normal;
}
legend {
border: 2px ridge #00457b;
font-size: 1.2em;
font-weight: bold;
background-color: #e36a51;
color: white;
padding: 8px 16px;
}
</style>
</head>
<body>
<fieldset>
<legend>Practical Data Science with R Errata</legend>
<div id="intro">
<a target="_blank" href="https://www.manning.com/books/practical-data-science-with-r"><img id="cover" src="https://manning-content.s3.amazonaws.com/book/d/7460e62-57ec-4954-82ec-e2e6558c374f/zumel.png"height="200" width="150" alt="Practical Data Science with R"></a>
<p>
Your authors (<a href="http://ninazumel.com/" target="_blank">Nina Zumel</a> and <a href="http://johnmount.com/" target="_blank">John Mount</a>)
and <a href="http://www.manning.com" target="_blank">Manning Publications Co.</a> have all worked very hard to make
<a target="_blank" href="https://manning.com/books/practical-data-science-with-r"><b><i>Practical Data Science with R</i></b></a> an excellent
and instructive book that is well worth your investment of time and money.
However, errors do creep in and we do feel bad about that.
To help, we are maintaining a public errata list
<a href="http://winvector.github.io/PDSwR/PracticalDataScienceWithRErrata.html" target="_blank">here</a> (and <a target="_blank" href="https://github.com/WinVector/zmPDSwR/blob/master/PracticalDataScienceWithRErrata.html">source</a>). We also have a <a target="_blank" href="http://practicaldatascience.com">central site for announcements</a>, example <a target="_blank" href="https://github.com/WinVector/zmPDSwR">data</a>, and <a target="_blank" href="https://github.com/WinVector/zmPDSwR/tree/master/CodeExamples">code</a>.
</p>
<p>
If you find any any errors in <a target="_blank" href="https://manning.com/books/practical-data-science-with-r"><b><i>Practical Data Science with R</i></b></a>
not listed below, or
just find something that you think is not well explained,
please post in the book's
<a href="https://forums.manning.com/forums/practical-data-science-with-r" target="_blank">Online Forum</a>
so that they may be collected here for everyone's benefit. Thanks!
</p>
<i>Last updated December 31, 2018</i>
<ul>
<li>
<h2>Page 12, Listing 1.2</h2>
<p>
Start the listing with the line <code>creditdata <- d</code>.
</p>
</li>
<li>
<h2>Page 25 Listing 2.8, page 30 Listing 2.11 and any other use of H2DB</h2>
<p>
Starting with version 1.4 the H2DB database enforces that the
path specified for the database files must be absolute
(starting with "./", "/", or "~/"). The book examples were
written using 1.3 versions of H2DB which did not enforce this.
The fix is to edit the db definition xml files and R
connection commands to make sure
you have absolute paths. Use:
<code>jdbc:h2:./H2DB;...</code>
and not: <code>jdbc:h2:H2DB;...</code>.
</p>
</li>
<li>
<h2>Pages 25 through 29 section 2.2.1 A production-sized example</h2>
<p>
We no longer recommend using H2DB, Java, and Squirrel SQL to load data. Instead we recommend using R, read.table, and dbWriteTable using either PostgreSQL (for a lot of data) or RSQlite (for in-memory practice). Nina has a write-up of the replacement screwdriver concept <a href="http://www.win-vector.com/blog/2016/02/using-postgresql-in-r/">here</a>.
</p>
</li>
<li>
<h2>Page 86, Table 5.1, second row, second column</h2>
<p>
The random forests cross-reference should be to section 9.1.2.
</p>
</li>
<li>
<h2>Pages 125-138, Section 6.3 Building models using many variables</h2>
<p>We strongly recommend splitting your training set into
two pieces, and using one piece for the construction of single
variable models, and only the other disjoint portion of the training
data for model construction. The issue is any row of data examined
during single variable model construction is no longer exchangeable
with even test data (let alone future data). Models trained on rows
used to build the variable encodings tend to overestimate effect
sizes of the sub-models (or treated variables), underestimate degrees
of freedom, and get significances wrong. We think the KNN model in
this section happened to perform okay due to the aggressive variable
pruning heuristic in listing 6.11.</p>
<p>A rerun of chapter 6 with the stricter data separation
can be found
<a target="_blank" href="https://github.com/WinVector/zmPDSwR/tree/master/KDD2009">here</a>
(including a newer run automating a lot of the steps using
the
<a target="_blank" href="http://www.win-vector.com/blog/2014/08/vtreat-designing-a-package-for-variable-treatment/">vtreat</a>
library; files, of course, are in the course
downloads).</p>
</li>
<li>
<h2>Page 120, Section 6.2.1 Using categorical features</h2>
<p>
It is a good idea to harden the look-up <code>vTab[pos,]</code> to <code>vTab[as.character(pos),]</code>
to defend against the irregular nature of <code>R</code>'s indexing operator (which would fail in this
case if we were to have <code>pos</code> as a logical instead of a string).
</p>
</li>
<li>
<h2>Page 126, Section 6.2.3 Using cross-validation</h2>
<p>
First paragraph: "cross-reference" should be "cross-validation."
</p>
</li>
<li>
<h2>Page 126, Listing 6.11 Basic variable selection</h2>
<p>
Second "Run through categorical variables and pick based
on a deviance improvement" should read "Run through numerical
variables and pick based on a deviance improvement."
</p>
</li>
<li>
<h2>Page 126, Text and Listing 6.11 Basic variable selection</h2>
<p>
The "2 times" in listing 6.11 is because statistical deviance is
traditionally written as -2*(log(p|model) - log(p|saturatedModel)).
We are using a "delta deviance" or improvement of deviance given by
<code>-2*(log(p|model) - log(p|saturatedModel)) - (-2*(log(p|nullModel) -
log(p|saturatedModel)))</code>. We wrote this as <code>2*log(p|model)-baseRateCheck</code>
and it is a deviance style metric (not an AIC style metric as claimed in the text,
which got behind the code zip file by one revision).
<p>
</p>The
text above listing 6.11 on page 126 claims we are going to use an AIC-like metric. The original
implementation did this by subtracting a variable entropy
estimate from each liCheck in the the catVars list and subtracted 1
from each numeric variable (a placeholder). This turned out to not be
useful for the data set at hand (this is the sort of thing you do want to try variations of),
and evidently we removed it without
completely updating the text.
</p>
<p>Corrected text, listings, and results now
<a href="https://github.com/WinVector/zmPDSwR/blob/master/CodeExamples/c06_Memorization_methods/00081_example_6.11_of_section_6.3.1.R" target="_blank">here</a>
and <a href="https://github.com/WinVector/zmPDSwR/blob/master/CodeExamples/c06_Memorization_methods/00082_example_6.12_of_section_6.3.1.txt" target="_blank">here</a> .</p>
</li>
<li>
<h2>Page 130, How decision tree models work</h2>
<p>In the text explaining the decision tree
"predVar126 < -0.002810871" should read
"predVar126 < 0.07366888".
</p>
</li>
<li>
<h2>Page 146, Using categorical features, Listing 6.4</h2>
<p>Replace "pPosWna <- (naTab/sum(naTab))[pos]" with
"pPosWna <- (naTab/sum(naTab))[as.character(pos)]" to get around
bad indexing issues due to R's treatment of logical types and expansion of vector lengths.
</p>
</li>
<li>
<h2>Page 148, Are there systematic errors?</h2>
<h2>Page 224, Footnote 5</h2>
<p>
Our description of "heteroscedasticity" is wrong and
(unintentionally) conflates a number of different negative issues.
Our intent was to expand on a simple definition such as "In
statistics, a collection of random variables is
heteroscedastic if there are sub-populations that have
different variabilities from others" (taken
from <a target="_blank"
href="http://en.wikipedia.org/wiki/Heteroscedasticity">Wikipedia:
Heteroscedasticity</a>). Unfortunately we pushed this too
far (roughly saying "it is bad if errors correlate with y's"
when we
meant to say "it was bad if errors correlate with unobserved
ideal y's which don't already have the error term added in", i.e. with functions of the x's).
</p>
<p>The correct thing to do is state that regression
(and in particular the diagnostics of a regression) depends
on a number of assumptions about the data. Important properties
include (paraphrased from Gelman, Carlin,
Stern, Dunson, Vehtari, and Rubin "Bayesian
Data Analysis" 3rd edition pp. 369-376) model
structure, linearity of expected value of y as a function
of the x's, bias of error terms, normality of error terms,
independence of observations, and constancy of variance.
For more see
<a target="_blank" href="http://andrewgelman.com/2013/08/04/19470/">What are the key assumptions of linear regression?</a>.
</p>
<p>
Also, the frequentest version of regression is commonly said to make no assumptions
on the distribution of data or parameters (only on distribution of errors).
This is not quite true as even the frequentest analysis depends on independence
of observations (which is a fact about x's in addition to y's and errors).
And a Bayesian treatment may need to assume some form of priors on one
or more of these to get well behaved detailed posterior estimates.
</p>
<p>We will correct later editions of the book, and hope
our error does not overly distract from the important
lesson that you must be aware of modeling assumptions.
We do note with some amusement how rarely "Bayesian
Data Analysis" 3rd edition pp. 369-376 uses the
term "heteroscedasticity" even though these are
the pages tagged with this term in the index (these authors
rightly emphasizing concepts over naming).
</p>
</li>
<li>
<h2>Page 156, Section 7.1.5 Reading the model summary and characterizing coefficient quality</h2>
<p>Normality of errors is desirable, but not necessary for linear regression (see
<a target="_blank" href="https://en.wikipedia.org/wiki/Linear_regression#Assumptions">Wikipedia</a>
and <a target="_blank" href="http://andrewgelman.com/2013/08/04/19470/">What are the key assumptions of linear regression?</a>). We had previously said the errors must be normally distributed (and we certainly did not mean to imply the <code>y</code> or <code>x</code> need to be normally distributed).</p>
<p>Replace "<bold>INTERPRETING MODEL SIGNIFICANCES</bold>" with:</p>
<p>
<p> <bold>INTERPRETING MODEL SIGNIFICANCES</bold> </p>
Many of the tests of linear regression, including the tests for coefficient and model significance, are sensitive to model assumptions. It's important to examine graphically or using quantile analysis to determine if the regression model is behaving as expected.</p>
</li>
<li>
<h2>Page 157, Section 7.2.1 Understanding logistic regression</h2>
<p>Unfortunate typo: the sigmoid function is <code>s(z) = 1/(1+exp(-z))</code>
(not <code>1/(1+exp(z))</code>).
</p>
</li>
<li>
<h2>Page 180, Section 8.1.2, Preparing the data</h2>
<p>After calling <code>scale()</code> always clear attributes from result with <code>attr(., "scaled:center") = NULL; attr(., "scaled:scale") = NULL</code> otherwise <code>prcomp</code> <code>predict(princ, .)</code> does not always equal the desired projection <code>. %*% princ$rotation</code>.
</p>
</li>
<li>
<h2>Page 187, Section 8.1.3, Total with sum of squares</h2>
<p>Typo should read "The within sum of squares for a single cluster is the total squared distance ..." (not average). Code is correct.
</p>
</li>
<li>
<h2>Page 205, Section 8.1.1 Distances, Cosine Similarity</h2>
<p>Change "Cosine similarity is a common similarity metric in text analysis. It measures the smallest angle between two vectors (the angle theta between two vectors is assumed to be between 0 and 90 degrees)." to "Cosine similarity is a common similarity metric in text analysis. It measures the smallest angle between two vectors (in our text example we assume non-negative vectors, so the angle theta between two vectors is between 0 and 90 degrees)."
</li>
<li>
<h2>Page 206, Section 8.2.3 Mining associations with the arules package, Listing 8.20</h2>
<p><code>intersetMeasure</code> now uses the argument <code>measure=</code> instead of <code>method=</code> (which now crashes). Switching the argument name brings the code up to date.</p>
</li>
<li>
<h2>Page 214, Listing 9.1 Preparing Spambase data and evaluating the performance of decision trees</h2>
<p>Bug: F1 is the harmonic mean of precision and recall: <code>2*precision*recall/(precision+recall)</code>.
See Page 96, Section 5.2.1 "F1" for the correct definition.
</p>
</li>
<li>
<h2>Page 230, Listing 8.17 examining the size distribution</h2>
<p>Remove the <code>binwidth=1</code> argument.</p>
</li>
<li>
<h2>Page 387, sqldf index entry</h2>
<p>(Not an error) Another important cross-reference for sqldf
is page 327, where we show an option that prevents an R-crash
on OSX.
</p>
</li>
</ul>
</fieldset>
</body>
</html>