Skip to content

Commit b9eef83

Browse files
author
maechler
committed
move model-fitting-functions.txt to markdown (and html)
git-svn-id: https://svn.r-project.org/R-dev-web/trunk@3379 c52295ea-58df-0310-926a-d16021944841
1 parent 3ce601c commit b9eef83

File tree

4 files changed

+342
-4
lines changed

4 files changed

+342
-4
lines changed

Diff for: Makefile

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
2+
HTML_FILES := model-fitting-functions.html
3+
4+
all: $(HTML_FILES)
5+
6+
$(HTML_FILES): %.html: %.md
7+
pandoc '$<' -o '$@' --standalone --smart
8+
9+
.PHONY: clean
10+
clean:
11+
$(RM) $(HTML_FILES)
12+

Diff for: index.html

+3-4
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,9 @@ <h3>Pointers</h3>
193193
<li> A <a href="parseRd.pdf">description</a> of the parse_Rd() parser for Rd files. This
194194
document also includes a draft description of the new facility for executing R code
195195
within Rd man pages.
196+
<li>How to write <a href="model-fitting-functions.html">model-fitting</a>
197+
functions in R, and especially on how to enable all the safety features.
198+
</li>
196199
<li>A list of things to consider for a possible <a
197200
href="new-homepage.html" >re-design of the R homepage.</a>
198201
<li>A brief writeup on <a href="rtags.html">how to tag R
@@ -392,10 +395,6 @@ <h3>Older Material</h3>
392395
function</a> and <a href="formulasFrame.Rd">docn</a> for making model
393396
frames from multiple formulas.
394397
</li>
395-
<li>Notes on <a href="model-fitting-functions.txt">model-fitting</a>
396-
functions in R, and especially on how to enable all the safety
397-
features.
398-
</li>
399398
</ul>
400399

401400
<li> the <a href=ideas.txt> Ideas List </a></li>

Diff for: model-fitting-functions.html

+125
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
2+
<html xmlns="http://www.w3.org/1999/xhtml">
3+
<head>
4+
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
5+
<meta http-equiv="Content-Style-Type" content="text/css" />
6+
<meta name="generator" content="pandoc" />
7+
<meta name="author" content="Brian Ripley, Nov. 2003; R Core Team" />
8+
<title>Model Fitting Functions in R</title>
9+
<style type="text/css">code{white-space: pre;}</style>
10+
</head>
11+
<body>
12+
<div id="header">
13+
<h1 class="title">Model Fitting Functions in R</h1>
14+
<h2 class="author">Brian Ripley, Nov. 2003; R Core Team</h2>
15+
</div>
16+
<h1 id="how-to-write-model-fitting-functions-in-r">How To Write Model-Fitting Functions in R</h1>
17+
<p>This page documents some of the features that are available to model-fitting functions in R, and especially the safety features that can (and should) be enabled.</p>
18+
<p>By model-fitting functions we mean functions like lm() which take a formula, create a model frame and perhaps a model matrix, and have methods (or use the default methods) for many of the standard accessor functions such as coef(), residuals() and predict().</p>
19+
<p>A fairly complete list of such functions in the standard and recommended packages is</p>
20+
<ul>
21+
<li><p>stats: aov (via lm), lm, glm, ppr</p></li>
22+
<li>MASS: glm.nb, glmmPQL, lda, lm.gls, lqs, polr, qda, rlm</li>
23+
<li>mgcv: mgcv</li>
24+
<li>nlme: gls, lme</li>
25+
<li>nnet: multinom, nnet</li>
26+
<li><p>survival: coxph, survreg</p></li>
27+
</ul>
28+
<p>and those not using a model matrix or not intended to handle categorical predictors</p>
29+
<ul>
30+
<li><p>stats: factanal, loess, nls, prcomp, princomp</p></li>
31+
<li>MASS: loglm</li>
32+
<li>nlme: gnls, nlme</li>
33+
<li><p>rpart: rpart</p></li>
34+
</ul>
35+
<h2 id="standard-behaviour-for-a-fitting-function">Standard behaviour for a fitting function</h2>
36+
<p>The annotations in the following simplified version of <code>lm</code> (its current source is in <a href="https::/svn.r-project.org/R/trunk/src/library/stats/R/lm.R" class="uri">https::/svn.r-project.org/R/trunk/src/library/stats/R/lm.R</a>) indicate what is standard for a model-fitting function.</p>
37+
<pre class="{r}"><code>lm &lt;- function (formula, data, subset, weights, na.action,
38+
method = &quot;qr&quot;, model = TRUE, contrasts = NULL, offset, ...)
39+
{
40+
cl &lt;- match.call()
41+
42+
## keep only the arguments which should go into the model frame
43+
mf &lt;- match.call(expand.dots = FALSE)
44+
m &lt;- match(c(&quot;formula&quot;, &quot;data&quot;, &quot;subset&quot;, &quot;weights&quot;, &quot;na.action&quot;,
45+
&quot;offset&quot;), names(mf), 0)
46+
mf &lt;- mf[c(1, m)]
47+
mf$drop.unused.levels &lt;- TRUE
48+
mf[[1]] &lt;- quote(stats::model.frame) # was as.name(&quot;model.frame&quot;), but
49+
## need &quot;stats:: ...&quot; for non-standard evaluation
50+
mf &lt;- eval.parent(mf)
51+
if (method == &quot;model.frame&quot;) return(mf)
52+
53+
## 1) allow model.frame to update the terms object before saving it.
54+
mt &lt;- attr(mf, &quot;terms&quot;)
55+
y &lt;- model.response(mf, &quot;numeric&quot;)
56+
57+
## 2) retrieve the weights and offset from the model frame so
58+
## they can be functions of columns in arg data.
59+
w &lt;- model.weights(mf)
60+
offset &lt;- model.offset(mf)
61+
x &lt;- model.matrix(mt, mf, contrasts)
62+
## if any subsetting is done, retrieve the &quot;contrasts&quot; attribute here.
63+
64+
z &lt;- lm.fit(x, y, offset = offset, ...)
65+
class(z) &lt;- c(if(is.matrix(y)) &quot;mlm&quot;, &quot;lm&quot;)
66+
67+
## 3) return the na.action info
68+
z$na.action &lt;- attr(mf, &quot;na.action&quot;)
69+
z$offset &lt;- offset
70+
71+
## 4) return the contrasts used in fitting: possibly as saved earlier.
72+
z$contrasts &lt;- attr(x, &quot;contrasts&quot;)
73+
74+
## 5) return the levelsets for factors in the formula
75+
z$xlevels &lt;- .getXlevels(mt, mf)
76+
z$call &lt;- cl
77+
z$terms &lt;- mt
78+
if (model) z$model &lt;- mf
79+
z
80+
}</code></pre>
81+
<p>Note that if this approach is taken, any defaults for arguments handled by model.frame are never invoked (the defaults in model.frame.default are used) so it is good practice not to supply any. (This behaviour can be overruled, and is by e.g. rpart.)</p>
82+
<p>If this is done, the following pieces of information are stored with the model object:</p>
83+
<ul>
84+
<li><p>The model frame (unless argument model=FALSE). This is useful to avoid scoping problems if the model frame is needed later (most often by predict methods).</p></li>
85+
<li><p>What contrasts and levels were used when coding factors to form the model matrix, and these plus the model frame allow the re-creation of the model matrix. (The real lm() allows the model matrix to be saved, but that is provided for S compatibility, and is normally a waste of space.)</p></li>
86+
<li><p>The na.action results are recorded for use in forming residuals and fitted values/prediction from the original data set.</p></li>
87+
<li>The terms component records</li>
88+
<li>environment(formula) as its environment,</li>
89+
<li>details of the classes supplied for each column of the model frame as attribute “dataClasses”,</li>
90+
<li><p>in the “predvars” attribute, calls to functions such as bs() and poly() which should be used for prediction from a new dataset. (See ?makepredictcall for the details.)</p></li>
91+
</ul>
92+
<p>Some of these are used automatically but most require code in class-specific methods.</p>
93+
<h2 id="residualsfittedweights-methods">residuals/fitted/weights methods</h2>
94+
<p>To make use of na.action options like na.exclude, the fitted() method needs to be along the lines of</p>
95+
<pre><code>fitted.default &lt;- function(object, ...)
96+
napredict(object$na.action, object$fitted.values)</code></pre>
97+
<p>For the residuals() method, replace napredict by naresid (although for all current na.action’s they are the same, this need not be the case in future).</p>
98+
<p>Similar code with a call to naresid is needed in a weights() method.</p>
99+
<h2 id="predict-methods">predict() methods</h2>
100+
<p>Prediction from the original data used in fitting will often be covered by the <code>fitted()</code> method, unless s.e.’s or confidence/prediction intervals are required.</p>
101+
<p>In a <code>newdata</code> argument is supplied, most methods will need to create a model matrix as if the newdata had originally been used (but with na.action as set on the predict method, defaulting to na.pass). A typical piece of code is</p>
102+
<pre><code> m &lt;- model.frame(Terms, newdata, na.action = na.action,
103+
xlev = object$xlevels)
104+
if(!is.null(cl &lt;- attr(Terms, &quot;dataClasses&quot;))) .checkMFClasses(cl, m)
105+
X &lt;- model.matrix(Terms, m, contrasts = object$contrasts)</code></pre>
106+
<p>Note the use of the saved levels and saved contrasts, and the safety check on the classes of the variables found by model.frame (which of course need not be found in <code>newdata</code>). Safe prediction from terms involving poly(), bs() and so on will happen without needing any code in the predict() method as this is handled in model.frame.default().</p>
107+
<p>If your code is like rpart() and handles ordered and unordered factors differently use <code>.checkMFClasses(cl, m, TRUE)</code> — this is not needed for code like lm() as both the set of levels of the factors and the contrasts used at fit time are recorded in the fit object and retrieved by the predict() method.</p>
108+
<h2 id="model.frame-methods">model.frame() methods</h2>
109+
<p>model.frame() methods are most often used to retrieve or recreate the model frame from the fitted object, with no other arguments. For fitting functions following the standard pattern outlined in this document no method is needed: as from R 1.9.0 model.frame.default() will work.</p>
110+
<p>One reason that a special method might be needed is to retrieve columns of the data frame that correspond to arguments of the orginal call other than <code>formula</code>, <code>subset</code> and <code>weights</code>: for example the glm method handles <code>offset</code>, <code>etastart</code> and <code>mustart</code>.</p>
111+
<p>If you have a <code>model.frame()</code> method it should</p>
112+
<ul>
113+
<li><p>return the <code>model</code> component of the fit (and there are no other arguments).</p></li>
114+
<li><p>establish a suitable environment within which to look for variables. The standard recipe is</p></li>
115+
</ul>
116+
<pre><code> fcall &lt;- formula$call
117+
## drop unneeded args
118+
fcall[[1]] &lt;- as.name(&quot;model.frame&quot;)
119+
if (is.null(env &lt;- environment(formula$terms))) env &lt;- parent.frame()
120+
eval(fcall, env)</code></pre>
121+
<ul>
122+
<li>allow <code>...</code> to specify at least <code>data</code>, <code>na.action</code> or <code>subset</code>.</li>
123+
</ul>
124+
</body>
125+
</html>

Diff for: model-fitting-functions.md

+202
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
---
2+
title: Model Fitting Functions in R
3+
author: Brian Ripley, Nov. 2003; R Core Team
4+
---
5+
How To Write Model-Fitting Functions in R
6+
=========================================
7+
8+
This page documents some of the features that are available to
9+
model-fitting functions in R, and especially the safety features that
10+
can (and should) be enabled.
11+
12+
By model-fitting functions we mean functions like lm() which take a
13+
formula, create a model frame and perhaps a model matrix, and have
14+
methods (or use the default methods) for many of the standard accessor
15+
functions such as coef(), residuals() and predict().
16+
17+
A fairly complete list of such functions in the standard and
18+
recommended packages is
19+
20+
- stats: aov (via lm), lm, glm, ppr
21+
22+
- MASS: glm.nb, glmmPQL, lda, lm.gls, lqs, polr, qda, rlm
23+
- mgcv: mgcv
24+
- nlme: gls, lme
25+
- nnet: multinom, nnet
26+
- survival: coxph, survreg
27+
28+
and those not using a model matrix or not intended to handle
29+
categorical predictors
30+
31+
- stats: factanal, loess, nls, prcomp, princomp
32+
33+
- MASS: loglm
34+
- nlme: gnls, nlme
35+
- rpart: rpart
36+
37+
38+
Standard behaviour for a fitting function
39+
-----------------------------------------
40+
41+
The annotations in the following simplified version of `lm` (its current
42+
source is in <https::/svn.r-project.org/R/trunk/src/library/stats/R/lm.R>) indicate
43+
what is standard for a model-fitting function.
44+
45+
```{r}
46+
lm <- function (formula, data, subset, weights, na.action,
47+
method = "qr", model = TRUE, contrasts = NULL, offset, ...)
48+
{
49+
cl <- match.call()
50+
51+
## keep only the arguments which should go into the model frame
52+
mf <- match.call(expand.dots = FALSE)
53+
m <- match(c("formula", "data", "subset", "weights", "na.action",
54+
"offset"), names(mf), 0)
55+
mf <- mf[c(1, m)]
56+
mf$drop.unused.levels <- TRUE
57+
mf[[1]] <- quote(stats::model.frame) # was as.name("model.frame"), but
58+
## need "stats:: ..." for non-standard evaluation
59+
mf <- eval.parent(mf)
60+
if (method == "model.frame") return(mf)
61+
62+
## 1) allow model.frame to update the terms object before saving it.
63+
mt <- attr(mf, "terms")
64+
y <- model.response(mf, "numeric")
65+
66+
## 2) retrieve the weights and offset from the model frame so
67+
## they can be functions of columns in arg data.
68+
w <- model.weights(mf)
69+
offset <- model.offset(mf)
70+
x <- model.matrix(mt, mf, contrasts)
71+
## if any subsetting is done, retrieve the "contrasts" attribute here.
72+
73+
z <- lm.fit(x, y, offset = offset, ...)
74+
class(z) <- c(if(is.matrix(y)) "mlm", "lm")
75+
76+
## 3) return the na.action info
77+
z$na.action <- attr(mf, "na.action")
78+
z$offset <- offset
79+
80+
## 4) return the contrasts used in fitting: possibly as saved earlier.
81+
z$contrasts <- attr(x, "contrasts")
82+
83+
## 5) return the levelsets for factors in the formula
84+
z$xlevels <- .getXlevels(mt, mf)
85+
z$call <- cl
86+
z$terms <- mt
87+
if (model) z$model <- mf
88+
z
89+
}
90+
```
91+
92+
Note that if this approach is taken, any defaults for arguments
93+
handled by model.frame are never invoked (the defaults in
94+
model.frame.default are used) so it is good practice not to supply
95+
any. (This behaviour can be overruled, and is by e.g. rpart.)
96+
97+
If this is done, the following pieces of information are stored with
98+
the model object:
99+
100+
* The model frame (unless argument model=FALSE). This is useful to
101+
avoid scoping problems if the model frame is needed later (most
102+
often by predict methods).
103+
104+
* What contrasts and levels were used when coding factors to form the
105+
model matrix, and these plus the model frame allow the re-creation
106+
of the model matrix. (The real lm() allows the model matrix to be
107+
saved, but that is provided for S compatibility, and is normally a
108+
waste of space.)
109+
110+
* The na.action results are recorded for use in forming residuals and
111+
fitted values/prediction from the original data set.
112+
113+
* The terms component records
114+
- environment(formula) as its environment,
115+
- details of the classes supplied for each column of the model frame
116+
as attribute "dataClasses",
117+
- in the "predvars" attribute, calls to functions such as bs() and
118+
poly() which should be used for prediction from a new dataset.
119+
(See ?makepredictcall for the details.)
120+
121+
Some of these are used automatically but most require code in
122+
class-specific methods.
123+
124+
125+
residuals/fitted/weights methods
126+
--------------------------------
127+
128+
To make use of na.action options like na.exclude, the fitted() method
129+
needs to be along the lines of
130+
```
131+
fitted.default <- function(object, ...)
132+
napredict(object$na.action, object$fitted.values)
133+
```
134+
For the residuals() method, replace napredict by naresid (although for
135+
all current na.action's they are the same, this need not be the case
136+
in future).
137+
138+
Similar code with a call to naresid is needed in a weights() method.
139+
140+
141+
predict() methods
142+
-----------------
143+
144+
Prediction from the original data used in fitting will often be
145+
covered by the `fitted()` method, unless s.e.'s or confidence/prediction
146+
intervals are required.
147+
148+
In a `newdata` argument is supplied, most methods will need to create
149+
a model matrix as if the newdata had originally been used (but with
150+
na.action as set on the predict method, defaulting to na.pass).
151+
A typical piece of code is
152+
153+
```
154+
m <- model.frame(Terms, newdata, na.action = na.action,
155+
xlev = object$xlevels)
156+
if(!is.null(cl <- attr(Terms, "dataClasses"))) .checkMFClasses(cl, m)
157+
X <- model.matrix(Terms, m, contrasts = object$contrasts)
158+
```
159+
160+
Note the use of the saved levels and saved contrasts, and the safety
161+
check on the classes of the variables found by model.frame (which of
162+
course need not be found in `newdata`). Safe prediction from terms
163+
involving poly(), bs() and so on will happen without needing any code
164+
in the predict() method as this is handled in model.frame.default().
165+
166+
If your code is like rpart() and handles ordered and unordered factors
167+
differently use `.checkMFClasses(cl, m, TRUE)` --- this is not needed
168+
for code like lm() as both the set of levels of the factors and the
169+
contrasts used at fit time are recorded in the fit object and
170+
retrieved by the predict() method.
171+
172+
173+
model.frame() methods
174+
---------------------
175+
176+
model.frame() methods are most often used to retrieve or recreate the
177+
model frame from the fitted object, with no other arguments. For
178+
fitting functions following the standard pattern outlined in this
179+
document no method is needed: as from R 1.9.0 model.frame.default()
180+
will work.
181+
182+
One reason that a special method might be needed is to retrieve
183+
columns of the data frame that correspond to arguments of the orginal
184+
call other than `formula`, `subset` and `weights`: for example the glm
185+
method handles `offset`, `etastart` and `mustart`.
186+
187+
If you have a `model.frame()` method it should
188+
189+
* return the `model` component of the fit (and there are no other arguments).
190+
191+
* establish a suitable environment within which to look for variables.
192+
The standard recipe is
193+
194+
```
195+
fcall <- formula$call
196+
## drop unneeded args
197+
fcall[[1]] <- as.name("model.frame")
198+
if (is.null(env <- environment(formula$terms))) env <- parent.frame()
199+
eval(fcall, env)
200+
```
201+
202+
* allow `...` to specify at least `data`, `na.action` or `subset`.

0 commit comments

Comments
 (0)