|
| 1 | +--- |
| 2 | +title: Model Fitting Functions in R |
| 3 | +author: Brian Ripley, Nov. 2003; R Core Team |
| 4 | +--- |
| 5 | +How To Write Model-Fitting Functions in R |
| 6 | +========================================= |
| 7 | + |
| 8 | +This page documents some of the features that are available to |
| 9 | +model-fitting functions in R, and especially the safety features that |
| 10 | +can (and should) be enabled. |
| 11 | + |
| 12 | +By model-fitting functions we mean functions like lm() which take a |
| 13 | +formula, create a model frame and perhaps a model matrix, and have |
| 14 | +methods (or use the default methods) for many of the standard accessor |
| 15 | +functions such as coef(), residuals() and predict(). |
| 16 | + |
| 17 | +A fairly complete list of such functions in the standard and |
| 18 | +recommended packages is |
| 19 | + |
| 20 | +- stats: aov (via lm), lm, glm, ppr |
| 21 | + |
| 22 | +- MASS: glm.nb, glmmPQL, lda, lm.gls, lqs, polr, qda, rlm |
| 23 | +- mgcv: mgcv |
| 24 | +- nlme: gls, lme |
| 25 | +- nnet: multinom, nnet |
| 26 | +- survival: coxph, survreg |
| 27 | + |
| 28 | +and those not using a model matrix or not intended to handle |
| 29 | +categorical predictors |
| 30 | + |
| 31 | +- stats: factanal, loess, nls, prcomp, princomp |
| 32 | + |
| 33 | +- MASS: loglm |
| 34 | +- nlme: gnls, nlme |
| 35 | +- rpart: rpart |
| 36 | + |
| 37 | + |
| 38 | +Standard behaviour for a fitting function |
| 39 | +----------------------------------------- |
| 40 | + |
| 41 | +The annotations in the following simplified version of `lm` (its current |
| 42 | +source is in <https::/svn.r-project.org/R/trunk/src/library/stats/R/lm.R>) indicate |
| 43 | +what is standard for a model-fitting function. |
| 44 | + |
| 45 | +```{r} |
| 46 | +lm <- function (formula, data, subset, weights, na.action, |
| 47 | + method = "qr", model = TRUE, contrasts = NULL, offset, ...) |
| 48 | +{ |
| 49 | + cl <- match.call() |
| 50 | +
|
| 51 | + ## keep only the arguments which should go into the model frame |
| 52 | + mf <- match.call(expand.dots = FALSE) |
| 53 | + m <- match(c("formula", "data", "subset", "weights", "na.action", |
| 54 | + "offset"), names(mf), 0) |
| 55 | + mf <- mf[c(1, m)] |
| 56 | + mf$drop.unused.levels <- TRUE |
| 57 | + mf[[1]] <- quote(stats::model.frame) # was as.name("model.frame"), but |
| 58 | + ## need "stats:: ..." for non-standard evaluation |
| 59 | + mf <- eval.parent(mf) |
| 60 | + if (method == "model.frame") return(mf) |
| 61 | +
|
| 62 | + ## 1) allow model.frame to update the terms object before saving it. |
| 63 | + mt <- attr(mf, "terms") |
| 64 | + y <- model.response(mf, "numeric") |
| 65 | +
|
| 66 | + ## 2) retrieve the weights and offset from the model frame so |
| 67 | + ## they can be functions of columns in arg data. |
| 68 | + w <- model.weights(mf) |
| 69 | + offset <- model.offset(mf) |
| 70 | + x <- model.matrix(mt, mf, contrasts) |
| 71 | + ## if any subsetting is done, retrieve the "contrasts" attribute here. |
| 72 | +
|
| 73 | + z <- lm.fit(x, y, offset = offset, ...) |
| 74 | + class(z) <- c(if(is.matrix(y)) "mlm", "lm") |
| 75 | +
|
| 76 | + ## 3) return the na.action info |
| 77 | + z$na.action <- attr(mf, "na.action") |
| 78 | + z$offset <- offset |
| 79 | +
|
| 80 | + ## 4) return the contrasts used in fitting: possibly as saved earlier. |
| 81 | + z$contrasts <- attr(x, "contrasts") |
| 82 | +
|
| 83 | + ## 5) return the levelsets for factors in the formula |
| 84 | + z$xlevels <- .getXlevels(mt, mf) |
| 85 | + z$call <- cl |
| 86 | + z$terms <- mt |
| 87 | + if (model) z$model <- mf |
| 88 | + z |
| 89 | +} |
| 90 | +``` |
| 91 | + |
| 92 | +Note that if this approach is taken, any defaults for arguments |
| 93 | +handled by model.frame are never invoked (the defaults in |
| 94 | +model.frame.default are used) so it is good practice not to supply |
| 95 | +any. (This behaviour can be overruled, and is by e.g. rpart.) |
| 96 | + |
| 97 | +If this is done, the following pieces of information are stored with |
| 98 | +the model object: |
| 99 | + |
| 100 | +* The model frame (unless argument model=FALSE). This is useful to |
| 101 | + avoid scoping problems if the model frame is needed later (most |
| 102 | + often by predict methods). |
| 103 | + |
| 104 | +* What contrasts and levels were used when coding factors to form the |
| 105 | + model matrix, and these plus the model frame allow the re-creation |
| 106 | + of the model matrix. (The real lm() allows the model matrix to be |
| 107 | + saved, but that is provided for S compatibility, and is normally a |
| 108 | + waste of space.) |
| 109 | + |
| 110 | +* The na.action results are recorded for use in forming residuals and |
| 111 | + fitted values/prediction from the original data set. |
| 112 | + |
| 113 | +* The terms component records |
| 114 | + - environment(formula) as its environment, |
| 115 | + - details of the classes supplied for each column of the model frame |
| 116 | + as attribute "dataClasses", |
| 117 | + - in the "predvars" attribute, calls to functions such as bs() and |
| 118 | + poly() which should be used for prediction from a new dataset. |
| 119 | + (See ?makepredictcall for the details.) |
| 120 | + |
| 121 | +Some of these are used automatically but most require code in |
| 122 | +class-specific methods. |
| 123 | + |
| 124 | + |
| 125 | +residuals/fitted/weights methods |
| 126 | +-------------------------------- |
| 127 | + |
| 128 | +To make use of na.action options like na.exclude, the fitted() method |
| 129 | +needs to be along the lines of |
| 130 | +``` |
| 131 | +fitted.default <- function(object, ...) |
| 132 | + napredict(object$na.action, object$fitted.values) |
| 133 | +``` |
| 134 | +For the residuals() method, replace napredict by naresid (although for |
| 135 | +all current na.action's they are the same, this need not be the case |
| 136 | +in future). |
| 137 | + |
| 138 | +Similar code with a call to naresid is needed in a weights() method. |
| 139 | + |
| 140 | + |
| 141 | +predict() methods |
| 142 | +----------------- |
| 143 | + |
| 144 | +Prediction from the original data used in fitting will often be |
| 145 | +covered by the `fitted()` method, unless s.e.'s or confidence/prediction |
| 146 | +intervals are required. |
| 147 | + |
| 148 | +In a `newdata` argument is supplied, most methods will need to create |
| 149 | +a model matrix as if the newdata had originally been used (but with |
| 150 | +na.action as set on the predict method, defaulting to na.pass). |
| 151 | +A typical piece of code is |
| 152 | + |
| 153 | +``` |
| 154 | + m <- model.frame(Terms, newdata, na.action = na.action, |
| 155 | + xlev = object$xlevels) |
| 156 | + if(!is.null(cl <- attr(Terms, "dataClasses"))) .checkMFClasses(cl, m) |
| 157 | + X <- model.matrix(Terms, m, contrasts = object$contrasts) |
| 158 | +``` |
| 159 | + |
| 160 | +Note the use of the saved levels and saved contrasts, and the safety |
| 161 | +check on the classes of the variables found by model.frame (which of |
| 162 | +course need not be found in `newdata`). Safe prediction from terms |
| 163 | +involving poly(), bs() and so on will happen without needing any code |
| 164 | +in the predict() method as this is handled in model.frame.default(). |
| 165 | + |
| 166 | +If your code is like rpart() and handles ordered and unordered factors |
| 167 | +differently use `.checkMFClasses(cl, m, TRUE)` --- this is not needed |
| 168 | +for code like lm() as both the set of levels of the factors and the |
| 169 | +contrasts used at fit time are recorded in the fit object and |
| 170 | +retrieved by the predict() method. |
| 171 | + |
| 172 | + |
| 173 | +model.frame() methods |
| 174 | +--------------------- |
| 175 | + |
| 176 | +model.frame() methods are most often used to retrieve or recreate the |
| 177 | +model frame from the fitted object, with no other arguments. For |
| 178 | +fitting functions following the standard pattern outlined in this |
| 179 | +document no method is needed: as from R 1.9.0 model.frame.default() |
| 180 | +will work. |
| 181 | + |
| 182 | +One reason that a special method might be needed is to retrieve |
| 183 | +columns of the data frame that correspond to arguments of the orginal |
| 184 | +call other than `formula`, `subset` and `weights`: for example the glm |
| 185 | +method handles `offset`, `etastart` and `mustart`. |
| 186 | + |
| 187 | +If you have a `model.frame()` method it should |
| 188 | + |
| 189 | +* return the `model` component of the fit (and there are no other arguments). |
| 190 | + |
| 191 | +* establish a suitable environment within which to look for variables. |
| 192 | + The standard recipe is |
| 193 | + |
| 194 | +``` |
| 195 | + fcall <- formula$call |
| 196 | + ## drop unneeded args |
| 197 | + fcall[[1]] <- as.name("model.frame") |
| 198 | + if (is.null(env <- environment(formula$terms))) env <- parent.frame() |
| 199 | + eval(fcall, env) |
| 200 | +``` |
| 201 | + |
| 202 | +* allow `...` to specify at least `data`, `na.action` or `subset`. |
0 commit comments