|
1 |
| - PD, 27 Dec 1999 |
| 1 | + PD, 30 Dec 1999 |
2 | 2 |
|
3 |
| -Data editor thoughts (or incoherent ramblings...): |
| 3 | + SPREADSHEET-LIKE DATA EDITOR |
| 4 | + ============================ |
4 | 5 |
|
5 |
| -- get something working quickly. Refine later. Not sure using an |
6 |
| - existing spreadsheet program really "cuts it". I *don't* think we |
7 |
| - want a "view" model! |
| 6 | +Background |
| 7 | +---------- |
8 | 8 |
|
9 |
| - Data structure: list of columns. Display side by side in grid. |
10 |
| - possible data types: numeric, character, factor. |
| 9 | +Early versions of R for Windows featured an interactive data editor, |
| 10 | +which for some reason was lost during the transition to the GraphApp |
| 11 | +based version. The Unix version still has it, but it is not getting |
| 12 | +used a lot, for good reasons. It is quite primitive and has some bugs, |
| 13 | +but worse is the fact that it doesn't support factors and data frames. |
11 | 14 |
|
12 |
| - factor types require add'l information on valid |
13 |
| - values + maybe a way of switching between input and |
14 |
| - output representations. |
| 15 | +A data editor is something that many users have seen in e.g. SPSS for |
| 16 | +Windows and would like to have similar functionality in R. Apart from |
| 17 | +the obvious fact that it is easier to fix errors and enter data, it |
| 18 | +provides some psychological comfort in that it makes data immediately |
| 19 | +visible to the user showing variable names and so forth. For large data |
| 20 | +sets, this advantage tends to disappear, but many projects and |
| 21 | +textbook examples are of a size where data can be contained within a |
| 22 | +few screenfulls of text. |
15 | 23 |
|
16 |
| - Generic vectors? Loose idea: arbitrary field types |
17 |
| - handled by popping up secondary worksheets. (might |
18 |
| - allow variable-length recordings inside individual |
19 |
| - records). |
| 24 | +Thus, it is desirable to update the data editor to match the current |
| 25 | +status of R and the present document discusses our options in doing |
| 26 | +so. |
20 | 27 |
|
21 |
| - Superstructure: grouped columns -> matrices, data frames, lists. |
22 |
| - displayed using multi-line headings. How about things |
23 |
| - like Surv() objects? Arrays might get messy. |
| 28 | +Preliminary considerations |
| 29 | +-------------------------- |
24 | 30 |
|
25 |
| - Attributes? (AsIs, time series) |
26 |
| - |
27 |
| - Coercion (use callbacks into R?) Note that changing one matrix column |
28 |
| - changes all of them... |
| 31 | +There are two major approaches to object editing. One is the "viewer" |
| 32 | +model, in which the display constantly reflects the contents of an |
| 33 | +object, so that if you modify an object (which could happen |
| 34 | +asynchronously if multiple interpreters are active in a multithreaded |
| 35 | +version) the display will update to reflect the changes. The other is |
| 36 | +the simpler "editor" model, in which you work on a copy of the data |
| 37 | +and write it to a new object or back into the original when you're |
| 38 | +done editing. While the "viewer" concept may be appealing in some |
| 39 | +ways, I think it will contain too many difficult issues to be |
| 40 | +considered at this stage. |
29 | 41 |
|
30 |
| - Cell names?? |
| 42 | +Given that we want an "editor" model, there are three things that one |
| 43 | +can do: Try to fix up the existing editor, use an external program |
| 44 | +(spreadsheet or database), or redesign the editor from the ground up. |
| 45 | +I'll describe these option further in the following. The latter option |
| 46 | +is the most appealing to me, but obviously also the most |
| 47 | +work-consuming one to implement. |
31 | 48 |
|
32 |
| - Exact API for the whole slew needs to be defined relatively early, so |
33 |
| - that portable versions can be defined (X11/Windows/WXWindows/Tk...). |
| 49 | +Summary of the current version |
| 50 | +------------------------------ |
34 | 51 |
|
35 |
| -Plan: |
| 52 | +Currently, the data editor works as follows (note that the help page |
| 53 | +is not quite accurate): |
36 | 54 |
|
37 |
| -1) Basic editor with simple columns. Roughly "as is". Needs factor |
38 |
| - specification issues, block deletions/insertions, change cell to NA |
39 |
| - (currently tricky), misc. userfriendliness... Handle |
40 |
| - dataframes/matrices via dispatching. (fix(data) might be an idea. |
41 |
| - fix(data.frame()) to create a new one) |
| 55 | +The workhorse is dataentry() which is just a wrapper for |
| 56 | +.Internal(dataentry(data, modes)) strating up the editing grid. The |
| 57 | +two arguments are a list of vectors and a list of modes. |
| 58 | + |
| 59 | +The tags of the list are used for column labels in the editing grid. |
| 60 | +The modes list, for which "character" and "numeric" (or "double") are |
| 61 | +the only values making sense, contain modes to which the columns will |
| 62 | +be coerced on input. Passing NULL for the modes causes the use of the |
| 63 | +current types, or rather: coercion of everything but character |
| 64 | +variables to REALSXPs. The possibility of passing non-NULL modes seems |
| 65 | +to have been little used - I found a bug causing the first mode in the |
| 66 | +list to be used for *all* columns! |
| 67 | + |
| 68 | +Attributes are unchanged by editing, unless the length of a column was |
| 69 | +changed. |
| 70 | + |
| 71 | +de() is used to pre- and post-process more general data types into a |
| 72 | +form that dataentry() can handle. It calls three helper routines, |
| 73 | +de.setup(), de.ncols(), and de.restore(). de.setup() splits up |
| 74 | +matrices and lists (which must be lists of vectors!) into columns and |
| 75 | +de.restore() pastes them together again as best it can. The result is |
| 76 | +passed back as a list. |
| 77 | + |
| 78 | +Finally, data.entry() calls de() and assigns the result to variables |
| 79 | +in the global environment (using names given by the tags). |
| 80 | + |
| 81 | +The scheme sort of works, but has a number of shortcomings (in random |
| 82 | +order): |
| 83 | + |
| 84 | +(a) If the complex data structures are used, de() gets confused if one |
| 85 | +adds new variables: |
| 86 | + |
| 87 | +> x<-1:10 |
| 88 | +> Y<-cbind(x=2,y=2) |
| 89 | +> data.entry(x,Y) # add data in 4th column |
| 90 | +Warning message: |
| 91 | +could not restore data types properly in: de(..., Modes = Modes, Names = Names) |
| 92 | +> x |
| 93 | +[1] 2 |
| 94 | + |
| 95 | +(b) Factors are not handled, although dataentry() almost avoids |
| 96 | +mangling them if you don't change the length (but the storage.mode |
| 97 | +turns into "double"). Data frames are converted to lists of vectors, |
| 98 | +and things like data frames containing matrices are not handled at all. |
| 99 | +(The data editor predates data frames, so it is little wonder that |
| 100 | +they're not handled well). |
| 101 | + |
| 102 | +(c) The grid itself is rather primitive. There's no way of adding or |
| 103 | +deleting elements in the middle of a vector, and no way to reshuffle |
| 104 | +columns. No way to back out from editing a field, either (well, there |
| 105 | +is, but you don't get to see the value, cf. "Bugs" below...). Also, |
| 106 | +there have been requests for easier ways of navigating the grid (e.g. |
| 107 | +a screenful at a time). Fields are restricted to 10 characters, longer |
| 108 | +values can be entered, but are not shown. |
| 109 | + |
| 110 | +(d) Bugs. No (simple) way to correct a numeric value to NA: if it is |
| 111 | +blanked out, the previous value is retained, if you convert to |
| 112 | +character and blank the field, it becomes 0 (!!) on output. Adding too |
| 113 | +many empty cells to a vector will cause weirdness: |
| 114 | + |
| 115 | +> de(1) # add "2" in R10 |
| 116 | +$"1" |
| 117 | + [1] 1.00000e+00 NA 5.34186e-315 1.00000e+00 NA |
| 118 | + [6] 5.34186e-315 1.00000e+00 NA 5.34186e-315 1.00000e+00 |
| 119 | + |
| 120 | + |
| 121 | +(e) It's just looks plain *ugly*!... |
| 122 | + |
| 123 | +(f) The automatic assignment of data.entry into the global environment |
| 124 | +may not be all that good an idea. |
| 125 | + |
| 126 | +(g) When editing matrices and lists one sees the column names, but not |
| 127 | +the item names. |
| 128 | + |
| 129 | +Option 1: Modifying the existing editor |
| 130 | +--------------------------------------- |
| 131 | + |
| 132 | +It doesn't appear to be too difficult to clear out the actual bugs in |
| 133 | +the current version, but that seems to be not quite enough. |
| 134 | + |
| 135 | +The worst problem is that factors are not handled. A minimal solution |
| 136 | +to that would be to allow them to be edited as integers and have their |
| 137 | +class and levels restored on the way out. Workable, but one would soon |
| 138 | +want to have them displayed in character form. The opposite is also |
| 139 | +possible: edit in character form, but that would be painful if you |
| 140 | +need to do large amounts of editing of factors with long labels such |
| 141 | +as "blue collar worker". |
| 142 | + |
| 143 | +One idea that seems workable would be to make the spreadsheet |
| 144 | +recognize factors, add a popup menu for specifying the level set, |
| 145 | +display the labels (or possibly numbers *and* names), but allow cell |
| 146 | +entry using numbers. |
| 147 | + |
| 148 | +It is clearly important to be able to handle data frames, since they |
| 149 | +are the items that are basic to most users. |
| 150 | + |
| 151 | +Keeping data frames as such should be little trouble, but data frames |
| 152 | +with structured elements might be. Or then again, maybe not: The key |
| 153 | +issue is that the current setup has no way of handling multilevel |
| 154 | +structures. It can handle lists of vectors and matrices, but not lists |
| 155 | +of matrices. So a convention that one just doesn't try to handle |
| 156 | +data frames along with anything else might do the trick. |
| 157 | + |
| 158 | +To get things working on the PC one needs to convert the Xlib calls to |
| 159 | +something understandable to GraphApp (or Win32 itself). Not sure how |
| 160 | +hard that would be, but Robert had something working previously. |
| 161 | + |
| 162 | + |
| 163 | +Option 2: Using external programs |
| 164 | +--------------------------------- |
| 165 | + |
| 166 | +This essentially means spreadsheets, such as Excel and Gnumeric, using |
| 167 | +methods like DCOM and Corba. This is in some ways appealing - in |
| 168 | +particular, one avoids reinventing things that already exist. On the |
| 169 | +other hand, one will necessarily lose control on some level: How does |
| 170 | +one ensure that only numeric data are entered into numeric columns for |
| 171 | +instance? |
| 172 | + |
| 173 | +Another issue is whether one can really assume that e.g. Excel is |
| 174 | +available on every PC (this is almost the case, I know, whether paid |
| 175 | +for or not...). |
| 176 | + |
| 177 | +It would probably be better to interface to database programs, but |
| 178 | +they are more scarce and people tend to know them less well than |
| 179 | +standard spreadsheets. |
| 180 | + |
| 181 | +[I'm well aware that there are people who understands this much better |
| 182 | +than I do. Please feel free to correct] |
| 183 | + |
| 184 | + |
| 185 | +Option 3: Redesigning it from scratch |
| 186 | +------------------------------------- |
| 187 | + |
| 188 | +This is what I feel like doing after thinking about matters for a |
| 189 | +while, but am not sure I really want to do after all... Anyways, it |
| 190 | +might be useful to dream a little: |
| 191 | + |
| 192 | +I'm fairly happy with the notion that a data editor works on a list of |
| 193 | +columns, displayed side by side in a grid, and where entries of a |
| 194 | +column generally have the same basic mode. Numeric and character types |
| 195 | +are easy enough to handle. Factors may require a little further |
| 196 | +elaboration, but it may not be too hard, cf. the discussion under |
| 197 | +Option 1. |
| 198 | + |
| 199 | +However, there are obviously data types that fall outside of this |
| 200 | +framework, and the question is whether one should strive to make it |
| 201 | +possible to handle this as well. |
| 202 | + |
| 203 | +- a data frame might contain a vector of generic type (it is possible |
| 204 | + to use such vectors with the current data frame code, although we're |
| 205 | + not very good at displaying them). One might represent elements of |
| 206 | + such vectors as "active cells" that when clicked pop up a secondary |
| 207 | + worksheet. One major application of this would be to allow |
| 208 | + individual records to contain variable-length recordings (cue: |
| 209 | + repeated measurements data). One might even get the idea of |
| 210 | + "thumbnailing" such secondary sheets in the form of tiny graphs or |
| 211 | + whatever... |
| 212 | + |
| 213 | +- many data vectors have attributes and we'd probably need some way of |
| 214 | + displaying/editing them. Quite possibly the popup sheet idea can |
| 215 | + also be used here. |
| 216 | + |
| 217 | +- depending on class and attributes, some vectors are displayed in |
| 218 | + different formats (Surv objects, time series). Do we wish to handle |
| 219 | + them specially? |
| 220 | + |
| 221 | +- coercion of objects is currently done very crudely (basically: leave |
| 222 | + the text as it is and replace ill-formed numbers with NA upon exit). |
| 223 | + Some sort of callback mechanism into R might be desirable. (Not too |
| 224 | + sure this is a good idea, because of representation issues) |
| 225 | + |
| 226 | +- one needs some way of representing displaying superstructure. A |
| 227 | + simple solution might be just to use multiline variable headings. |
| 228 | + Arrays might be handled as vectors of vectors of...of matrices. |
| 229 | + |
| 230 | +- attaching cell names to the display (e.g. in a tiny font in the top |
| 231 | + left corner of a cell) might be useful. |
| 232 | + |
| 233 | +- can this be done portably? (X11/Windows/WxWindows/Tk...). |
42 | 234 |
|
43 |
| -2) Add superstructure and attribute capabilities. Possibly also the |
44 |
| - "generic field" capability (suspect to need it for the attributes |
45 |
| - anyway). |
46 | 235 |
|
47 |
| -3) Bells + whistles... |
48 | 236 |
|
0 commit comments