Skip to content

Commit 22c0fda

Browse files
author
pd
committed
Rewrite in coherent form
git-svn-id: https://svn.r-project.org/R-dev-web/trunk@29 c52295ea-58df-0310-926a-d16021944841
1 parent 8ceb6e2 commit 22c0fda

File tree

1 file changed

+222
-34
lines changed

1 file changed

+222
-34
lines changed

dataedit.txt

+222-34
Original file line numberDiff line numberDiff line change
@@ -1,48 +1,236 @@
1-
PD, 27 Dec 1999
1+
PD, 30 Dec 1999
22

3-
Data editor thoughts (or incoherent ramblings...):
3+
SPREADSHEET-LIKE DATA EDITOR
4+
============================
45

5-
- get something working quickly. Refine later. Not sure using an
6-
existing spreadsheet program really "cuts it". I *don't* think we
7-
want a "view" model!
6+
Background
7+
----------
88

9-
Data structure: list of columns. Display side by side in grid.
10-
possible data types: numeric, character, factor.
9+
Early versions of R for Windows featured an interactive data editor,
10+
which for some reason was lost during the transition to the GraphApp
11+
based version. The Unix version still has it, but it is not getting
12+
used a lot, for good reasons. It is quite primitive and has some bugs,
13+
but worse is the fact that it doesn't support factors and data frames.
1114

12-
factor types require add'l information on valid
13-
values + maybe a way of switching between input and
14-
output representations.
15+
A data editor is something that many users have seen in e.g. SPSS for
16+
Windows and would like to have similar functionality in R. Apart from
17+
the obvious fact that it is easier to fix errors and enter data, it
18+
provides some psychological comfort in that it makes data immediately
19+
visible to the user showing variable names and so forth. For large data
20+
sets, this advantage tends to disappear, but many projects and
21+
textbook examples are of a size where data can be contained within a
22+
few screenfulls of text.
1523

16-
Generic vectors? Loose idea: arbitrary field types
17-
handled by popping up secondary worksheets. (might
18-
allow variable-length recordings inside individual
19-
records).
24+
Thus, it is desirable to update the data editor to match the current
25+
status of R and the present document discusses our options in doing
26+
so.
2027

21-
Superstructure: grouped columns -> matrices, data frames, lists.
22-
displayed using multi-line headings. How about things
23-
like Surv() objects? Arrays might get messy.
28+
Preliminary considerations
29+
--------------------------
2430

25-
Attributes? (AsIs, time series)
26-
27-
Coercion (use callbacks into R?) Note that changing one matrix column
28-
changes all of them...
31+
There are two major approaches to object editing. One is the "viewer"
32+
model, in which the display constantly reflects the contents of an
33+
object, so that if you modify an object (which could happen
34+
asynchronously if multiple interpreters are active in a multithreaded
35+
version) the display will update to reflect the changes. The other is
36+
the simpler "editor" model, in which you work on a copy of the data
37+
and write it to a new object or back into the original when you're
38+
done editing. While the "viewer" concept may be appealing in some
39+
ways, I think it will contain too many difficult issues to be
40+
considered at this stage.
2941

30-
Cell names??
42+
Given that we want an "editor" model, there are three things that one
43+
can do: Try to fix up the existing editor, use an external program
44+
(spreadsheet or database), or redesign the editor from the ground up.
45+
I'll describe these option further in the following. The latter option
46+
is the most appealing to me, but obviously also the most
47+
work-consuming one to implement.
3148

32-
Exact API for the whole slew needs to be defined relatively early, so
33-
that portable versions can be defined (X11/Windows/WXWindows/Tk...).
49+
Summary of the current version
50+
------------------------------
3451

35-
Plan:
52+
Currently, the data editor works as follows (note that the help page
53+
is not quite accurate):
3654

37-
1) Basic editor with simple columns. Roughly "as is". Needs factor
38-
specification issues, block deletions/insertions, change cell to NA
39-
(currently tricky), misc. userfriendliness... Handle
40-
dataframes/matrices via dispatching. (fix(data) might be an idea.
41-
fix(data.frame()) to create a new one)
55+
The workhorse is dataentry() which is just a wrapper for
56+
.Internal(dataentry(data, modes)) strating up the editing grid. The
57+
two arguments are a list of vectors and a list of modes.
58+
59+
The tags of the list are used for column labels in the editing grid.
60+
The modes list, for which "character" and "numeric" (or "double") are
61+
the only values making sense, contain modes to which the columns will
62+
be coerced on input. Passing NULL for the modes causes the use of the
63+
current types, or rather: coercion of everything but character
64+
variables to REALSXPs. The possibility of passing non-NULL modes seems
65+
to have been little used - I found a bug causing the first mode in the
66+
list to be used for *all* columns!
67+
68+
Attributes are unchanged by editing, unless the length of a column was
69+
changed.
70+
71+
de() is used to pre- and post-process more general data types into a
72+
form that dataentry() can handle. It calls three helper routines,
73+
de.setup(), de.ncols(), and de.restore(). de.setup() splits up
74+
matrices and lists (which must be lists of vectors!) into columns and
75+
de.restore() pastes them together again as best it can. The result is
76+
passed back as a list.
77+
78+
Finally, data.entry() calls de() and assigns the result to variables
79+
in the global environment (using names given by the tags).
80+
81+
The scheme sort of works, but has a number of shortcomings (in random
82+
order):
83+
84+
(a) If the complex data structures are used, de() gets confused if one
85+
adds new variables:
86+
87+
> x<-1:10
88+
> Y<-cbind(x=2,y=2)
89+
> data.entry(x,Y) # add data in 4th column
90+
Warning message:
91+
could not restore data types properly in: de(..., Modes = Modes, Names = Names)
92+
> x
93+
[1] 2
94+
95+
(b) Factors are not handled, although dataentry() almost avoids
96+
mangling them if you don't change the length (but the storage.mode
97+
turns into "double"). Data frames are converted to lists of vectors,
98+
and things like data frames containing matrices are not handled at all.
99+
(The data editor predates data frames, so it is little wonder that
100+
they're not handled well).
101+
102+
(c) The grid itself is rather primitive. There's no way of adding or
103+
deleting elements in the middle of a vector, and no way to reshuffle
104+
columns. No way to back out from editing a field, either (well, there
105+
is, but you don't get to see the value, cf. "Bugs" below...). Also,
106+
there have been requests for easier ways of navigating the grid (e.g.
107+
a screenful at a time). Fields are restricted to 10 characters, longer
108+
values can be entered, but are not shown.
109+
110+
(d) Bugs. No (simple) way to correct a numeric value to NA: if it is
111+
blanked out, the previous value is retained, if you convert to
112+
character and blank the field, it becomes 0 (!!) on output. Adding too
113+
many empty cells to a vector will cause weirdness:
114+
115+
> de(1) # add "2" in R10
116+
$"1"
117+
[1] 1.00000e+00 NA 5.34186e-315 1.00000e+00 NA
118+
[6] 5.34186e-315 1.00000e+00 NA 5.34186e-315 1.00000e+00
119+
120+
121+
(e) It's just looks plain *ugly*!...
122+
123+
(f) The automatic assignment of data.entry into the global environment
124+
may not be all that good an idea.
125+
126+
(g) When editing matrices and lists one sees the column names, but not
127+
the item names.
128+
129+
Option 1: Modifying the existing editor
130+
---------------------------------------
131+
132+
It doesn't appear to be too difficult to clear out the actual bugs in
133+
the current version, but that seems to be not quite enough.
134+
135+
The worst problem is that factors are not handled. A minimal solution
136+
to that would be to allow them to be edited as integers and have their
137+
class and levels restored on the way out. Workable, but one would soon
138+
want to have them displayed in character form. The opposite is also
139+
possible: edit in character form, but that would be painful if you
140+
need to do large amounts of editing of factors with long labels such
141+
as "blue collar worker".
142+
143+
One idea that seems workable would be to make the spreadsheet
144+
recognize factors, add a popup menu for specifying the level set,
145+
display the labels (or possibly numbers *and* names), but allow cell
146+
entry using numbers.
147+
148+
It is clearly important to be able to handle data frames, since they
149+
are the items that are basic to most users.
150+
151+
Keeping data frames as such should be little trouble, but data frames
152+
with structured elements might be. Or then again, maybe not: The key
153+
issue is that the current setup has no way of handling multilevel
154+
structures. It can handle lists of vectors and matrices, but not lists
155+
of matrices. So a convention that one just doesn't try to handle
156+
data frames along with anything else might do the trick.
157+
158+
To get things working on the PC one needs to convert the Xlib calls to
159+
something understandable to GraphApp (or Win32 itself). Not sure how
160+
hard that would be, but Robert had something working previously.
161+
162+
163+
Option 2: Using external programs
164+
---------------------------------
165+
166+
This essentially means spreadsheets, such as Excel and Gnumeric, using
167+
methods like DCOM and Corba. This is in some ways appealing - in
168+
particular, one avoids reinventing things that already exist. On the
169+
other hand, one will necessarily lose control on some level: How does
170+
one ensure that only numeric data are entered into numeric columns for
171+
instance?
172+
173+
Another issue is whether one can really assume that e.g. Excel is
174+
available on every PC (this is almost the case, I know, whether paid
175+
for or not...).
176+
177+
It would probably be better to interface to database programs, but
178+
they are more scarce and people tend to know them less well than
179+
standard spreadsheets.
180+
181+
[I'm well aware that there are people who understands this much better
182+
than I do. Please feel free to correct]
183+
184+
185+
Option 3: Redesigning it from scratch
186+
-------------------------------------
187+
188+
This is what I feel like doing after thinking about matters for a
189+
while, but am not sure I really want to do after all... Anyways, it
190+
might be useful to dream a little:
191+
192+
I'm fairly happy with the notion that a data editor works on a list of
193+
columns, displayed side by side in a grid, and where entries of a
194+
column generally have the same basic mode. Numeric and character types
195+
are easy enough to handle. Factors may require a little further
196+
elaboration, but it may not be too hard, cf. the discussion under
197+
Option 1.
198+
199+
However, there are obviously data types that fall outside of this
200+
framework, and the question is whether one should strive to make it
201+
possible to handle this as well.
202+
203+
- a data frame might contain a vector of generic type (it is possible
204+
to use such vectors with the current data frame code, although we're
205+
not very good at displaying them). One might represent elements of
206+
such vectors as "active cells" that when clicked pop up a secondary
207+
worksheet. One major application of this would be to allow
208+
individual records to contain variable-length recordings (cue:
209+
repeated measurements data). One might even get the idea of
210+
"thumbnailing" such secondary sheets in the form of tiny graphs or
211+
whatever...
212+
213+
- many data vectors have attributes and we'd probably need some way of
214+
displaying/editing them. Quite possibly the popup sheet idea can
215+
also be used here.
216+
217+
- depending on class and attributes, some vectors are displayed in
218+
different formats (Surv objects, time series). Do we wish to handle
219+
them specially?
220+
221+
- coercion of objects is currently done very crudely (basically: leave
222+
the text as it is and replace ill-formed numbers with NA upon exit).
223+
Some sort of callback mechanism into R might be desirable. (Not too
224+
sure this is a good idea, because of representation issues)
225+
226+
- one needs some way of representing displaying superstructure. A
227+
simple solution might be just to use multiline variable headings.
228+
Arrays might be handled as vectors of vectors of...of matrices.
229+
230+
- attaching cell names to the display (e.g. in a tiny font in the top
231+
left corner of a cell) might be useful.
232+
233+
- can this be done portably? (X11/Windows/WxWindows/Tk...).
42234

43-
2) Add superstructure and attribute capabilities. Possibly also the
44-
"generic field" capability (suspect to need it for the attributes
45-
anyway).
46235

47-
3) Bells + whistles...
48236

0 commit comments

Comments
 (0)