Skip to content

Commit b4498f8

Browse files
committed
Week 1, lecture 9 notes
1 parent d5baaa1 commit b4498f8

File tree

1 file changed

+114
-0
lines changed

1 file changed

+114
-0
lines changed

Diff for: week-01/lecture-09-read-and-writing-data-part-2.md

+114
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
Read and Write Data - Part 2
2+
============================
3+
4+
Textual Formats
5+
---------------
6+
7+
* dumping and dputing are useful because the resulting textual format is editable, and in the case of corruption, potentially recoverable
8+
* Unlink writing out a table or CSV file, dump and dput preserve the metadata (sacrificing some readability), so that another user doesn't have to specify it all over again
9+
* Textual formats can work much better with version control programs like subversion and git which can only track changes meaningfully in text files
10+
* Textual formats can be longer-lived; if there is corruption somewhere in the file, it can be easier to fix the problem
11+
* Textual formats adhere to the "Unix philosophy"
12+
* Downside: the format is not very space efficient
13+
14+
15+
dput-ting R Objects
16+
-------------------
17+
18+
Another way to pass data around is by deparsing the R object with dput and reading back using dget.
19+
20+
> y <- data.frame(a = 1, b = "a")
21+
> dput(y)
22+
structure(list(a =1,
23+
b = structure(1L, .Label = "a",
24+
class = "factor")),
25+
.Names = c("a", "b"), row.names = c(NA, -1L),
26+
class = "data.frame")
27+
> dput(y, file = "y.R")
28+
> new.y <- dget("y.R")
29+
> new.y
30+
a b
31+
1 1 a
32+
33+
34+
Dumping R Objects
35+
-----------------
36+
37+
Multiple objects can be deparsed using the dump function and read back in using source.
38+
39+
> x <- "foo"
40+
> y <- data.frame(a = 1, b = "a")
41+
> dump(c("x", "y"), file = "data.R")
42+
> rm(x, y)
43+
> source("data.R")
44+
> y
45+
a b
46+
1 1 a
47+
> x
48+
[1] "foo"
49+
50+
51+
Interfaces to the Outside World
52+
-------------------------------
53+
54+
Data are read in using connection interfaces. Connections can be made to files (most common) or to other more exotic things.
55+
56+
* file, opens a connection to a file
57+
* gzfile, opens a connection to a file compressed with gzip
58+
* bzfile, opens a connection to a file compressed with bzip2
59+
* url, opens a connection to a webpage
60+
61+
62+
File Connections
63+
----------------
64+
65+
> str(file)
66+
function (description = "", open = "", blocking = TRUE,
67+
encoding = getOption("encoding"))
68+
69+
* description is the name of the file
70+
* open is a code indicating
71+
* "r" read only
72+
* "w" writing (and initializing a new file)
73+
* "a" appending
74+
* "rb", "wb", "ab" reading, writing, or appending in binary mode (Windows)
75+
76+
77+
Connections
78+
-----------
79+
80+
In general, connections are powerful tools that let you navigate files or other external objects. In practice, we often don't need to deal with the connection interface directly.
81+
82+
> con <-file("foo.txt", "r")
83+
> data <- read.csv(con)
84+
> close(con)
85+
86+
Is the same as:
87+
88+
> data <- read.csv("foo.txt")
89+
90+
91+
Reading Lines of a Text File
92+
----------------------------
93+
94+
> con <- gzfile("words.gz")
95+
> x <- readLines(con, 10)
96+
> x
97+
[1] "1080" "10-point" "10th" "11-point"
98+
[5] "12-point" "16-point" "18-point" "1st"
99+
[9] "2" "20-point" ""
100+
101+
writeLines takes a character vector and writes each element one line at a time to a text file.
102+
103+
readLines can be useful for reading in lines of webpages.
104+
105+
> ## This might take time
106+
> con <- url("http://www.jhsph.edu", "r")
107+
> x <- readLines(con)
108+
> head(x)
109+
[1] "<!DOCTYPE html>"
110+
[2] "<html lang=\"en\">"
111+
[3] ""
112+
[4] "<head>"
113+
[5] "<meta charset=\"utf-8\" />"
114+
[6] "<title>Johns Hopkins Bloomberg School of Public Health</title>"

0 commit comments

Comments
 (0)