-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME.Rmd
147 lines (104 loc) · 4.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
output: downlit::readme_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
options(width = 95)
```
utf8
====
<!-- badges: start -->
[![rcc](https://github.com/patperry/r-utf8/workflows/rcc/badge.svg)](https://github.com/patperry/r-utf8/actions)
[![Coverage Status][codecov-badge]][codecov]
[![CRAN Status][cran-badge]][cran]
[![License][apache-badge]][apache]
[![CRAN RStudio Mirror Downloads][cranlogs-badge]][cran]
<!-- badges: end -->
*utf8* is an R package for manipulating and printing UTF-8 text that fixes
[multiple][windows-enc2utf8] [bugs][emoji-print] in R's UTF-8 handling.
Installation
------------
### Stable version
*utf8* is [available on CRAN][cran]. To install the latest released version,
run the following command in R:
```r
install.packages("utf8")
```
### Development version
To install the latest development version, run the following:
```r
devtools::install_github("patperry/r-utf8")
```
Usage
-----
```{r}
library(utf8)
```
### Validate character data and convert to UTF-8
Use `as_utf8()` to validate input text and convert to UTF-8 encoding. The
function alerts you if the input text has the wrong declared encoding:
```{r, error = TRUE}
# second entry is encoded in latin-1, but declared as UTF-8
x <- c("fa\u00E7ile", "fa\xE7ile", "fa\xC3\xA7ile")
Encoding(x) <- c("UTF-8", "UTF-8", "bytes")
as_utf8(x) # fails
# mark the correct encoding
Encoding(x[2]) <- "latin1"
as_utf8(x) # succeeds
```
### Normalize data
Use `utf8_normalize()` to convert to Unicode composed normal form (NFC).
Optionally apply compatibility maps for NFKC normal form or case-fold.
```{r}
# three ways to encode an angstrom character
(angstrom <- c("\u00c5", "\u0041\u030a", "\u212b"))
utf8_normalize(angstrom) == "\u00c5"
# perform full Unicode case-folding
utf8_normalize("Größe", map_case = TRUE)
# apply compatibility maps to NFKC normal form
# (example from https://twitter.com/aprilarcus/status/367557195186970624)
utf8_normalize("𝖸𝗈 𝐔𝐧𝐢𝐜𝐨𝐝𝐞 𝗅 𝗁𝖾𝗋𝖽 𝕌 𝗅𝗂𝗄𝖾 𝑡𝑦𝑝𝑒𝑓𝑎𝑐𝑒𝑠 𝗌𝗈 𝗐𝖾 𝗉𝗎𝗍 𝗌𝗈𝗆𝖾 𝚌𝚘𝚍𝚎𝚙𝚘𝚒𝚗𝚝𝚜 𝗂𝗇 𝗒𝗈𝗎𝗋 𝔖𝔲𝔭𝔭𝔩𝔢𝔪𝔢𝔫𝔱𝔞𝔯𝔶 𝔚𝔲𝔩𝔱𝔦𝔩𝔦𝔫𝔤𝔳𝔞𝔩 𝔓𝔩𝔞𝔫𝔢 𝗌𝗈 𝗒𝗈𝗎 𝖼𝖺𝗇 𝓮𝓷𝓬𝓸𝓭𝓮 𝕗𝕠𝕟𝕥𝕤 𝗂𝗇 𝗒𝗈𝗎𝗋 𝒇𝒐𝒏𝒕𝒔.",
map_compat = TRUE)
```
### Print emoji
On some platforms (including MacOS), the R implementation of `print()` uses an
outdated version of the Unicode standard to determine which characters are
printable. Use `utf8_print()` for an updated print function:
```{r}
print(intToUtf8(0x1F600 + 0:79)) # with default R print function
utf8_print(intToUtf8(0x1F600 + 0:79)) # with utf8_print, truncates line
utf8_print(intToUtf8(0x1F600 + 0:79), chars = 1000) # higher character limit
```
Citation
--------
Cite *utf8* with the following BibTeX entry:
```{r echo = FALSE, comment = NA}
print(suppressWarnings(citation("utf8")), "Bibtex")
```
Contributing
------------
The project maintainer welcomes contributions in the form of feature requests,
bug reports, comments, unit tests, vignettes, or other code. If you'd like to
contribute, either
- fork the repository and submit a pull request
- [file an issue][issues];
- or contact the maintainer via e-mail.
This project is released with a [Contributor Code of Conduct][conduct],
and if you choose to contribute, you must adhere to its terms.
[apache]: https://www.apache.org/licenses/LICENSE-2.0.html "Apache License, Version 2.0"
[apache-badge]: https://img.shields.io/badge/License-Apache%202.0-blue.svg "Apache License, Version 2.0"
[building]: #development-version "Building from Source"
[codecov]: https://codecov.io/github/patperry/r-utf8?branch=master "Code Coverage"
[codecov-badge]: https://codecov.io/github/patperry/r-utf8/coverage.svg?branch=master "Code Coverage"
[conduct]: https://github.com/patperry/r-utf8/blob/master/CONDUCT.md "Contributor Code of Conduct"
[cran]: https://cran.r-project.org/package=utf8 "CRAN Page"
[cran-badge]: https://www.r-pkg.org/badges/version/utf8 "CRAN Page"
[cranlogs-badge]: https://cranlogs.r-pkg.org/badges/utf8 "CRAN Downloads"
[emoji-print]: https://twitter.com/ptrckprry/status/887732831161425920 "MacOS Emoji Printing"
[issues]: https://github.com/patperry/r-utf8/issues "Issues"
[windows-enc2utf8]: https://twitter.com/ptrckprry/status/901494853758054401 "Windows enc2utf8 Bug"