-
Notifications
You must be signed in to change notification settings - Fork 217
Update "Character encoding" and related provisions #438 #461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
0beba16
91f07a0
e391329
e7119e8
90b017d
bc98ead
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -114,9 +114,11 @@ Rules for each ``purl`` component | |||||
|
|
||||||
| A ``purl`` string is an ASCII URL string composed of seven components. | ||||||
|
|
||||||
| Some components are allowed to use other characters beyond ASCII: these | ||||||
| components must then be UTF-8-encoded strings and percent-encoded as defined in | ||||||
| the "Character encoding" section. | ||||||
| Except as expressly stated otherwise in this section, each component: | ||||||
|
|
||||||
| - MAY be composed of any of the characters defined as "Permitted Characters" in | ||||||
| the "Character encoding" section | ||||||
| - MUST be encoded as defined in the "Character encoding" section | ||||||
|
|
||||||
| The rules for each component are: | ||||||
|
|
||||||
|
|
@@ -225,17 +227,13 @@ Character encoding | |||||
| Permitted characters | ||||||
| -------------------- | ||||||
|
|
||||||
| A canonical ``purl`` is an ASCII string composed of these characters: | ||||||
| A canonical ``purl`` is composed of these Permitted Characters: | ||||||
|
|
||||||
| - alphanumeric characters ``A to Z``, ``a to z``, ``0 to 9``, | ||||||
| - the ``purl`` separators ``:/@?=&#`` (colon ':', slash '/', at sign '@', | ||||||
| question mark '?', equal sign '=', ampersand '&' and pound sign '#'), and | ||||||
| - these punctuation marks ``%.-_~`` (percent sign '%', period '.', dash '-', | ||||||
| underscore '_' and tilde '~'). | ||||||
|
|
||||||
| All other characters MUST be encoded as UTF-8 and then percent-encoded. | ||||||
| In addition, each component specifies its permitted characters and | ||||||
| its percent-encoding rules. | ||||||
| - the ASCII characters ``+%.-_~`` (plus '+', percent sign '%', period '.', | ||||||
| dash '-', underscore '_' and tilde '~'). | ||||||
|
|
||||||
|
|
||||||
| ``purl`` separators | ||||||
|
|
@@ -259,14 +257,26 @@ These ``purl`` separator characters MUST NOT be percent-encoded when used as | |||||
| Percent-encoding rules | ||||||
| ---------------------- | ||||||
|
|
||||||
| When applying percent-encoding or decoding to a string, use the rules of RFC | ||||||
| 3986 section 2 (https://datatracker.ietf.org/doc/html/rfc3986#section-2). | ||||||
|
|
||||||
| Each component defines when and how to apply percent-encoding and decoding to | ||||||
| its content. | ||||||
|
|
||||||
| When percent-encoding is required, all characters MUST be encoded except for | ||||||
| the colon ':'. | ||||||
| - In the "Rules for each ``purl`` component" section above, each component | ||||||
| defines when and how to apply percent-encoding and decoding to its content, | ||||||
| including which characters to percent-encode and when percent-encoding is | ||||||
| required. | ||||||
| - When percent-encoding is required by a component definition, each | ||||||
pombredanne marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| codepoint MUST be replaced by the percent-encoded bytes of the codepoint's | ||||||
| UTF-8 encoding using the percent-encoding mechanism defined in RFC 3986 | ||||||
| section 2.1 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.1). | ||||||
| - With the exception of the percent-encoding mechanism, the rules regarding | ||||||
| percent-encoding are defined by this specification alone. | ||||||
| - Where the space ' ' is permitted, it MUST be percent-encoded as | ||||||
pombredanne marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| '%20'. | ||||||
| - The following characters do not need to be percent-encoded: | ||||||
pombredanne marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| - the alphanumeric characters ``A to Z``, ``a to z``, ``0 to 9``, | ||||||
| - the ASCII characters ``.-_~`` (period '.', dash '-', underscore | ||||||
| '_' and tilde '~'), | ||||||
| - the percent sign '%' when used to represent a percent-encoded character, | ||||||
| - a ``purl`` separator when being used as a ``purl`` separator, and | ||||||
| - the colon ':', whether used as a ``purl`` separator or otherwise. | ||||||
|
||||||
| These ``purl`` separator characters MUST NOT be percent-encoded when used as | |
| ``purl`` separators: |
Do we need to repeat it here too?
- line 279: perhaps I am misunderstanding your point here -- without line 279, how will users know that colons do not need to be percent-encoded?
Sure we need to say that colon : does not need to be percent-encoded, but I think we don't need to repeat that it also does not need to be encoded when used as a separator.
Maybe we could make this paragraph less descriptive and more imperative like:
To percent-encode a string of characters:
1. encode it using UTF-8,
2. for each byte of the encoded string:
- if the byte corresponds to:
- an alphanumeric ASCII character (``A to Z``, ``a to z``, ``0 to 9``)
- or one of the ASCII characters `.`, `-`, `_`, `~` and `:`.
copy the byte to the output.
- otherwise, append the percent-encoding of the byte to the output, as defined in RFC 3986
section 2.1 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.1).
Uh oh!
There was an error while loading. Please reload this page.