Skip to content
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 21 additions & 26 deletions PURL-SPECIFICATION.rst
Original file line number Diff line number Diff line change
Expand Up @@ -227,7 +227,7 @@ Character encoding
Permitted characters
--------------------

A canonical ``purl`` is composed of these characters ("Permitted Characters"):
A canonical ``purl`` is composed of these Permitted Characters:

- alphanumeric characters ``A to Z``, ``a to z``, ``0 to 9``,
- the ``purl`` separators ``:/@?=&#`` (colon ':', slash '/', at sign '@',
Expand Down Expand Up @@ -257,31 +257,26 @@ These ``purl`` separator characters MUST NOT be percent-encoded when used as
Percent-encoding rules
----------------------

Unless otherwise provided in this specification, when applying percent-encoding
or decoding to a string, use the rules of RFC 3986 section 2
(https://datatracker.ietf.org/doc/html/rfc3986#section-2). In the event of any
conflict between this specification and RFC 3986 section 2, this specification
governs.

In the "Rules for each ``purl`` component" section above, each component
defines when and how to apply percent-encoding and decoding to its content.

When percent-encoding is required, all Permitted Characters MUST be encoded as
UTF-8 and then percent-encoded except for the following:

- the alphanumeric characters,

- the ASCII characters ``.-_~`` (period '.', dash '-', underscore
'_' and tilde '~'),

- the percent sign '%' when used to represent a percent-encoded character,

- a ``purl`` separator when being used as a ``purl`` separator, and

- the colon ':', whether used as a ``purl`` separator or otherwise.

In addition, where the space ' ' is permitted, it MUST be percent-encoded as
'%20'.
- In the "Rules for each ``purl`` component" section above, each component
defines when and how to apply percent-encoding and decoding to its content,
including which characters to percent-encode and when percent-encoding is
required.
- When percent-encoding is required by a component definition, each
codepoint MUST be replaced by the percent-encoded bytes of the codepoint's
UTF-8 encoding using the percent-encoding mechanism defined in RFC 3986
section 2.1 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.1).
- With the exception of the percent-encoding mechanism, the rules regarding
percent-encoding are defined by this specification alone.
- Where the space ' ' is permitted, it MUST be percent-encoded as
'%20'.
- The following characters do not need to be percent-encoded:

- the alphanumeric characters ``A to Z``, ``a to z``, ``0 to 9``,
- the ASCII characters ``.-_~`` (period '.', dash '-', underscore
'_' and tilde '~'),
- the percent sign '%' when used to represent a percent-encoded character,
- a ``purl`` separator when being used as a ``purl`` separator, and
- the colon ':', whether used as a ``purl`` separator or otherwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since each component specifies the arguments of the percent-encode method, I think this is not necessary.
The argument of percent-encode will never contain "characters used as purl separators"; those characters will be added afterwards.

For example:

  • percent-encode each segment of the namespace.
  • join the encoded segments with the / character.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment @ppkarwasz . Not sure if you're addressing all 3 lines you excerpt or just line 278, so just in case:

  • line 277: that is definitely more than needed -- was just trying to be thorough ;-) no objections at all to deleting if others agree
  • line 278: believe it or not, there have been issues that this point addresses, e.g., does the colon between scheme and type need to be percent-encoded? See, e.g., Percent encoding spec and : and /; imho this is needed to avoid such issues in the future and make the use of a PURL as clear as possible.
  • line 279: perhaps I am misunderstanding your point here -- without line 279, how will users know that colons do not need to be percent-encoded?

Copy link
Contributor

@ppkarwasz ppkarwasz Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • line 278: believe it or not, there have been issues that this point addresses, e.g., does the colon between scheme and type need to be percent-encoded?

Lines 242-243 say:

These ``purl`` separator characters MUST NOT be percent-encoded when used as
``purl`` separators:

Do we need to repeat it here too?

  • line 279: perhaps I am misunderstanding your point here -- without line 279, how will users know that colons do not need to be percent-encoded?

Sure we need to say that colon : does not need to be percent-encoded, but I think we don't need to repeat that it also does not need to be encoded when used as a separator.

Maybe we could make this paragraph less descriptive and more imperative like:

To percent-encode a string of characters:

1. encode it using UTF-8,
2. for each byte of the encoded string:
    - if the byte corresponds to:
       - an alphanumeric ASCII character (``A to Z``, ``a to z``, ``0 to 9``)
       - or one of the ASCII characters `.`, `-`, `_`, `~` and `:`.
       
      copy the byte to the output.
    - otherwise, append the percent-encoding of the byte to the output, as defined in RFC 3986
      section 2.1 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.1).



How to build ``purl`` string from its components
Expand Down