Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 31 additions & 16 deletions PURL-SPECIFICATION.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,9 +114,11 @@ Rules for each ``purl`` component

A ``purl`` string is an ASCII URL string composed of seven components.

Some components are allowed to use other characters beyond ASCII: these
components must then be UTF-8-encoded strings and percent-encoded as defined in
the "Character encoding" section.
Except as expressly stated otherwise in this section, each component:

- MAY be composed of any of the characters defined as "Permitted Characters" in
the "Character encoding" section
- MUST be encoded as defined in the "Character encoding" section

The rules for each component are:

Expand Down Expand Up @@ -225,17 +227,13 @@ Character encoding
Permitted characters
--------------------

A canonical ``purl`` is an ASCII string composed of these characters:
A canonical ``purl`` is composed of these characters ("Permitted Characters"):

- alphanumeric characters ``A to Z``, ``a to z``, ``0 to 9``,
- the ``purl`` separators ``:/@?=&#`` (colon ':', slash '/', at sign '@',
question mark '?', equal sign '=', ampersand '&' and pound sign '#'), and
- these punctuation marks ``%.-_~`` (percent sign '%', period '.', dash '-',
underscore '_' and tilde '~').

All other characters MUST be encoded as UTF-8 and then percent-encoded.
In addition, each component specifies its permitted characters and
its percent-encoding rules.
- the ASCII characters ``+%.-_~`` (plus '+', percent sign '%', period '.',
dash '-', underscore '_' and tilde '~').


``purl`` separators
Expand All @@ -259,14 +257,31 @@ These ``purl`` separator characters MUST NOT be percent-encoded when used as
Percent-encoding rules
----------------------

When applying percent-encoding or decoding to a string, use the rules of RFC
3986 section 2 (https://datatracker.ietf.org/doc/html/rfc3986#section-2).
Unless otherwise provided in this specification, when applying percent-encoding
or decoding to a string, use the rules of RFC 3986 section 2
(https://datatracker.ietf.org/doc/html/rfc3986#section-2). In the event of any
conflict between this specification and RFC 3986 section 2, this specification
governs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going back to how it was before? The way I understood the current version was that it was a breaking change to intentionally break from RFC3986 section 2 because the RFC3986 encoding rules are more complicated to implement and most PURL implementations did not try to implement them, instead applying mostly the same encoding rules to all components. I've already had somebody ask why phylum-dev/purl doesn't encode plus signs in version numbers, which isn't required by RFC 3968 and is non-canonical according to the old PURL but is required by the rules in the current version of PURL, or at least we both thought so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matt-phylum can you elaborate? I am not sure I understand fully your point. I thing that what we are trying to convey is that:

  1. this spec defines WHICH characters to encode and WHERE/WHEN (e.g., with specifics for separators and components)
  2. we defer to RFC3986 to define HOW to encode characters we want encoded.

Would this be clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By saying "In the event of any conflict between this specification and RFC 3986 section 2" it sounds to me like the implementer is supposed to combine RFC 3986 and PURL rules, eg merging RFC 3986 pchar with the PURL character rules when outputting a package name. If the PURL spec is clearly specifying WHICH characters WHERE/WHEN and RFC 3986 is specifying HOW then it's much easier to implement and there shouldn't be conflicts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matt-phylum That's my goal -- each component should specify clearly what characters are permitted or prohibited as well as what needs to be percent-encoded and when. (scheme, type and qualifiers already do exactly that, and namespace, name, version and subpath should do the same to eliminate the possibility of ambiguity.)

With respect to RFC 3986 and the HOW, I'm adopting your earlier suggestion that the RFC 3986 reference be changed from section 2 to section 2.1, which addresses the mechanics of percent-encoding, i.e., the HOW. Please take a look once I push an update and let me know if more fine-tuning is needed and if so I'll take care of it.


In the "Rules for each ``purl`` component" section above, each component
defines when and how to apply percent-encoding and decoding to its content.

When percent-encoding is required, all Permitted Characters MUST be encoded as
UTF-8 and then percent-encoded except for the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems at odds with the first paragraph. "When percent-encoding" should be redundant because percent-encoding is always required (PURL has components that must not be encoded, but I would say it's more accurate that those components do not allow any characters that require encoding, especially for qualifier keys), but then because it says to use RFC 3986 rules and then "when percent-encoding is required, [...] characters must be [...] percent-encoded except," this could be taken to mean these are exceptions to the RFC 3986 rules instead of a replacement of the RFC 3986 rules (which don't map one-to-one with PURL).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matt-phylum how would you phrase this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate that the process of applying "percent-encoding" to a string does not necessarily change the string because it complicates talking about which characters need to be changed vs which characters do not need to be changed. RFC 3986 talks about percent-encoding a (byte) string, a process which may or may not alter the string, and percent-encoding an octet, a process which consistently converts one octet into three. WHATWG URL is similar, talking about percent-encoding a byte sequence, a process which may or may not alter the byte sequence, and percent-encoding a byte, a process which consistently converts one octet into three. PURL, at least in this section, is less specific about what "percent-encoding" means.

Merging in @ppkarwasz's comment, it could say something like this:

When serializing a string, unless excluded by the following rules, every code point must be replaced by the percent-encoded bytes of the code point's UTF-8 encoding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here is to refer to those component definitions that require percent-encoding, e.g., something like

When percent-encoding is required by a component definition, each
codepoint MUST be replaced by the percent-encoded bytes of the codepoint's
UTF-8 encoding using the percent-encoding mechanism defined in RFC 3986
section 2.1 . . . .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here on the other hand we are talking about the decoded form of a component (e.g. a package named Pan語).

The domain of the percent-encode function is not restricted to the "Permitted Characters", but any Unicode character can be present (components can restrict this set).


- the alphanumeric characters,

- the ASCII characters ``.-_~`` (period '.', dash '-', underscore
'_' and tilde '~'),

- the percent sign '%' when used to represent a percent-encoded character,

- a ``purl`` separator when being used as a ``purl`` separator, and

Each component defines when and how to apply percent-encoding and decoding to
its content.
- the colon ':', whether used as a ``purl`` separator or otherwise.

When percent-encoding is required, all characters MUST be encoded except for
the colon ':'.
In addition, where the space ' ' is permitted, it MUST be percent-encoded as
'%20'.


How to build ``purl`` string from its components
Expand Down