Skip to content
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 82 additions & 39 deletions PURL-SPECIFICATION.rst
Original file line number Diff line number Diff line change
Expand Up @@ -176,25 +176,25 @@ The rules for each component are:

- **qualifiers**:

- The ``qualifiers`` string is prefixed by a '?' separator when not empty
- This '?' is not part of the ``qualifiers``
- This is a query string composed of zero or more ``key=value`` pairs each
separated by a '&' ampersand. A ``key`` and ``value`` are separated by the equal
'=' character
- These '&' are not part of the ``key=value`` pairs.
- ``key`` must be unique within the keys of the ``qualifiers`` string
- ``value`` cannot be an empty string: a ``key=value`` pair with an empty ``value``
is the same as no key/value at all for this key
- For each pair of ``key`` = ``value``:

- The ``key`` must be composed only of ASCII letters and numbers, '.', '-' and
'_' (period, dash and underscore)
- A ``key`` cannot start with a number
- A ``key`` must NOT be percent-encoded
- A ``key`` is case insensitive. The canonical form is lowercase
- A ``key`` cannot contain spaces
- A ``value`` must be a percent-encoded string
- The '=' separator is neither part of the ``key`` nor of the ``value``
- The ``qualifiers`` component MUST be prefixed by a '?' separator when not empty.
- The '?' separator is not part of the ``qualifiers`` component.
- The ``qualifiers`` component is a query string composed of one or more ``key=value``
pairs. Multiple ``key=value`` pairs MUST be separated by an ampersand '&'.
A ``key`` and ``value`` MUST be separated by the equal '=' character.
- Neither the '&' nor the '=' separator is part of the ``key`` or the ``value``.
- Each ``key`` MUST be unique among the keys of the ``qualifiers`` string.
- A ``value`` MUST NOT be an empty string: a ``key=value`` pair with an empty ``value``
is the same as if no ``key=value`` pair exists for this ``key``.

- For each ``key=value`` pair:

- The ``key`` MUST be composed only of ASCII letters and numbers, '.', '-' and
'_' (period, dash and underscore).
- A ``key`` MUST start with an ASCII letter.
- A ``key`` MUST NOT be percent-encoded.
- A ``key`` is case insensitive. The canonical form is lowercase.
- A ``value`` MAY be composed of any character. A ``value`` MUST be
percent-encoded as described in the "Character encoding" section.


- **subpath**:
Expand All @@ -206,44 +206,78 @@ The rules for each component are:
in the canonical form
- Each ``subpath`` segment MUST be a percent-encoded string
- When percent-decoded, a segment:

- MUST NOT contain a '/'
- MUST NOT be any of '..' or '.'
- MUST NOT be empty

- The ``subpath`` MUST be interpreted as relative to the root of the package


Character encoding
~~~~~~~~~~~~~~~~~~

For clarity and simplicity a ``purl`` is always an ASCII string. To ensure that
there is no ambiguity when parsing a ``purl``, separator characters and non-ASCII
characters must be UTF-encoded and then percent-encoded as defined at::
A canonical ``purl`` is always an ASCII string composed only of these characters:

- ``A to Z``,
- ``a to z``,
- ``0 to 9`` and
- the punctuation marks ``:/@?#%.-_~`` .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing the "ampersand" in that list:

Suggested change
- the punctuation marks ``:/@?#%.-_~`` .
- the punctuation marks ``:/@?&#%.-_~`` .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ampersand '&' added. Good eye.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about + (plus)

we are not www-URL, so a plus is a plus. while a space, which is not in the list of allowed characters, needs to be percent-encoded .... which is declared already. but maybe add also a explicit note how spaces should be handled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#261 recommends that a PURL qualifier value should not contain + because some implementations do treat it as . But, in the interest of simplifying the encoding, maybe the specified encoding should be the same for qualifier values as everywhere else (whether that's encoded or not).

Copy link
Member

@jkowalleck jkowalleck Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these implementations are already not respecting the current purl spec.
a plus is a plus, and a space must be percent-encoded .


To ensure that there is no ambiguity when parsing a ``purl``, separator characters
and non-ASCII characters MUST be UTF-encoded and then percent-encoded as defined at
https://en.wikipedia.org/wiki/Percent-encoding and as further defined below.

----

Use these rules for percent-encoding and decoding the characters that comprise
a ``purl`` string. Except as otherwise provided in the "Rules for each
``purl`` component" section above:

- A character used in a ``purl`` component MUST be percent-encoded unless it is:

- an unreserved character as defined in RFC 3986 section 2.3 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.3),

- expressly defined in this PURL-SPECIFICATION.rst as a ``purl`` separator (and only when used as such a separator), or

https://en.wikipedia.org/wiki/Percent-encoding
- expressly permitted in that ``purl`` component.

Use these rules for percent-encoding and decoding ``purl`` components:
- All non-ASCII characters MUST be encoded as UTF-8 and then percent-encoded.

- the ``type`` must NOT be encoded and must NOT contain separators
- The characters used as ``purl`` separators are listed below. These characters:

- the '#', '?', '@' and ':' characters must NOT be encoded when used as
separators. They may need to be encoded elsewhere
- MUST NOT be percent-encoded when used as separators.

- the ':' ``scheme`` and ``type`` separator does not need to and must NOT be encoded.
It is unambiguous unencoded everywhere
- MUST be percent-encoded when not used as separators unless expressly permitted
by a ``purl`` component.

- the '/' used as ``type``/``namespace``/``name`` and ``subpath`` segments separator
does not need to and must NOT be percent-encoded. It is unambiguous unencoded
everywhere
- ``purl`` separators:

- the '@' ``version`` separator must be encoded as ``%40`` elsewhere
- the '?' ``qualifiers`` separator must be encoded as ``%3F`` elsewhere
- the '=' ``qualifiers`` key/value separator must NOT be encoded
- the '#' ``subpath`` separator must be encoded as ``%23`` elsewhere
':' (colon)
- between ``scheme`` and ``type``

- All non-ASCII characters must be encoded as UTF-8 and then percent-encoded
'@' (at sign)
- ``version`` prefix

It is OK to percent-encode ``purl`` components otherwise except for the ``type``.
Parsers and builders must always percent-decode and percent-encode ``purl``
'?' (question mark)
- ``qualifiers`` prefix

'#' (number sign)
- ``subpath`` prefix

'/' (slash)
- ``type``/``namespace``/``name`` separator
- ``subpath`` segments separator

'=' (equals)
- ``qualifiers`` ``key``/``value`` separator

'&' (ampersand)
- ``qualifiers`` ``key=value`` separator

----

Parsers and builders MUST always percent-decode and percent-encode ``purl``
components and component segments as explained in the "How to parse" and "How to
build" sections.

Expand Down Expand Up @@ -486,3 +520,12 @@ License
~~~~~~~

This document is licensed under the MIT license

Definitions
~~~~~~~~~~~

[ASCII] See, e.g.,

- American National Standards Institute, "Coded Character Set -- 7-bit
American Standard Code for Information Interchange", ANSI X3.4, 1986.
- https://en.wikipedia.org/wiki/ASCII.
7 changes: 4 additions & 3 deletions faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Scheme

**QUESTION**: Can the ``scheme`` component be followed by a colon and two slashes, like a URI?

No. Since a ``purl`` never contains a URL Authority, its ``scheme`` should not be suffixed with double slash as in 'pkg://' and should use 'pkg:' instead. Otherwise this would be an invalid URI per RFC 3986 at https://tools.ietf.org/html/rfc3986#section-3.3::
**ANSWER**: No. Since a ``purl`` never contains a URL Authority, its ``scheme`` should not be suffixed with double slash as in 'pkg://' and should use 'pkg:' instead. Otherwise this would be an invalid URI per RFC 3986 at https://tools.ietf.org/html/rfc3986#section-3.3::

If a URI does not contain an authority component, then the path
cannot begin with two slash characters ("//").
Expand All @@ -24,9 +24,10 @@ For example, although these two purls are strictly equivalent, the first is in c

pkg://gem/ruby-advisory-db-check@0.12.4


**QUESTION**: Is the colon between ``scheme`` and ``type`` encoded? Can it be encoded? If yes, how?

The "Rules for each ``purl`` component" section provides that "[t]he ``scheme`` MUST be followed by an unencoded colon ':'.
**ANSWER**: The "Rules for each ``purl`` component" section provides that "[t]he ``scheme`` MUST be followed by an unencoded colon ':'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**ANSWER**: The "Rules for each ``purl`` component" section provides that "[t]he ``scheme`` MUST be followed by an unencoded colon ':'.
**ANSWER**: The "Rules for each ``purl`` component" section provides that the ``scheme`` MUST be followed by an unencoded colon ':'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -- fixed.


In this case, the colon ':' between ``scheme`` and ``type`` is being used as a separator, and consequently should be used as-is, never encoded and never requiring any decoding. Moreover, it should be a parsing error if the colon ':' does not come directly after 'pkg'. Tools are welcome to recover from this error to help with malformed purls, but that's not a requirement.

Expand All @@ -37,7 +38,7 @@ Type
**QUESTION**: What behavior is expected from a purl spec implementation if a
``type`` contains a character like a slash '/' or a colon ':'?

The "Rules for each purl component" section provides that
**ANSWER**: The "Rules for each purl component" section provides that

[t]he package ``type`` MUST be composed only of ASCII letters and numbers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnmhoran Can we refine this with the new wording? and remove the the weird square brackets in [t]he?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne I've fixed the use of square brackets (thanks for catching that) and will commit and push these updates. I'm not sure what you are referring to by "the new wording" aside from the square brackets -- please clarify as needed once the revised faq.rst has been pushed.

'.', '+' and '-' (period, plus, and dash)
Expand Down