Skip to content
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 71 additions & 41 deletions PURL-SPECIFICATION.rst
Original file line number Diff line number Diff line change
Expand Up @@ -176,25 +176,30 @@ The rules for each component are:

- **qualifiers**:

- The ``qualifiers`` string is prefixed by a '?' separator when not empty
- This '?' is not part of the ``qualifiers``
- This is a query string composed of zero or more ``key=value`` pairs each
separated by a '&' ampersand. A ``key`` and ``value`` are separated by the equal
'=' character
- These '&' are not part of the ``key=value`` pairs.
- ``key`` must be unique within the keys of the ``qualifiers`` string
- ``value`` cannot be an empty string: a ``key=value`` pair with an empty ``value``
is the same as no key/value at all for this key
- For each pair of ``key`` = ``value``:

- The ``key`` must be composed only of ASCII letters and numbers, '.', '-' and
'_' (period, dash and underscore)
- A ``key`` cannot start with a number
- A ``key`` must NOT be percent-encoded
- A ``key`` is case insensitive. The canonical form is lowercase
- A ``key`` cannot contain spaces
- A ``value`` must be a percent-encoded string
- The '=' separator is neither part of the ``key`` nor of the ``value``
- The ``qualifiers`` component MUST be prefixed by an unencoded question
mark '?' separator when not empty. This '?' separator is not part of the
``qualifiers`` component.
- The ``qualifiers`` component is a query string composed of one or more
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The ``qualifiers`` component is a query string composed of one or more
- The ``qualifiers`` component is a sequence of one or more

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This make sense, as the query string references does not bring anything special (though this has specific meaning in the URL specs).
What about going even simpler, as a sequence is also a new term:

Suggested change
- The ``qualifiers`` component is a query string composed of one or more
- The ``qualifiers`` component is composed of one or more

Copy link
Member

@jkowalleck jkowalleck Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, the "sequence" would mean that things are in a certain order - and this is irrelevant.
lets just use the " ... is composed of one or more ..." phrase

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done -- changed to

The ``qualifiers`` component is composed of one or more

``key=value`` pairs. Multiple ``key=value`` pairs MUST be separated by an
unencoded ampersand '&'. This '&' separator is not part of the
``qualifiers`` component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``qualifiers`` component.
``qualifiers``' sub-components.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkowalleck Let's not introduce a sub-component concept.

Copy link
Member

@jkowalleck jkowalleck Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to reword this "This '&' separator is not part of the
qualifiers component." still.
the & is definitely part of the qualifiers component, but it is not part of that "key=value" part. how do we want to call this "key=value" part?
For parth, we know path-segments as the "items". is the there a proper name we can use here? I guess "qualifier" could be fitting?

Suggested change
``qualifiers`` component.
individual ``qualifier``s'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


- A ``key`` and ``value`` MUST be separated by the unencoded equal sign '='
character. This '=' separator is not part of the ``key`` or ``value``.
- A ``value`` MUST NOT be an empty string: a ``key=value`` pair with an
empty ``value`` is the same as if no ``key=value`` pair exists for this
``key``.

- For each ``key=value`` pair:

- The ``key`` MUST be composed only of lowercase ASCII letters and numbers,
period '.', dash '-' and underscore '_'.
- A ``key`` MUST start with an ASCII letter.
- A ``key`` MUST NOT be percent-encoded.
- Each ``key`` MUST be unique among all the keys of the ``qualifiers``
component.
- A ``value`` MAY be composed of any character and all characters MUST be
encoded as described in the "Character encoding" section.


- **subpath**:
Expand All @@ -206,46 +211,62 @@ The rules for each component are:
in the canonical form
- Each ``subpath`` segment MUST be a percent-encoded string
- When percent-decoded, a segment:

- MUST NOT contain a '/'
- MUST NOT be any of '..' or '.'
- MUST NOT be empty

- The ``subpath`` MUST be interpreted as relative to the root of the package


Character encoding
~~~~~~~~~~~~~~~~~~

For clarity and simplicity a ``purl`` is always an ASCII string. To ensure that
there is no ambiguity when parsing a ``purl``, separator characters and non-ASCII
characters must be UTF-encoded and then percent-encoded as defined at::
Permitted characters
--------------------

A canonical ``purl`` is an ASCII string composed of these characters:

- alphanumeric characters ``A to Z``, ``a to z``, ``0 to 9``,
- the ``purl`` separators ``:/@?=&#`` (colon ':', slash '/', at sign '@',
question mark '?', equal sign '=', ampersand '&' and pound sign '#'), and
- these punctuation marks ``%.-_~`` (percent sign '%', period '.', dash '-',
underscore '_' and tilde '~').

https://en.wikipedia.org/wiki/Percent-encoding
All other characters MUST be encoded as UTF-8 and then percent-encoded.
In addition, each component specifies its permitted characters and
its percent-encoding rules.

Use these rules for percent-encoding and decoding ``purl`` components:

- the ``type`` must NOT be encoded and must NOT contain separators
``purl`` separators
-------------------

- the '#', '?', '@' and ':' characters must NOT be encoded when used as
separators. They may need to be encoded elsewhere
These ``purl`` separator characters MUST NOT be percent-encoded when used as
``purl`` separators:

- the ':' ``scheme`` and ``type`` separator does not need to and must NOT be encoded.
It is unambiguous unencoded everywhere
- ':' (colon) is the separator between ``scheme`` and ``type``
- '/' (slash) is the separator between ``type``, ``namespace`` and ``name``
- '/' (slash) is the separator between ``subpath`` segments
- '@' (at sign) is the separator between ``name`` and ``version``
- '?' (question mark) is the separator before ``qualifiers``
- '=' (equals) is the separator between a ``key`` and a ``value`` of a
``qualifier``
- '&' (ampersand) is the separator between ``qualifiers`` (each being a
``key=value`` pair)
- '#' (number sign) is the separator before ``subpath``

- the '/' used as ``type``/``namespace``/``name`` and ``subpath`` segments separator
does not need to and must NOT be percent-encoded. It is unambiguous unencoded
everywhere

- the '@' ``version`` separator must be encoded as ``%40`` elsewhere
- the '?' ``qualifiers`` separator must be encoded as ``%3F`` elsewhere
- the '=' ``qualifiers`` key/value separator must NOT be encoded
- the '#' ``subpath`` separator must be encoded as ``%23`` elsewhere
Percent-encoding rules
----------------------

- All non-ASCII characters must be encoded as UTF-8 and then percent-encoded
When applying percent-encoding or decoding to a string, use the rules of RFC
3986 section 2 (https://datatracker.ietf.org/doc/html/rfc3986#section-2).

It is OK to percent-encode ``purl`` components otherwise except for the ``type``.
Parsers and builders must always percent-decode and percent-encode ``purl``
components and component segments as explained in the "How to parse" and "How to
build" sections.
Each component defines when and how to apply percent-encoding and decoding to
its content.

When percent-encoding is required, all characters MUST be encoded except for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking all good to me. IMHO, the last point of discussion left is this sentence. e.g., is colon all we need as a generic rule (leaving aside the per-component rules)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnmhoran Let's move this section to a new PR for clarity as discussed in today's Ecma call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne This has been removed and replaced with the version of "Character encoding" currently in main. I opened a new issue for the current "Character encoding" work:

the colon ':'.


How to build ``purl`` string from its components
Expand Down Expand Up @@ -486,3 +507,12 @@ License
~~~~~~~

This document is licensed under the MIT license

Definitions
~~~~~~~~~~~

[ASCII] See, e.g.,

- American National Standards Institute, "Coded Character Set -- 7-bit
American Standard Code for Information Interchange", ANSI X3.4, 1986.
- https://en.wikipedia.org/wiki/ASCII.
7 changes: 4 additions & 3 deletions faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Scheme

**QUESTION**: Can the ``scheme`` component be followed by a colon and two slashes, like a URI?

No. Since a ``purl`` never contains a URL Authority, its ``scheme`` should not be suffixed with double slash as in 'pkg://' and should use 'pkg:' instead. Otherwise this would be an invalid URI per RFC 3986 at https://tools.ietf.org/html/rfc3986#section-3.3::
**ANSWER**: No. Since a ``purl`` never contains a URL Authority, its ``scheme`` should not be suffixed with double slash as in 'pkg://' and should use 'pkg:' instead. Otherwise this would be an invalid URI per RFC 3986 at https://tools.ietf.org/html/rfc3986#section-3.3::

If a URI does not contain an authority component, then the path
cannot begin with two slash characters ("//").
Expand All @@ -24,9 +24,10 @@ For example, although these two purls are strictly equivalent, the first is in c

pkg://gem/[email protected]


**QUESTION**: Is the colon between ``scheme`` and ``type`` encoded? Can it be encoded? If yes, how?

The "Rules for each ``purl`` component" section provides that "[t]he ``scheme`` MUST be followed by an unencoded colon ':'.
**ANSWER**: The "Rules for each ``purl`` component" section provides that "[t]he ``scheme`` MUST be followed by an unencoded colon ':'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**ANSWER**: The "Rules for each ``purl`` component" section provides that "[t]he ``scheme`` MUST be followed by an unencoded colon ':'.
**ANSWER**: The "Rules for each ``purl`` component" section provides that the ``scheme`` MUST be followed by an unencoded colon ':'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -- fixed.


In this case, the colon ':' between ``scheme`` and ``type`` is being used as a separator, and consequently should be used as-is, never encoded and never requiring any decoding. Moreover, it should be a parsing error if the colon ':' does not come directly after 'pkg'. Tools are welcome to recover from this error to help with malformed purls, but that's not a requirement.

Expand All @@ -37,7 +38,7 @@ Type
**QUESTION**: What behavior is expected from a purl spec implementation if a
``type`` contains a character like a slash '/' or a colon ':'?

The "Rules for each purl component" section provides that
**ANSWER**: The "Rules for each purl component" section provides that

[t]he package ``type`` MUST be composed only of ASCII letters and numbers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnmhoran Can we refine this with the new wording? and remove the the weird square brackets in [t]he?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pombredanne I've fixed the use of square brackets (thanks for catching that) and will commit and push these updates. I'm not sure what you are referring to by "the new wording" aside from the square brackets -- please clarify as needed once the revised faq.rst has been pushed.

'.', '+' and '-' (period, plus, and dash)
Expand Down