Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 43 additions & 30 deletions PURL-SPECIFICATION.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,9 +114,11 @@ Rules for each ``purl`` component

A ``purl`` string is an ASCII URL string composed of seven components.

Some components are allowed to use other characters beyond ASCII: these
components must then be UTF-8-encoded strings and percent-encoded as defined in
the "Character encoding" section.
Except as expressly stated otherwise in this section, each component:

- MAY be composed of any of the characters defined in the "Permitted
characters" section
- MUST be encoded as defined in the "Character encoding" section

The rules for each component are:

Expand Down Expand Up @@ -219,30 +221,24 @@ The rules for each component are:
- The ``subpath`` MUST be interpreted as relative to the root of the package


Character encoding
~~~~~~~~~~~~~~~~~~

Permitted characters
--------------------

A canonical ``purl`` is an ASCII string composed of these characters:
~~~~~~~~~~~~~~~~~~~~

- alphanumeric characters ``A to Z``, ``a to z``, ``0 to 9``,
- the ``purl`` separators ``:/@?=&#`` (colon ':', slash '/', at sign '@',
question mark '?', equal sign '=', ampersand '&' and pound sign '#'), and
- these punctuation marks ``%.-_~`` (percent sign '%', period '.', dash '-',
underscore '_' and tilde '~').
A canonical ``purl`` is composed of these permitted ASCII characters:

All other characters MUST be encoded as UTF-8 and then percent-encoded.
In addition, each component specifies its permitted characters and
its percent-encoding rules.
- the Alphanumeric Characters: ``A to Z``, ``a to z``, ``0 to 9``,
- the Punctuation Characters: ``.-_~`` (period '.',
dash '-', underscore '_' and tilde '~'),
- the Plus Character: ``+`` (plus '+'),
- the Percent Character: ``%`` (percent sign '%'), and
- the Separator Characters ``:/@?=&#`` (colon ':', slash '/', at sign '@',
question mark '?', equal sign '=', ampersand '&' and pound sign '#').


``purl`` separators
-------------------
~~~~~~~~~~~~~~~~~~~

These ``purl`` separator characters MUST NOT be percent-encoded when used as
``purl`` separators:
This is how each of the Separator Characters is used:

- ':' (colon) is the separator between ``scheme`` and ``type``
- '/' (slash) is the separator between ``type``, ``namespace`` and ``name``
Expand All @@ -256,17 +252,34 @@ These ``purl`` separator characters MUST NOT be percent-encoded when used as
- '#' (number sign) is the separator before ``subpath``


Percent-encoding rules
----------------------

When applying percent-encoding or decoding to a string, use the rules of RFC
3986 section 2 (https://datatracker.ietf.org/doc/html/rfc3986#section-2).

Each component defines when and how to apply percent-encoding and decoding to
its content.
Character encoding
~~~~~~~~~~~~~~~~~~

When percent-encoding is required, all characters MUST be encoded except for
the colon ':'.
- In the "Rules for each ``purl`` component" section, each component
defines when and how to apply percent-encoding and decoding to its content.
- When percent-encoding is required by a component definition, the component
string MUST first be encoded as UTF-8.
- In the component string, each "data octet" MUST be replaced by the
percent-encoded "character triplet" applying the percent-encoding mechanism
defined in RFC 3986 section 2.1 (https://datatracker.ietf.org/doc/html/rfc3986#section-2.1),
including the RFC definition of "data octet" and "character triplet",
and using these definitions for RFC's "allowed set" and "delimiters":

- "allowed set" is composed of the Alphanumeric Characters and the
Punctuation Characters
- "delimiters" is composed of the Separator Characters

- The following characters MUST NOT be percent-encoded:

- the Alphanumeric Characters,
- the Punctuation Characters,
- the Separator Characters when being used as ``purl`` separators,
- the colon ':', whether used as a Separator Character or otherwise, and
- the percent sign '%' when used to represent a percent-encoded character.

- Where the space ' ' is permitted, it MUST be percent-encoded as '%20'.
- With the exception of the percent-encoding mechanism, the rules regarding
percent-encoding are defined by this specification alone.


How to build ``purl`` string from its components
Expand Down