From 2fce73351fed9941beeac7aebbbb2aae486e7dcf Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Lars=20Hellstr=C3=B6m?= <lars.hellstrom@mdh.se>
Date: Thu, 28 Sep 2017 17:03:50 +0200
Subject: [PATCH] Rewrite of technical.md and starting xml-for-om tutorial

One thing left untouched here is the example of a CD symbol definition
(transc1#log). That needs more attention, because the implication given
is wrong (see issue OpenMath/CDs#15), and also it seems to be based on
an older version of the CD (but the error is also in the present CD).

In addition, I've included some OMS URIs such as
  http://www.openmath.org/cd/nums1#rational
which currently fail with messages such as
  The requested URL /var/www/www.openmath.org/www/nums1.xhtml
  was not found on this server.
I think that is a misconfiguration on (the old) www.openmath.org however,
so it probably shouldn't call for a change here.
---
 technical.md        | 391 +++++++++++++++++++++++++++++++++++++-------
 xml-for-om/index.md | 100 +++++++++++
 2 files changed, 432 insertions(+), 59 deletions(-)
 create mode 100644 xml-for-om/index.md

diff --git a/technical.md b/technical.md
index 5c78a01f..02bc0bec 100644
--- a/technical.md
+++ b/technical.md
@@ -5,100 +5,212 @@ title: OpenMath, A Technical Overview
 
 ## Introduction
 
-OpenMath is a standard for representing mathematical data in as unambiguous a way as possible. It can be used to exchange mathematical objects between software packages or via email, or as a persistent data format in a database. It is tightly focussed on representing semantic information and is not intended to be used directly for presentation, although tools exist to facilitate this.
+OpenMath is a standard for representing mathematical data in as unambiguous a way as possible. It can be used to exchange mathematical objects between software packages or via email, or as a persistent data format in a database. It is tightly focussed on representing semantic information and is not intended to be used directly for presentation, although it is often a fair starting point for generating a such; tools exist which take an OpenMath object as input and generate a presentation (e.g. LaTeX or MathML) as output. The formal definition of OpenMath is contained within [_The OpenMath Standard_](../standard/) and its accompanying documents.
 
-The original motivation for OpenMath came from the Computer Algebra community. Computer Algebra packages were getting bigger and more unwieldy, and it seemed reasonable to adopt a generic "plug and play" architecture to allow specialised programs to be used from general purpose environments. There were plenty of mechanisms for connecting software components together, but no common format for representing the underlying data objects. It quickly became clear that any standard had to be vendor-neutral and that objects encoded in OpenMath should not be too verbose. This has led to the design outlined below.
+The original motivation for OpenMath came from the Computer Algebra community, in particular the need to transfer data between different computer algebra systems. There were plenty of mechanisms for connecting software components, but when expressing mathematical objects, computer algebra systems will typically fall back to producing expressions (possibly prettyprinted) in their built-in command language for constructing said objects — a method that is both expensive to support externally and fragile when targeting software being actively developed.
 
-In 1998, the [Worldwide Web Consortium (W3C)](http://www.w3.org) produced its first recommendation for the [Extensible Markup Language (XML)](http://www.w3.org/xml), intended to be a universal format for representing structured information on the worldwide web. It was swiftly followed by the first [MathML](http://www.w3.org/math) recommendation which is an XML application oriented mainly towards the presentation (i.e. the rendering) of mathematical expressions.
-
-The formal definition of OpenMath is contained within [_The OpenMath Standard_](../standard/) and its accompanying documents, and the reader is referred there for more details.
+OpenMath rather favours the **math in the middle** paradigm, where the expression of a mathematical object being communicated is founded directly on mathematical logic, even if encoded in a machine-readable way. This means OpenMath can be used to speak about mathematical objects regardless of whether these are effectively computable or not; as with human mathematicians, ultimately the reader bears the burden of interpreting the content, and not all readers are expected to understand all that has been written, although the writer shares a responsibility to express one's objects in an understandable way. When what is in the middle is purely mathematics, authors of software need not worry about what other programs their users might want to make a connection to, as their concern is rather that import or export of objects preserves the mathematical meaning of the data.
 
 ## The OpenMath Architecture
 
-The OpenMath representation of a mathematical structure is referred to as an _OpenMath object_. This is an abstract structure which is represented concretely via an _OpenMath encoding_. These encoded objects are what an OpenMath application would read and write, and in practice the OpenMath objects themselves almost never exist, except on paper. The advantage of this is that OpenMath is not tied to any one underlying mechanism: in the past we have used functional, SGML and binary encodings. The current favourite is XML, as described below, and we will tend to use XML notation when describing OpenMath objects (even though strictly speaking the XML representation is an encoding).
+The primary concept in the OpenMath standard is that of an _OpenMath object_: these are used for encoding both the mathematical objects themselves and statements about mathematical objects. Philosophically the OpenMath objects constitute a formal language, essentially equivalent to the language of terms and well-formed formulae that constitutes the basis of formalised mathematical logic (although not as minimalistic as the formal languages typically used in textbooks on mathematical logic, since OpenMath aims to support all contemporary mathematics, not just some fraction needed to formalise one approach to its foundations). OpenMath objects are abstract entities, for which there exist a number of concrete encodings as strings of characters or sequences of bytes; tutorials typically use the OpenMath-XML encoding in examples, but the standard also blesses OpenMath-binary and Content MathML as concrete encodings for OpenMath objects.
+
+A second important concept is that of a _content dictionary_ (CD): these are documents defining one or more _symbols_ for use in OpenMath objects. Symbols are used to denote pretty much any mathematical concept that one would ordinarily write a definition of: the sine function, the constant pi, the addition operation, the set of integers, the property of being continuous, the universal quantifier &x2200;, and so on. The definitions in a content dictionary may be anywhere from fully formal to informal handwaving; the point is mainly that the knowledgeable mathematician, after looking up a symbol in its content dictionary, should be able to tell _exactly_ what this symbol denotes (sometimes by identifying one among several different conventions that exist in the literature). Anyone may create a new content dictionary (although the set of dictionaries under the default `cdbase` is managed by the OpenMath Society). Different content dictionaries may define different symbols for the same mathematical concept; finding alignments between different symbols, and using those to translate between different dialects of "OpenMath object", are concerns at a higher level than that which is addressed by the OpenMath standard.
+
+Given the evolutionary nature of mathematics, it is clear that the set of CDs should be forever growing and never complete. Currently there are CDs for high-school mathematics, linear algebra, polynomials and group theory to name a few, and new contributions are always welcome. There is no requirement that applications use the standard set of CDs and it is often very useful to design a "private" CD for a specific purpose.
 
 ### OpenMath Objects
 
-Formally, an OpenMath object is a labelled tree whose leaves are the _basic_ OpenMath objects integers, IEEE double precision floats, unicode strings, byte arrays, variables or symbols. Of these, symbols are the most interesting since they consist of a name and a reference to a definition in an external document called a _content dictionary_ (or CD). Using XML notation where the element name <tt>OMS</tt> indicates an OpenMath symbol, the following: `<OMS name="sin" cd="transc1"/>` represents the usual sine function, as defined in the CD "transc1". A basic OpenMath object is an OpenMath object, although its XML representation will be:
+An OpenMath object is either _compound_ (built up from other OpenMath objects) or _basic_ (not possible to decompose that way). Among the basic objects we find _symbols_, _variables_, and _integers_, which should be familiar from mathematical logic textbooks (although integers are often treated as informal shorthands there, since they are not necessarily fundamental). There are three additional cases of basic OpenMath object, which were perhaps motivated more by computational practicality — _floating-point numbers_ (IEEE "doubles"), (Unicode) _character strings_, and _bytearrays_ — but mathematically these are no less well-defined than integers, and they have uses beyond the obvious programming ones.
+
+The most common case of compound object is an _application object_, which is built from a sequence of (one or) several objects, the first of which is called the _head_ of the application. Applications include both function application (head is the function, remaining objects are the arguments of the function) and relation/predicate application (head is the relation/predicate, remaining objects are the arguments). The second most common case of compound object is a _binding object_, which differs from application objects primarily in that it _binds_ one or several variables within the body of the binding; the way that quantifiers control variables is by being the heads of binding objects that bind said variables. OpenMath also has _attribution_ and _error_ as cases of compound object, but application and binding cover everything one tends to find in mathematical logic textbooks.
 
+### Object examples in OpenMath-XML encoding
+
+In the OpenMath-XML encoding, every OpenMath object (and subobject of an OpenMath object) is encoded as an XML element. Compound objects have one or several child elements as their content, whereas basic objects are leaves in the object tree. The tags of the OpenMath object elements all begin with <tt>OM</tt>, followed by a one- (or several) letter abbreviation of the type of object: `OMS` is for symbol objects, `OMV` is for variable objects, `OMI` is for integer objects, `OMF` is for floating-point objects, `OMSTR` is for string objects, and `OMB` is for bytearray objects. For the compound objects we similarly have `OMA` for application objects, `OMBIND` for binding objects, `OMATTR` for attribution objects, and `OME` for error objects.
+
+A symbol is characterised by three things: its _name_, the name of its _content dictionary_, and its _cdbase_; in the OpenMath-XML encoding these are encoded as the three attributes `name`, `cd`, and `cdbase` of an `OMS` element (although `cdbase` is rarely specified explicitly, since most symbols defined to date are in the default cdbase). The sine function is for example named `sin` in the `transc1` (transcendental functions 1) content dictionary, so the XML element encoding that OpenMath symbol object is
 ```XML
-<OMOBJ>
-   <OMS name="sin" cd="transc1"/>
+  <OMS cd="transc1" name="sin"/>
+```
+(Strictly speaking, the OpenMath-XML encoding defines a _document type_ for "OpenMath object", so it describes a class of XML documents where the contents are one OpenMath object. For the above symbol object to become a valid OpenMath-XML document, it must be given a slight wrapper, like so:
+```XML
+<OMOBJ xmlns="http://www.openmath.org/OpenMath">
+  <OMS cd="transc1" name="sin"/>
 </OMOBJ>
 ```
+See [_Introduction to XML_](../xml-for-om/) for more explanation of XML. For now we'll mostly ignore that wrapper since it only occurs at the outermost level.)
 
-OpenMath objects can be built up recursively in a number of ways. The simplest is function application, for example the expression _sin_(_x_) can be represented by the XML:
-
+A variable is fully characterised by its name alone, which similarly is specified using the `name` attribute, but of an `OMV` element. The formula sin(_x_) is thus encoded as
 ```XML
-<OMOBJ>
   <OMA>
-    <OMS name="sin" cd="transc1"/>
+    <OMS cd="transc1" name="sin"/>
     <OMV name="x"/>
   </OMA>
-</OMOBJ>
 ```
+Names are like identifiers in computer languages; technically OpenMath uses the same rules for these as XML (even in the non-XML encodings), so we're a bit more generous than traditional programming languages: hyphen (<tt>-</tt>) and period (<tt>.</tt>) are allowed in names (as long as they're not the first character), and many thousands of non-ASCII characters (for example from the Greek alphabet) are allowed as well.
 
-where <tt>OMV</tt> introduces a variable and <tt>OMA</tt> is the application element. Another straightforward method is _attribution_ which as the name suggests can be used to add additional information (for example "the AXIOM command which generated me was ...") to an object without altering its fundamental meaning. More interesting are _binding_ objects which are used to represent an expression containing bound variables, for example:
-
+The formula cos(π)=–1 can be encoded as
 ```XML
-<OMOBJ>
-  <OMA>
-    <OMS cd="calculus1" name="int"/>
-    <OMS cd="transc1" name="sin"/>
+  <OMA><OMS cd="relation1" name="eq"/>
+    <OMA>
+      <OMS cd="transc1" name="cos"/>
+      <OMS cd="nums1" name="pi"/>
+    <OMA>
+    <OMI> -1 </OMI>
   </OMA>
-</OMOBJ>
 ```
+Here we see that `OMI` elements, unlike `OMS` and `OMV` (but like `OMSTR` and `OMB`) elements, carry their "value" in the element contents (material between start-tag and end-tag) rather than in attributes of the element. Apart from these basic element contents, _there is no character data in OpenMath documents_ — whitespace before and after the element encoding an OpenMath object is irrelevant and can be ignored. Thus there is no difference between putting an opening `<OMA>` on the same line as the `<OMS>` which serves as its head, or putting them on separate lines, but some authors stylistically prefer doing it one way and others prefer doing it the other way. Nor does the indentation have any significance, but a consistent indentation scheme tends to make the code easier to read.
 
-represents the integral of the _sin_ function, but the encoding:
+What about bindings? The formula ∀_x_: sin<sup>2</sup>(_x_)+cos<sup>2</sup>(_x_)=1 becomes
+```XML
+  <OMBIND>
+    <OMS cd="quant1" name="forall"/>
+    <OMBVAR> <OMV name="x"/> </OMBVAR>
+    <OMA> <OMS cd="relation1" name="eq"/>
+      <OMA> <OMS cd="arith1" name="plus"/>
+        <OMA> <OMS cd="arith1" name="power"/>
+          <OMA> <OMS cd="transc1" name="sin"/> <OMV name="x"/> </OMA>
+          <OMI>2</OMI>
+        <OMA>
+        <OMA> <OMS cd="arith1" name="power"/>
+          <OMA> <OMS cd="transc1" name="cos"/> <OMV name="x"/> </OMA>
+          <OMI>2</OMI>
+        <OMA>
+      </OMA>
+      <OMI>1</OMI>
+    </OMA>
+  </OMBIND>
+```
+The `OMBVAR` element here does not correspond to an OpenMath (sub)object; it is rather a syntax-required wrapper around the sequence of variables being bound in this binding, because there may be more than one. 
 
+A more practical example of something one might want to transport from one Computer Algebra System to another is perhaps a polynomial. How would one then encode _x_<sup>5</sup> – 3_x_<sup>4</sup> + (3/14)_x_<sup>2</sup> – 7? A literal translation would be
 ```XML
-<OMOBJ>
-  <OMA>
-    <OMS cd="calculus1" name="int"/>
-    <OMBIND>
-      <OMS cd="fns1" name="lambda"/>
-      <OMBVAR> <OMV name="x"/> </OMBVAR>
-      <OMA>
-        <OMS name="sin" cd="transc1"/>
-        <OMV name="x"/>
+  <OMA> <OMS cd="arith1" name="plus"/>
+    <OMA> <OMS cd="arith1" name="power"/> <OMV name="x"/> <OMI>5</OMI> </OMA>
+    <OMA> <OMS cd="arith1" name="times"/>
+      <OMI>-3</OMI>
+      <OMA> <OMS cd="arith1" name="power"/> <OMV name="x"/> <OMI>4</OMI> </OMA>
+    </OMA>
+    <OMA> <OMS cd="arith1" name="times"/>
+      <OMA> <OMS cd="nums1" name="rational"/>
+        <OMI>3</OMI> <OMI>14</OMI>
       </OMA>
-    </OMBIND>
+      <OMA> <OMS cd="arith1" name="power"/> <OMV name="x"/> <OMI>2</OMI> </OMA>
+    </OMA>
+    <OMI>-7</OMI>
   </OMA>
-</OMOBJ>
 ```
+although it could be argued that this is not the polynomial as such, but rather the corresponding polynomial expression in _x_. To encode "the function mapping _x_ to _x_<sup>5</sup> – 3_x_<sup>4</sup> + (3/14)_x_<sup>2</sup> – 7", one can use the `lambda` binder as follows:
+```XML
+  <OMBIND>
+    <OMS cd="fns1" name="lambda"/>
+    <OMBVAR> <OMV name="x"/> </OMBVAR>
+    <OMA> <OMS cd="arith1" name="plus"/>
+      <OMA> <OMS cd="arith1" name="power"/> <OMV name="x"/> <OMI>5</OMI> </OMA>
+      <OMA> <OMS cd="arith1" name="times"/>
+        <OMI>-3</OMI>
+        <OMA> <OMS cd="arith1" name="power"/> <OMV name="x"/> <OMI>4</OMI> </OMA>
+      </OMA>
+      <OMA> <OMS cd="arith1" name="times"/>
+        <OMA> <OMS cd="nums1" name="rational"/>
+          <OMI>3</OMI> <OMI>14</OMI>
+        </OMA>
+        <OMA> <OMS cd="arith1" name="power"/> <OMV name="x"/> <OMI>2</OMI> </OMA>
+      </OMA>
+      <OMI>-7</OMI>
+    </OMA>
+  </OMBIND>
+```
+A practical advantage of using the latter is that the variable is now contained within the object, whereas with the former its interpretation might well depend on the context (a computer algebra system where _x_ has been assigned a numerical value would probably evaluate the former expression for that value of _x_ rather than keep it as a polynomial, which is not necessarily what one wants). It is fairly common to see idioms involving lambda for encoding things like "integral of expression" or "limit of expression", and presentation-generating tools usually need to recognise a number of these.
 
-represents ∫sin(_x_)dx. This may appear overly complicated but it is useful, for example when searching in a database for expressions which match ∫sin(_y_)dy . The definition of a symbol in the CD specifies whether or not it may be used to bind variables, which is why <tt><OMS cd="calculus1" name="int"/></tt> cannot be used as a binding symbol.
+### Other encodings
 
-The final kind of OpenMath object is an _error_ which is built up from a symbol describing the error and a sequence of OpenMath objects. For example:
+One criticism that can be raised is that the above looks like an awful lot of text to write for such a small polynomial — a Computer Algebra System might let you get away with just `x->x^5-3*x^4+3/14*x^2-7` for the same thing! A factor causing this difference is that OpenMath isn't playing favourites; certainly ordinary addition, multiplication, and exponentiation of numbers are common operations in elementary mathematics, but there is plenty of mathematics which is about very different things and using very different operations, so why should arithmetic get a lot of special shorthands? (Introducing alternative meanings of + etc. in special contexts works fine in print, but not so well for machine-readable content.) Another factor is however that XML was designed to be unambigous and robust rather than compact – even an extra factor 10 in encoding size is usually not that big a deal if both reader and writer are computers, as long as the structural complexity is not increased – so even though it is probably well suited for some purposes (such as archiving your results), it need not be optimal for all. That is one reason we have alternative encodings.
+
+The binary encoding mostly replaces the spacious tags of XML with single bytes, and packs identifiers and numerical data tight. A line by line conversion of the above to binary, writing bytes either in hexadecimal or as `"string"` (meaning the UTF-8 encoding of the quoted characters), is with spaces inserted for clarity just
+```
+  1a
+    080406 "fns1" "lambda"
+    1c 0501 "x" 1d
+    10 080604 "arith1" "plus"
+      10 080605 "arith1" "power" 0501 "x" 0105 11
+      10 080605 "arith1" "times" 
+        01fd
+        10 080605 "arith1" "power" 0501 "x" 0104 11
+      11
+      10 080605 "arith1" "times" 
+        10 080508 "nums1" "rational"
+          0103 010e
+        11
+        10 080605 "arith1" "power" 0501 "x" 0102 11
+      11
+      01f8
+    11
+  1b
+```
+Now the most spacious parts are the symbol names, due to not playing favourites. Even these may be further abbreviated by using _references_ to repeated subobjects:
+```
+  1a
+    080406 "fns1" "lambda"
+    1c 8501 "x" 1d
+    10 080604 "arith1" "plus"
+      10 880605 "arith1" "power" 1e00 0105 11
+      10 880605 "arith1" "times" 
+        01fd
+        10 1e01 1e00 0104 11
+      11
+      10 1e02 
+        10 080508 "nums1" "rational"
+          0103 010e
+        11
+        10 1e01 1e00 0102 11
+      11
+      01f8
+    11
+  1b
+```
+Here we are down to 61 bytes for the structure and another 56 bytes for names (from 506 significant characters in the XML encoding) — certainly still longer than `x->x^5-3*x^4+3/14*x^2-7`, although a fairer comparison would be with the fully parenthesised `x->((x^5)+((-3)*(x^4))+((3/14)*(x^2))+(-7))` since OpenMath makes no presumptions about operation priorities.
 
+The third standardised OpenMath encoding is (strict) Content MathML (cMML). This is a sibling of the more well-known _Presentation_ MathML (pMML) that is sometimes used for math formulae in web settings, but as the name says encoding _content_ (semantics) rather than _presentation_ (looks). Being another XML encoding, it is in most ways similar to OpenMath-XML, but the tag names are different, and cMML prefers to put names in element contents rather than attributes, so our initial `<OMS cd="transc1" name="sin"/>` example becomes
 ```XML
-<OMOBJ>
-  <OME>
-    <OMS name="unexpected_symbol" cd="error1">
-    <OMS name="sine" cd="transc1">
-  </OME>
-</OMOBJ>
+  <csymbol cd="transc1">sin</csymbol>
 ```
+in cMML.
 
-represents the error which might be generated when an application sees a symbol it doesn't recognise from a CD it thought it knew about.
+Historically there also was an encoding called _prefix_ or _functional_, with a syntax based on LISP. This was never codified in the standard, but the content dictionary pages on the OpenMath website still offers showing OpenMath objects in prefix notation.
 
-### OpenMath Encodings
+### Future encodings
+
+A candidate for becoming the fourth standard-codified OpenMath encoding is called [PopCorn](http://java.symcomp.org/FormalPopcorn.html); its niche is being a format that humans actually might stand to write. In PopCorn, "ordinary programming identifiers" are interpreted as symbol objects, whereas identifiers prefixed by a `$` are interpreted as variable objects, and "ordinary programming literals" are interpreted as the corresponding kind of basic OpenMath object (possibly a quoted symbol). Application objects are written like function application, with the head followed by a parenthesis containing the arguments, separated by commas.
+
+Unlike the currently standard-codified encodings PopCorn _does_ play favourites, by defining punctuation shorthands for a small set of symbols, with associated priority rules. In PopCorn, the above polynomial _can_ be written
+```
+  fns1.lambda[$x->$x^5-3*$x^4+3//14*$x^2-7]
+```
+(although `fns1.lambda[$x->$x^5+-3*$x^4+3//14*$x^2+-7]` is technically more faithful to the object as originally encoded).
+
+There are also some schemes around that use LaTeX to encode OpenMath objects. Here there is often a dual goal of both encoding mathematical semantics (content) and typesetting a mathematical formula for print (presentation), with different schemes putting different weight on these goals. One point of including semantic information in documents aimed at print can be to improve searchability; knowing that `\Phi(x)` denotes the same thing as `\frac{1}{\sqrt{2\pi}} \int_{-\infty}^x \exp(-u^2/2) du` typically makes a huge difference for whether formulae involving that `\Phi` constitute matches for a specific search. The to date most impressive (though not necessarily the most formalised) use of such semantic LaTeX macros is the [Digital Library of Mathematical Functions (DLMF)](http://dlmf.nist.gov).
 
-We have already seen some examples of the XML encoding, but it is by no means the only encoding. In the past there was a functional encoding (which looked like Lisp) and an SGML encoding which evolved into the current XML. Both of these are now obsolete, but there is still a binary encoding described in the [standard](../standard/) , which is much more compact than the XML one.
 
-In fact the XML encoding is not 100% XML. When XML was in its infancy the developers of OpenMath realised that it might become significant and decided to add some XML-like features to the SGML encoding so that an an OpenMath object could be encoded as valid XML. Thus it is currently the case that any well-formed OpenMath object encoded using the XML encoding as described in the [standard](../standard/) is a valid XML document. However, if one uses standard XML tools to generate an OpenMath object in the XML encoding from the DTD given in chapter 4 of the [standard](../standard/), it is possible that the result will not be valid OpenMath, although in practice this is highly unlikely. To cover all the possibilities allowed by XML would make it much more complicated to write an application to read any OpenMath object from scratch. Whether to adopt XML completely remains a hot topic of debate within the OpenMath community!
 
-Generally speaking, it is not intended that the existing encodings should be readable by a human user or writable by hand. It is desirable that they be compact and it is also desirable that they be linear, but neither of these is a requirement. It is a property of encodings that it is possible to convert between them with no loss of information.
+## More on Content Dictionaries
 
-### Content Dictionaries
+Every symbol has a corresponding Universal Resource Identifier (URI), formed from the _cdbase_, content dictionary name _cd_, and symbol name _name_ as follows <i>cdbase</i><tt>/</tt><i>cd</i><tt>#</tt><i>name</i>. Hence the symbol `<OMS cd="nums1" name="rational"/>` above, under the default cdbase `http://www.openmath.org/cd`, has the URI [http://www.openmath.org/cd/nums1#rational](http://www.openmath.org/cd/nums1#rational). Though not a formal requirement, it is a good idea to arrange things so that fetching the URI of a symbol will get you the content dictionary document defining that symbol; if nothing else, this helps deciding who would win if two different content dictionary documents claim the same cdbase and content dictionary name.
 
-Content Dictionaries (or CDs for short) are the most important, and the most interesting, aspect of OpenMath because they define the meaning of the objects being transmitted. A CD is a collection of related symbols and their definitions, encoded in an XML format. Defining the meaning of a symbol is not a trivial task, and even referring to well-known references can be fraught with pitfalls Formal definitions and properties can be very useful but time-consuming to produce and verbose, not to mention difficult to get right. A symbol definition in an OpenMath CD consists of the following pieces of information:
+In practice, what one typically gets if entering a symbol URI into a web browser is not the formal document that constitutes the content dictionary, but an HTML rendering of it, scrolled to the symbol of interest. (This is because of content negotiation in the communication protocol: the web browser says it prefers to get the resource as HTML, so that is what the server delivers, if it can. More low-level user agents might ask for the information in another form.) Formal content dictionary documents are rather designed to be machine-readable (for example so that they could be used to generate on-line help in an integrated development environment); the concept is abstract, but the standard only codifies an XML encoding for content dictionaries.
+
+
+### Symbol definitions
+
+Most of the material in a CD consists of _symbol definitions_. A symbol definition in an OpenMath CD consists of the following pieces of information:
 
 *   the symbol name;
 *   a description in plain text;
+*   optionally, a _role_ (constant, application, binder, attribution, semantic-attribution, or error) of the symbol.
 *   optionally, a set of this symbol's properties in plain text (_Commented Mathematical Properties_, or _CMPs_);
 *   optionally, a set of this symbol's properties encoded in OpenMath (_Formal Mathematical Properties_, or _FMPs_);
-*   optionally, one or more examples of its use (encoded in OpenMath).
+*   optionally, one or more examples of its use (encoded in OpenMath and/or plain text).
 
 In practice the CMPs and FMPs can come as pairs, and often serve in the place of examples.
 
@@ -159,28 +271,189 @@ log 100 to base 10 (which is 2).
 </CDDefinition>
 ```
 
-This provides a symbol to represent the _log_ function by giving a pointer to a standard reference book. It provides the property that:
+This provides a symbol to represent the `log` function by giving a pointer to a standard reference book. It provides the property that:
 
 <center>a<sup>b</sup>=c  →  log<sub>a</sub>(c)=b</center>
 
 both as plain text and as OpenMath, and also gives an example of how the symbol is used.
 
-CDs usually consist of related symbols and collections of related CDs can be grouped together, for convenience, as _CD Groups_. One very important CD Group is that corresponding to the content part of MathML. The set of CDs produced by the OpenMath Society can be browsed [here](../cd/).
+A role provides a coarse kind of type declaration for a symbol — a symbol that is the head of an application object should be an application, a symbol that is the head of a binding object should be a binder, etc. — although this is advisory rather than normative. More detailed type information would be provided in a separate document parallel to the content dictionary. Simple signatures can be encoded using the [Simple Type System](../standard/sts.pdf), while more type-theoretically formal definitions are possible using the [Extended Calculus of Constructors](../standard/ecc.pdf); the OpenMath community does not mandate any single system. (Paraphrasing Shakespeare, we've found that there are more things in mathematics, than are dreamt of in the philosophy of any one type theorist.)
 
-It is possible to associate extra information with CDs, in particular type information. Since there are many type systems available, each of which has its own strengths and advocates, the OpenMath community does not mandate any single system. Simple signatures can be encoded using the [Simple Type System](../standard/sts.pdf), while more formal definitions are possible using the [Extended Calculus of Constructorss](../standard/ecc.pdf). Other associated information can include style sheets for rendering OpenMath symbols in MathML, and mathematical definitions to be used by formal logic systems.
+The same approach of having documents parallel to the content dictionary that provide additional information about the symbols it defines can be used for many other purposes. There can be files providing notations for symbols, as one would need if generating presentations (e.g. MathML or LaTeX) for an OpenMath object. There can be files providing definitions to be used by formal logic systems (Agda, Coq, Isabelle/HOL, Mizar, Theorema, etc.). There are in most cases not even a common wrapper format that these systems could agree upon, so it is necessary to have separate files for each.
+
+As a rule of thumb, OpenMath symbols should not be regarded as verbs, since they are used to construct objects rather than to send commands. (A reader may _choose_ to interpret them as commands, but that is then a choice made by that particular party.) The presence of both nouns and verbs in a CD (e.g. "integral" and "integrate") is strongly discouraged.
+
+
+### Global information in content dictionaries
+
+In addition to the symbol definitions, content dictionaries contain the following global pieces of information:
+
+* A _name_, that becomes the _cd_ attribute of the symbols.
+* Optionally a _cdbase_ URI for the content dictionary. If no cdbase is specified, the content dictionary has the default cdbase.
+* Optionally (but highly recommended) a _description_ of the content dictionary, used to present the content dictionary in lists and tables.
+* A _status_, that is one of `official` (attaining this requires a formal decision by the OpenMath Society), `experimental`, `private`, and `obsolete`. The most common status is **experimental**; in practice it covers everything from "something I just cobbled together" to "candidate to become official".
+* _Revision metadata_, consisting of: _date_, _version_, _revision_ (minor version), and (optionally) a _review date_.
+
+Along with these, the "header" part of a content dictionary document typically also contains a license statement, wrapped up in a `CDComment` element. These statements are non-functional (in the sense that the computer doesn't care) and not mandatory, but they may make a difference for whether the content dictionary can be distributed with other pieces of software.
+
+Other uses for `CDComment` elements include providing a bibliography in the content dictionary document, and inserting section headings, if there are meaningful sections of symbols in the document. (Alternatively, a common style is to organise a content dictionary as a flat list of symbols in alphabetical order.)
+
+
+### CD groups
+
+CDs usually consist of related symbols. Collections of related CDs can be grouped together, for convenience, as _CD Groups_. One very important CD Group is that corresponding to the [content part of MathML](https://www.w3.org/TR/MathML3/chapter4.html#contm.opel).
 
-Given the evolutionary nature of mathematics, it is clear that the set of CDs should be forever growing and never complete. Currently there are CDs for high-school mathematics, linear algebra, polynomials and group theory to name a few, and new contributions are always welcome. There is no requirement that applications use the standard set of CDs and it is often very useful to design a "private" CD for a specific purpose.
 
 ## OpenMath in Action
 
-There is no definitive way in which OpenMath should be used, as the protocol has been designed to be as flexible as possible. Nevertheless many OpenMath applications share common characteristics which we shall discuss here.
+A list of applications of OpenMath can be found in the [software & tools](../software/) section of this website. There is no definitive way in which OpenMath should be used, as the standard has been designed to be as flexible as possible. Nevertheless many OpenMath applications share common characteristics which we shall discuss here.
+
+The simplest use of OpenMath would be for recording the results of some (presumably nontrivial) calculation you've done; grant agencies and journals are increasingly demanding that research data are deposited in open databases. The simplest way to do so is to have your program output the results encoded in OpenMath, since that allows you to mathematically state what your result is. If for example you calculated the chromatic polynomials of all graphs on 10 vertices (hardly a remarkable feat today, but it serves as an example), then you can make a self-contained record of these results by encoding in OpenMath a statement of the form
+
+> The chromatic polynomial of the graph … is …, and the chromatic polynomial of the graph … is …, and the chromatic polynomial of the graph … is …, …
+
+and so on, until you've listed all graphs (up to isomorphism). A custom file format could no doubt be more compact, or allow for fast look-up, but then it would also need to be documented (which could be a minor math paper of its own) and reading it would require specialised tools (which need developing, debugging, and maintenance). Better then to keep it straightforward.
+
+Slightly more complicated is to use OpenMath as import and export format for an application. It is unlikely that the internal data structures used by the application will be OpenMath (but neither do traditional computer algebra systems exclusively use their command languages as internal data structures), and so translation between the internal representations and (some encoding of) OpenMath will have to take place. The piece of software which does this is usually referred to as a _phrase-book_. The export part of a phrase-book can often be a straightforward recursion over the layers of the internal data structures being exported, whereas the import part may be trickier, depending on how flexible the application should be regarding how the data to import is structured.
+
+The next step up from import and export is to use OpenMath for interactive communication between two applications, e.g. a client program and a computational server, or a generalist computer algebra system and a domain-specific system; the two applications may communicate by sending OpenMath objects to each other, thereby removing the need for one system to be dependent upon the details of the data structures employed by the other system. A notable mechanism here is [SCSCP](../standard/index.html#symbolic-computation-software-composability-protocol-scscp), which is a remote procedure call protocol that uses OpenMath objects as messages. Remote procedure calls in general merely means that one application can tell another to do something, but the level of integration can be low, especially if the data involved is nontrivial. With SCSCP, published functions in the remote application really become available as if they were defined locally.
+
+### Writing phrase-books
+
+It is possible to write a generic phrase-book which can handle any piece of OpenMath, but applications where this makes sense are few and far between, because few _applications_ can do something sensible with arbitrary mathematical formulae. If restricting attention to cases where the application can deal with the mathematics being handed to it, a phrase-book still faces challenges both in the area of vocabulary and in the area of grammar.
+
+In OpenMath, the vocabulary of a phrase-book is simply the set of symbols that it understands. Encountering an unknown symbol may raise an error, or it may trigger some default behaviour (e.g. treating the symbol as an identifier that has not been assigned any value). OpenMath carefully avoids defining one behaviour as "right" in every given circumstance, and leaves it up to the phrase-book writer to decide what to do. Systems with a built-in command language typically allow extending the phrase-book vocabulary with new symbol definitions, whereas simpler systems tend to be more limited.
+
+The challenges in the area of grammar are not with the OpenMath encodings, even though these technically are defined by way of a formal language grammar, but in the higher level issue of how the abstract OpenMath objects are structured. One way of saying that _S_ is the squaring function would be ∀_x_: _S_(_x_) = _x_<sup>2</sup>, i.e.,
+```XML
+  <OMBIND> <OMS cd="quant1" name="forall"/>
+    <OMBVAR> <OMV name="x"/> </OMBVAR>
+    <OMA> <OMS cd="relation" name="eq"/>
+      <OMA> <OMV name="S"/> <OMV name="x"/> </OMA>
+      <OMA> <OMS cd="arith1" name="power"/>
+        <OMV name="x"/> <OMI> 2 </OMI>
+      </OMA>
+    </OMA>
+  </OMBIND>
+```
+but another would be
+```XML
+  <OMA> <OMS cd="relation" name="eq"/>
+    <OMV name="S"/>
+    <OMBIND> <OMS cd="fns1" name="lambda"/>
+      <OMBVAR> <OMV name="x"/> </OMBVAR>
+      <OMA> <OMS cd="arith1" name="power"/>
+        <OMV name="x"/> <OMI> 2 </OMI>
+      </OMA>
+    </OMBIND>
+  </OMA>
+```
+In this case, it is entirely reasonable for a phrase-book to understand one but not the other (although it is also perfectly possible to make a phrase-book which understands both). Being able to understand _x_<sup>2</sup>–_x_–2 but not (_x_+1)(_x_–2), or more formally
+```XML
+  <OMA> <OMS cd="arith1" name="times"/>
+    <OMA> <OMS cd="arith1" name="plus"/>
+      <OMV name="x"/> <OMI>1</OMI>
+    </OMA>
+    <OMA> <OMS cd="arith1" name="minus"/>
+      <OMV name="x"/> <OMI>2</OMI>
+    </OMA>
+  </OMA>
+```
+(on account of the operations in the latter not being nested in the expected order plus–times–power) may on the other hand seem unreasonable, however a simplistic phrase-book design could well end up exhibiting precisely this behaviour. A trick that is useful for avoiding many trivial limitations in the grammar supported by a phrase-book is to implement parsing as evaluation (for appropriate interpretations of application and binder heads); this would mean the internal representation of a mathematical object is constructed step by step from simpler objects using mathematical operations, rather than allocated at the start and then filled in with data as it is encountered. Remember, though, that OpenMath symbols should primarily not be regarded as verbs, since they are used to construct objects rather than to send commands. 
+
+This kind of limitations in application phrase-books could be a reason for having used the `nums1#rational` symbol rather than `arith1#divide` to encode the coefficient 3/14 in the polynomial _x_<sup>5</sup> – 3 _x_<sup>4</sup> + 3/14 _x_<sup>2</sup> – 7 used as example above: an application for (say) associative algebra need not be comfortable with the concept of division, since that in its area of expertise is undefined more often than not, but it will probably have to deal with rational numbers since there isn't any smaller field of characteristic zero. A strict reflection of that in the phrase-book would be to support `nums1#rational` (which can only be used to build rational numbers) but not `arith1#divide` (since it typically doesn't know how to do that), however a perfectly feasible pragmatic alternative for that associative algebra application would be to treat `arith1#divide` as a synonym of `nums1#rational`, throwing an error whenever either operand isn't an integer. Either way, it may well be most natural for the phrase-book to use `nums1#rational` when exporting data, since the internal representation of a rational number is likely something more specialised than "quotient of two expressions".
+
+Writing a phrase-book may be non-trivial, and requires an understanding of the semantics of the underlying software. An OpenMath object need not map directly into a private object and vice-versa, for example in some systems a rational number might have to be represented by a float, or a sparse matrix by a dense one. Indeed it is quite possible that a piece of software could have multiple phrase-books associated with it for different purposes. 
+
+
+## Additional OpenMath object details
+
+### Basic objects
+
+The XML encoding of `OMI`s and `OMF`s can be in hexadecimal as well as decimal, if one wishes to avoid a radix conversion. The binary encoding is base-2 (well, perhaps rather base-256) only. In the case of integers, hexadecimal is signalled with an `x` before the first digit (but after the minus sign, if there is one). White space is ignored, so digits may be grouped. In the case of floating-point numbers, the value is given as an attribute `dec` or `hex` of the `OMF` tag, and the name of the attribute determines which format is used. A `hex` value consists of the 16 hexdigits needed to encode the 8 bytes of a 64-bit IEEE floating-point number (colloquially known as a "double"), whereas a `dec` value is an ordinary decimal fraction with optional exponent part, e.g. `-3.743e4`.
+
+Strings are encoded as explicit characters in the XML encoding (the `<`, `>`, and `&` characters need to be expressed using character entities, since they have special roles in XML syntax), whereas in the binary encoding they are encoded using UTF-8. Bytearrays are raw bytes in the binary encoding, whereas in the XML encoding they are encoded using base64 (so 3 bytes of data are encoded into 4 ASCII characters).
+
+### Attribution objects
+
+An _attribution object_ is a kind of compound object that takes a core object (which may be compound or basic) and decorates it with additional information, in the form of one or several _attribution pairs_. Each attribution pair consists of one symbol that serves as key specifying the meaning of this attribution, and one value which can be an arbitrary OpenMath object (or derived object, see below). Attributions are either _semantic_ (attached using a symbol with role `semantic-attribution`) or _adornment_ (preferred symbol role is `attribution`); adornment attributions can be removed without changing the meaning of an object.
+
+One common use of adornment attributions is to include hints about what might be a suitable presentation form of an OpenMath object; the [altenc](http://www.openmath.org/cd/altenc) content dictionary has symbols `MathML_encoding` and `LaTeX_encoding` using which one may attach (suggestions) for presentation forms of an object.
+```XML
+  <OMATTR>
+    <OMATP>
+      <OMS cd="altenc" name="LaTeX_encoding"/>
+      <OMSTR>J_0</OMSTR>
+    </OMATP>
+    <OMBIND> <OMS cd="fns1" name="lambda"/>
+      <OMBVAR> <OMV name="z"/> </OMBVAR>
+      <OMA> <OMS cd="hypergeo2" name="besselJ"/> 
+        <OMI>0</OMI> <OMV name="z"/>
+      </OMA>
+    </OMBIND>
+  </OMATTR>
+```
+suggests that `J_0` is a good LaTeX formula for the reduction of the two-variable [`hypergeo2#besselJ`](http://www.openmath.org/cd/hypergeo2#besselJ) function to a one-variable function by fixing the first (parameter) argument to 0.
+
+Adornment attributions do however not need to be non-mathematical. Another use can be to provide additional information that was computed during a calculation, but technically not asked for by the user. For example, if one asks a computer algebra system whether one polynomial is in the ideal generated by four others, then the answer is either [yes](http://www.openmath.org/cd/logic1#true) or [no](http://www.openmath.org/cd/logic1#false), which is just one bit of information. The standard algorithm for computing that answer would however be to first compute a Gröbner basis for the ideal, a step that (on average) is both computationally expensive and likely to be useful for future calculations involving that ideal. Hence it would make sense to attach an attribution stating "By the way, a Gröbner basis of that ideal is …" to that boolean result.
+
+Other uses for adornment annotations can be to attach information about the origin of an object (for example "the AXIOM/Maple/Mathematica/Maxima/Sage command which generated me was …") or how expensive it was to compute ("the calculation of this took 5031s and used 382MiB of RAM"). Yet another can be to record intermediate results of a computation by structural recursion, e.g. "this evaluates to 5.138" or "this subexpression is positive, and continuous as a function of _x_." Adornment annotations are allowed to carry information that influences the course of a computation; it is only the _meaning_ of an object that they cannot change.
+
+Semantic annotations do change the meaning of an object, although in most cases merely by providing "contextual" information to clarify some detail. This is not a mechanism usually found in textbooks on mathematical logic (since a richer language means more cases in every proof by structural induction, which is awkward) and it would be possible to do mathematics without it, but practical mathematics tends to have quite a lot of this. One example can be parameters of certain functions or relations (for which the alternative formalisation would be to make them explicit arguments), especially if the parameters typically can be ignored or left with default values, and only in a minority of cases need adjustment.
+
+One notable use of semantic annotations is type annotations on variables being quantified, as annotations can appear in the `OMBVAR` element. Printed mathematics often handle that kind of thing via conventions (e.g. boldface variables are vectors, the letters _i_ through _n_ denote integers), but when formalising it is better to make such assumptions explicit.
+
+### Derived objects and error objects
+
+The value of an annotation can be an OpenMath object of arbitrary complexity, but it can also be an OpenMath _derived object_, containing foreign data; in the OpenMath-XML encoding, derived objects appear as `OMFOREIGN` elements. The most common use for derived objects is to adorn OpenMath objects with Presentation MathML renderings of these as formulae, but the data is by no means restricted to just that format. Indeed, the `encoding` attribute of an `OMFOREIGN` element permits using any XML namespace or MIME type to identify the format of the data.
+
+From a purely logical perspective, arbitrary data may equally well (in particular, without undue overhead) be embedded into an OpenMath document as a bytearray or string object, but then one would have to rely on the mathematical context around that basic object for providing the meaning of it — a JPEG image embedded as a bytearray would probably have to be placed as an argument of some "decode JPEG data" function symbol for the interpretation to be clear — and whereas this can certainly be comprehended it does require any reader to have a fair understanding of the mathematical structure of the OpenMath object into which the data is embedded. Foreign objects that specify their own format allow tools with a more shallow understanding of the mathematics to access the embedded data; in the case of the XML encodings, this is further facilitated by having just one tree of XML elements which contains both OpenMath and non-OpenMath portions. Conversely there is also the matter that data encoded as an OpenMath object convey an expectation of having a mathematical interpretation, in a way that foreign objects do not; a JPEG image can reasonably be mathematised as a height by width by number of channels array of numbers, but a paragraph of HTML is much further from having a mathematical interpretation, so it makes correspondingly less sense to encode it as an OpenMath object proper.
+
+The final kind of OpenMath object is an _error_ which is built up from a symbol describing the error and a sequence of OpenMath objects. For example:
+```XML
+  <OME>
+    <OMS name="unexpected_symbol" cd="error1">
+    <OMS name="sine" cd="transc1">
+  </OME>
+```
+represents the error which might be generated when an application sees a symbol it doesn't recognise from a CD it thought it knew about. Error OpenMath objects are similar to application objects, having a head object followed by zero or more operands, but there are some differences. First, it represents an error, so a phrasebook encountering it may (but is not by the standard _required_ to) respond by raising an error/exception; further levels of compound object built around an error subobject may end up being ignored. Second, the head of an error object must be a symbol object (preferably with a role of error); if you need a parametrised family of errors, then use one or several operands for that.
+```XML
+  <OME>
+    <OMS cd="error2" name="oserror"/> <!-- Not an existing CD. -->
+    <OMI> -1728 </OMI> <!-- Presumably an OS-specific error code,
+       which the program didn't know what it meant and is just passing
+       on to the user. -->
+  </OME>
+```
+Third, the operands need not all be OpenMath objects; an operand of an error may alternatively be a derived object. This is practical, since pieces of data describing an error need not be mathematical in nature.
+
+### References
+
+Though not corresponding to another kind of OpenMath object, there is one more element type that can be used where an OpenMath object is expected, namely that of `OMR` references. The mechanism is that any OpenMath-XML element encoding an OpenMath object may carry an `id` attribute that associates an identifier with that element, and when the `href` attribute of an `OMR` element references this identifier it means "the same object as that one". References may be forward as well as backward, and even external (to OpenMath objects in other documents, not necessarily employing the same encoding of OpenMath), but they must not be cyclic, not even indirectly. (Someone who wishes to describe a periodic stream or some other kind of infinite data structure in OpenMath should employ explicitly stream-constructing operations or the like to do so; this is anyway more flexible than abusing references would be.) An OpenMath-XML document with an `OMR` element is always equivalent to the same document with the `OMR` replaced by a copy of the element it references (although some adjusting of the attributes is needed: the copy must have the `id` attribute removed, and may need to have a `cdbase` attribute inserted), and therefore a document with references is always equivalent to some document without references.
+
+The way it looks is that
+```XML
+  <OMA> <OMS cd="arith1" name="times"/>
+    <OMI id="three"> 3 </OMI>
+    <OMR href="#three"/>
+  </OMA>
+```
+encodes the formula 3*3, using a reference to the first 3 (carrying an identifier of `three`) to express the second. Note that there is a `#` in the `href` value but not in the `id` value; this is because the identifier becomes a "fragment identifier" in the URI, and thus comes after the `#`.
+
+
+## Links to more information
+
+* [The OpenMath standard](../standard/)
+* The set of CDs produced by _or contributed to_ the OpenMath Society can be browsed [here](../cd/).
+
+### Various related standards and specifications
 
-Suppose that we wish to have two applications communicating by sending OpenMath objects to each other, e.g. a client program and a computational server. It is unlikely that the internal data structures used by the applications will be OpenMath, and so translation between the internal representations and OpenMath (almost certainly OpenMath encodings rather than objects) will have to take place. The piece of software which does this is usually referred to as a _phrase-book_.
+* [MathML](https://www.w3.org/TR/MathML3/Overview.html)
+* [PopCorn](http://java.symcomp.org/FormalPopcorn.html)
+* [SCSCP](../standard/index.html#symbolic-computation-software-composability-protocol-scscp)
 
-It is possible to write a generic phrase-book which can handle any piece of OpenMath, but applications where this makes sense are few and far between. In practice an OpenMath phrase book will usually only handle a fixed set of CDs (and hence a fixed set of symbols). What "handle" means will vary from case to case: a computer algebra system will usually try and evaluate its input and return a result or an error, while a typesetter will print its input according to some rendering rules and not return anything. OpenMath carefully avoids defining what the "right" behaviour is in a given circumstance, and leaves that up to the phrase-book writer. Indeed it is quite possible that a piece of software could have multiple phrase-books associated with it for different purposes. OpenMath symbols should not be regarded as verbs since they are used to construct objects rather than to send commands, and the presence of both nouns and verbs in a CD (e.g. "integral" and "integrate") is strongly discouraged.
+### Tutorials and introductions
 
-Writing a phrase-book may be non-trivial, and requires an understanding of the semantics of the underlying software. An OpenMath object may not map directly into a private object and vice-versa, for example in some systems a rational number might have to be represented by a float, or a sparse matrix by a dense one.
+* [An OpenMath-oriented introduction to XML](../xml-for-om/)
 
-The OpenMath standard includes a section on compliance, which describes the behaviour of an OpenMath application when certain errors occur. It also insists that all compliant software has the capability to use the XML encoding, to guarantee a degree of interoperability. This is an area where the standard is expected to evolve as more OpenMath applications become available.
 
-A list of applications of OpenMath can be found in the [software & tools](../software/) section of this website.
diff --git a/xml-for-om/index.md b/xml-for-om/index.md
new file mode 100644
index 00000000..27e6f973
--- /dev/null
+++ b/xml-for-om/index.md
@@ -0,0 +1,100 @@
+---
+layout: page 
+title: A Tutorial on XML for OpenMath
+---
+
+OpenMath as such does not require using XML — there is the binary encoding, if one prefers that — but there is quite a lot of XML in the OpenMath software ecosystem, so a user of OpenMath will probably be better off knowing XML than not. However, since XML is typically not a topic that features much in mathematics curricula, it is also something that mathematicians rarely find themselves already familiar with. Hence this OpenMath-oriented tutorial on XML.
+
+## The basics
+
+XML, like HTML, belongs to the [SGML](https://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language) family of computer languages, so a lot of the basics will probably look familiar if you've ever edited HTML code. There are some differences, which we'll summarise below, but familiarity with HTML is not a prerequisite to learning XML.
+
+Fundamentally, XML files consist of text interspersed with tags. A _tag_ begins with `<` and ends with `>`; this is similar to how LaTeX uses the `\` character to mark the beginning of a control sequence. The `<` and `>` are meant to be thought of as left and right angle brackets, but technically they are the ASCII less-than and greater-than characters. A third character with a special role in XML syntax is `&`, which heads an "entity"; these are mainly used to express characters that have special meaning in XML syntax without invoking those meanings: the entity for `<` is `&lt;`, that for `>` is `&gt;`, and that for `&` itself is `&amp;`.
+
+Tags are used to encode _elements_, which may be nested. The range of an XML file that encodes a specific element begins with a start-tag, e.g. `<OMA>`, ends with the matching end-tag, i.e. `</OMA>`, and everything between these two constitutes the _contents_ of the element. When the contents are empty, there is also a shorter empty-element form of the tag that has the `/` just before the closing `>`; thus `<OMA/>` would be equivalent to `<OMA></OMA>`, although neither of these two occur in practice since the contents of an OpenMath `OMA` are the elements encoding its subobjects, of which there must be at least one: the head. The contents of the `OMS`, `OMV`, and `OMF` elements are on the other hand always empty, so their tags are typically seen in empty-element form (although a start-tag end-tag combination is technically a possible alternative).
+
+Elements have a _type_, which is the name that follows the opening `<` of a start-tag or empty-tag, and which is repeated after the `</` of an end-tag. Elements may also carry zero or more _attributes_, in the form of equations <i>name</i><tt>="</tt><i>value</i><tt>"</tt> that come after the type in the start-tag or empty-tag. The element type controls which attributes are allowed, and whether they are required or optional. The order of the attributes is not significant. In an `OMS` element the `cd` and `name` attributes are required, whereas the `cdbase` and `id` attributes are optional. Putting these things together, we see that in
+```XML
+  <OMA>
+    <OMS cd="arith1" name="power"/>
+    <OMS cd="nums1" name="e"/>
+    <OMA>
+      <OMS cd="arith1" name="times"/>
+      <OMS cd="nums1" name="i"/>
+      <OMS cd="nums1" name="pi"/>
+    </OMA>
+  </OMA>
+```
+there are 7 elements, 5 of which have type OMS and 2 of which have type OMA. Each of the OMA elements have 3 _child_ elements in their contents, and the outer OMA element has 3 "grandchildren", namely the 3 children of the inner OMA element that is the last child of the outer OMA element. (Oh, and in OpenMath that XML snippet encodes the Euler e<sup>iπ</sup> formula, but you probably guessed that one.)
+
+So far the tags, but what of the text between them? This is a point where different _applications_ of XML (roughly: specific file formats based on XML) can go in different directions. One extreme is the _markup-operiented_, where the text is viewed as the primary contents of a document, whereas the tags merely serve to structure and enrichen the text; XML is an abbreviation of eXtensible Markup Language, so this is from where it is coming. The other extreme is the _data-oriented_, where the information is all in the element structure, and the only characters outside the tags are ignored whitespace (for newlines and indentation). OpenMath-XML is highly data-oriented — the only significant data outside tags are the character data in the contents of `OMI`, `OMSTR`, and `OMB` elements — whereas the Content Dictionary XML format (.ocd files) is fairly markup-oriented. The Content MathML encoding is more markup-oriented than the OpenMath-XML encoding, but still predominantly on the data side of the scale.
+
+### XML vs. HTML
+
+Though belonging to the same language family, there are some points on which the basic syntaxes of XML and HTML differ:
+* XML distinguishes upper and lower case, whereas HTML does not. `<OMA>`, `<OMa>` and `<oma>` are thus different tags in XML, but the same tag in HTML.
+* Some attribute values in HTML may appear bare (without quotes), but in XML all attribute values must be quoted. The quotes can be double (`"`, as in the examples above) or single (`'`), but they may not be omitted.
+* HTML uses the start form of tags (no slash) even for empty tags, which in principle can lead to ambiguities, but in XML every start-tag must have a matching end-tag.
+
+None of these is a major concern programming-wise, but some authors may need to adjust their typing habits for writing XML rather than HTML.
+
+### XML namespaces
+
+The part of basic XML that is the least intuitive is the namespace handling, mostly due to that it was added as an afterthought. The problem that namespaces solve is that different communities are likely to want to use the same tag name to mean different things, for example MathML's `<ci>` (i = identifier = variable) and `<cn>` (n = number) element types are so short that one must expect many other applications of XML to have used the same names to denote something completely different. A _namespace_ mechanism solves this by having every element type (and likewise element attribute) be a pair of namespace and name, so that only the `ci` in the MathML namespace denotes a Content MathML identifier — a similarly named element type in another namespace will be completely unrelated, so there is no risk of confusion even if one were to use both in the same document.
+
+The unintuitive part is the way that namespaces are specified syntactically. XML namespaces are URIs, so they tend to look like web addresses — the namespace for MathML is for example `http://www.w3.org/1998/Math/MathML`, the namespace for OpenMath-XML is `http://www.openmath.org/OpenMath`, and the namespace for OpenMath CDs is `http://www.openmath.org/OpenMathCD` — which are a bit too long to repeat in every tag. Instead one defines a _prefix_ as denoting the namespace, and uses that prefix in the actual tag types; the prefix and name proper are separated by a colon. To define a prefix, one sets (in any element) an attribute which has prefix `xmlns` and the prefix as name, to the wanted namespace URI; then that prefix denotes that namespace throughout that element (its descendants included, unless one of them redefines the prefix). Thus in defining `bar` as being a prefix for the OpenMath-XML namespace, one might write
+```XML
+  <bar:OMA xmlns:bar="http://www.openmath.org/OpenMath">
+    <bar:OMS bar:cd="arith1" bar:name="plus"/>
+    <bar:OMI> 2 </bar:OMI>
+    <bar:OMI> 2 </bar:OMI>
+  </bar:OMA>
+```
+for the formula 2+2. The reason we usually don't see any such prefix in examples of OpenMath-XML is that the prefix may also be _empty_, in which case even the colon is omitted, so to a proper XML parser the above is completely equivalent to
+```XML
+  <OMA xmlns="http://www.openmath.org/OpenMath">
+    <OMS cd="arith1" name="plus"/>
+    <OMI> 2 </OMI>
+    <OMI> 2 </OMI>
+  </OMA>
+```
+Since all element types involved in an OpenMath object live in the same namespace, one normally has an `xmlns="http://www.openmath.org/OpenMath"` on the root element of an OpenMath-XML document and ignore namespaces thereafter (possibly resetting it in the contents of an `OMFOREIGN`), whereas in an XHTML+MathML document the namespaces are more mixed and therefore a prefix makes more sense. XSLT (a technology which we will get to later) has heavy mixing of namespaces, and therefore employing one or several prefixes is pretty much unavoidable there.
+
+The downside of namespaces is that they make it harder for humans to read XML code – two fragments of XML can look the same, but mean different things, or look different, but actually parse as the same thing, due to differences in namespace prefix definitions — so copying a block of code from some other document _can_ be trickier than one thinks. Another known pitfall is that the namespace URIs are long and look very structured, but will in practice often rather be random jumbles of information; the only reliable way to get one right is to look up what it should be and copy the full string of the namespace (in particular, trying to produce the right string by modifying the name of a related namespace surprisingly often fails).
+
+### The full story, and what parts to skip
+
+For those who like to go to the source, there is the [Extensible Markup Language (XML)](https://www.w3.org/TR/xml/) specification. There are however a number of features in that specification which in practice turn out to be of little or no relevance, at least within the OpenMath technological ecosystem.
+
+XML files can start with an _XML declaration_, which looks like
+```XML
+<?xml version="1.0"?>
+```
+(i.e., like a tag with extra `?` at the start and end), but they don't have to have a declaration. The main extra functionality of providing one is that you can specify the character encoding used for the document, which is useful if there are explicit non-ASCII characters there, but less of an issue now than in years past.
+
+In the prologue of an XML file, there can also be a _document type declaration_ (DTD), which can look like
+```XML
+<!DOCTYPE foo [
+   <!ELEMENT foo (foo | bar)+ >
+   <!ELEMENT bar #PCDATA >
+   <!ATTLIST foo  baz #CDATA #IMPLIED >
+] >
+```
+The point of these is to give a grammar for the application of XML that the present document exercises, in particular which elements may be children of which elements, whether there are restrictions on their order, and which attributes elements may have. Most of the DTD can be placed in a separate file (and reused by many document), so the grammar does not itself have to be included in every document, _but it can be_.
+
+Document type declarations were practical necessities in SGML, because that has such a flexible syntax that documents often cannot be parsed without reference to the DTD for resolving ambiguities, but XML is more rigid and therefore eminently parsable without this kind of grammatical assistance. (It is still possible to use DTDs to obfuscate XML, but this is seldom practical or useful.) DTDs can also be used to _validate_ XML documents — which basically means checking that they are structured in the way that they themselves claim they are structured — but for that there are also various other techologies available. The OpenMath standard uses RelaxNG schemata for defining what it means to be valid OpenMath-XML and content dictionary documents.
+
+DTDs may also define various kinds of _entities_ (parsed entities, unparsed entities, parameter entities, and maybe others), which is a kind of macro mechanism, but that mostly seems to be used for creating more complex DTDs (which OpenMath doesn't really use anyway). The only entities commonly seen in practice are the five predefined character entities (`&lt;`, `&gt;`, `&amp;`, `&quot;`, and `&apos;`) and the numeric character references (<tt>&#</tt>_decnum_<tt>;</tt> and <tt>&#x</tt>_hexnum_<tt>;</tt>).
+
+Throughout an XML file, there may be _processing instructions_, which look like
+```XML
+<?target pseudoattribute="value"?>
+```
+These may in principle be used to pass instructions to the processor reading an XML document, but there are no notable examples of this in the context of OpenMath.
+
+
+
+## Doing stuff
+
+(This section is under construction.)
+