Skip to content

Tokenizing

Dave DeLong edited this page Sep 13, 2015 · 1 revision

Tokenization is the first phase of parsing. It's the point where characters are extracted from the source string and grouped. There are four kinds of tokens:

  1. numbers
  2. functions
  3. variables
  4. operators

Numbers

All numbers extracted from the source string are positive, and are defined as anything that matches the following regular expression:

\d*(\.\d*)?([eE][-+]?\d+)?

Even though "." is technically recognized by this regular expression, it is not evaluated as a number.

No effort is made to allow for locale-sensitive numbers. Allowing things like thousands groupings or the comma as the decimal seperator can introduce ambiguity in the parser. For example, if "," were recognized as the decimal seperator, then this is ambiguous:

max(1,2,3)

Should that be parsed as the maximum of three numbers (1, 2, and 3), or the maximum of two (1.2 and 3, or 1 and 2.3)? Similar problems arise when dealing with thousands groupings. As such, numbers are not locale-sensitive.

Functions

Function tokens are strictly the name of a function. For example, given the string "sin(0)", the extracted function token is "sin".

Functions can contain letters (upper and lowercase), decimal digits, and underscores.

The exception to this are three special functions that are special cased in their recognition: "π", "Φ", and "τ". These correspond to the mathematical costants of pi, phi, and tau, respectively.

Variables

Variables follow the same rules as functions, except that they must be prefixed with a "$" character. Thus, the following are all legal variable names:

  • $a
  • $_
  • $0xdeadbeef

In addition, variables may also be quoted strings:

  • 'a'
  • "hello"
  • '\''
  • "Inigo Montoya"

Operators

Operators are pretty much all other characters in the string.

Parentheses are parsed as operator tokens, even though they are not listed as part of the built-in operators. Parentheses used to denote order of operations and functions arguments are eliminated during term grouping.

Regarding Whitespace

Whitespace is seen as a logical break in the token stream. That means that "3 4" will be parsed as the 3 token followed by the 4 token. And because of the logic in recognizing implicit multiplication, a multiplication operator will be injected into the stream. Thus, "3 4" is recognized as "3*4", and evaluates to 12.