C++20 Coroutine Synchronous Parser Combinator Library.
This library contains a monadic parser type and associated combinators that can be composed to create parsers using C++20 Coroutines.
PRs are welcome. 🎉🎉🎉
As a first example, let's say that you want to create a parser that parses any two-character string where both characters are the same (e.g. "==", "aa", etc.) and then returns that character on success. You could do that like this:
parsco::parser<char> parse_two_same() {
char c = co_await parsco::any_chr();
co_await parsco::chr( c );
co_return c;
}
Another way to write it, which will provide a better error message to the user when parsing fails, would be:
parsco::parser<char> parse_two_same() {
char c1 = co_await parsco::any_chr();
char c2 = co_await parsco::any_chr();
if( c1 != c2 )
// Provides better error message.
co_await parsco::fail( "expected two equal chars" );
co_return c1;
}
Running this parser can result in three possible outcomes:
- The parser finds two characters and they are both the same, e.g. "xx". In that case, the parser will yield 'x'.
- The parser finds two characters and they are not the same. In that case, the parser will fail with an error message. The particular error message depends on which of the above two implementations is used.
- The parser encounters the end of the input stream (EOF) before it can parse both characters. In this case, the parser function will fail and return.
The parsers used above (chr
and any_chr
) are functions that
return parser<T>
s (in this case, T
is char
). As can be
seen, they have been combined to yield a higher-level function
parse_two_same
which is itself a parser in that it returns a
parser<T>
. This parse_two_same
parser can thus can be used to
build even more complex parsers in a similar way, namely by
using the C++20 co_await
keyword on the parser object that
it returns.
With the first example out of the way, let's discuss what is happening under the hood.
The term "parser" in this library is a bit
overloaded. It can refer to an object of type parsco::parser<T>
or to a coroutine function that returns one. It is ok to think of
them as equivalent for now. In fact, the parser object owns (in
RAII sense) the coroutine.
Typically, a parser is created by calling a coroutine function
that returns a parsco::parser<T>
and then that parser is run
by applying C++20's co_await
keyword to it. A parser object
does nothing on its own; it is only when it is co_await
ed
that it is given access to the input string and begins to
parse. If it succeeds, then it will result in an object of
type T
. If it fails, it will automagically unwind the entire
parser call stack and deliver an error messages to the original
caller of the overall parse operation.
In this library, parsers are "glued together" using C++20 Coroutines. An explanation of how C++20 coroutines work in general is beyond the scope of this README; for an in-depth look into how C++ coroutines work, see this blog. However, you do not need to fully understand them to use this library, so feel free to continue on in this tutorial either way.
For the purposes of using this library, a coroutine is a function that
- Returns a
parsco::parser<T>
type. - Uses one or both of the keywords
co_await
andco_return
in its body (this library does not make use ofco_yield
). - Has the ability (unlike a conventional function) to suspend and resume execution.
When a coroutine is first called, the C++ language "runtime" will
setup a coroutine frame consisting of a "promise" object and enough
space to hold any local variables in the function. Then,
while executing the coroutine, each time the co_await
or
co_return
keywords are used, the compiler will query the
promise object via some extension points to ask it what to do in
response. This promise object (and the hidden calls to query it)
are leveraged by this library to automate the following tasks
which must be done to parse a string:
- Passing around the input buffer.
- Keeping track of the current position in the input buffer.
- Detecting a premature EOF (end of input stream) and potentially terminating the parse.
- Detecting and handle parsing errors (i.e., syntax errors in the input), which can result in either cancelling the entire parse (unwinding the call stack) or in backtracking to reattempt with another parser.
- Providing error message and input buffer location to the user in the case of a parsing error.
- Returning of the result of parsing in a type-safe way.
All of this happens automatically when you co_await
on a parser
object.
Since the coroutines thread the input buffer pointers and state through the coroutine call stack automatically, there is no global state in the library, hence we have nice properties of being re-entrant and thread safe (though only in that sense).
A parser combinator is a function (or
higher-order function) that can transform a given parser into a
different or more complex parser. Combinators are used to glue
together primitive (basic) parsers into more complex ones.
They typically take as parameters either parser objects or
functions that return parser objects, and then construct a new
parser and return it as a new parsco::parser<T>
object.
In this library, you will build up your parsers by using combinators to combine simple parsers into more complex ones. In fact, in many cases, a parser can be constructed using existing combinators alone without requiring any coroutines. The coroutines will still be needed though, in cases where a parser requires some new logic not provided by an existing combinator.
As a second example, let us define a parser that can parse
four-component IPv4 addresses, possibly with a subnet mask on the
end. An IP address must consist of four integers between 0
and
255
(inclusive) separated by dots, with no spaces, e.g.
"123.234.22.33".
The trailing subnet mask, if present, must immediately follow the
fourth number and must consist of a slash (/
) followed by an
integer between 0
and 32
(inclusive), e.g.:
"123.234.22.33/24".
To help write our parser, we will first define a struct
in
which to hold the result. This is good practice, since you will
ideally want your parsers to return results in the form of your
own type-safe data structures:
#include <optional>
struct ipv4_address {
// Each of the four integer components.
int n1 = 0;
int n2 = 0;
int n3 = 0;
int n4 = 0;
// Optional trailing subnet mask.
std::optional<int> subnet_mask = std::nullopt;
};
First we define a parser for the main integral components:
parsco::parser<int> parse_ip_number() {
int n = co_await parsco::parse_int();
if( n > 255 )
co_await parsco::fail( "ip values must be <= 255" );
co_return n;
}
which will be called once for each of the four components. Note
that this parses the int
and does validation, failing the parse
(all the way up to the root or the nearest try_
combinator) if
validation fails.
Finally, we define the IPv4 parser that produces the result in
the form of an ipv4_address
data structure:
parsco::parser<ipv4_address> parse_ip_address() {
ipv4_address result;
result.n1 = co_await parse_ip_number();
co_await parsco::chr( '.' );
result.n2 = co_await parse_ip_number();
co_await parsco::chr( '.' );
result.n3 = co_await parse_ip_number();
co_await parsco::chr( '.' );
result.n4 = co_await parse_ip_number();
// The try_ combinator returns the result wrapped in a
// `result_t` in order to signal (in a type-safe way) that the
// parser may or may not succeed and that it will backtrack if
// not successful.
parsco::result_t<char> slash =
co_await parsco::try_{ parsco::chr( '/' ) };
if( slash.has_value() ) {
// We have the start of the subnet mask.
int mask = co_await parsco::parse_int();
if( mask > 32 )
co_await parsco::fail( "subnet mask must be <= 32" );
result.subnet_mask = mask;
}
co_return result;
}
You can find a working version of this example in the examples
folder (ip-address-parser.cpp
). When you run it, you should see
the following output:
test "123.234.123.99" succeeded to parse: 123.234.123.99
test "123.234.123.99/23" succeeded to parse: 123.234.123.99/23
test "123.234.123.99 /23" succeeded to parse: 123.234.123.99
test "123.234.123.99/" failed to parse:
tests:error:1:15
test "123,234.123.99" failed to parse:
tests:error:1:4 expected '.'
test "123.234.xxx.99" failed to parse:
tests:error:1:9
test "123.234.123.99/33" failed to parse:
tests:error:1:17 subnet mask must be <= 32
test "123.234.123" failed to parse:
tests:error:1:11 EOF
test "123.234.123.990" failed to parse:
tests:error:1:15 ip values must be <= 255
test "123.234.123/8" failed to parse:
tests:error:1:12 expected '.'
Some of the inputs parse successfully, while others fail and give the location of the parsing failure. Note that some of those that successfully parse (such as the third one) do not consume all input; in those cases they successfully parses an IP address with no subnet mask and leave the remainder of the input buffer unparsed.
As a final note for this example, we could have terminated the
parse_ip_address
function with a co_await parsco::eof()
statement, which would have ensured that we consumed all of the
input. That would then have the benefit of preventing inputs such
as "123.234.123.234 /23" from parsing successfully (currently
that input parses successfully because the parser parses an IP
address with no subnet mask and simply ignores the remainder of
the input). However, the downside to adding eof()
into a parser
is that it prevents that parser from being composed with other
parsers to produce more complex parsers, since it always insists
that it be the only consumer of the input.
As a third example, let us define a simple grammar that we will refer to as the "hello world" grammar. It is defined as follows:
- The string may start or end with any number of spaces.
- It must contain the two words 'hello' and 'world' in that order.
- The two words must have their first letters either both lowercase or both uppercase.
- Subsequent letters in each word must always be lowercase.
- The second word may have an arbitrary number of exclamation marks after it, but they must begin immediately after the second word (no spaces).
- The number of exclamation marks, if any, must be even.
- The two words can be separated by spaces or by a comma. If separated by a comma, spaces are optional after the comma. If there is a comma, it must come immediately after the first word.
Examples:
"Hello, World!!", // should pass, and consume all input.
" hello , world!! ", // should fail (#7 violated)
" hello, world!!!! ", // should pass, and consume all input.
" hello, world!!! ", // should fail (#6 violated)
"hEllo, World", // should fail (#4 violated)
"hello world", // should pass, and consume all input.
"HelloWorld", // should fail (#7 violated)
"hello,world", // should pass, and consume all input.
"hello, World", // should fail (#3 violated)
"hello, world!!!!!!", // should pass, and consume all input.
"hello, world !!!!", // should pass, but not consume all input.
"hello, world ", // should pass, and consume all input.
"hello, world!! x", // should pass, but not consume all input.
This is clearly a contrived grammar, but it will allow us to demonstrate some of the basic facilities in this library. The following is a Parsco parser that parses this grammar:
parsco::parser<string> parse_hello_world() {
// Eat spaces.
co_await blanks();
// Parse one char exactly, and it must be either 'h' or 'H'.
// Our grammar dictates that the "hello" can start with either
// a lowercase of capital h.
char h = co_await one_of( "hH" );
// But the rest of the letters in the word must be lowercase.
// Parse the given string exactly.
co_await str( "ello" );
// We can have a comma followed by zero-or-more-spaces, or we
// can have no comma followed by one-or-more spaces.
//
// The >> operator is a sequencing operator that will run
// the first parser, ensuring that it succeeds, and will
// throw away the result. Then it will invoke the second
// parser, again ensuring that it succeeds and returning its
// result.
//
// The | operator runs the first parser and if it succeeds,
// returns its result; otherwise runs the second parser and
// returns its result, or fails if it fails.
co_await (( chr( ',' ) >> blanks()) | many1( space ));
// Our grammar rules say that the two words must have the same
// capitalization.
( h == 'h' ) ? co_await str( "world" )
: co_await str( "World" );
// Parse zero or more exclamation marks.
string excls = co_await many( chr, '!' );
// Grammar says number of exclamation marks must be even.
if( excls.size() % 2 != 0 )
co_await fail( "must have even # of !s" );
co_await blanks();
// This is optional, since we're just validating the input, but
// it demonstrates how to return a result from the parser.
co_return "Hello, World!";
}
See the file hello-world-parser.cpp
in the examples folder for
a runnable demo of calling this parser on the above input data.
Essentially, we do this:
// This should conform to the above grammar.
std::string_view test = " hello, world!!! ";
// Filename is specified only to improve error messages. In
// this case, there is no filename.
parsco::result_t<string> hw =
parsco::run_parser( "fake-filename.txt", test, parse_hello_world() );
if( !hw ) {
cout << "test \"" << s
<< "\" failed to parse: " << hw.get_error().what()
<< "\n";
} else {
cout << "test \"" << s << "\" succeeded to parse.\n";
}
which, when run on each of the test cases above, yields:
test "Hello, World!!" succeeded to parse.
test " hello , world!! " failed to parse:
fake-filename.txt:error:1:9 expected 'w'
test " hello, world!!!! " succeeded to parse.
test " hello, world!!! " failed to parse:
fake-filename.txt:error:1:18 must have even # of !s
test "hEllo, World" failed to parse:
fake-filename.txt:error:1:2 expected 'e'
test "hello world" succeeded to parse.
test "HelloWorld" failed to parse:
fake-filename.txt:error:1:6
test "hello,world" succeeded to parse.
test "hello, World" failed to parse:
fake-filename.txt:error:1:8 expected 'w'
test "hello, world!!!!!!" succeeded to parse.
test "hello, world !!!!" succeeded to parse.
test "hello, world " succeeded to parse.
test "hello, world!! x" succeeded to parse.
Note that in those cases where we provided an error message
ourselves via the fail
combinator, it gives it to the user
and this can result in a better experience.
Before closing this example, let's extend it a bit more. Let's
say that we want to allow more than one occurrence of the
"hello world" in the input buffer (i.e., one or more) and then
have the parser return the number that were found. To do this,
we can just use our existing parse_hello_world
parser and
wrap it in a combinator:
parsco::parser<int> parse_hello_worlds() {
std::vector<std::string> hello_worlds =
co_await parsco::many1( parse_hello_world );
co_return hello_worlds.size();
}
where the parsco::many1
combinator parses one or more of the
given parser. That means that it will fail unless there is at
least one correctly-parsed "hello world." Notice that by wrapping
our parser in parsco::many1
, the result is now a
std::vector<std::string>
, since we are parsing multiple copies
of "hello world." In this case our parser is simply validating
the input, so the contents of the vector are not of much use to
use, but in most other cases they will be useful. In our case,
we're just interested in the size.
The parse_hello_worlds
, just as with the parse_hello_world
parser may not consume the entire input. In the face of a syntax
error in the first occurrence it will fail, but syntax errors in
subsequent occurrences will simply cause parsco::many1
to stop
parsing (reporting success) and to return whatever it has
successfully parsed, potentially leaving unparsed characters in
the input buffer. If we want to ensure that the entire input
buffer is consumed, we can use either the parsco::eof
combinator or the parsco::exhaust
combinator; an example of the
latter is:
parsco::parser<int> parse_hello_worlds() {
std::vector<std::string> hello_worlds =
co_await parsco::exhaust( parsco::many1( parse_hello_world ) );
co_return hello_worlds.size();
}
The parsco::exhaust
combinator returns a parser that only
succeeds if the parser given to it succeeds and consumes all
input; otherwise, the parser fails.
See the section called "Combinator Reference" for a complete reference of all parsers and combinators.
To see a more realistic example, see the json-parser.cpp
file
in the examples folder which contains a JSON parser constructed
using the combinators in this library. Note that the JSON
parser makes use of the ADL extension point mechanism of the
library, which will be described in the next section.
Something to note in the implementation of the JSON parser is
that many of the functions are actually not coroutines because
they do not use the co_await
or co_return
keywords (and
co_yield
is not used by this library). For example, the parser
for a JSON boolean:
parser<boolean> parser_for( lang<Json>, tag<boolean> ) {
return (str( "true" ) >> ret( boolean{ true })) |
(str( "false" ) >> ret( boolean{ false }));
}
This is possible because it is frequently the case that a parsing operation can be described entirely in terms of existing combinators without applying any logic to the intermediate results. In those cases your parsers may end up looking like the above, namely a single expression describing the parser in a declarative way. Note: in order to support this style of writing parsers, the combinators take their parameters by value in order to avoid dangling references; see the section "Laziness and Parameter Lifetime" for more information.
Writing a function in this declarative/expression-based style when possible is likely a good idea because it will prevent the function from being compiled as a coroutine, which means that there will be less overhead in invoking it (invoking a coroutine typically involves at least one heap allocation to create the coroutine state; theoretically, the compiler can elide the heap allocation, but that does not seem to currently happen with this library, see the section on runtime performance below).
The Parsco library provides an ADL-based extension point so that
you can define parsers for user-defined types in a consistent
manner and that will be discoverable by the library. For example,
the library provides a parser for std::variant
which will
invoke the extension point to attempt to parse the types that it
contains, which may be user-defined types. To define a parser for
some user type MyType
which lives in your namespace your_ns
,
do the following:
#incldue "parsco/ext.hpp"
#incldue "parsco/parser.hpp"
#incldue "parsco/promise.hpp"
#incldue "parsco/ext-basic.hpp"
namespace your_ns {
// For exposition only.
using parsco::lang;
using parsco::tag;
using parsco::parser;
struct MyType {
int x;
int y;
};
// A type tag representing the (custom) language that your
// parser parses. All of your parsers for a given language will
// use the same type tag to guarantee at compile-time that they
// are not mixed. That means that you can safely have multiple
// parsers for different languages for the same user type My-
// Type and they won't interfere.
struct MyLang {};
parser<MyType> parser_for( lang<MyLang>, tag<MyType> ) {
// As an example, we're parse a point-like structure
// consisting of two integers separated by a comma.
int x = co_await parsco::parse<MyLang, int>();
co_await parsco::chr( ',' );
int y = co_await parsco::parse<MyLang, int>();
co_return MyType{ x, y };
}
}
Now, having defined that, any other code, including code internal to the Parsco library, can parse your type by running the following:
MyType mt = co_await parsco::parse<MyLang, MyType>();
and your parser will be discovered through ADL. You can use this
to easily build generic parsers for e.g. std::vector<T>
which
then invoke parsco::parse<SomeLang, T>()
to parse any type.
Using this ADL extension point is entirely optional; you may not need it or opt not to use it, and that is fine.
When a co_await
ed parser fails, it aborts the entire parsing
call stack in the sense that all coroutines in the call stack
will be suspended (and likely destroyed shortly thereafter when
the original parser<T>
object goes out of scope).
Sometimes, however, we want to attempt a parser and allow for it
to fail. For this, there is a special combinator called try_
:
using namespace parsco;
result_t<int> maybe_int = co_await try_{ parse_int() };
This will propagate parse errors via a return value that can be
checked in the parser. Upon a failure, the maybe_int
will
contain the error and the parser will internally have backtracked
in the input buffer so that any subsequent parsers can retry the
same section of input.
Generally, the combinators in the library that allow for failing
parsers will all use the try_
combinator internally and thus
will automatically backtrack, so that the internal position in
the buffer will never extend beyond what has been successfully
parsed and returned by your parsers. That said, the framework
does keep track of the farthest position that was attempted to
be parsed, in order to deliver accurate error messages to users
in response to syntax errors.
Although coroutines have important applications in concurrent or
asynchronous programming, they are not being used in that
capacity here. In this library they are used simply as a "glue"
for parsers and combinators, all of which operate in a synchronous
way and that expect to have the entire input buffer in memory.
In fact, parsco::parser<T>
coroutines will only suspend
when they fail (apart from their initial suspend point, since
they are lazy; see below), and will then never resume, similar to
how a std::optional<T>
coroutine might work.
This can be contrasted with other uses of coroutines where e.g. the coroutine will suspend while waiting for more input to arrive asynchronously, and will then be scheduled to resume when it does.
The parsco::parser<T>
type can be thought of as the C++
equivalent to Haskell's Parsec
monad, for example.
The coroutines in this library follow the practice of Structured Concurrency in that:
- The are lazy: they suspend at their initial suspend point.
- A parser object return by a coroutine owns the coroutine in RAII fashion and will destroy the coroutine it when the parser object goes out of scope.
This guarantees (in most use cases) that memory and lifetimes will be managed properly and automatically without any special consideration from the library user.
It is also hoped that this ownership model will aid the optimizer in understanding that it can elide all of the coroutine state heap allocations (which I believe should be theoretically possible in all of the common use cases). This unfortunately does not seem to happen currently; see the section on runtime performance for more information.
Parsco parser coroutines are lazy. That means that when you
invoke a parsco::parser<T>
coroutine, it immediately suspends.
In fact it must suspend, because it does not yet have access to
the input data initially. The parser is only resumed and run when
it is co_await
ed, at which point it is given access to the
input data by its caller.
For that reason, if combinator functions were to accept
parameters by reference, it could lead to thorny lifetime issues
where the references become dangling when a combinator object is
returned from a function without co_await
ing it, which is
convenient to be able to do (the JSON parser example does this
often).
In order to systematically avoid this problem, this library opts
for the safe route and thus all combinators accept their
arguments by value. This may lead to slight inefficiencies, but
it guarantees that there will not be any dangling references. The
developer can thus freely create arbitrary combinator creations
(co_await
ing the results immediately or at a later time)
without having to reason about lifetime or dangling references.
When designing new combinators, you may want to consider doing
the same for consistency. In practice, this should not add much
overhead, because typical parameters to coroutines include a)
stateless lambdas, b) primitive types such as ints, and
parsco::parser<T>
objects which are always moved.
Theoretically, it seems that any such lifetime issues could be
systematically avoided by constructing parsers and wrapping
combinators in one expression and then co_await
ing the result
immediately before the expression goes out of scope, as opposed
to putting the result into a parser<T>
object and then
co_await
ing on it later. However, to save the developer having
to think about that and to give them flexibility, the combinators
just take their arguments by value.
The examples in this library have been tested with ASan and they do come out clean, for what that is worth.
Since C++20 coroutines are a relatively new feature, compiler support is currently not complete or perfect.
As of August, 2021, this library has only been tested successfully with Clang trunk and (partially successfully) with GCC 11.1. The library runs well with Clang, but unfortunately GCC's coroutine support still seems too lacking to run this library without crashing, though it does seem to be able to run the "hello world" parser example.
I am told that this library now builds successfully with MSVC, so any MSVC build breakages should be considered a regression.
To build and run, you will need a recent build of Clang trunk (which at the time of writing should be the as-yet-unreleased Clang 12) along with a standard library that supports enough of C++20 and Coroutines. For the standard library I use libstdc++ from gcc 11.1.0, though more recent ones should work as well.
The build is done with CMake, and here is a typical build process
that I use on Ubuntu Linux 20.04 on an x86_64
system that will
use a build of LLVM located at /path/to/llvm
and a libstdc++
11.1.0 located in /path/to/gcc
:
$ git clone https://github.com/dpacbach/parsco
$ cd parsco
# From the parsco directory:
$ mkdir build
$ cd build
$ cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_COMPILER="/path/to/llvm/bin/clang++" \
-DCMAKE_CXX_FLAGS_INIT="-nostdinc++ -I/path/to/gcc/include/c++/11.1.0 -I/path/to/gcc/include/c++/11.1.0/x86_64-pc-linux-gnu" \
-DCMAKE_EXE_LINKER_FLAGS_INIT="-Wl,-rpath,/path/to/gcc/lib64 -L/path/to/gcc/lib64 -fuse-ld=lld"
$ make -j8
# From the build directory:
$ ./src/example/hello-world-parser
$ ./src/example/json-parser
Though you may well have to tweak the CMake command to work with your setup.
When using Clang with optimizations enabled (-O3
), I have
observed that a complex parser made of many combinators using
this library can be optimized down by the compiler to a point
where it runs around 15x slower than a parser hand-coded in C, in
the benchmarks that I ran.
Given how incredibly fast a hand-coded C parser can be, and given the high level of abstraction that can be attained using this library, it is perhaps not so bad. That said, performance is very important in the C++ ecosystem and thus a goal of this library is to eventually attain C-level performance without sacrificing any of the abstraction.
Theoretically, it should be the case when using this library that all coroutine frame heap allocations can/should be elided by the compiler. But, unfortunately, that does not seem to happen in the instances where I have looked at the generated code. I am not sure why this is at the moment, but I am hopeful that this situation will improve in the future due to either a) an increase in my own understanding of coroutine library optimization techniques, b) help from contributors who have expertise in this area, or c) an increase in compiler's ability to optimize coroutines.
If we could get the compiler to elide the coroutine frames and allocations, subsequent inlining I believe could allow this library to reach C-level performance levels. Help with that would be much appreciated.
In non-optimized builds, the performance (relative to said C parser) is even worse unfortunately, and this is another problem that would be nice to improve upon (any help is appreciated).
Upon parse failure, the parsco parser framework is always able to
give the location (line and column) that the error occurred in
the input text. However, it is not generally able to provide a
human-readable error message describing what the parser was
expecting, unless the failure was initiated by use of the fail
combinator as demonstrated in the Hello World example above.
It may be possible to enhance the library so that a given parser can be inspected to automatically determine what characters it was expecting in order to automatically generate more useful error messages. Some parser frameworks in some libraries are able to do this to some extent, but it is difficult.
That said, even if that were possible, it seems that the most
user-intelligible error message are still going to be provided
mainly by the programmer via the fail( "..." )
combinator.
As a quick implementation note on the combinators, if you
look at the source code (mainly in combinators.hpp
) you'll
notice that many of them are implemented as niebloids instead of
function templates. This is actually not to do with ADL but
instead is to work around an issue that Clang seems to have with
function template coroutines (it yields strange runtime errors
when such coroutines is called). Hopefully that will be fixed
eventually, at which point the niebloids in this library can be
changed back to function templates.
This section lists any other C++ coroutine parser libraries that are available or in development for comparison:
Below is documentation for each parser and combinator function in this library. As you will see, there are many more combinators available than those that have been demonstrated in the examples thus far. Hence, to get the most out of this library, you are encouraged to read through these to get a sense of the tools that will be in your toolbox.
This is the official documentation for the combinators. I am better at maintaining the combinator descriptions in this README than in the code comments, so I'd recommend looking here instead of to the comments above each function.
This parser consumes a char that must be c
, otherwise it
fails. When it fails, it will produce an error message to the
effect of "expected character 'x'".
parser<char> chr( char c );
This parser consumes any char, fails only at EOF (end-of-file a.k.a. end-of-input).
parser<char> any_chr();
This parser parses a single character for which the predicate returns true, fails otherwise.
template<typename T>
parser<char> pred( T func );
The predicate function should be callable with a char
and
should return something that is implicitly convertible to bool
.
This combinator returns a parser that always succeeds and produces the given value.
template<typename T>
parser<T> ret( T );
This parser consumes one space character (specifically, an ASCII
0x20
). It Will fail if it does not find one.
parser<> space();
If you want to parse zero-or-more spaces (as is common in
parsing) then instead use the parsco::blanks
parser.
This parser consumes one character that must be either a CR (carriage return) or LF (line feed), and will fail if it does not find one.
parser<> crlf();
This parser consumes one tab character. Will fail if it does not find one.
parser<> tab();
This parser consumes one character that must be either a space, tab, CR, or LF, and fails otherwise.
parser<> blank();
This parser consumes one digit char ([0-9]
) or fails.
parser<char> digit();
This parser parses one lowercase letter ([a-z]
) or fails.
parser<char> lower();
This parser parses one uppercase letter ([A-Z]
) or fails.
parser<char> upper();
This parser parses one letter ([a-zA-Z]
) or fails.
parser<char> alpha();
This parser parses one alphanumeric character ([a-zA-Z0-9]
) or
fails.
parser<char> alphanum();
This parser consumes one char if it is one of the ones in s
,
otherwise fails.
parser<char> one_of( std::string s );
As is discussed in the section on parameter lifetime above, this
function takes the string by value in order to avoid dangling
references that might result in some cases if it were to use
either a const std::string&
or a std::string_view
or the
like.
This parser consumes one char if it is not one of the ones in
s
, otherwise fails.
parser<char> not_of( std::string s );
As is discussed in the section on parameter lifetime above, this
function takes the string by value in order to avoid dangling
references that might result in some cases if it were to use
either a const std::string&
or a std::string_view
or the
like.
This parser succeeds if the input stream is finished, and fails otherwise. Can be used to test if all input has been consumed.
parser<> eof();
See also the exhaust
parser below which is related.
The try_*
family of combinators allow attempting a parser which
may not succeed and backtracking if it fails.
This combinator simply wraps another parser and signals
that it is to be allowed to fail. Moreover, the try_
wrapper
will wrap the return type of the parser in a
parsco::result_t<T>
, which is analogous to a std::expected
(we will use std::expected
when it becomes available).
template<Parser P> // note Parser here is a C++20 Concept.
parser<parsco::result_t<R>> try_( P p );
where R
is P::value_type
, i.e., the result of running parser
p
.
In other words, if p
is a parser of type parser<T>
, then
try_{ p }
yields a parser of type
parser<parsco::result_t<T>>
. When it is run, the parser p
will be run and, if it fails, the resulting result_t
will
contain an error and the parser will have backtracked to restore
the current position in the input buffer to what it was before
the parser began, thus giving subsequent parsers the opportunity
to parse the same input. If p
succeeds, then the result_t
contains its result.
Hence, a parser wrapped in try_
never fails in the sense that
parsers normally fail in this library (by aborting the entire
parse); instead, a failure is communicated via return value.
This parser will try running the given parser but ignore the result. This is useful if you want to run a parser but a) you don't care if it succeeds, and b) you don't care what its result is if it succeeds. This can sometimes help to get rid of "unused return value" compiler warnings.
template<Parser P>
parser<> try_ignore( P p );
This parser attempts to consume the exact string given at the current position in the input string, and fails otherwise.
parser<> str( std::string sv );
This parser attempts to parse a valid identifier,
which must match the regex [a-zA-Z_][a-zA-Z0-9_]*
.
parser<std::string> identifier();
This parser consumes zero or more blank spaces (including newlines and tabs). It will never fail.
parser<> blanks();
This parser parses "..."
and returns the characters
inside the quotes, which may contain newlines. Note that these
return string views into the buffer because they are implemented
using the parser primitives defined in the file magic.hpp
,
i.e., combinators for which there is special support within the
parsco::promise_type
object, just for efficiency.
parser<std::string_view> double_quoted_str();
This parser parses '...'
and returns the characters
inside the quotes, which may contain newlines. Note that these
return string views into the buffer because they are implemented
using the parser primitives defined in the file magic.hpp
,
i.e., combinators for which there is special support within the
parsco::promise_type
object, just for efficiency.
parser<std::string_view> single_quoted_str();
This parser parses a string that is either in double quotes or single quotes. Does not allow escaped quotes within the string.
parser<std::string> quoted_str();
Many of the combinators in this section are actually higher-order functions. They take functions that, when called (which is typically done in some repeated fashion), produce parser objects to be run.
A single parser<T>
object represents a live coroutine, and so
it itself cannot be run multiple times; instead, if a combinator
needs to run a user-specified parser multiple times, it must
accept a function that produces those parsers when called.
This parser parses zero or more of the given parser.
template<typename Func, typename... Args>
parser<R> many( Func f, Args... args );
where R
is a std::vector
if the element parser returns
something other than a character, or a std::string
otherwise.
As mentioned in the introduction to this section, the argument to
the combinator is not actually a parser, but rather a function
which, when called with the given arguments, produces a parser
object. The function will be invoked on the arguments multiple
times to repeatedly generate parsers until a parser fails to
parse, at which point the many
parser finishes (always
successfully) and backtracks over any fragment of input that
failed parsing in the last iteration.
This parser parses zero or more of the given type for
the given language tag using the parsco ADH extension point
mechanism. That is, it will call parsco::parse<Lang, T>()
to
parse a T
and may do so multiple times.
template<typename Lang, typename T>
parser<R> many_type();
where R
is a std::vector
if the element parser returns
something other than a character, or a std::string
otherwise.
See the JSON example in the examples folder or the above section on user types for how to make user-defined types parsable using the ADL extension point of the parsco library.
This parser parses one or more of the given parser (technically it's the parser returned by invoking the given function on the given arguments).
template<typename Func, typename... Args>
auto many1( Func f, Args... args ) const -> parser<R>;
where R
is a std::vector
if the element parser returns
something other than a character, or a std::string
otherwise.
This parser runs multiple parsers in sequence, and only succeeds if all of them succeed. Returns all results in a tuple.
template<typename... Parsers>
parser<std::tuple<...>> seq( Parsers... );
This combinator takes parser objects directly (as opposed to functions-that-return-parsers) because each parser only needs to be run at most once. Although note that if a parser fails, the subsequent ones will not be run.
This parser runs multiple parsers in sequence, and only succeeds if all of them succeed. Returns the result of the last parser.
template<typename... Parsers>
parser<R> seq_last( Parsers... ps );
where R
is the value_type
of the last parser in the argument
list. As above, this combinator also takes parser objects
directly as opposed to functions.
This parser runs multiple parsers in sequence, and only succeeds if all of them succeed. Returns first result.
template<typename... Parsers>
parser<R> seq_first( Parsers... ps );
where R
is the value_type
of the first parser in the argument
list. As above, this combinator also takes parser objects
directly as opposed to functions.
This parser parses "g f g f g f" and returns the f's.
template<typename F, typename G>
parser<R> interleave_first( F f, G g, bool sep_required = true );
where R
is the value_type
of the parser f
. F
and G
are
nullary functions that return parser objects.
This parser parses "f g f g f g" and returns the f's.
template<typename F, typename G>
parser<R> interleave_last( F f, G g, bool sep_required = true );
where R
is the value_type
of the parser f
. F
and G
are
nullary functions that return parser objects.
This parser parses "f g f g f" and returns the f's.
template<typename F, typename G>
parser<R> interleave( F f, G g, bool sep_required = true );
where R
is the value_type
of the parser f
. F
and G
are
nullary functions that return parser objects.
This parser runs multiple string-yielding parsers in sequence and concatenates the results into one string.
template<typename... Parsers>
parser<std::string> cat( Parsers... ps );
In other words, each of the Parsers
must be a parser<std::string>
.
This operator runs the parsers in sequence (all must succeed) and returns the result of the final one.
template<Parser T, Parser U>
parser<typename U::value_type> operator>>( T l, U r );
// Example
co_await (blanks() >> identifier());
This operator runs the parsers in sequence (all must succeed) and returns the result of the first one.
template<Parser T, Parser U>
parser<typename T::value_type> operator<<( T l, U r );
// Example
co_await (identifier() << blanks());
Note: despite the direction that the operator is pointing (<<
),
the parsers are still run from left-to-right order, and the
result of the left-most one is returned, assuming of course that
they all succeed.
This means that we give a set of possible parsers, only one of
which needs to succeed. These parsers use the try_
combinator
internally, and so therefore they will do backtracking after each
failed parser before invoking the next one.
This parser runs the parsers in sequence until the first
one succeeds, then returns its result (all of the parsers must
return the same result type). If none of them succeed then the
parser fails. Another way to use this combinator is to use the
pipe (|
) operator defined below.
template<typename P, typename... Ps>
requires( std::is_same_v<typename P::value_type,
typename Ps::value_type> && ...)
parser<R> first( P fst, Ps... rest );
where R
is P::value_type
. When one parser succeeds,
subsequent parsers will not be run.
This operator runs the parser in the order given and returns the result of the first one that succeeds. The parsers must all return the same type.
template<Parser T, Parser U>
parser<typename U::value_type> operator|( T l, U r ) {
// Example
co_await (identifier() | quoted_str());
This is equivalent to the first
parser above.
The combinators in this section have to do with invoking functions on the results of parsers.
This parser calls the given function with the results of the parsers as arguments (which must all succeed).
NOTE: the parsers are guaranteed to be run in the order they appear in the parameter list, and that is one of the benefits of using this helper.
template<typename Func, typename... Parsers>
parser<R> invoke( Func f, Parsers... ps );
where R
is the result type of the function f
when invoked
with the results of all of the parsers are arguments.
More explicitly, this combinator will run all of the parsers ps
in the order they are passed to the function, and then use their
results as the arguments to the function f
, which must itself
produce a parser. So in other wods, if we do this:
auto r = co_await parsco::invoke( f, any_char(), any_char() );
that is equivalent to doing:
char c1 = co_await any_char();
char c2 = co_await any_char();
auto r = co_await f( c1, c2 );
and thus as can be seen, this parser (like many others) is for convenience; it does not do anything that couldn't be done manually.
This parser calls the constructor of the given type T
with the results of the parsers as arguments (which must all
succeed).
template<typename T, typename... Parsers>
parser<T> emplace( Parsers... ps );
The parsers will be run in the order they are passed to the function. Example:
struct Point {
Point( int x_, int y_ ) : x( x_ ), y( y_ ) {}
int x;
int y;
};
// Assuming that we have a parser for integers called parse_int:
Point p = co_await parsco::emplace<Point>( parse_int(),
parse_int() );
In particular, note that we are not co_await
ing on the results
of the parse_int()
calls; we just give the combinators to
emplace
and it does the rest. This can reduce some verbosity in
parsers from time to time. Think of this as an analog to
Haskell's <$>
operator for Applicatives.
The venerable fmap
combinator runs the parser p
and applies
the given function to the result, if successful. The function
typically does not return a parser; it is just a normal function.
template<Parser P, typename Func>
requires( std::is_invocable_v<Func, typename P::value_type> )
parser<R> fmap( Func f, P p );
where R
is the result of invoking the function on the
value_type
of the parser p
. This can be seen as a
single-parameter version of the invoke
combinator above.
This combinator runs the given parser and if it fails it will return the error message given (as opposed to any error message that was produced by the parser).
template<typename Parser>
Parser on_error( Parser p, std::string err_msg );
This is used to provide more meaningful error messages to users in response to a given parser having failed. If you use this combinator at all, then generally you'll want to use it on lower-level (leaf) parsers as opposed to using it to wrap higher level parsers; if you do the latter then it will effectively suppress error messages from all lower parsers, replacing them with a single error message, which is probably not useful to the user.
A better use would be with a parser like chr
:
using namespace parsco;
char c = co_await on_error( chr( '=' ), "assignment operator is required because..." );
which will give the user a meaningful error message; had the
on_error
combinator not been used, then the error message would
have defaulted to that produced by chr
, which, at best, would
be limited to something like "expected =" (that is the best that
the chr
combinator can do on its own).
This parser runs the given parser and then checks that the input buffers has been exhausted (if not, it fails). Returns the result from the parser on success (i.e., when all input has been consumed).
template<typename Parser>
Parser exhaust( Parser p );
This parser is not really a parser, it just takes a
nullable entity (such as a std::optional
, std::expected
,
std::unique_ptr
, parsco::result_t<T>
, etc.), and it will
return a parser object that, when run, will try to get the value
from inside of the nullable object. If the object does not
contain a value, then the parser will fail. If it does contain an
object, the parser will yield it as its result.
template<Nullable N>
parser<typename N::value_type> lift( N n );
Conceptually this is analogous to a monadic "lift" operation as
you'd have in functional languages where a monadic value is
lifted from an inner monad (e.g. std::optional<T>
) to the
transformed monad (parsco::parser<T>
).
This parser runs the given parser p
between
characters l
and r
(or parsers l
and r
, depending on the
overload chosen) and, if all parsers are successful, returns only
the result of the parser p
(i.e., what is inside the
delimiters).
template<typename T>
parser<T> bracketed( char l, parser<T> p, char r );
template<typename L, typename R, typename T>
parser<T> bracketed( parser<L> l, parser<T> p, parser<R> r );
For example, to parse an identifier between two curly braces, you could do:
using namespace parsco;
using namespace std;
string ident = co_await bracketed( '{',
identifier(),
'}' )
Given an input of "{hello}" that would yield "hello".