Skip to content
Mark Papadakis edited this page Mar 25, 2017 · 1 revision

See common.h where some core data types are defined and limits.h where limits are specified. Depending on your need you may change declarations and limits.

Strings

str8_t and str32_t are strings with their lengths. The difference between them is the type of the length, where for str8_t it is defined as uint8_t len and for str32_t it is defined as uint32_t len. You may want to use a different string type to support multi-byte encoding. If you want to use Switch's strings, and you want, for example, 2 bytes/character encoding, you can define them as

using str8_t = strwithlen<uint8_t, uint16_t>;
using str32_t = strwithlen<uint32_t, uint16_t>;

and define terms_cmp() and default_token_parser_impl() appropriately.

Other Types

A document id is by default an unsigned 32bit integer. You can change docid_t to be a 64bit quantity, or maybe a 16bit quantity. Unless there is a good reason not to, I recommend that you stick with 32bit document IDs.

A document word position, tokenpos_t, is represented by default by an unsigned 16bit integer. This means that the maximum document position supported by default is 65535. You can change that to be a bigger or smaller integer depending on your needs, but again, unless you really need to do so, I recommend you stick with 16bit.

If you change either of those, you will need to recompile Trinity and re-index your documents.

Tokens parser and Strings Comparisons

Trinity::default_token_parser_impl() is the default parser function used if another parser function is not specified in either the constructor of Trinity::Query or in Trinity::Query::parse().

That function, or an alternative with the same signature used instead, should parse the next token from the string provided as the sole argument and return its length. If no token can be parsed, 0 should be returned instead.

The default default_token_parser_impl() implementation uses a very basic heuristic to parse tokens. You probably want to roll your own in order to support multi-byte characters and fancy tokenisation schemes.

terms_cmp() accepts two strings and should return -1, 0, or 1 depending on the result of a lexicographic comparison between them. You probably want to change this to support multi-byte strings or maybe you want to compare strings regardless of case(upper or lower case).

Clone this wiki locally