Secondary Indexing

Preliminary

Examples in this document refer to a feature type with the following attributes

ATTRIBUTE NAME	ATTRIBUTE TYPE
LOCATION	Geometry
PERSONS	Numeric (Long)
RECORD_DATE	Date
INCOME_CATEGORY	String (specific set of values)
AFFILIATION	String (mixed text)
ID	String

Problem Description

Find the optimal set of one ore more indices to answer a query. A query requests a data store to find features matching the query’s search criteria. A feature is a single instance of data (e.g record, entity, etc.).

Query Syntax: ECQL http://docs.geoserver.org/stable/en/user/filter/ecql_reference.html Query Example: INTERSECTS(LOCATION, "Polygon-109.0448 37.0004, -102.0424 36.9949, -102.0534 41.0006, -109.0489 40.9996, -109.0448 37.0004") AND RECORD_DATE DURING 2006-11-30T00:30:00Z/2006-12-31T00:30:00Z AND PERSONS BETWEEN 1000000 AND 3000000 AND INCOME_CATEGORY in ('10-15', '15-22')

Possible Indices

Primary Geospatial Index on LOCATION
Primary Geospatial-Temporal Index on LOCATION and RECORD_DATE
Secondary Time or Numeric Index on RECORD_DATE
Secondary Text Index on INCOME_CATEGORY
Secondary Numeric Index on PERSONS

Type of Secondary Indices

Text
1. Optimized for Regular Expression
2. Optimized for equality
Numeric
1. Range or Single Value
2. Numeric, Floating Point, Big Integer/Decimal?
Date/Times
1. Treat as Numeric (GMT milliseconds from epoch) or Alternate Representation (e.g. YYYYMMDDHHMISSMMMMM)
2. Range or Single Value
Geometry (Alternate from primary)

Role of the Primary Index

The primary index maintains a copy of each feature. A primary index may have more than copy of feature. Each copy is given its own unique index identifier (ROW ID). A feature may be stored in multiple primary indices.

For GeoWave, the primary index is range based index (e.g. bounding rectangle, bounding time).

Compound Indices

The generalization of an index should include the ability to create compound keys. Geo-temporal is an example. There may be some limitations. Numeric indices can be managed using techniques for primary indices (ie. Space Filling Curve). Text may be limited to equality. The design should not impede further future development.

Numeric ranges derived from two attributes (e.g. start and end) are considered a type of compound key for this discussion. For the moment, this document will not derive a distinction from a single attribute representative of a range, such as start+end or start+duration.

Persistence Design

The design in this section assumes Accumulo.

A secondary index is a key derived from an attribute value (or attributes in a compound scenario) associated with a primary index identifier. For example, attribute PERSONS with value 29383 is associated with a list of primary index identifiers of those features that have a PERSONS attribute value matching exactly 29383 for a specific primary index.

Table Design

A each primary index is a separate table. A primary index table is named according to name space and index type. For example, geospatial and geo-temporal indices are separate tables.

Does each type of secondary index reside in its own table?

Yes. The approach is consistent with primary index separation. It can be potentially easier manage multiple tables from the perspective of writes and backups.

The alternative is to maintain a single table using the column family with locality configured to isolate each type of secondary index. This approach reduces the growth of secondary index tables, one per each type of secondary index. It is unclear if there are any performance concerns, as locality groups organize column families in separate files.

What is the basic structure of a table?

ROW ID	CF	CQ	VIS	VALUE
Attribute Value	Attribute Type	Attribute Name	Attribute Visibility	Primary Index identifier(ROW ID)

The table design reflects a separate row for each Primary Index identifier to attribute value association. This permits rapid ingest. Storing a set of index identifiers in a column value would require encoding and decoding the set for each addition/subtraction.

How are updates and deletions manager?

In Accumulo, updates and deletions are managed using the versioning iterator. For updates, versioning iterating selects each unique row ID with the latest time stamp. For deletions, a deletion flag is marked on a specific ROW ID and timestamp. All rows with the same ROW ID and older timestamp are deleted and eventually removed in a compaction.

Appending some unique identifier to the ROW ID enables deletion and modification. This assumes either the attribute value has a uniform size or range scans can be polluted with false positives requiring additional filtering.

Which primary index ROW ID is referenced in a secondary index for features indexed in more than one primary index?

Consistent with approaches used for primary index and statistic selection, as communicated by the data adapter, a default option is used. The default option uses the index with the least number if dimensions (e.g. geo-spatial over geo-temporal). In cases where the decision requires a change or there is a conflict (two primaries with same number of dimensions), the adapter provides additional guidance.

Should secondary indices reference primary index identifiers from all primary indices?

In this solution, only one ID is required to look up a feature. For the purposes of extracting a feature associated with a secondary index, the extra IDs are redundant, requiring more space and logic.

The chosen approach precludes intersecting joins of look-ups between secondary and primary indices. Consider the example data stored in both geospatial and geotemporal indices. Suppose, to efficiently answer the query, the system chooses a geotemporal primary index (for LOCATION and RECORD_DATE) and a secondary index on PERSONS. If the secondary index only maintains primary index identifiers for the geospatial index, then intersecting join between index inquiries cannot be processed, as the identifier do not match between the indices.

To support primary and secondary index joins, each primary identifier along with the primary index type must be maintained with the secondary index. The answer to the next question provides additional details.

A primary index may have more than one identifier per feature. Does the secondary index reference all index identifiers?

The secondary index references only one selected index identifier. The index references the first identifier (ROW ID), lexicographically. As part of a robust solution for intersecting joins when inquiring multiple secondary indices, the feature ID is extracted from the index identifier. This is discussed in more detail later in this document.

This approach precludes intersecting joins between primary indices and secondary indices. A Cost Based Optimizer (CBO) fulfills a query criteria choosing between a single secondary index, intersecting joins between a set of secondary, or a primary index. To reason the benefit/complexity analysis of this decision, consider each action separately.

In a primary index scan, the query is decomposed into a set of ranges. Distributed filtering is applied to each discovered row to eliminate features that do not meet the query’s criteria. Using the example query in a geospatial primary index, the LOCATION attribute is decomposed into a set of ROW ranges, provided to the table scanner. The distributed filter (iterator) further eliminates rows that fail to meet the query’s criteria (e.g. INCOME_CATEGORY IN ('10-15', '15-22')).

In a secondary index scan, the CBO may choose any or all of the secondary indices. The final result of ROW IDs is the intersection across the set of ROW IDS from each secondary index inquiry: Index(INCOME_CATEGORY) ∩ Index(PERSONS) ∩ Index(RECORD_DATE)

For example, consider the returned primary index identifiers from each index. . Index(INCOME_CATEGORY) = { 0178_xc456, 0199_xt356, 0394_jj593} . Index(RECORD_DATE) = { 0394_jj593, 0142_yh636, 0199_xt356, 0393_uh383} . Index(PERSONS) = { 0142_yh636,0394_jj593, 0178_xc456, 0384_kf0394,0199_xt356} The intersection set of identifiers is {0199_xt356, 0394_jj593}.

Suppose decomposition of the LOCATION attribute using the primary index strategy provided ranges (prefixes) 0197 - 0200 and 0201 - 0202. Intersecting with the secondary index, the only valid identifier with is 0199_xt356.

Assuming some statistics can be applied to provide some evidence on the actual number of rows that meet a set of ranges provided through primary index decomposition, a CBO can make a clear choice on using either a primary index or a set of one or more secondary indices to resolve features for a given query.

One can intuit that there is not a clear benefit of applying distributed filtering during a primary index range scan in lieu of merging primary index range decomposition with secondary index results to select the set of rows to fetch from a primary index. Secondary index query processing involves a preprocessing step to find a select set of specific ROW IDs. This step involves range scans on tables, one for each secondary index. The secondary indices do not preclude distributed filtering. They do introduce some additional measurable overhead.

==

  Consider the example query in the beginning of this document.
Suppose a feature (ID='xc456',LOCATION=POINT((-109.0448 37.0004)), PERSONS=39474, INCOME_CATEGORY='10-15',RECORD_DATE=2006-12-12T13:04:11Z, AFFILIATION='AFLCIO') is indexed to ROW_ID '0178_xc456'.

Concerns on efficiently managing the a single table with respect to writes, backups and table level functions D secondary index - how we persist them - (simplest case is what we do with feature id now - basically another table where the row id is the feature id, and the value is the row-id in the main geowave table) - how we index has an impact on the ability to accelerate some conditions (LIKE/ILIKE (case sensitive/insensitive prefix matching). We don’t have to necessarily support every possible condition - just need to know what we are and aren’t supporting.

statistics - information about cardinality (number of unique values), distribution (histogram), ranges (min/max), etc. for each indexed attribute. - how statistics are set, managed, modified, etc. - making sure the whole workflow "works" - also, exposed in geoserver? - Eric has already done a good bit of this

query planning - I think what we want is basically a cost based optimizer - Basically this is going to take the ECQL portion of a query and determine the best use of indexes to satisfy the query - http://docs.geoserver.org/latest/en/user/filter/ecql_reference.html - no joins for us to optimize across, so problem set is much simpler - typical decision might be something like "where carmake = "toyota" and carcolor = "white"" — What’s the cardinality of carmake - 100 out of 200 (total count) features? That means, on average, a particular value migh only comprise 2 rows. (if we have histogram values we can do better here). So say a cost of 2. — What’s the cardinality of carcolor = 2 out of 200? Typical value might still cost us 100 rows? Might be a threshold where it’s not worth using an index - have to experimentally determine where that’s at. — So in this case the optimizer might decide to do an index lookup for carmake, but just do a full scan on the results of that for carcolor. - Typical choices here, since we don’t have to do joins, are basically going to be — Do I use an index or not — Which stats are the best to use to answer the question above. - I think we start out with some sort of cost factor, but will have to run some benchmarks to determine what the final cost factors are (could even turn this into an auto-tune system for different installations) - Would be nice to have a log output optionally available (at some log level) for the results of the optimizer.

predicate pushdown - for the temporal and spatial predicates - (before, during, after, DWITHIN, TOUCHES, etc.) - implementing logic to accelerate these with indexes already in place

Provide feedback

Saved searches

Use saved searches to filter your results more quickly