Skip to content

Secondary Indexing

Eric Robertson edited this page Jun 26, 2015 · 24 revisions

Preliminary

Examples in this document refer to a feature type with the following attributes

ATTRIBUTE NAME

ATTRIBUTE TYPE

LOCATION

Geometry

PERSONS

Numeric (Long)

RECORD_DATE

Date

INCOME_CATEGORY

String (specific set of values)

AFFILIATION

String (mixed text)

ID

String

Problem Description

Find the optimal set of one ore more indices to answer a query. A query requests a data store to find features matching the query’s search criteria. A feature is a single instance of data (e.g record, entity, etc.).

Query Syntax: ECQL http://docs.geoserver.org/stable/en/user/filter/ecql_reference.html Query Example: INTERSECTS(LOCATION, "Polygon-109.0448 37.0004, -102.0424 36.9949, -102.0534 41.0006, -109.0489 40.9996, -109.0448 37.0004") AND RECORD_DATE DURING 2006-11-30T00:30:00Z/2006-12-31T00:30:00Z AND PERSONS BETWEEN 1000000 AND 3000000 AND INCOME_CATEGORY in ('10-15', '15-22') Example Feature: (ID='xc456',LOCATION=POINT-109.0448 37.0004, PERSONS=39474, INCOME_CATEGORY='10-15',RECORD_DATE=2006-12-12T13:04:11Z, AFFILIATION='AFLCIO') Example Primary Index Identifier (ROW ID): '0178_xc456'

Possible Indices

  1. Primary Geospatial Index on LOCATION

  2. Primary Geospatial-Temporal Index on LOCATION and RECORD_DATE

  3. Secondary Time or Numeric Index on RECORD_DATE

  4. Secondary Text Index on INCOME_CATEGORY

  5. Secondary Numeric Index on PERSONS

Type of Secondary Indices

  1. Text

    1. Optimized for Regular Expression

    2. Optimized for equality

  2. Numeric

    1. Range or Single Value

    2. Numeric, Floating Point, Big Integer/Decimal?

  3. Date/Times

    1. Treat as Numeric (GMT milliseconds from epoch) or Alternate Representation (e.g. YYYYMMDDHHMISSMMMMM)

    2. Range or Single Value

  4. Geometry (Alternate from primary)

Role of the Primary Index

The primary index maintains a copy of each feature. A primary index may have more than copy of feature. Each copy is given its own unique index identifier (ROW ID). A feature may be stored in multiple primary indices.

For GeoWave, the primary index is range based index (e.g. bounding rectangle, bounding time).

Compound Indices

The generalization of an index should include the ability to create compound keys. Geo-temporal is an example. There may be some limitations. Numeric indices can be managed using techniques for primary indices (ie. Space Filling Curve). Text may be limited to equality. The design should not impede further future development.

Numeric ranges derived from two attributes (e.g. start and end) are considered a type of compound key for this discussion. For the moment, this document will not derive a distinction from a single attribute representative of a range, such as start+end or start+duration.

Persistence Design

The design in this section assumes Accumulo as a data store.

A secondary index is a key derived from an attribute value (or attributes in a compound scenario) associated with a primary index identifier. For example, attribute PERSONS with value 29383 is associated with a list of primary index identifiers of those features that have a PERSONS attribute value matching exactly 29383 for a specific primary index.

Table Design

A each primary index is a separate table. A primary index table is named according to name space and index type. For example, geospatial and geo-temporal indices are separate tables.

Does each type of secondary index reside in its own table?

Yes. The approach is consistent with primary index separation. It can be potentially easier manage multiple tables from the perspective of writes and backups.

The alternative is to maintain a single table using the column family with locality configured to isolate each type of secondary index. This approach reduces the growth of secondary index tables, one per each type of secondary index. It is unclear if there are any performance gains over using locality groups. Locality groups organize column families in separate files.

What is the basic structure of a table?

ROW ID

CF

CQ

VIS

VALUE

Attribute Value

Attribute Type

Attribute Name

Attribute Visibility

Primary Index identifier(ROW ID)

The table design reflects a separate row for each Primary Index identifier to attribute value association. This permits rapid ingest. Storing a set of index identifiers in a column value would require encoding and decoding the set for each addition/subtraction.

The example below demonstrates three features indexed on INCOME_CATEGORY in ('10-15').

ROW ID

CF

CQ

VIS

VALUE

10-15

String

INCOME_CATEGORY

-

0492_xc284

10-15

String

INCOME_CATEGORY

-

0912_ks942

10-15

String

INCOME_CATEGORY

-

0894_kl038

The structure requires versioning be disabled since the ROW_ID is not unique between the different features.

How are updates and deletions manager?

In Accumulo, updates and deletions are managed using the versioning iterator. For updates, versioning iterating selects each unique row ID with the latest time stamp. For deletions, a deletion flag is marked on a specific ROW ID and timestamp. All rows with the same ROW ID and older timestamp are deleted and eventually removed in a compaction.

Appending some unique identifier to the ROW ID enables deletion and modification. This assumes either the attribute value has a uniform size or range scans can be polluted with false positives requiring additional filtering.

There are several options here to explore:
Forcing uniformity of size for the secondary index.

Research into text indexing may find this is indeed possible and preferable.

Since the column family is redundant in the multi-table approach, move the attribute name to column family and the primary index identifier into the column qualifier. The column value is unused. Updates and deletions involve removing the column. Accumulo does support wide-rows permitting such an approach.

ROW ID

CF

CQ

VIS

VALUE

Attribute Value

Attribute Name

Primary Index identifier(ROW ID)

Attribute Visibility

-

Do not adjust index for updates and deletions.

This results in false positives filtered during query processing. Consider that, at present, with the exception of count, updates are not considered separately from inserts in statistics. Treating updates as inserts in the secondary index yields false positives filtered during query processing.

Off-line processing of updates and deletions.

A separate special 'visibility' can be used to flag attribute values associated with one or more features that have changed due to updates or deletions. Periodically, the associated row ids can be reprocessed in a batch job. Reprocessing involves inserting a delete flag for the ROW ID (attribute value) and then reinserting the valid rows(with newer timestamps), omitting invalid rows.

The example below uses 'xxchangedxx' as the flag.

ROW ID

CF

CQ

VIS

VALUE

Attribute Value

Attribute Type

Attribute Name

'xxchangedxx'

'-'

Which primary index ROW ID is referenced in a secondary index for features indexed in more than one primary index?

Consistent with approaches used for primary index and statistic selection, as communicated by the data adapter, the adapter provides a default option and permits customization. The default option uses the index with the least number if dimensions (e.g. geo-spatial over geo-temporal). In cases where the decision requires a change or there is a conflict (e.g. two primary index with same number of dimensions), the adapter provides additional guidance.

Should secondary indices reference primary index identifiers from all primary indices?

In this solution, only one ID is required to look up a feature. For the purposes of extracting a feature associated with a secondary index, the extra IDs are redundant, requiring more space and logic.

The chosen approach precludes intersecting joins of look-ups between secondary and primary indices. Consider the example data stored in both geospatial and geotemporal indices. Suppose, to efficiently answer the query, the system chooses a geotemporal primary index (for LOCATION and RECORD_DATE) and a secondary index on PERSONS. If the secondary index only maintains primary index identifiers for the geospatial index, then intersecting join between index inquiries cannot be processed, as the identifier does not match between the two indices.

To support primary and secondary index joins, each primary identifier along with the primary index type must be maintained with the secondary index. The answer to the next question provides additional details.

A primary index may have more than one identifier per feature. Does the secondary index reference all index identifiers?

The secondary index references only one selected index identifier. The index references the first identifier (ROW ID), lexicographically. As part of a robust solution for intersecting joins when inquiring multiple secondary indices, the feature ID is extracted from the index identifier. This is discussed in more detail later in this document.

This approach precludes intersecting joins between primary indices and secondary indices. A Cost Based Optimizer (CBO) fulfills a query criteria choosing between a single secondary index, intersecting joins between a set of secondary index, or a primary index. To reason the benefit/complexity analysis of this decision, consider each action separately.

In a primary index scan, the query is decomposed into a set of ranges. Distributed filtering is applied to each discovered row to eliminate features that do not meet the query’s criteria. Using the example query in a geospatial primary index, the LOCATION attribute is decomposed into a set of ROW ranges, provided to the index table scanner. The distributed filter (iterator) further eliminates rows that fail to meet the query’s criteria (e.g. INCOME_CATEGORY IN ('10-15', '15-22')).

In a secondary index scan, the CBO may choose any or all of the secondary indices. The final result of ROW IDs is the intersection across the set of ROW IDS from each secondary index inquiry: Index(INCOME_CATEGORY) ∩ Index(PERSONS) ∩ Index(RECORD_DATE)

For example, consider the returned primary index identifiers from each index.

  1. Index(INCOME_CATEGORY) = { 0178_xc456, 0199_xt356, 0394_jj593}

  2. Index(RECORD_DATE) = { 0394_jj593, 0142_yh636, 0199_xt356, 0393_uh383}

  3. Index(PERSONS) = { 0142_yh636,0394_jj593, 0178_xc456, 0384_kf0394,0199_xt356}

The intersection set of identifiers is {0199_xt356, 0394_jj593}.

Suppose decomposition of the LOCATION attribute using the primary index strategy provided ranges (prefixes) 0197 - 0200 and 0201 - 0202. Intersecting with the secondary index, the only valid identifier with is 0199_xt356.

Assuming some statistics can be applied to provide some evidence on the actual number of rows that meet a set of ranges provided through primary index decomposition, a CBO can make a clear choice on using either a primary index or a set of one or more secondary indices to resolve features for a given query.

One can intuit that there is not a clear benefit of intersecting joins between secondary and primary indices. A sufficiently constrained primary index query eliminates a small set of non-matching feature using distributed filtering. Similarly, secondary indices, when used appropriately, sufficiently reduces the number of specific features to inspect based on a query’s constraints. The scenario for using a secondary index set join with a primary index result set occurs when statistics indicate that set a select set of secondary of indices and a primary index are, in isolation, insufficiently constraining. However, attribute level statistics cannot determine a measurable outcome of a intersecting join. Consider statistics indicate, for a given query, secondary index results in 5000 features and a primary index results in 6000 features. Without compound indexing with its own statistics, a CBO cannot determine the overlap between the two sets of 5000 and 6000 features. Rather, the CBO determines the size of final result set is less than 5000. Engaging the secondary indexing, incurring the associated overhead, cannot guarantee consistent results when the secondary indices do not sufficient eliminate the result set.

Statistics

The following statistics are used for cost based appraisal of indices for query processing.

Statistic

Type

Application

Histogram

Numeric

Provides an estimate of the number of features that have an attribute with a specific value

Range

Numeric

Provides evidence of the range of values in an index

Cardinality

Non-Numeric

Provides the number of unique values in an index.

Count MinSketch

Non-Numeric

Provides the number of matching features for a specific value to an attribute. Not recommended for attributes that do not have have a constrained set of values (e.g. descriptions, comments, etc.).

Distribution

Any

Generalization of a histogram (applicable to text, geometries, etc.)

Query Planning

Query planning ascertains, probabilistically, selectivity of an index given query constraints. The selectivity can be in terms of the average number of features. For example, if the cardinality of the attribute for AFFILIATION is one thousand and there are ten thousand features, the planner could assume a uniform distribution, estimating ten features per each unique value AFFILIATION. Worst case, 9001 features may be discovered in the secondary index. Query execution could abandon a secondary index scan once a threshold has been met if the results turn out much worse than the estimate.

[NOTE] Since the cardinality of AFFILIATION is low, a count min-sketch statistic can be used give a more accurate indication for a specific value.

Cost factors and thresholds per index will be part of the discovery process to determine the optimal use of indexing. Cost factors include considerations for overhead when processing a secondary index, as the results are individual primary index identifiers, not a range.

Limitations

This section will include limitations of secondary index, such as effectiveness of LIKE and ILIKE filtering on an indexed text attribute.

Features in Consideration

  1. Log of optimizer output to evaluate and tune.

  2. Benchmarks to evaluate optimizer.

  3. Auto-tune capabilities

  4. predicate pushdown

Clone this wiki locally