Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Array Support POC #282

Draft
wants to merge 5 commits into
base: integ-array-support-poc
Choose a base branch
from

Conversation

forestmvey
Copy link

@forestmvey forestmvey commented Jul 6, 2023

Description

Add support for indexing arrays using square parenthesis. This implementation gives users the responsibility on how to handle their data. If array indexing parenthesis are supplied then the plugin assumes an array is to be found with supplied field name. In the event of non-array values or invalid indexes used by the user, an empty value will be returned.

The goal of adding the syntax option for returning and indexing arrays is to provide users the ability to utilize arrays without the need for introducing a breaking change with the SQL plugin. By offloading the responsibility for how to handle stored array data we can avoid the need for additional data mapping.

Syntax

  • array[:]
  • array[index]<[index]>

Examples

Dataset:

"array": [[1], [2, [3, 4]], 5]

Query:

SELECT array FROM arrays;

Result:
1


Query:

SELECT array[:] FROM arrays;

Result:
[[1], [2, [3, 4]], 5]


Query:

SELECT array[0] FROM arrays;

Result:
[1]


Query:

SELECT array[1] FROM arrays;

Result:
[2, [3, 4]]


Query:

SELECT array[1][1] FROM arrays;

Result:
[3, 4]


Query:

SELECT array[1][1][0] FROM arrays;

Result:
3


Query:

SELECT array[2] FROM arrays;

Result:
5


Query:

SELECT array[3] FROM arrays;

Result:
null


Object Arrays

The V2 engine has the current functionality for object arrays.
Dataset:

"objectArray": [{"id": [1, 2]}, {"id": 2}],
"nestedArray": [{"id": [1, 2]}, {"id": 2}]

Query:

SELECT objectArray FROM arrays;

Result
{"id": 1}


Query:

SELECT nestedArray FROM arrays;

Result
[{"id": [1, 2]}, {"id": 2}]


Added Functionality


Query:

SELECT objectArray[:] FROM arrays;
SELECT nestedArray[:] FROM arrays;

Result
[{"id": [1, 2]}, {"id": 2}]


Query:

SELECT objectArray[1] FROM arrays;
SELECT nestedArray[1] FROM arrays;

Result
{"id": 2}


Query:

SELECT objectArray[1].id FROM arrays;
SELECT nestedArray[1].id FROM arrays;

Result
[1, 2]


Query:

SELECT objectArray[1].id[1] FROM arrays;
SELECT nestedArray[1].id[1] FROM arrays;

Result
2


Edge Cases

  • Index out of bounds returns empty
  • Indexing non-array value returns empty

Additional Functionality

  • Add support for ranges: array[<firstIndex>:<lastIndex>]

Limitations

Array support is limited to the support that OpenSearch has for arrays. Arrays are not mapped as array type but the type defined within the arrays. As well arrays are stored as a single document and will return entire arrays if a portion of the array matches a filter. For example an array consisting of [1, 2] would return the whole array when filtering for values of 1 or 2.

Clause Support

  • ✅ SELECT Clause
    • ❌ Function Support
  • ❌ ORDER BY
  • ❌ WHERE
  • ❌ GROUP BY
  • ❌ HAVING

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov
Copy link

codecov bot commented Jul 6, 2023

Codecov Report

Merging #282 (41d7bf5) into integ-array-support-poc (e186cf7) will decrease coverage by 0.69%.
The diff coverage is 49.36%.

@@                      Coverage Diff                      @@
##             integ-array-support-poc     #282      +/-   ##
=============================================================
- Coverage                      99.98%   99.30%   -0.69%     
- Complexity                      2624     2637      +13     
=============================================================
  Files                            205      206       +1     
  Lines                           5955     6025      +70     
  Branches                         378      392      +14     
=============================================================
+ Hits                            5954     5983      +29     
- Misses                             1       38      +37     
- Partials                           0        4       +4     
Flag Coverage Δ
sql-engine 99.30% <49.36%> (-0.69%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ensearch/sql/expression/ExpressionNodeVisitor.java 95.45% <0.00%> (-4.55%) ⬇️
...org/opensearch/sql/analysis/QualifierAnalyzer.java 69.23% <11.11%> (-30.77%) ⬇️
...rg/opensearch/sql/analysis/ExpressionAnalyzer.java 92.09% <36.36%> (-7.91%) ⬇️
...earch/sql/expression/ArrayReferenceExpression.java 43.33% <43.33%> (ø)
...nsearch/sql/analysis/SelectExpressionAnalyzer.java 100.00% <100.00%> (ø)
...opensearch/sql/expression/HighlightExpression.java 96.29% <100.00%> (-3.71%) ⬇️
...opensearch/sql/expression/ReferenceExpression.java 100.00% <100.00%> (ø)
...h/sql/expression/function/OpenSearchFunctions.java 100.00% <100.00%> (ø)
...nsearch/sql/storage/bindingtuple/BindingTuple.java 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@forestmvey forestmvey requested a review from a team July 6, 2023 16:39
@GumpacG
Copy link

GumpacG commented Jul 6, 2023

Regarding future support for features like geo_shape, consider cases like:
Data:

{
    "borders": [
        {
            "border": {
                "type" : "linestring",
                "coordinates" : [[74.0060, 40.7128], [71.0589, 42.3601]]
            }
        },
        {
            "border": {
                "type" : "linestring",
                "coordinates" : [[74.0060, 40.7128], [71.0589, 42.3601]]
            }
        },
        {
            "border": {
                "type" : "linestring",
                "coordinates" : [[74.0060, 40.7128], [71.0589, 42.3601]]
            }
        },
        {
            "border": {
                "type" : "linestring",
                "coordinates" : [[74.0060, 40.7128], [71.0589, 42.3601]]
            }
        }
    ]  
}
  • accessing coordinates with borders[0].border.coordinates
  • accessing starting points of border with borders[0].border.coordinates[0]

Note: this example may be an edge case and may not represent typical real world use cases.

Copy link

@Yury-Fridlyand Yury-Fridlyand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.

Please, add more tests:

  • Array of objects
  • Array in object
  • Array in nested

Consider different tricky combinations, like this for the first test:

[
  {
    "id" : 1,
    "name" : "one"
  },
  {
    "id" : 2,
    "name" : "two"
  },
  null
]

2.

What if I do?

select int0[0] from calcs

Please, add a test for this.

3.

Please add a design doc.

4.

Please, fix column headers/names.

@@ -254,7 +254,7 @@ public void nested_function_and_field_with_order_by_clause() {
rows("a", 4),
rows("b", 2),
rows("c", 3),
rows("zz", new JSONArray(List.of(3, 4))));
rows("zz", 3));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't a breaking change? Please, compare this with nested behavior in V1 and V2 @ 2.8.

Copy link
Author

@forestmvey forestmvey Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is V1 behaviour, but my updates make this consistent with V2 behaviour. I believe this should be considered a bug and not a breaking change because if we were to do a select on this value without the nested function, a 3 would be returned. Very weird case here but this new change aligns with whats expected with V2.

@@ -814,6 +815,11 @@ columnName
: qualifiedName
;

arrayColumnName
: qualifiedName LT_SQR_PRTHS COLON_SYMB RT_SQR_PRTHS
| qualifiedName LT_SQR_PRTHS decimalLiteral RT_SQR_PRTHS

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could decimalLiteral be negative?

@Override
public UnresolvedExpression visitArrayColumnName(ArrayColumnNameContext ctx) {
UnresolvedExpression qualifiedName = visit(ctx.qualifiedName());
if (ctx.decimalLiteral() == null) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make decimalLiteral named value in ANTLR grammar, so you can reference the name here.

@@ -105,8 +107,17 @@ public List<NamedExpression> visitAllFields(AllFields node,
AnalysisContext context) {
TypeEnvironment environment = context.peek();
Map<String, ExprType> lookupAllFields = environment.lookupAllFields(Namespace.FIELD_NAME);
return lookupAllFields.entrySet().stream().map(entry -> DSL.named(entry.getKey(),
new ReferenceExpression(entry.getKey(), entry.getValue()))).collect(Collectors.toList());
return lookupAllFields.entrySet().stream().map(entry ->
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an example in the PR description for SELECT *? Will this return the full array for a column when SELECT * is queried? Since SELECT array still returns just the first value, is this still the case for SELECT * as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the functionality for SELECT * will stay the same, otherwise would be a breaking change.

@@ -107,6 +107,6 @@ private static boolean isQuoted(String text, String mark) {
}

public static String removeParenthesis(String qualifier) {
return qualifier.replaceAll("\\[.+\\]", "");
return qualifier.replaceAll("\\[\\d+\\]", "");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a regex? Do you want to match : or \d+:\d+ (or even \d*:\d*)?
Add a comment for this function please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants