Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QueryBuilder: 'contains' operator not implemented for SQLite storage backends #6494

Closed
Tracked by #6435
danielhollas opened this issue Jun 29, 2024 · 3 comments · Fixed by #6619
Closed
Tracked by #6435

QueryBuilder: 'contains' operator not implemented for SQLite storage backends #6494

danielhollas opened this issue Jun 29, 2024 · 3 comments · Fixed by #6619

Comments

@danielhollas
Copy link
Collaborator

# if operator == 'contains':

Some discussion about implementation:
sqlalchemy/sqlalchemy#7836

@giovannipizzi
Copy link
Member

giovannipizzi commented Jul 25, 2024

Here I put my investigations on how to possibly implement this.
This follows the reasoning #6552: so, before reading below, see there first for a general reasoning.

First attempt (with SUM/GROUP BY)

Like in #6552, I work on a test DB that looks like this:
Screenshot 2024-07-25 at 12 20 12

Here is a query that works (NOTE: this is only for arrays - see next comment to extend also to dictionaries), explanation below:

SELECT 
json_type(test_1.extras, '$') as type,
test_1.id, test_1.extras,
sum(filter_cond.value is not NULL) as num_found
FROM test AS test_1, json_each(test_1.extras, '$') AS anon_1
LEFT JOIN (SELECT value from json_each(json_array('a', 'c'))) AS filter_cond
ON anon_1.value = filter_cond.value 
GROUP BY test_1.id
HAVING  num_found = 2 /* this should be equal to the length of the json_array used above */

Here is the result:
Screenshot 2024-07-25 at 12 23 37

Explanation

Here, I am filtering for an array at the top level, containing the two elements 'a' and 'c' (via json_each(json_array('a', 'c'))). The query groups by all items in the list with those elements, and counts those found, thus finding - for each node or entry in the test table - the list of items that match the values you are querying for. NOTE: The last value num_found=2 must be adapted to be the length of the elements in the contains query, this should be done when generating the query. Only if the number is equal to 2 (in this example), it means that both are included, so the query returns the expected results.

Also negation should work correctly (replacing = 2 with != 2 at the end), the result:
Screenshot 2024-07-25 at 12 24 27

Note on extending to different types

One probably needs to add additional filters on the type, and use the CASE statement. An example (incorrect, but to remember the syntax) would be:

[...]
LEFT JOIN (SELECT value from json_each(json_array('a', 'c'))) AS filter_cond
ON 
CASE 
 WHEN json_type(JSON_QUOTE(JSON_EXTRACT(test_1.extras, '$'))) = 'object' THEN anon_1.key
 WHEN json_type(JSON_QUOTE(JSON_EXTRACT(test_1.extras, '$'))) = 'array' THEN anon_1.value
END = filter_cond.value 
GROUP BY test_1.id
[...]

Note next comment on why this is NOT the correct query.

How to implement in AiiDA (SQLAlchemy and QueryBuilder)

The same notes/discussion apply as those in #6552. In particular, using an aggregation is not ideal.

I honestly don't know yet if this is possible with a simple filter, rather than with joins - I report my findings here because maybe someone has some idea, while I continue thinking about it. Pinging @danielhollas @sphuber
[EDIT]
Actually, this for instance seems to be working, with a correlated subquery:

select test_1.*, (SELECT COUNT(*) FROM (
	  SELECT * from json_each(test_1.extras, '$') AS anon_1 
	  WHERE anon_1.key IN ('c', 'b')
	)) as count from test as test_1
where count = 2

Screenshot 2024-07-25 at 13 07 34

or, even better, this query where the subquery only appears in the where statement:

select test_1.* from test as test_1
where (SELECT COUNT(*) FROM (
	  SELECT * from json_each(test_1.extras, '$') AS anon_1 
	  WHERE anon_1.key IN ('c', 'b')
	)) = 2

I didnt' think yet if this is easy to implement currently (maybe one just puts the subquery in the filter, and SQLA uses the right syntax - to test)

@giovannipizzi
Copy link
Member

Important note on feature parity with PostgeSQL

I am not sure neither the docs of SQLAlchemy on the PSQL dialect, nor our QueryBuilder docs, really explain what contains exactly does for a dictionary (or for strings, ...).

For a list, it should return "all lists where all elements are included". For elements, I don't know, needs to be checked.
For a dictionary, from the SQLAlchemy docs I was thinking that it just checks if the elements of the list are fully included in the keys of the dict. However, I think this is not the case. I think it requires a dict, and checks that the key-value pairs all exist.

Here some quick examples (not to forget), but I suggest to do proper testing before implementing the more complex query with CASE etc, to make sure we implement the same logic as in PSQL, and then we should clarify very well with examples also in our QueryBuilder docs.

Data generation

I first generate some simple data (in a PSQL backend) - note this should be done only once, or you will get multiple copies of the results!

node = orm.Int(1).store()
node.base.extras.set_many({'sub': {'a': 1, 'b': 2, 'c': 3}})

node = orm.Int(2).store()
node.base.extras.set_many({'sub': {'b': 2, 'c': 3}})

node = orm.Int(3).store()
node.base.extras.set_many({'sub': ['a', 'b', 'c']})

node = orm.Int(4).store()
node.base.extras.set_many({'sub': ['b', 'c']})

node = orm.Int(5).store()
node.base.extras.set_many({'sub': 1})

Queries

"Contains" with a list

qbuild = orm.QueryBuilder()
filters = {"extras.sub": {"contains": ["a"]}}
qbuild.append(
    orm.Int,
    project=['attributes.value', 'extras.sub'],
    filters=filters,
)
print(qbuild.all())

returns

[[3, ['a', 'b', 'c']]]

"Does not contain" with a list

qbuild = orm.QueryBuilder()
filters = {"extras.sub": {"!contains": ["a"]}}
qbuild.append(
    orm.Int,
    project=['attributes.value', 'extras.sub'],
    filters=filters,
)
qbuild.all()

returns

[[1, {'a': 1, 'b': 2, 'c': 3}],
 [2, {'b': 2, 'c': 3}],
 [4, ['b', 'c']],
 [5, 1]]

"Contains" with a dictionary, key exists, values does not match

qbuild = orm.QueryBuilder()
filters = {"extras.sub": {"contains": {"a": 'whatever'}}}
qbuild.append(
    orm.Int,
    project=['attributes.value', 'extras.sub'],
    filters=filters,
)
qbuild.all()

returns

[]

"Contains" with a dictionary, key-value pair exists

qbuild = orm.QueryBuilder()
filters = {"extras.sub": {"contains": {"a": 1}}}
qbuild.append(
    orm.Int,
    project=['attributes.value', 'extras.sub'],
    filters=filters,
)
qbuild.all()

returns

[[1, {'a': 1, 'b': 2, 'c': 3}]]

@giovannipizzi
Copy link
Member

giovannipizzi commented Jul 25, 2024

Possible solution for AiiDA

After some testing (and the correlate_except is the key part of this query), this seems to be working!

        if operator == 'contains':
            from sqlalchemy import select

            if isinstance(value, (list, set)):
                subq = select(database_entity).where(func.json_each(database_entity).table_valued('value', joins_implicitly=True).c.value.in_(value)).correlate_except()
                subsubq = select(func.count()).select_from(subq).scalar_subquery()
                return subsubq == len(value)
            elif isinstance(value, dict):
                raise NotImplementedError
            else:
                raise TypeError("contains filters can only have as a parameter a list (when matching against lists) or dictionaries (when matching against dictionaries)")

If useful, here is an example script I was using for testing:

from aiida import orm, load_profile
from aiida.storage.sqlite_temp import SqliteTempBackend
temp_profile = SqliteTempBackend.create_profile('temp-profile')
load_profile(temp_profile, allow_switch=True)

node = orm.Int(1).store()
node.base.extras.set_many({'sub': {'a': 1, 'b': 2, 'c': 3}})
node = orm.Int(2).store()
node.base.extras.set_many({'sub': {'b': 2, 'c': 3}})
node = orm.Int(3).store()
node.base.extras.set_many({'sub': {'a': 1, 'b': 2}})
node = orm.Int(4).store()
node.base.extras.set_many({'sub': "a"})
node = orm.Int(5).store()
node.base.extras.set_many({'sub': 1})
node = orm.Int(6).store()
node.base.extras.set_many({'sub': ["a", "b"]})
node = orm.Int(7).store()
node.base.extras.set_many({'sub': ["c", "d"]})
node = orm.Int(8).store()
node.base.extras.set_many({'sub': ["a", "c"]})
node = orm.Int(9).store()
node.base.extras.set_many({'sub': ["b", "d", "a"]})


qbuild = orm.QueryBuilder()
filters = {"extras.sub": {"~contains": ["a", "b"]}}
#filters = {"extras.sub": {"contains": ["c"]}}

qbuild.append(
    orm.Int,
    project=['attributes.value'],
    filters=filters,
)
print(qbuild.all())
print(qbuild.as_sql(True))

Still TODOs (for whoever wants to pick this up):

  • check that the logic with is instance is OK, and instead of (list, set) or dict one could use other baseclasses
  • implement similarly the case of dict, should be quite easy, one has to match on pairs for (key, value) inside the query
  • check the edge cases (behavior for strings, floats, ...) and also for the list case against the dict case and vice versa (should we add an additional filter on the type?), to match the PSQL case (e.g. I think that the list case, currently implemented above, would match against the dict.values() list in the DB, I don't know if this is the PSQL case).
  • add many tests!! (including for dicts at the very top level in extras, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants