Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sets support #206

Merged
merged 19 commits into from
Feb 22, 2022
Merged

sets support #206

merged 19 commits into from
Feb 22, 2022

Conversation

rekt-hard
Copy link
Contributor

@rekt-hard rekt-hard commented Jan 18, 2022

adds sets support

closes #178 , #186 , #191, #210

@rekt-hard rekt-hard marked this pull request as draft January 18, 2022 15:37
@rekt-hard rekt-hard changed the title WIP: Sets sets support Jan 18, 2022
@rekt-hard rekt-hard marked this pull request as ready for review January 19, 2022 08:30
rerowep and others added 2 commits January 23, 2022 19:12
* Corrects resumption token with from and until dates.
* Closes inveniosoftware#178.
@rekt-hard rekt-hard requested a review from jma January 25, 2022 09:24
invenio_oaiserver/models.py Outdated Show resolved Hide resolved
tests/test_percolator.py Outdated Show resolved Hide resolved
Copy link
Member

@ppanero ppanero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance it LGTM, I am going to test it now... The only thing where I slightly disagree are all the styling changes... as much as I would love to use formatting.... I think is good for new files but we should leave old files untouched until we manage to do it organization wide (keeping the history)....

invenio_oaiserver/percolator.py Show resolved Hide resolved
invenio_oaiserver/percolator.py Outdated Show resolved Hide resolved
invenio_oaiserver/percolator.py Outdated Show resolved Hide resolved
@ppanero
Copy link
Member

ppanero commented Jan 31, 2022

@slint IT WORKS 🚀!

List of tasks to complete before merge:

  • Take a decision on styling changes (do we revert? do we merge them?)
  • Remove commented code and make into issues

Screenshots

List sets:

Screenshot 2022-01-31 at 09 39 28

No records:

Screenshot 2022-01-31 at 10 11 48

Set with records (e.g. search_pattern=metadata.description:seat):

Screenshot 2022-01-31 at 10 10 12

@ppanero ppanero requested a review from slint January 31, 2022 09:22
@ppanero
Copy link
Member

ppanero commented Jan 31, 2022

Removed Python 2 support 🚀

@ppanero ppanero removed their assignment Jan 31, 2022
# Create the percolator doc_type in the existing index for >= ES5
for index, mapping_path in current_search.mappings.items():
percolator_doc_type = _get_percolator_doc_type(index)
_create_percolator_mapping(index, percolator_doc_type)
Copy link
Member

@ppanero ppanero Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we create in a delete function? the inner code of the function basically creates it if doesnt exists... seems to simply be there to avoid a potential "cannot delete what's not there" error... but that should be taken care of by the 404 below.

for index in current_search.mappings.keys():
# Create the percolator doc_type in the existing index for >= ES5
percolator_doc_type = _get_percolator_doc_type(index)
_create_percolator_mapping(index, percolator_doc_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking with @rekt-hard we have deemed this line unnecesary... any ideas why it was there @slint @lnielsen

search_pattern="(title_statement.title:foo AND genre:math)")
create_oaiset(spec="nonmath",
search_pattern="(title_statement.title:foo AND -genre:math)")
def test_set_with_no_records(without_oaiset_signals, schema):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous tests were hard to read and I think had multiple checks for the same thing. Plus they make use of add_record and remove_record for manually managed sets, which we cannot do now with the dynamic ones.
I tried to bring back the tests that make sense and split them by use case.

jma
jma previously requested changes Feb 7, 2022
# raise error that no matches can be found ?
query = Q('match_none')
else:
query = Q(query_string_parser(set.search_pattern))
Copy link

@jma jma Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The application crashes if the search_pattern is empty. Why the default behaviour is not to search in the _oai.set fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, an empty search_pattern does not make sense, if there is no way to manually manage sets.

The idea is to not store the information of the sets a record belongs to within the record itself. Instead, the record is matched against the queries of the sets (ElasticSearch percolate query) and this information is generated dynamically on request.

If the set argument is provided (which is the case if the code above is to be executed), then only records matching the specified set will be returned. Thus it is necessary to lookup the set and get its search pattern. After fetching the records belonging to this set, the above step is performed (i.e. the records are matched against all the sets queries).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply.

Here is some suggestions:

  • If the search_pattern is required, I suggest to add a no null constraint in the db model
  • to speed up the ListIdentifiers, you can use source for the elasticsearch during the get_records
  • the source fields can be a configuration for ListRecords
  • using a search_query as insitution:test raise the following error: TypeError: expected string or bytes-like object, replacing by institution:"test" works (bug)

Thanks

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to not store the information of the sets a record belongs to within the record itself. Instead, the record is matched against the queries of the sets (ElasticSearch percolate query) and this information is generated dynamically on request.

Probably I misunderstand something. If during the request we use the search_pattern, is the percolator still useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I misunderstand something. If during the request we use the search_pattern, is the percolator still useful?

The percolate query is used to get all sets a record belongs to (needed for the header of the response).
The additional query above is used, when a setSpec is specified in the request (f.e. https://127.0.0.1:5000/oai2d?verb=ListRecords&metadataPrefix=oai_dc&set=my-set). It is used during record querying so only records belonging to this set are retrieved. Then we take these records and do a percolate query to get all the sets the records belong to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jma First of all thanks for reviewing!

As @rekt-hard explained, the big change here is that we want to get away from storing OAI sets information inside the record. It's highly very inefficient to do it this way (e.g. define a new set, or modify an existing set and you may need to reindex every single record in the entire repository. The only thing we're loosing is the ability to say that a record was deleted from a set. A harvest can obtain that information by running ListIdentifiers for a set and compare against a local set.

On a side note, can I ask you not to use the "Request changes" when reviewing a PR. Instead, simply prefix your comment with Comment/Question/Minor//Major according (see definitions here). Basically idea is that lack of a green approve, means it's not approved. Just wanted to explain why I'm gonna press the "dismiss review"

* Adds a check to prevent creating percolator mappings for indices not
  used by OAI-PMH.
* Removes the unusded `_percolate_query` function.
@slint slint merged commit 3048847 into inveniosoftware:master Feb 22, 2022
@slint
Copy link
Member

slint commented Feb 22, 2022

Will backport some of the fixes and release in 1.3.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

uncorrect resumtion token
6 participants