-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sets support #206
sets support #206
Conversation
* Corrects resumption token with from and until dates. * Closes inveniosoftware#178.
refactor: pydocstyle
test: update resumptionToken test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance it LGTM, I am going to test it now... The only thing where I slightly disagree are all the styling changes... as much as I would love to use formatting.... I think is good for new files but we should leave old files untouched until we manage to do it organization wide (keeping the history)....
@slint IT WORKS 🚀!
List of tasks to complete before merge:
Screenshots List sets: No records: Set with records (e.g. |
Removed Python 2 support 🚀 |
invenio_oaiserver/percolator.py
Outdated
# Create the percolator doc_type in the existing index for >= ES5 | ||
for index, mapping_path in current_search.mappings.items(): | ||
percolator_doc_type = _get_percolator_doc_type(index) | ||
_create_percolator_mapping(index, percolator_doc_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we create
in a delete
function? the inner code of the function basically creates it if doesnt exists... seems to simply be there to avoid a potential "cannot delete what's not there" error... but that should be taken care of by the 404
below.
for index in current_search.mappings.keys(): | ||
# Create the percolator doc_type in the existing index for >= ES5 | ||
percolator_doc_type = _get_percolator_doc_type(index) | ||
_create_percolator_mapping(index, percolator_doc_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After talking with @rekt-hard we have deemed this line unnecesary... any ideas why it was there @slint @lnielsen
search_pattern="(title_statement.title:foo AND genre:math)") | ||
create_oaiset(spec="nonmath", | ||
search_pattern="(title_statement.title:foo AND -genre:math)") | ||
def test_set_with_no_records(without_oaiset_signals, schema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous tests were hard to read and I think had multiple checks for the same thing. Plus they make use of add_record
and remove_record
for manually managed sets, which we cannot do now with the dynamic ones.
I tried to bring back the tests that make sense and split them by use case.
# raise error that no matches can be found ? | ||
query = Q('match_none') | ||
else: | ||
query = Q(query_string_parser(set.search_pattern)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The application crashes if the search_pattern
is empty. Why the default behaviour is not to search in the _oai.set
fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, an empty search_pattern
does not make sense, if there is no way to manually manage sets.
The idea is to not store the information of the sets a record belongs to within the record itself. Instead, the record is matched against the queries of the sets (ElasticSearch percolate query) and this information is generated dynamically on request.
If the set argument is provided (which is the case if the code above is to be executed), then only records matching the specified set will be returned. Thus it is necessary to lookup the set and get its search pattern. After fetching the records belonging to this set, the above step is performed (i.e. the records are matched against all the sets queries).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply.
Here is some suggestions:
- If the
search_pattern
is required, I suggest to add a no null constraint in the db model - to speed up the
ListIdentifiers
, you can usesource
for the elasticsearch during theget_records
- the source fields can be a configuration for
ListRecords
- using a
search_query
asinsitution:test
raise the following error:TypeError: expected string or bytes-like object
, replacing byinstitution:"test"
works (bug)
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to not store the information of the sets a record belongs to within the record itself. Instead, the record is matched against the queries of the sets (ElasticSearch percolate query) and this information is generated dynamically on request.
Probably I misunderstand something. If during the request we use the search_pattern
, is the percolator still useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably I misunderstand something. If during the request we use the
search_pattern
, is the percolator still useful?
The percolate query is used to get all sets a record belongs to (needed for the header of the response).
The additional query above is used, when a setSpec is specified in the request (f.e. https://127.0.0.1:5000/oai2d?verb=ListRecords&metadataPrefix=oai_dc&set=my-set). It is used during record querying so only records belonging to this set are retrieved. Then we take these records and do a percolate query to get all the sets the records belong to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jma First of all thanks for reviewing!
As @rekt-hard explained, the big change here is that we want to get away from storing OAI sets information inside the record. It's highly very inefficient to do it this way (e.g. define a new set, or modify an existing set and you may need to reindex every single record in the entire repository. The only thing we're loosing is the ability to say that a record was deleted from a set. A harvest can obtain that information by running ListIdentifiers for a set and compare against a local set.
On a side note, can I ask you not to use the "Request changes" when reviewing a PR. Instead, simply prefix your comment with Comment/Question/Minor//Major according (see definitions here). Basically idea is that lack of a green approve, means it's not approved. Just wanted to explain why I'm gonna press the "dismiss review"
Prefix with comment/question/minor/moderate/major instead
* Adds a check to prevent creating percolator mappings for indices not used by OAI-PMH. * Removes the unusded `_percolate_query` function.
Will backport some of the fixes and release in |
adds sets support
closes #178 , #186 , #191, #210