-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UNIMARC Thompson-Traill Completeness #416
Conversation
Dear @gegic
but later the code contains something that is similar to the content of the removed method: private boolean validateRepeatableCode(String code) {
List<String> nonPatternCodes = codes.stream()
.filter(e -> !e.isRegex())
.map(EncodedValue::getCode)
.collect(Collectors.toList());
for (int i = 0; i < code.length(); i += unitLength) {
String unit = code.substring(i, i + unitLength);
if (!nonPatternCodes.contains(unit)) {
return false;
}
}
return true;
}
private boolean validateNonRepeatableCode(String code) {
return codes.stream()
.filter(e -> !e.isRegex())
.map(EncodedValue::getCode)
.anyMatch(e -> e.equals(code));
} extractValidCodes caches the the valid codes into a variable at object initialization phase. Since these are singleton classes, this extraction happens only once during the lifecycle of the application. But the validate() runs million times in the lifecycle. I do not see the advantage, only the disadvantage of refactoring the code this way. What was your intention with it (other than use Stream API instead of a for loop, which is a valid reason, however it might have been done in the |
When it comes to the size, since the majority of those changed files were related to removing exactly that line that you referred to, I really was hoping that it wouldn't take much time :D About the removal of the I also conducted a quick profiling session two times for both variants, where the results oscillated a little, but in all cases the validation of complex control fields took around 70-80ms (in total during the entire runtime, not for one control field), whereas the validation of subfield positions took around 200-250ms (during the entire runtime of the validation, meaning not for only one call but for all calls during the execution in total). For the reference, the validation as a whole took around 9 minutes, and to me a difference of 10-20ms wasn't really significant given that the total was almost ten minutes. Since this is not conclusive at all (I basically run the profiling four times so it probably can't show quite precise measurements), I will revert that part of the changes today. Please let me know if there's anything else I should've paid attention to. Thank you :D EDIT[27.02.]: @pkiraly The |
4db988a
to
cc8040d
Compare
cc8040d
to
349491c
Compare
@gegic Is it OK to review the PR again, or do you still work on it? |
@pkiraly Yes, I believe it can reviewed. |
@gegic Seems fine. |
First of all, I apologize for this many changed files. That comes from changing the validation part and removal of the
validCodes
field, which came as a consequence of introduction of the regex pattern to the Avram schema.When it comes to the Thompson-Traill analysis, the following changes were made:
Other than the TT completeness analysis, this pull request also includes changes to the
UnimarcSchemaReader
in order to conform to the new Avram schema specification which allows forpattern
objects. However, it's not clear howgroup
objects would be used in this context, so I omitted that part. Accordingly, the validation process has been slightly modified.