Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UNIMARC Thompson-Traill Completeness #416

Merged
merged 10 commits into from
Mar 4, 2024

Conversation

gegic
Copy link
Contributor

@gegic gegic commented Feb 13, 2024

First of all, I apologize for this many changed files. That comes from changing the validation part and removal of the validCodes field, which came as a consequence of introduction of the regex pattern to the Avram schema.

When it comes to the Thompson-Traill analysis, the following changes were made:

  • Removal of the fields from their respective Record classes and creation of special ThomsponTraillAnalysis classes for Marc21, Pica and Unimarc formats. This was done primarily in order to enable for different analysis processes for each format, which in turn allowed us to calculate some fields in Unimarc in a custom manner. That's also the main reason why I didn't use crosswalk similar to what was done for PICA. In other words, the idea was to check for more than only if certain fields exist.
  • Tests have been slightly modified to check whether the files are as expected.

Other than the TT completeness analysis, this pull request also includes changes to the UnimarcSchemaReader in order to conform to the new Avram schema specification which allows for pattern objects. However, it's not clear how group objects would be used in this context, so I omitted that part. Accordingly, the validation process has been slightly modified.

@pkiraly
Copy link
Owner

pkiraly commented Feb 26, 2024

Dear @gegic
Sorry for the delay, I was involved in too many things recently. And the size of your PR did not helped in this. My main comment:

  private Leader00() {
    initialize();
    extractValidCodes(); // you removed this line and the method
  }

but later the code contains something that is similar to the content of the removed method:

  private boolean validateRepeatableCode(String code) {
    List<String> nonPatternCodes = codes.stream()
      .filter(e -> !e.isRegex())
      .map(EncodedValue::getCode)
      .collect(Collectors.toList());

    for (int i = 0; i < code.length(); i += unitLength) {
      String unit = code.substring(i, i + unitLength);
      if (!nonPatternCodes.contains(unit)) {
        return false;
      }
    }
    return true;
  }

  private boolean validateNonRepeatableCode(String code) {
    return codes.stream()
      .filter(e -> !e.isRegex())
      .map(EncodedValue::getCode)
      .anyMatch(e -> e.equals(code));
  }

extractValidCodes caches the the valid codes into a variable at object initialization phase. Since these are singleton classes, this extraction happens only once during the lifecycle of the application. But the validate() runs million times in the lifecycle. I do not see the advantage, only the disadvantage of refactoring the code this way. What was your intention with it (other than use Stream API instead of a for loop, which is a valid reason, however it might have been done in the extractValidCodes().

@gegic
Copy link
Contributor Author

gegic commented Feb 27, 2024

When it comes to the size, since the majority of those changed files were related to removing exactly that line that you referred to, I really was hoping that it wouldn't take much time :D

About the removal of the validCodes field, my primary intention was to to avoid the usage of both the list codes and the list validCodes which appeared a bit redundant to me. This way, the idea was to reduce repetition and improve readability a little by not having a list and its subset. The reason I thought that made sense is because, practically, given the complexity of the validation process, this part wouldn't make a difference of more than a dozen milliseconds given that the code lists usually aren't that large.

I also conducted a quick profiling session two times for both variants, where the results oscillated a little, but in all cases the validation of complex control fields took around 70-80ms (in total during the entire runtime, not for one control field), whereas the validation of subfield positions took around 200-250ms (during the entire runtime of the validation, meaning not for only one call but for all calls during the execution in total). For the reference, the validation as a whole took around 9 minutes, and to me a difference of 10-20ms wasn't really significant given that the total was almost ten minutes.

Since this is not conclusive at all (I basically run the profiling four times so it probably can't show quite precise measurements), I will revert that part of the changes today.

Please let me know if there's anything else I should've paid attention to. Thank you :D


EDIT[27.02.]: @pkiraly The validCodes part has been reverted. I added a short comment to the field to briefly explain why it's there :)

@gegic gegic force-pushed the unimarc-tt-completeness branch from 4db988a to cc8040d Compare February 27, 2024 20:13
@gegic gegic force-pushed the unimarc-tt-completeness branch from cc8040d to 349491c Compare February 27, 2024 20:17
@pkiraly
Copy link
Owner

pkiraly commented Feb 29, 2024

@gegic Is it OK to review the PR again, or do you still work on it?

@gegic
Copy link
Contributor Author

gegic commented Feb 29, 2024

@pkiraly Yes, I believe it can reviewed.

@pkiraly pkiraly merged commit 5df3c2b into pkiraly:main Mar 4, 2024
1 of 2 checks passed
@pkiraly
Copy link
Owner

pkiraly commented Mar 4, 2024

@gegic Seems fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants