UNIMARC Thompson-Traill Completeness #416

gegic · 2024-02-13T21:23:35Z

First of all, I apologize for this many changed files. That comes from changing the validation part and removal of the validCodes field, which came as a consequence of introduction of the regex pattern to the Avram schema.

When it comes to the Thompson-Traill analysis, the following changes were made:

Removal of the fields from their respective Record classes and creation of special ThomsponTraillAnalysis classes for Marc21, Pica and Unimarc formats. This was done primarily in order to enable for different analysis processes for each format, which in turn allowed us to calculate some fields in Unimarc in a custom manner. That's also the main reason why I didn't use crosswalk similar to what was done for PICA. In other words, the idea was to check for more than only if certain fields exist.
Tests have been slightly modified to check whether the files are as expected.

Other than the TT completeness analysis, this pull request also includes changes to the UnimarcSchemaReader in order to conform to the new Avram schema specification which allows for pattern objects. However, it's not clear how group objects would be used in this context, so I omitted that part. Accordingly, the validation process has been slightly modified.

pkiraly · 2024-02-26T15:59:23Z

Dear @gegic
Sorry for the delay, I was involved in too many things recently. And the size of your PR did not helped in this. My main comment:

  private Leader00() {
    initialize();
    extractValidCodes(); // you removed this line and the method
  }

but later the code contains something that is similar to the content of the removed method:

  private boolean validateRepeatableCode(String code) {
    List<String> nonPatternCodes = codes.stream()
      .filter(e -> !e.isRegex())
      .map(EncodedValue::getCode)
      .collect(Collectors.toList());

    for (int i = 0; i < code.length(); i += unitLength) {
      String unit = code.substring(i, i + unitLength);
      if (!nonPatternCodes.contains(unit)) {
        return false;
      }
    }
    return true;
  }

  private boolean validateNonRepeatableCode(String code) {
    return codes.stream()
      .filter(e -> !e.isRegex())
      .map(EncodedValue::getCode)
      .anyMatch(e -> e.equals(code));
  }

extractValidCodes caches the the valid codes into a variable at object initialization phase. Since these are singleton classes, this extraction happens only once during the lifecycle of the application. But the validate() runs million times in the lifecycle. I do not see the advantage, only the disadvantage of refactoring the code this way. What was your intention with it (other than use Stream API instead of a for loop, which is a valid reason, however it might have been done in the extractValidCodes().

gegic · 2024-02-27T06:27:39Z

When it comes to the size, since the majority of those changed files were related to removing exactly that line that you referred to, I really was hoping that it wouldn't take much time :D

About the removal of the validCodes field, my primary intention was to to avoid the usage of both the list codes and the list validCodes which appeared a bit redundant to me. This way, the idea was to reduce repetition and improve readability a little by not having a list and its subset. The reason I thought that made sense is because, practically, given the complexity of the validation process, this part wouldn't make a difference of more than a dozen milliseconds given that the code lists usually aren't that large.

I also conducted a quick profiling session two times for both variants, where the results oscillated a little, but in all cases the validation of complex control fields took around 70-80ms (in total during the entire runtime, not for one control field), whereas the validation of subfield positions took around 200-250ms (during the entire runtime of the validation, meaning not for only one call but for all calls during the execution in total). For the reference, the validation as a whole took around 9 minutes, and to me a difference of 10-20ms wasn't really significant given that the total was almost ten minutes.

Since this is not conclusive at all (I basically run the profiling four times so it probably can't show quite precise measurements), I will revert that part of the changes today.

Please let me know if there's anything else I should've paid attention to. Thank you :D

EDIT[27.02.]: @pkiraly The validCodes part has been reverted. I added a short comment to the field to briefly explain why it's there :)

pkiraly · 2024-02-29T14:11:38Z

@gegic Is it OK to review the PR again, or do you still work on it?

gegic · 2024-02-29T20:15:44Z

@pkiraly Yes, I believe it can reviewed.

pkiraly · 2024-03-04T17:02:23Z

@gegic Seems fine.

gegic force-pushed the unimarc-tt-completeness branch from 4db988a to cc8040d Compare February 27, 2024 20:13

gegic added 9 commits February 27, 2024 21:14

Modify UnimarcSchemaReader and validation

0e0b7e7

Get started with UNIMARC completeness

207d099

Remove the unnnecessary files

40d9e68

Minor changes and introduce new resources

1245150

Rebase UNIMARC TT Completeness

8e739fd

Add UNIMARC TT Completeness tests

b9add24

Minor cleanup

6403064

Remove mistakenly added files

a7c2898

Revert validCode extraction changes

349491c

gegic force-pushed the unimarc-tt-completeness branch from cc8040d to 349491c Compare February 27, 2024 20:17

Fix an umlaut typo

897151a

pkiraly merged commit 5df3c2b into pkiraly:main Mar 4, 2024
1 of 2 checks passed

pkiraly added a commit that referenced this pull request Mar 4, 2024

UNIMARC Thompson-Traill Completeness (#416): fix a test

ae1eb70

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UNIMARC Thompson-Traill Completeness #416

UNIMARC Thompson-Traill Completeness #416

gegic commented Feb 13, 2024

pkiraly commented Feb 26, 2024 •

edited

Loading

gegic commented Feb 27, 2024 •

edited

Loading

pkiraly commented Feb 29, 2024

gegic commented Feb 29, 2024 •

edited

Loading

pkiraly commented Mar 4, 2024

UNIMARC Thompson-Traill Completeness #416

UNIMARC Thompson-Traill Completeness #416

Conversation

gegic commented Feb 13, 2024

pkiraly commented Feb 26, 2024 • edited Loading

gegic commented Feb 27, 2024 • edited Loading

pkiraly commented Feb 29, 2024

gegic commented Feb 29, 2024 • edited Loading

pkiraly commented Mar 4, 2024

pkiraly commented Feb 26, 2024 •

edited

Loading

gegic commented Feb 27, 2024 •

edited

Loading

gegic commented Feb 29, 2024 •

edited

Loading