Releases: pkiraly/qa-catalogue
Version 0.7.0
The major features of this release
Improved PICA handling
PICA is an alternative bibliographic metadata schema used in Germany, The Netherlands and France. The development of PICA related features were done in cooperation with K10Plus, the largest union catalogue of Germany. Now the analyses of PICA records covers completeness, validation, subject heading and authority name analyses plus searching and displaying individual records.
Handling union catalogues
Union catalogues covers the collections of multiple libraries. Now QA catalogue could display the results of completeness, validation, searching and term list for both the whole catalogue and for any individual library.
SHACL4bib
Shape Expression Constraints Language (SHACL) has been adapted to MARC and PICA records. It provides a customized analysis for a library, so it can write a configuration file to check records against their own customs and ruleset which are not part of the core standard. This feature was party developed by Jean Michel Nzi Mba as part of his Bachelor thesis.
Other features
Improved command line interface, documentation. The code base has been more robust thanks to hints from code quality assessment framework Sonar.
Contributors
In the creation of this release Jakob Voß (VZG) and Jean Michel Nzi Mba (University of Göttingen) provided important contributions. Special thanks to Verbundzentrale des GBV (VZG), GWDG and JetBrains for supporting the development.
Details
Group values by library
- #199: Group results in completeness
- #200: Group results in issues
- #246: Filter results in data tab
- #254: Fixing performance issue for groupping validation
- #253: Creation of id-groupid.csv required for validation
PICA changes
- #163: PICA: general changes
- #190: Extend PICA subject fields
- #215: issue #215: Completeness: check occurrence numbers
- #232: Adding XML serialization for PICA
- #234: Making occurrence a first class citizen of PICA data fields
- #247: Uniqueness of PICA field ranges reported wrongly
- #251: PICA: fixing reading of gzipped files
- #250: Copy Avram schema to output directory
- Adjust K10plus Avram schema
Shacl4bib
Command line interface
- common-script: die if input files don't exist
- common-script: disable colors if not run via terminal
- common-script: emit DONE only for processing steps
- common-script: show UPDATE on config
- Add default settings to setdir.sh
- Add configuration varaible UPDATE and summarize configuration
- Add configuration variable ANALYSES for all-analyses
- Refactor common-script
- Allow globs in MASK
- Fixing parameter removal from catalogue specific params
- Ignore default input/output also when they are symlinks
- Improve downloaders
- Improve KB downloader
- Update ONB downloader
- Improve output of common-script
- Add input directory to ONB downloader
- #223: Create a configuration file for Zentralbibliothek Zürich #223
- masking ZB
- #265: 'all' command should run only the selected tasks if schema is PICA #265
- Update catalogue scripts
- Update catalogues
- Make common-script more robust
- Make setdir.sh optional
- Make sqlite more robust
- Remove unnecessary ; chars
- Simplify bash scripts
- Simplify catalogues/k10plus_*.sh
- Remove duplicated DONE in catalog scripts
- Remove unused parts
- Support setting MASK in setdir.sh (k10plus_pica only)
Documentation
- README.md: Adjust path to run helper script
- Create CONTRIBUTING
- Better definition of the tool in the README
- Adding sponsors section
- Adding Binghampton University Libraries to the list of users
- Add SonarCloud badge
- #196: issue #196: update README
- #244: Document dependencies (close #244)
- Rename CONTRIBUTING to CONTRIBUTING.md
- Update test schema README file
CSV generation
- #216: Completeness: use proper CSV library to generate .csv
- #242: Validation: use proper CSV library to generate .csv
other
- #227: The data field (without subfields) are categorized as "unknown origin" in marc-elements.csv #227
Dependency updates
- upgrade com.fasterxml.jackson.core from 2.13.4 to 2.15.0
- upgrade org.apache.logging.log4j from 2.19.0 to 2.20.0
- upgrade org.apache.solr from 9.1.0 to 9.2.0
- upgrade org.apache.spark from 3.3.1 to 3.3.2
- upgrade org.mongodb:bson from 4.7.2 to 4.9.1
- upgrade org.mongodb:mongo-java-driver from 3.12.11 to 3.12.13
- upgrade org.xerial:sqlite-jdbc from 3.39.3.0 to 3.41.2.1
Debugging, refactoring, performance inmprovement
- Implement Sonar suggestions.
- #269: Build failure: testing
- Add coveralls report integration
- Improve performance of classification analysis
- Improve test coverage
- Improving performance
- Fix a missing character from the Docker description.
Files
qa-catalogue-0.7.0-release.zip
: all the files which need to run the software. Download, unzip and go!qa-catalogue-0.7.0-jar-with-dependencies.jar
: the Java library file with all the dependenciesqa-catalogue-0.7.0.jar
: the Java library file without dependencies
Version 0.7.0 release candidate (1)
The current release has the following major features:
- in case of union catalogues the main analyses (validation, completeness) and data is displayed both as a whole, and by individual libraries
- more support of PICA records
- improved command line interface
- a new beta feature: validation against SHACL-like problem patterns
Group values by library
- #199: Group results in completeness
- #200: Group results in issues
- #246: Filter results in data tab
- #254: Fixing performance issue for groupping validation
- #253: Creation of id-groupid.csv required for validation
PICA changes
- #163: PICA: general changes
- #190: Extend PICA subject fields
- #215: issue #215: Completeness: check occurrence numbers
- #232: Adding XML serialization for PICA
- #234: Making occurrence a first class citizen of PICA data fields
- #247: Uniqueness of PICA field ranges reported wrongly
- #251: PICA: fixing reading of gzipped files
- #250: Copy Avram schema to output directory
- Adjust K10plus Avram schema
Shacl4bib
Command line interface
- common-script: die if input files don't exist
- common-script: disable colors if not run via terminal
- common-script: emit DONE only for processing steps
- common-script: show UPDATE on config
- Add default settings to setdir.sh
- Add configuration varaible UPDATE and summarize configuration
- Add configuration variable ANALYSES for all-analyses
- Refactor common-script
- Allow globs in MASK
- Fixing parameter removal from catalogue specific params
- Ignore default input/output also when they are symlinks
- Improve downloaders
- Improve KB downloader
- Update ONB downloader
- Improve output of common-script
- Add input directory to ONB downloader
- #223: Create a configuration file for Zentralbibliothek Zürich #223
- masking ZB
- #265: 'all' command should run only the selected tasks if schema is PICA #265
- Update catalogue scripts
- Update catalogues
- Make common-script more robust
- Make setdir.sh optional
- Make sqlite more robust
- Remove unnecessary ; chars
- Simplify bash scripts
- Simplify catalogues/k10plus_*.sh
- Remove duplicated DONE in catalog scripts
- Remove unused parts
- Support setting MASK in setdir.sh (k10plus_pica only)
Documentation
- README.md: Adjust path to run helper script
- Create CONTRIBUTING
- Better definition of the tool in the README
- Adding sponsors section
- Adding Binghampton University Libraries to the list of users
- Add SonarCloud badge
- #196: issue #196: update README
- #244: Document dependencies (close #244)
- Rename CONTRIBUTING to CONTRIBUTING.md
- Update test schema README file
CSV generation
- #216: Completeness: use proper CSV library to generate .csv
- #242: Validation: use proper CSV library to generate .csv
other
- #227: The data field (without subfields) are categorized as "unknown origin" in marc-elements.csv #227
Dependency updates
- upgrade com.fasterxml.jackson.core from 2.13.4 to 2.15.0
- upgrade org.apache.logging.log4j from 2.19.0 to 2.20.0
- upgrade org.apache.solr from 9.1.0 to 9.2.0
- upgrade org.apache.spark from 3.3.1 to 3.3.2
- upgrade org.mongodb:bson from 4.7.2 to 4.9.1
- upgrade org.mongodb:mongo-java-driver from 3.12.11 to 3.12.13
- upgrade org.xerial:sqlite-jdbc from 3.39.3.0 to 3.41.2.1
Debugging, refactoring, performance inmprovement
- Implement Sonar suggestions.
- #269: Build failure: testing
- Add coveralls report integration
- Improve performance of classification analysis
- Improve test coverage
- Improving performance
- Fix a missing character from the Docker description.
Release v0.6.0
The main focus of the current release is to support the basic analyses of PICA records, i.e.
- validation
- completeness
- indexing
- subject indexing
- authority names
- cataloguing history
PICA field definitions are not hardcoded as in case of MARC, but comes from external Avram schema, so the customization to a library's need is flexible. If no other schema is provided, QA catalogue uses the metadata schema of North German union catalogue, K10plus which can be downloaded from https://format.k10plus.de/avram.pl?profile=k10plus-title. The work on PICA is sponsored by Verbundzentrale (VZG) des Gemeinsamen Bibliotheksverbundes (GBV).
The release also contains other bug fixes and improvements.
The artefacts of the release are available in Maven Central as well: https://central.sonatype.dev/artifact/de.gwdg.metadataqa/metadata-qa-marc/0.6.0.
PICA related changes:
- #137 filter out records
- #138 parsing PICA Plain file
- #140 parsing PICA records
- #142 completeness of PICA records
- #144 filter out internal fields
- #145 Implement PICA Path
- #151 validate PICA records
- #152
ignorableFields
parameter should support masking - #153 indexing PICA records
- #154 subject indexing analysis for PICA
- #155 name authority analysis
- #161 cataloguing history
- #164 Parsing PICA Plain with $ in field values
- #174 FRBR functions
- #187 add parameter to exclude issue types
Other changes:
- #188 Move validators to distinct classes
- #128 Implement incremental timeline
- #127 Include version specific subfields to the JSON schema representation and completeness
Many thanks for @nichtich for being an excellent committer and product owner of this release!
Release v0.5.0
The highlihts in this release:
- the British Library and KBR, the national library of Belgium started to use this tool, and both of them and Gent University Library sent important feedback, bug report and feature requests
- the underlying Java version has been changed to Java 11, and several other technical changes has been implemented
- improved documentation
In the future we will issue releases more frequently.
The list of important changes:
- #94 Change to Java 11
- #89
Check definitins against MARC updates - Catalogue versions
- Avram schema files related developments
- Completeness
- General parameters
- #126 Skip records without errors from issue-details.csv
- #125 ignorableFields should not be mentioned in undefined fields
- #116 Create an INSTALL.md file with installation instructions
- #113 Reading MARCMaker format
- #107 Reorganize scripts directory. Right now there is a
catalogue
directory for the catalogue specific configuration files, and ascripts
directory with some subdirectories for the analyses
Release v0.4
The main features of the current release:
- Full Solr index
- Completeness calculation
- MARC validation
- Support of FRBR functions
- Subject analysis
- Authority names analysis
- Serials analysis
- Thompson—Traill completeness (ebook analysis)
- Shelf-Ready completeness
- Field frequency distribution
- History of cataloging
Other features:
- docker version
- the corresponding web user interface is available at https://github.com/pkiraly/metadata-qa-marc-web/releases/tag/0.4
- MARC data elements from British Library
v0.3
Installation
wget https://github.com/pkiraly/metadata-qa-marc/releases/download/v0.3/metadata-qa-marc-0.3-release.zip
unzip metadata-qa-marc-0.3-release.zip
cd metadata-qa-marc-0.3/
Configuration
cp setdir.sh.template setdir.sh
nano setdir.sh
set your path to root MARC directories:
# the input directory, where your MARC dump files exist
BASE_INPUT_DIR=
# the input directory, where the output CSV files will land
BASE_OUTPUT_DIR=
- Create configuration based on some existing config files:
- cp scripts/loc.sh scripts/[abbreviation-of-your-library].sh
- edit scripts/[abbreviation-of-your-library].sh according to configuration guide
Use
scripts/[abbreviation-of-your-library].sh all-analyses
scripts/[abbreviation-of-your-library].sh all-solr
For a catalogue with around 1 milion record the first command will take 5-10 minutes, the later 1-2 hours.
Release for the SWIB19 conference
Installation
wget https://github.com/pkiraly/metadata-qa-marc/releases/download/v0.2.1/metadata-qa-marc-0.2-SNAPSHOT-release.zip
unzip metadata-qa-marc-0.2-SNAPSHOT-release.zip
cd metadata-qa-marc-0.2-SNAPSHOT/
Configuration
cp setdir.sh.template setdir.sh
nano setdir.sh
set your path to root MARC directories:
# the input directory, where your MARC dump files exist
BASE_INPUT_DIR=
# the input directory, where the output CSV files will land
BASE_OUTPUT_DIR=
- Create configuration based on some existing config files:
- cp scripts/loc.sh scripts/[abbreviation-of-your-library].sh
- edit scripts/[abbreviation-of-your-library].sh according to configuration guide
Use
scripts/[abbreviation-of-your-library].sh all-analyses
scripts/[abbreviation-of-your-library].sh all-solr
For a catalogue with around 1 milion record the first command will take 5-10 minutes, the later 1-2 hours.