A add-on for the Apache Solr Data Import Handler.
This project adds an entity processor that handles bibliographic records. It supports the metamorph DSL for data extraction.
mvn package
Produces solr-metamorph-entity-processor-VERSION-jar-with-dependencies.jar in target .
Assuming a fresh Solr installation.
-
Download the latest version
-
Unzip
-
Directory of your Solr installation is
solr-VERSION
(e.g. solr-7.4.0)
A list of Solr directories:
Name | Location | Example |
---|---|---|
SOLR_ROOT |
Path to the unpacked solr distribution |
/srv/solr-7.4.0 |
SOLR_SERVER_DIR |
SOLR_ROOT/server |
/srv/solr-7.4.0/server |
SOLR_HOME |
SOLR_ROOT/server/solr |
/srv/solr-7.4.0/server/solr |
Create the directory SOLR_ROOT/lib
:
mkdir -p SOLR_ROOT/lib
mkdir -p SOLR_ROOT/lib/metafacture
Copy all Metafacture Module JARs into SOLR_ROOT/lib/metafacture
and into SOLR_ROOT/server/solr-webapp/webapp/WEB-INF/lib
.
cd SOLR_ROOT/lib/metafacture repo="http://central.maven.org/maven2/org/metafacture" modules="metafacture-biblio metafacture-commons metafacture-flowcontrol metafacture-framework metafacture-io metafacture-mangling metamorph metamorph-api" for module in $modules; do wget -q -P $(realpath ${SOLR})/lib/metafacture ${repo}/${module}/${METAFACTURE_VERSION}/${module}-${METAFACTURE_VERSION}.jar done
mkdir -p SOLR_ROOT/lib/dih
Copy the latest release JAR into SOLR_ROOT/lib/dih
.
Note
|
Assumes a existing core (you may use a default core). Edit the solrconfig.xml of your core. |
Enable the Data Import Handler and the processor by adding the following
lib statements to the solrconfig.xml
of your config set:
<!-- Data Import Handler --> <lib dir="\${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" /> <!-- Metafacture --> <lib dir="\${solr.install.dir:../../../..}/lib/metafacture" regex="metafacture-.*\.jar" /> <!-- Data Import Handler Add-Ons --> <lib dir="\${solr.install.dir:../../../..}/lib/dih" regex="solr-metamorph-entity-processor-.*\.jar" />
Add the /dataimport request handle to the solrconfig.xml
:
<requestHandler name="/dataimport" class="solr.DataImportHandler"> <lst name="defaults"> <str name="config">solr-data-config.xml</str> </lst> </requestHandler>
Tip
|
A example solr-data-config.xml is located in example/solr-data-config.xml .
|
Note
|
Test data are located in example/testdata.mrc . The solr-data-config.xml expects them in /tmp .
|
This MetamorphEntityProcessor reads all content from the data source on a record by record basis. This processor may handle compressed input streams, if the consumed data source is a BinFileDataSource.
Each record is processed by a metafacture pipeline that uses metamorph to extract fields.
The Metamorph Entity Processor has the following attributes:
- url
-
Required. A attribute that specifies the location of the input file in a way that is compatible with the configured data source.
- format
-
Required. The format supplied by the data source.
- Supported Formats
-
-
marc21
-
Pre-processing records by replacing newline and carriage return with a space
-
-
marcxml
-
Pre-processing records by converting marcxml into marc21 and using the marc21 pre-processing (see above).
-
if includeFullRecord=true, the implicit field fullRecord contains the MARC21 representation of the record.
-
-
- morphDef
-
Required. The metamorph definition files that are used for field extraction. Each extracted field is added as a implicit field. If the input is a list of files (separated by a comma), the data get passed from one metamorph file to another. Those files are located inside the config set’s conf directory. :: Make sure that your metamorph definition xml has the following properties:
-
The encoding of the file should be UTF-8
-
Validate the file encoding with a text editor
-
-
Check for control characters, if you use XML 1.0
-
ASCII control characters are not legally encodeable in XML 1.0
-
-
- includeFullRecord
-
An optional attribute that adds the received record to the implicit field
fullRecord
. The attribute is a boolean value (true or false), that is false by default. - onError
-
By default the MetamorphEntityProcessor will stop processing documents, if it finds one that generates an error. If you set onError to "skip", the MetamorphEntityProcessor will instead skip documents that fail processing. A debug message will be created that contains the record and the cause of the failure.
For example:
<entity name="morph"
processor="org.culturegraph.solr.handler.dataimport.MetamorphEntityProcessor"
url="path/to/file.marc21"
inputFormat="marc21"
morphDef="morph.xml,morph2.xml"
includeFullRecord="true"
onError="skip">
<field column="identifier" name="id"/>
<field column="fullRecord" name="fullRecord_s"/>
</entity>
The used metamorph definitions:
<?xml version="1.0" encoding="UTF-8"?>
<!-- morph.xml -->
<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
<rules>
<data name="idn" source="001"/>
</rules>
</metamorph>
<?xml version="1.0" encoding="UTF-8"?>
<!-- morph2.xml -->
<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
<rules>
<data name="identifier" source="idn"/>
</rules>
</metamorph>
Run a full-import:
curl -s http://localhost:1111/solr/demo/dataimport?command=full-import
Check status:
curl -s http://localhost:1111/solr/demo/dataimport?command=status
Commit:
curl -s http://localhost:1111/solr/demo/update?commit=true
- NOTE
-
The admin UI provides a Dataimport Screen .
A record processed by metamorph will be transformed into a intermediate representation (IR) that consists of the following elements:
-
Record
-
Entity
-
Literal
A row processed by Solr is a map that consists of key-value or key-list pairs.
startRecord("001") literal("date", "20181001") startEntity("person") literal("lastname", "Unknown") endEntity() literal("cat", "human") literal("cat", "person") endRecord()
{ "cat": ["human", "person"] "date": "20181001" "personLastname": "Unknown" }
The following rules are applied to convert a IR to a Row:
-
Record id will be ignored
-
Literals with the same name form a list
-
Literal names in entities are prefixed with the entity name in CamelCase