Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch-reset creates an empty record at the end when writing and many empty records with encode marcxml #543

Open
TobiasNx opened this issue Jun 21, 2024 · 3 comments · May be fixed by #560
Assignees
Labels

Comments

@TobiasNx
Copy link
Contributor

TobiasNx commented Jun 21, 2024

See: TobiasNx/metafacture_workflows@64ba3e8

Running the workflow with Metafix-Runner 1.0.0

When using the flux module batch-reset and set the batch size to "1" (| batch-reset(batchsize="1"). MF creates an empty record after the last transfromation.

batch-reset should not output empty records.

e.g.:

infile
| open-file
| decode-xml
| handle-generic-xml
| fix(FLUX_DIR + "test.fix")
| batch-reset(batchsize="1")
| encode-xml
| write(FLUX_DIR + "test-output-${i}.xml")
;

or

infile
| open-file
| decode-xml
| handle-generic-xml
| fix(FLUX_DIR + "test.fix")
| batch-reset(batchsize="1")
| encode-yaml
| write(FLUX_DIR + "test-output-${i}.xml")
;

@TobiasNx TobiasNx added the Bug label Jul 3, 2024
@TobiasNx
Copy link
Contributor Author

TobiasNx commented Jul 3, 2024

After fixing #525 some strange behavior occures when using the combination of batch-reset AND encode-marcxml:

With encode-marcxml it creates an additional empty record with every single record.

See: https://gitlab.com/oersi/oersi-marc/-/commit/2abe46321aefc650fe85e5a88a43eba0add3b649

It must have to do with the combination of batch-reset error AND the changes done for encode-marcxml

With Metafix Runner 1.1.1

directory
| read-dir
| open-file
| as-records
| decode-json
| fix(FLUX_DIR + "oersiToMarc.fix", *)
| encode-marcxml
| write(FLUX_DIR + "test-output-${i}.xml")
;

@TobiasNx TobiasNx changed the title batch-reset creates an empty record at the end when writing batch-reset creates an empty record at the end when writing and many empty records with encode marcxml Jul 3, 2024
@dr0i dr0i self-assigned this Sep 5, 2024
dr0i added a commit that referenced this issue Sep 27, 2024
This commit fails and so shows that, although the record is empty,
the footer is written.
dr0i added a commit that referenced this issue Sep 27, 2024
dr0i added a commit that referenced this issue Sep 27, 2024
Test fails because the stream is resetted two times although
only one time called.
dr0i added a commit that referenced this issue Sep 27, 2024
By not calling the pipe (aka wrapper) but the receiver directly
the stream is only once resetted when called once.

(In conjunction with ObjectFileWriter and StreamBatchResetter this bug
had resulted in as many empty files as non-empty ones.)

Complements 04f8410fddceeefce5e228eb5d1866a82dff1687.
dr0i added a commit that referenced this issue Sep 27, 2024
Test fails because it's a line break inserted although there was
no data processed.
@dr0i
Copy link
Member

dr0i commented Sep 27, 2024

I think here are two issues mixed - one of them is no issue at all ( s. a) & b) ).
·
a) IMO it's ok - when explicitly resetting the stream - that a new file is opened (it's stated exactly like this in

writer.resetStream(); // increments count, starts new file
.

b) You use batch-reset in combination with write-files to determine how many records should be in one file before another file will be created (which is renamed by incrementing the number i). If you use batchsize=1 it's always so that a) happens, i.e. the latest file will be empty. If you choose $countOfRecordsInInput modulo $batchsize > 0 you will not have an empty file at the end (s.

)

c) Not https://github.com/metafacture/metafacture-core/pull/532/files changed the behaviour. It has indeed something to do with encode-marcxml, it's a consequence of #527 where a wrapper is used, where the wrapper also calls the resetStream() and thus two files were being created every time the stream was resetted. Fixed with 04a6312.

I've also discovered that when an empty process is triggered in ObjectFileWriter
a linebreak was nonetheless inserted. So there were sometimes (modulo!) files with one byte (the line break). This is fixed in 16b9349.

Another bug surfaced: if a record was empty a footer was written. This is fixed in 0ea6d23.

A good way to test this is the following flux (which is useful with CLI, not for playground because of the written files which you cannot see in that web app):

"http://lobid.org/download/marcXml-8-records.xml"
| open-http(accept="application/xml")
| decode-xml
| handle-marcxml
| batch-reset(batchsize="5")
| encode-marcxml
| write("test-output-${i}.xml")
;

@dr0i dr0i linked a pull request Sep 27, 2024 that will close this issue
@blackwinter
Copy link
Member

it's a consequence of #527 where a wrapper is used

You mean #524, right? The wrapper was introduced in #539, not in #538.

dr0i added a commit that referenced this issue Sep 30, 2024
By not calling the pipe (aka wrapper) but the receiver directly
the stream is only once resetted when called once.

(In conjunction with ObjectFileWriter and StreamBatchResetter this bug
had resulted in as many empty files as non-empty ones.)
@dr0i dr0i assigned dr0i and unassigned dr0i Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Review
Development

Successfully merging a pull request may close this issue.

3 participants