Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add logging to crawl.log for metadata records created by ExtractorYoutubeDL #593

Merged

Conversation

adam-miller
Copy link
Contributor

No description provided.

Comment on lines 613 to 626
//Use the recorder object to calculate the content digest and store it on the curi.
//Must be calculated now, before the warc writer closes the file stream.
//We don't need an extra copy, so just write to NullOutputStream.
String recorderBaseName = "ExtractorYoutubeDL-" + nextRecorderId.getAndIncrement();
Recorder recorder = new Recorder(new File(controller.getScratchDir().getFile(), recorderBaseName),
controller.getRecorderOutBufferBytes(), controller.getRecorderInBufferBytes());
recorder.getRecordedInput().setDigest("sha1");
getLocalTempFile().seek(0);
recorder.inputWrap(Channels.newInputStream(getLocalTempFile().getChannel()));
recorder.getRecordedInput().startDigest();
recorder.outputWrap(new NullOutputStream());
recorder.getRecordedInput().readFully();
curi.getData().put(YDL_JSON_FILE_DIGEST,recorder.getRecordedInput().getDigestValue());
recorder.getRecordedOutput().close();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't call recorder.cleanup() would this leak temp files when the resource is larger than the buffer size and Recorder spills to disk?

Are we only using Recorder here for the side-effect of calculating the digest? Since we have DigestUtils from Commons Codec available maybe we could just do:

getLocalTempFile().seek(0);
InputStream inputStream = Channels.newInputStream(getLocalTempFile().getChannel());
curi.getData().put(YDL_JSON_FILE_DIGEST, DigestUtils.sha1(inputStream));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my intention was to reuse existing code paths when possible, but this would simplify things, and sha1 should be pretty safe. Thanks for taking a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hashing portion has been refactored. Much simpler now.

@adam-miller adam-miller marked this pull request as ready for review July 24, 2024 20:46
@adam-miller adam-miller merged commit b22d6ce into master Aug 7, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants