-
Notifications
You must be signed in to change notification settings - Fork 761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add logging to crawl.log for metadata records created by ExtractorYoutubeDL #593
feat: Add logging to crawl.log for metadata records created by ExtractorYoutubeDL #593
Conversation
//Use the recorder object to calculate the content digest and store it on the curi. | ||
//Must be calculated now, before the warc writer closes the file stream. | ||
//We don't need an extra copy, so just write to NullOutputStream. | ||
String recorderBaseName = "ExtractorYoutubeDL-" + nextRecorderId.getAndIncrement(); | ||
Recorder recorder = new Recorder(new File(controller.getScratchDir().getFile(), recorderBaseName), | ||
controller.getRecorderOutBufferBytes(), controller.getRecorderInBufferBytes()); | ||
recorder.getRecordedInput().setDigest("sha1"); | ||
getLocalTempFile().seek(0); | ||
recorder.inputWrap(Channels.newInputStream(getLocalTempFile().getChannel())); | ||
recorder.getRecordedInput().startDigest(); | ||
recorder.outputWrap(new NullOutputStream()); | ||
recorder.getRecordedInput().readFully(); | ||
curi.getData().put(YDL_JSON_FILE_DIGEST,recorder.getRecordedInput().getDigestValue()); | ||
recorder.getRecordedOutput().close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't call recorder.cleanup() would this leak temp files when the resource is larger than the buffer size and Recorder spills to disk?
Are we only using Recorder here for the side-effect of calculating the digest? Since we have DigestUtils from Commons Codec available maybe we could just do:
getLocalTempFile().seek(0);
InputStream inputStream = Channels.newInputStream(getLocalTempFile().getChannel());
curi.getData().put(YDL_JSON_FILE_DIGEST, DigestUtils.sha1(inputStream));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my intention was to reuse existing code paths when possible, but this would simplify things, and sha1 should be pretty safe. Thanks for taking a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hashing portion has been refactored. Much simpler now.
No description provided.