Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPUB import #12457

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

EPUB import #12457

wants to merge 6 commits into from

Conversation

InAnYan
Copy link
Member

@InAnYan InAnYan commented Feb 3, 2025

EPUB has some bits of metadata, so why not import it

Mandatory checks

  • I own the copyright of the code submitted and I licence it under the MIT license
  • [?] Change in CHANGELOG.md described in a way that is understandable for the average user (if change is visible to the user)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
    - [ ] Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

@InAnYan
Copy link
Member Author

InAnYan commented Feb 4, 2025

I would appreciate it if anyone can give me places where I can download ePUB books and include them in JabRef tests.

I only took books in public domain from Project Gutenberg, but they appear to be strictly in one format.

However, on their websites I found other ePUBs that have different contents of ZIP-archive (This is the reason why I search for any .opf file)

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code currently does not meet JabRef's code guidelines.
We use Checkstyle to identify issues.
Please carefully follow the setup guide for the codestyle.
Afterwards, please run checkstyle locally and fix the issues.

In case of issues with the import order, double check that you activated Auto Import.
You can trigger fixing imports by pressing Ctrl+Alt+O to trigger Optimize Imports.

addField(StandardField.URL, identifier);
}

addField(StandardField.AUTHOR, Optional.of(String.join(" and ", authors)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Authorlist parser

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it do? Parsing authors? Why do I need it there?

Authors are specified separately in <dc:author>. Should be...

UPDATE: Ah, crap, I remember seeing sometimes it's not. Should investigate a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that there is a fetcher which has a similar schema, take a look at that one

Copy link
Member Author

@InAnYan InAnYan Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes! There is Dublin Core scheme (which OPF internally uses)!

Well, anyway, it was an interesting experience with parsing XML files and using `XPath`s.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terrible, it's tied to XMP format.

I can't find a way to pass ordinary XML nodes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah damn. Can you try the Dublin Core Extractor

public class DublinCoreExtractor {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! It is this class that relies on XMP

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! It is this class that relies on XMP

And thus this refs #12457 (comment)

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

@koppor
Copy link
Member

koppor commented Feb 7, 2025

I would appreciate it if anyone can give me places where I can download ePUB books and include them in JabRef tests.

Google pointed me to the German page https://allesebook.de/kostenlose-ebooks/?fwp_anmeldung=nein, which is nice. At the end, you find some epub files.

Put them into a separate directory - and add a README.md stating the source.

Maybe, we need to start a separate repository for import test files? If it grows to more than 50 MB, it really should be a sub module rather than versioned in JabRef directly.

Comment on lines +52 to +58
private final XPathExpression titlePath = xpath.compile("/package/metadata/title");
private final XPathExpression creatorPath = xpath.compile("/package/metadata/creator");
private final XPathExpression identifierPath = xpath.compile("/package/metadata/identifier");
private final XPathExpression languagePath = xpath.compile("/package/metadata/language");
private final XPathExpression sourcePath = xpath.compile("/package/metadata/source");
private final XPathExpression descriptionPath = xpath.compile("/package/metadata/description");
private final XPathExpression subjectPath = xpath.compile("/package/metadata/subject");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, one uses a stax parser to parse XML. (Not DOM, not SAX, not XPath)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I remember it was used. But it's not like I parse whole XML file: I only extract tiny bits of data.

I tried to not include any third-party libraries and use what we alredy have (module info must be unchanged). If there is a specifc reason why it's better to use stax here, I'll rewrite.

Stax is used for importing libraries for other formats, I guess it's core strength in it's parsing interface, that it doesn't load all files into memory. However in this PR an OPF file is parsed, which contains just metadata (all content, source, chapters are in separate XML files in ePUB). Maybe I can write this comment to justify using XPaths there?

I'll also look, if types generated by stax could somehow be transformed into DublinCoreSchema...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to not include any third-party libraries and use what we alredy have (module info must be unchanged). If there is a specifc reason why it's better to use stax here, I'll rewrite.

Just add a JavaDoc comment that you used XPath because you only parse a fragment of the file.

I think, XPath goes into the while DOM nevertheless and StAX would be more efficient. Nevertheless, StAX is more imperative and we as SQL guys like declarative (which is XPath)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll also look, if types generated by stax could somehow be transformed into DublinCoreSchema...

StAX is more a "nice" way to walk around the DOM tree.

I commented because of consistency to the other importers

@koppor koppor marked this pull request as draft February 16, 2025 16:23
@koppor
Copy link
Member

koppor commented Feb 16, 2025

Converted to draft to make the PR overview easier. @InAnYan As soon as you feel ready, please convert it to ready 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants