EPUB import #12457

InAnYan · 2025-02-03T19:08:45Z

EPUB has some bits of metadata, so why not import it

Mandatory checks

I own the copyright of the code submitted and I licence it under the MIT license
[?] Change in CHANGELOG.md described in a way that is understandable for the average user (if change is visible to the user)
Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
~~- [ ] Screenshots added in PR description (for UI changes)~~
Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

github-actions

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

InAnYan · 2025-02-04T09:00:16Z

I would appreciate it if anyone can give me places where I can download ePUB books and include them in JabRef tests.

I only took books in public domain from Project Gutenberg, but they appear to be strictly in one format.

However, on their websites I found other ePUBs that have different contents of ZIP-archive (This is the reason why I search for any .opf file)

github-actions

Your code currently does not meet JabRef's code guidelines.
We use Checkstyle to identify issues.
Please carefully follow the setup guide for the codestyle.
Afterwards, please run checkstyle locally and fix the issues.

In case of issues with the import order, double check that you activated Auto Import.
You can trigger fixing imports by pressing Ctrl+Alt+O to trigger Optimize Imports.

src/main/java/org/jabref/logic/util/io/FileUtil.java

Siedlerchr · 2025-02-04T11:55:08Z

src/main/java/org/jabref/logic/importer/fileformat/EpubImporter.java

+                addField(StandardField.URL, identifier);
+            }
+
+            addField(StandardField.AUTHOR, Optional.of(String.join(" and ", authors)));


Use Authorlist parser

What does it do? Parsing authors? Why do I need it there?

Authors are specified separately in <dc:author>. Should be...

UPDATE: Ah, crap, I remember seeing sometimes it's not. Should investigate a bit.

I remember that there is a fetcher which has a similar schema, take a look at that one

Oh yes! There is Dublin Core scheme (which OPF internally uses)!

Well, anyway, it was an interesting experience with parsing XML files and using `XPath`s.

Terrible, it's tied to XMP format.

I can't find a way to pass ordinary XML nodes

Ah damn. Can you try the Dublin Core Extractor

jabref/src/main/java/org/jabref/logic/xmp/DublinCoreExtractor.java

Line 37 in beefbaa

public class DublinCoreExtractor {

Yes! It is this class that relies on XMP

Yes! It is this class that relies on XMP

And thus this refs #12457 (comment)

src/main/java/org/jabref/logic/util/io/XMLUtil.java

src/main/java/org/jabref/model/strings/StringUtil.java

src/test/java/org/jabref/logic/importer/fileformat/EpubImporterFilesTest.java

github-actions

JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.

You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.

koppor · 2025-02-07T08:13:56Z

I would appreciate it if anyone can give me places where I can download ePUB books and include them in JabRef tests.

Google pointed me to the German page https://allesebook.de/kostenlose-ebooks/?fwp_anmeldung=nein, which is nice. At the end, you find some epub files.

Put them into a separate directory - and add a README.md stating the source.

Maybe, we need to start a separate repository for import test files? If it grows to more than 50 MB, it really should be a sub module rather than versioned in JabRef directly.

koppor · 2025-02-07T08:16:46Z

src/main/java/org/jabref/logic/importer/fileformat/EpubImporter.java

+    private final XPathExpression titlePath = xpath.compile("/package/metadata/title");
+    private final XPathExpression creatorPath = xpath.compile("/package/metadata/creator");
+    private final XPathExpression identifierPath = xpath.compile("/package/metadata/identifier");
+    private final XPathExpression languagePath = xpath.compile("/package/metadata/language");
+    private final XPathExpression sourcePath = xpath.compile("/package/metadata/source");
+    private final XPathExpression descriptionPath = xpath.compile("/package/metadata/description");
+    private final XPathExpression subjectPath = xpath.compile("/package/metadata/subject");


Normally, one uses a stax parser to parse XML. (Not DOM, not SAX, not XPath)

Yes, I remember it was used. But it's not like I parse whole XML file: I only extract tiny bits of data.

I tried to not include any third-party libraries and use what we alredy have (module info must be unchanged). If there is a specifc reason why it's better to use stax here, I'll rewrite.

Stax is used for importing libraries for other formats, I guess it's core strength in it's parsing interface, that it doesn't load all files into memory. However in this PR an OPF file is parsed, which contains just metadata (all content, source, chapters are in separate XML files in ePUB). Maybe I can write this comment to justify using XPaths there?

I'll also look, if types generated by stax could somehow be transformed into DublinCoreSchema...

I tried to not include any third-party libraries and use what we alredy have (module info must be unchanged). If there is a specifc reason why it's better to use stax here, I'll rewrite.

Just add a JavaDoc comment that you used XPath because you only parse a fragment of the file.

I think, XPath goes into the while DOM nevertheless and StAX would be more efficient. Nevertheless, StAX is more imperative and we as SQL guys like declarative (which is XPath)

I'll also look, if types generated by stax could somehow be transformed into DublinCoreSchema...

StAX is more a "nice" way to walk around the DOM tree.

I commented because of consistency to the other importers

koppor · 2025-02-16T16:24:01Z

Converted to draft to make the PR overview easier. @InAnYan As soon as you feel ready, please convert it to ready 😅

InAnYan added 3 commits February 3, 2025 16:23

Start working on

d647a14

Finish

cfdc0b7

Fix checkers

eea8299

github-actions bot reviewed Feb 3, 2025

View reviewed changes

Fix

339c1a6

github-actions bot reviewed Feb 4, 2025

View reviewed changes

Siedlerchr reviewed Feb 4, 2025

View reviewed changes

src/main/java/org/jabref/logic/util/io/FileUtil.java Outdated Show resolved Hide resolved

Siedlerchr reviewed Feb 4, 2025

View reviewed changes

src/main/java/org/jabref/logic/util/io/XMLUtil.java Outdated Show resolved Hide resolved

Siedlerchr reviewed Feb 4, 2025

View reviewed changes

src/main/java/org/jabref/model/strings/StringUtil.java Show resolved Hide resolved

Siedlerchr reviewed Feb 4, 2025

View reviewed changes

src/test/java/org/jabref/logic/importer/fileformat/EpubImporterFilesTest.java Outdated Show resolved Hide resolved

InAnYan added 2 commits February 5, 2025 16:26

Update from code review

e0032f3

Update comment

f00d472

github-actions bot reviewed Feb 5, 2025

View reviewed changes

koppor reviewed Feb 7, 2025

View reviewed changes

koppor marked this pull request as draft February 16, 2025 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EPUB import #12457

EPUB import #12457

InAnYan commented Feb 3, 2025

github-actions bot left a comment

InAnYan commented Feb 4, 2025 •

edited

Loading

github-actions bot left a comment

Siedlerchr Feb 4, 2025

InAnYan Feb 4, 2025

Siedlerchr Feb 4, 2025

InAnYan Feb 5, 2025 •

edited

Loading

InAnYan Feb 5, 2025

Siedlerchr Feb 5, 2025

InAnYan Feb 5, 2025

koppor Feb 7, 2025

github-actions bot left a comment

koppor commented Feb 7, 2025

koppor Feb 7, 2025

InAnYan Feb 7, 2025

koppor Feb 11, 2025

koppor Feb 11, 2025

koppor commented Feb 16, 2025

EPUB import #12457

Are you sure you want to change the base?

EPUB import #12457

Conversation

InAnYan commented Feb 3, 2025

Mandatory checks

github-actions bot left a comment

Choose a reason for hiding this comment

InAnYan commented Feb 4, 2025 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

InAnYan Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

koppor commented Feb 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koppor commented Feb 16, 2025

InAnYan commented Feb 4, 2025 •

edited

Loading

InAnYan Feb 5, 2025 •

edited

Loading