-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EPUB import #12457
base: main
Are you sure you want to change the base?
EPUB import #12457
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.
I would appreciate it if anyone can give me places where I can download ePUB books and include them in JabRef tests. I only took books in public domain from Project Gutenberg, but they appear to be strictly in one format. However, on their websites I found other ePUBs that have different contents of ZIP-archive (This is the reason why I search for any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code currently does not meet JabRef's code guidelines.
We use Checkstyle to identify issues.
Please carefully follow the setup guide for the codestyle.
Afterwards, please run checkstyle locally and fix the issues.
In case of issues with the import order, double check that you activated Auto Import.
You can trigger fixing imports by pressing Ctrl+Alt+O to trigger Optimize Imports.
addField(StandardField.URL, identifier); | ||
} | ||
|
||
addField(StandardField.AUTHOR, Optional.of(String.join(" and ", authors))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Authorlist parser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it do? Parsing authors? Why do I need it there?
Authors are specified separately in <dc:author>
. Should be...
UPDATE: Ah, crap, I remember seeing sometimes it's not. Should investigate a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that there is a fetcher which has a similar schema, take a look at that one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes! There is Dublin Core scheme (which OPF internally uses)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Terrible, it's tied to XMP format.
I can't find a way to pass ordinary XML nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah damn. Can you try the Dublin Core Extractor
public class DublinCoreExtractor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! It is this class that relies on XMP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! It is this class that relies on XMP
And thus this refs #12457 (comment)
src/test/java/org/jabref/logic/importer/fileformat/EpubImporterFilesTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JUnit tests are failing. In the area "Some checks were not successful", locate "Tests / Unit tests (pull_request)" and click on "Details". This brings you to the test output.
You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide.
Google pointed me to the German page https://allesebook.de/kostenlose-ebooks/?fwp_anmeldung=nein, which is nice. At the end, you find some epub files. Put them into a separate directory - and add a README.md stating the source. Maybe, we need to start a separate repository for import test files? If it grows to more than 50 MB, it really should be a sub module rather than versioned in JabRef directly. |
private final XPathExpression titlePath = xpath.compile("/package/metadata/title"); | ||
private final XPathExpression creatorPath = xpath.compile("/package/metadata/creator"); | ||
private final XPathExpression identifierPath = xpath.compile("/package/metadata/identifier"); | ||
private final XPathExpression languagePath = xpath.compile("/package/metadata/language"); | ||
private final XPathExpression sourcePath = xpath.compile("/package/metadata/source"); | ||
private final XPathExpression descriptionPath = xpath.compile("/package/metadata/description"); | ||
private final XPathExpression subjectPath = xpath.compile("/package/metadata/subject"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, one uses a stax parser to parse XML. (Not DOM, not SAX, not XPath)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I remember it was used. But it's not like I parse whole XML file: I only extract tiny bits of data.
I tried to not include any third-party libraries and use what we alredy have (module info must be unchanged). If there is a specifc reason why it's better to use stax here, I'll rewrite.
Stax is used for importing libraries for other formats, I guess it's core strength in it's parsing interface, that it doesn't load all files into memory. However in this PR an OPF file is parsed, which contains just metadata (all content, source, chapters are in separate XML files in ePUB). Maybe I can write this comment to justify using XPath
s there?
I'll also look, if types generated by stax could somehow be transformed into DublinCoreSchema
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to not include any third-party libraries and use what we alredy have (module info must be unchanged). If there is a specifc reason why it's better to use stax here, I'll rewrite.
Just add a JavaDoc comment that you used XPath because you only parse a fragment of the file.
I think, XPath goes into the while DOM nevertheless and StAX would be more efficient. Nevertheless, StAX is more imperative and we as SQL guys like declarative (which is XPath)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll also look, if types generated by stax could somehow be transformed into DublinCoreSchema...
StAX is more a "nice" way to walk around the DOM tree.
I commented because of consistency to the other importers
Converted to draft to make the PR overview easier. @InAnYan As soon as you feel ready, please convert it to ready 😅 |
EPUB has some bits of metadata, so why not import it
Mandatory checks
CHANGELOG.md
described in a way that is understandable for the average user (if change is visible to the user)- [ ] Screenshots added in PR description (for UI changes)