-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Add for support fo regex labeled APNX entries. #2745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Why regex rather than XPath for this? Much more robust than using a regex to parse HTML. For example, your default regex as XPath would be something like
As for generating APNX files without a device, why? For such things the best way is to run use calibre-debug -c to call whatever function in the calibre code base you like. |
Is there any example in code how I can use xPath? My knowledge of python and calibre source code is limited, so I would appropriate help here. I used regex as I saw example in PagebreakPageGenerator generator. Also I just noticed that labels are all lower cased. Can I keep letter cases as well. Reason why I would like to generate apnx without device is that I want to share APNX file together with AZW8 files. So person who reads the book can just copy paste on his device and have exact page break locations. |
On Sun, May 11, 2025 at 06:05:50AM -0700, Vaso Peras-Likodrić wrote:
Vasolik left a comment (kovidgoyal/calibre#2745)
Is there any example in code how I can use xPath? My knowledge of python and calibre source code is limited, so I would appropriate help here. I used regex as I saw example in PagebreakPageGenerator generator.
XPath is used all over calibre, just grep for xpath and you will see
plenty of examples.
Also I just noticed that labels are all lower cased. Can I keep letter cases as well.
Yes, makes sense.
Reason why I would like to generate apnx without device is that I want to share APNX file together with AZW8 files. So person who reads the book can just copy paste on his device and have exact page break locations.
The calibre-debug -c route should work fine for that.
|
I was searching the code base. I have found a lot of example of use of xpath, but what I am missing is how can I search it on entire document. All examples I saw, there is already some root from where search is done, but I do not see how to get a root to be entire mobi (azw3) document? |
Use parse_html() from parse_utils.py on any html file to get the root on |
But there is more then one html file. And I think I need binary location inside the file, not location in one xml. Now I am thinking can xpath even work? |
if you need binary locations then xpath can work by adding an attribute |
Multiply html documents means there is more the one root. And by your suggestion I need to provide now both xpath and regex (for find near by). I do not see what is the problem to just use regex? Regex version is more versatile in my opinion anyway Someone else can make xpath version as well. |
Is there a way to auto generate
I see xpaths used here well? |
On Thu, May 22, 2025 at 03:16:37AM -0700, VasoLiAbbyyNew wrote:
VasoLiAbbyyNew left a comment (kovidgoyal/calibre#2745)
Multiply html documents means there is more the one root.
And you need to run your regex over more than one document as well, so
what's the difference?
And by your suggestion I need to provide now both xpath and regex (for find near by).
No you just need to provide xpath using regex is an internal
implementation detail.
I do not see what is the problem to just use regex? Regex version is more versatile in my opinion anyway
Regexes are hard to use for most people and prone to fragility when used
against markup languages like HTML.
Someone else can make xpath version as well.
calibre is not a grab bag of random ill thought out features.
I have told you what needs to happen to get your PR accepted, if you
dont want to do that, that's fine.
…
--
Reply to this email directly or view it on GitHub:
#2745 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
--
_____________________________________
Dr. Kovid Goyal
https://www.kovidgoyal.net
https://calibre-ebook.com
_____________________________________
|
I thought this is searching over all of the html, not just one document For xpath, user need to modify the document (to add what hxml is in every tag (which does not make sense)) and then again to write one more regex to now find it inside of the document. XPath solution is inferior in my opinion. Also I do not know how to implement it to work properly. I want this feature as proper marking of pages is very important for documents that are used in academia, so quoting certen books is contestant. That is why I am pushing for this change. |
On Thu, May 22, 2025 at 03:34:25AM -0700, VasoLiAbbyyNew wrote:
VasoLiAbbyyNew left a comment (kovidgoyal/calibre#2745)
I thought this is searching over all of the html, not just one document ```html = mobi_html(mobi_file_path)```
For xpath, user need to modify the document (to add what hxml is in every tag (which does not make sense)) and then again to write one more regex to now find it inside of the document.
No that's not wha tth euser has to do. The user has to just specify the
XPath. You have to use a regex to calculate the offset of the matching
element in your implementation.
XPath solution is inferior in my opinion.
Also I do not know how to implement it to work properly.
I have told you how it can be done.
I want this feature as proper marking of pages is very important for documents that are used in academia, so quoting certen books is contestant. That is why I am pushing for this change.
You are welcome to implement it, but it needs to be done properly.
…
--
Reply to this email directly or view it on GitHub:
#2745 (comment)
You are receiving this because you commented.
Message ID: ***@***.***>
--
_____________________________________
Dr. Kovid Goyal
https://www.kovidgoyal.net
https://calibre-ebook.com
_____________________________________
|
About 3 years ago I wanted to submit page generation labels based on aria (Accessible Rich Internet Applications) https://kb.daisy.org/publishing/docs/html/dpub-aria/doc-pagebreak.html specification of encoding pagebreaks.
That was rejected as too specific.
This time I am submitting general regex (with one regex group) page generator. Whenever regex is matched, it looks for a group in regex as set that as page location. This time aria page break is just default value for regex.
I hope this modification can be accepted as additional generation of APNX files.
PS: Is it ok to make generation of APNX files possible without attached device? Maybe like additional conversion option? Or something similar?