Add for support fo regex labeled APNX entries. #2745

Vasolik · 2025-05-09T22:57:41Z

About 3 years ago I wanted to submit page generation labels based on aria (Accessible Rich Internet Applications) https://kb.daisy.org/publishing/docs/html/dpub-aria/doc-pagebreak.html specification of encoding pagebreaks.

That was rejected as too specific.

This time I am submitting general regex (with one regex group) page generator. Whenever regex is matched, it looks for a group in regex as set that as page location. This time aria page break is just default value for regex.

I hope this modification can be accepted as additional generation of APNX files.

PS: Is it ok to make generation of APNX files possible without attached device? Maybe like additional conversion option? Or something similar?

kovidgoyal · 2025-05-11T04:55:54Z

Why regex rather than XPath for this? Much more robust than using a regex to parse HTML. For example, your default regex as XPath would be something like

//*[@role="doc-pagebreak"]/@aria-label

As for generating APNX files without a device, why? For such things the best way is to run use

calibre-debug -c

to call whatever function in the calibre code base you like.

Vasolik · 2025-05-11T13:05:20Z

Is there any example in code how I can use xPath? My knowledge of python and calibre source code is limited, so I would appropriate help here. I used regex as I saw example in PagebreakPageGenerator generator.

Also I just noticed that labels are all lower cased. Can I keep letter cases as well.

Reason why I would like to generate apnx without device is that I want to share APNX file together with AZW8 files. So person who reads the book can just copy paste on his device and have exact page break locations.

kovidgoyal · 2025-05-11T14:08:36Z

On Sun, May 11, 2025 at 06:05:50AM -0700, Vaso Peras-Likodrić wrote: Vasolik left a comment (kovidgoyal/calibre#2745) Is there any example in code how I can use xPath? My knowledge of python and calibre source code is limited, so I would appropriate help here. I used regex as I saw example in PagebreakPageGenerator generator.

XPath is used all over calibre, just grep for xpath and you will see plenty of examples.

Also I just noticed that labels are all lower cased. Can I keep letter cases as well.

Yes, makes sense.

Reason why I would like to generate apnx without device is that I want to share APNX file together with AZW8 files. So person who reads the book can just copy paste on his device and have exact page break locations.

The calibre-debug -c route should work fine for that.

Vasolik · 2025-05-13T17:04:45Z

I was searching the code base. I have found a lot of example of use of xpath, but what I am missing is how can I search it on entire document. All examples I saw, there is already some root from where search is done, but I do not see how to get a root to be entire mobi (azw3) document?

kovidgoyal · 2025-05-14T02:50:43Z

Use parse_html() from parse_utils.py on any html file to get the root on
which to run xpath

Vasolik · 2025-05-21T19:26:39Z

But there is more then one html file. And I think I need binary location inside the file, not location in one xml.

Now I am thinking can xpath even work?

kovidgoyal · 2025-05-22T04:30:07Z

if you need binary locations then xpath can work by adding an attribute
to the tag with a uuid or similar and then using regex to find the
binary offset to it. Not sure what multiple html files have to do with
it? that is the case regardless whether you use regex or xpath or
anything else.

VasoLiAbbyyNew · 2025-05-22T10:16:15Z

Multiply html documents means there is more the one root. And by your suggestion I need to provide now both xpath and regex (for find near by).

I do not see what is the problem to just use regex? Regex version is more versatile in my opinion anyway

Someone else can make xpath version as well.

VasoLiAbbyyNew · 2025-05-22T10:20:01Z

Is there a way to auto generate

<nav epub:type="page-list">
  <h2>Page List</h2>
  <ol>
    <li><a href="chapter1.xhtml#p1">1</a></li>
    <li><a href="chapter1.xhtml#p2">2</a></li>
  </ol>
</nav>

I see xpaths used here well?

kovidgoyal · 2025-05-22T10:24:00Z

On Thu, May 22, 2025 at 03:16:37AM -0700, VasoLiAbbyyNew wrote: VasoLiAbbyyNew left a comment (kovidgoyal/calibre#2745) Multiply html documents means there is more the one root.

And you need to run your regex over more than one document as well, so what's the difference?

And by your suggestion I need to provide now both xpath and regex (for find near by).

No you just need to provide xpath using regex is an internal implementation detail.

I do not see what is the problem to just use regex? Regex version is more versatile in my opinion anyway

Regexes are hard to use for most people and prone to fragility when used against markup languages like HTML.

Someone else can make xpath version as well.

calibre is not a grab bag of random ill thought out features. I have told you what needs to happen to get your PR accepted, if you dont want to do that, that's fine.

…

-- Reply to this email directly or view it on GitHub: #2745 (comment) You are receiving this because you commented. Message ID: ***@***.***>

--

_____________________________________ Dr. Kovid Goyal https://www.kovidgoyal.net https://calibre-ebook.com

_____________________________________

VasoLiAbbyyNew · 2025-05-22T10:34:04Z

I thought this is searching over all of the html, not just one document html = mobi_html(mobi_file_path)

For xpath, user need to modify the document (to add what hxml is in every tag (which does not make sense)) and then again to write one more regex to now find it inside of the document.

XPath solution is inferior in my opinion.

Also I do not know how to implement it to work properly.

I want this feature as proper marking of pages is very important for documents that are used in academia, so quoting certen books is contestant. That is why I am pushing for this change.

kovidgoyal · 2025-05-22T10:49:00Z

On Thu, May 22, 2025 at 03:34:25AM -0700, VasoLiAbbyyNew wrote: VasoLiAbbyyNew left a comment (kovidgoyal/calibre#2745) I thought this is searching over all of the html, not just one document ```html = mobi_html(mobi_file_path)``` For xpath, user need to modify the document (to add what hxml is in every tag (which does not make sense)) and then again to write one more regex to now find it inside of the document.

No that's not wha tth euser has to do. The user has to just specify the XPath. You have to use a regex to calculate the offset of the matching element in your implementation.

XPath solution is inferior in my opinion. Also I do not know how to implement it to work properly.

I have told you how it can be done.

I want this feature as proper marking of pages is very important for documents that are used in academia, so quoting certen books is contestant. That is why I am pushing for this change.

You are welcome to implement it, but it needs to be done properly.

…

-- Reply to this email directly or view it on GitHub: #2745 (comment) You are receiving this because you commented. Message ID: ***@***.***>

--

_____________________________________ Dr. Kovid Goyal https://www.kovidgoyal.net https://calibre-ebook.com

_____________________________________

Add for support fo regex labeled APNX entries.

7d26583

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add for support fo regex labeled APNX entries. #2745

Add for support fo regex labeled APNX entries. #2745

Uh oh!

Vasolik commented May 9, 2025

Uh oh!

kovidgoyal commented May 11, 2025

Uh oh!

Vasolik commented May 11, 2025

Uh oh!

kovidgoyal commented May 11, 2025 via email

Uh oh!

Vasolik commented May 13, 2025

Uh oh!

kovidgoyal commented May 14, 2025

Uh oh!

Vasolik commented May 21, 2025

Uh oh!

kovidgoyal commented May 22, 2025

Uh oh!

VasoLiAbbyyNew commented May 22, 2025

Uh oh!

VasoLiAbbyyNew commented May 22, 2025 •

edited

Loading

Uh oh!

kovidgoyal commented May 22, 2025 via email

Uh oh!

VasoLiAbbyyNew commented May 22, 2025

Uh oh!

kovidgoyal commented May 22, 2025 via email

Uh oh!

Uh oh!

Uh oh!

Add for support fo regex labeled APNX entries. #2745

Are you sure you want to change the base?

Add for support fo regex labeled APNX entries. #2745

Uh oh!

Conversation

Vasolik commented May 9, 2025

Uh oh!

kovidgoyal commented May 11, 2025

Uh oh!

Vasolik commented May 11, 2025

Uh oh!

kovidgoyal commented May 11, 2025 via email

Uh oh!

Vasolik commented May 13, 2025

Uh oh!

kovidgoyal commented May 14, 2025

Uh oh!

Vasolik commented May 21, 2025

Uh oh!

kovidgoyal commented May 22, 2025

Uh oh!

VasoLiAbbyyNew commented May 22, 2025

Uh oh!

VasoLiAbbyyNew commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kovidgoyal commented May 22, 2025 via email

Uh oh!

VasoLiAbbyyNew commented May 22, 2025

Uh oh!

kovidgoyal commented May 22, 2025 via email

Uh oh!

Uh oh!

VasoLiAbbyyNew commented May 22, 2025 •

edited

Loading