Skip to content

Conversation

Vasolik
Copy link
Contributor

@Vasolik Vasolik commented May 9, 2025

About 3 years ago I wanted to submit page generation labels based on aria (Accessible Rich Internet Applications) https://kb.daisy.org/publishing/docs/html/dpub-aria/doc-pagebreak.html specification of encoding pagebreaks.

That was rejected as too specific.

This time I am submitting general regex (with one regex group) page generator. Whenever regex is matched, it looks for a group in regex as set that as page location. This time aria page break is just default value for regex.

I hope this modification can be accepted as additional generation of APNX files.

PS: Is it ok to make generation of APNX files possible without attached device? Maybe like additional conversion option? Or something similar?

@kovidgoyal
Copy link
Owner

Why regex rather than XPath for this? Much more robust than using a regex to parse HTML. For example, your default regex as XPath would be something like

//*[@role="doc-pagebreak"]/@aria-label

As for generating APNX files without a device, why? For such things the best way is to run use

calibre-debug -c

to call whatever function in the calibre code base you like.

@Vasolik
Copy link
Contributor Author

Vasolik commented May 11, 2025

Is there any example in code how I can use xPath? My knowledge of python and calibre source code is limited, so I would appropriate help here. I used regex as I saw example in PagebreakPageGenerator generator.

Also I just noticed that labels are all lower cased. Can I keep letter cases as well.

Reason why I would like to generate apnx without device is that I want to share APNX file together with AZW8 files. So person who reads the book can just copy paste on his device and have exact page break locations.

@kovidgoyal
Copy link
Owner

kovidgoyal commented May 11, 2025 via email

@Vasolik
Copy link
Contributor Author

Vasolik commented May 13, 2025

I was searching the code base. I have found a lot of example of use of xpath, but what I am missing is how can I search it on entire document. All examples I saw, there is already some root from where search is done, but I do not see how to get a root to be entire mobi (azw3) document?

@kovidgoyal
Copy link
Owner

Use parse_html() from parse_utils.py on any html file to get the root on
which to run xpath

@Vasolik
Copy link
Contributor Author

Vasolik commented May 21, 2025

But there is more then one html file. And I think I need binary location inside the file, not location in one xml.

Now I am thinking can xpath even work?

@kovidgoyal
Copy link
Owner

if you need binary locations then xpath can work by adding an attribute
to the tag with a uuid or similar and then using regex to find the
binary offset to it. Not sure what multiple html files have to do with
it? that is the case regardless whether you use regex or xpath or
anything else.

@VasoLiAbbyyNew
Copy link

Multiply html documents means there is more the one root. And by your suggestion I need to provide now both xpath and regex (for find near by).

I do not see what is the problem to just use regex? Regex version is more versatile in my opinion anyway

Someone else can make xpath version as well.

@VasoLiAbbyyNew
Copy link

VasoLiAbbyyNew commented May 22, 2025

Is there a way to auto generate

<nav epub:type="page-list">
  <h2>Page List</h2>
  <ol>
    <li><a href="chapter1.xhtml#p1">1</a></li>
    <li><a href="chapter1.xhtml#p2">2</a></li>
  </ol>
</nav>

I see xpaths used here well?

@kovidgoyal
Copy link
Owner

kovidgoyal commented May 22, 2025 via email

@VasoLiAbbyyNew
Copy link

I thought this is searching over all of the html, not just one document html = mobi_html(mobi_file_path)

For xpath, user need to modify the document (to add what hxml is in every tag (which does not make sense)) and then again to write one more regex to now find it inside of the document.

XPath solution is inferior in my opinion.

Also I do not know how to implement it to work properly.

I want this feature as proper marking of pages is very important for documents that are used in academia, so quoting certen books is contestant. That is why I am pushing for this change.

@kovidgoyal
Copy link
Owner

kovidgoyal commented May 22, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants