RFC: Books API Pagination #1580
Replies: 6 comments 10 replies
-
|
I'll note that this does not take into account changes needed for pagination on the UI side of things. This is because we did have pagination implemented on the UI and had removed it as the backend was not operating as expected & did not support things as needed for the facet counts. However, if there's open questions on that side I'd be happy to add them & work through issues. |
Beta Was this translation helpful? Give feedback.
-
|
I like the book_search_data table idea. Since it would already contain some of the core info I wonder if this could be beefed up with other fields (Book ID instead of a separate ID, library ID, primary file id and related primary file fields etc)? And then use this as the source for the page list response itself to avoid some of the memory-intensive backend mapping, e.g. loading the full book result and stripping fields back out to send to the FE. |
Beta Was this translation helpful? Give feedback.
-
|
How would series collapsing work? Right now it's "load everything on the FE, check what belong to a series and then filter them out in the browser" which is horrible. Pagination will definitely break that, and a backend solution would be much cleaner all-round. |
Beta Was this translation helpful? Give feedback.
-
|
How do you imagine book selection in the browser UI to work, both individual and select all? I think the former would be fine, but select all would be flaky with pagination. I guess you'd need a quick endpoint to fetch all book IDs and return them to the FE? |
Beta Was this translation helpful? Give feedback.
-
|
All of this looks great- thanks for the in-depth breakdown of this, and for scoping it out. The only thing that sticks out to me is I don't think long term we'll have much need for the concept of a primary book file with some of the changes proposed in editions work, but that's an easy change to make in the future- most important thing is getting alignment on this direction. I'm curious what you see the next steps here as, and what the individual ticket breakdown/scoping of those tickets would look like? |
Beta Was this translation helpful? Give feedback.
-
|
Out of curiosity, how was the decision made to do subsequent sort queries for multiple stacked sorts ( |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem statement
One of the most visible issues with supporting any decently sized libraries
today is the Books API endpoint, and we want to support larger libraries.
Today, any time the dashboard is loaded a request for every book in the library is made.
For extremely large libraries, this response can be (as expected) extremly large - some
users have reported over a gigabyte taking multiple minutes to respond. The data
retrieved is then processed, sorted, and filtered in the browser. This is non-ideal.
There were previous attempts to use the API endpoints originally designed for a first
party application which does have pagination in its endpoints. However, we found that these
were missing features from the original implementation.
It would be a big win for larger libraries to support pagination for the books endpoint.
Goals
Non-goals
Proposed solution
Add facet and sort query parameters to the
/v1/books/pageendpoint,create a
/v1/books/facetsendpoint to support faceting, and add atable specific to searching and sorting that is associated with each book.
Supporting sorting in the books endpoint
A query parameter for sorting MUST be added to the
/v1/books/pageendpoint asa
sortquery parameter. Thesortquery parameter SHOULD be used to add an additionallayer of sorting by including a comma (
,) and concatenating another sort value. The value ofthe
sortMAY have a-(dash) prepended to signify that it is a descending sort. The defaultMUST be an ascending sort when the
-is omitted.For example, a sort by series name ascending and then series number descending
may be supported via `?sort=seriesName,-seriesNumber
The sort MUST always have an implicit final sort of the book primary key
idascending as a tie breaker to ensure consistent sorting.
Simple Sorts Implementation
Most sorts are straightforward to implement. This includes:
titleseriesNameseriesNumberaddedOnpublisherpublishedDateamazonRatingamazonReviewCountgoodreadsRatinggoodreadsReviewCounthardcoverRatinghardcoverReviewCountranobedbRatingnarratorpageCountThese MAY be implemented with the existing models and simplified sort definitions.
Sorts not possible via standard specifications
Many of the sorts reference multiple other records - or are shorthand
for a variety of other fields. In both of these cases this is difficult
because we either need to sort by an aggregate or we need to perform expensive
compute for each row before sorting.
For example, for many of the files we want to sort on the primary file - which
is currently defined via logic that cannot be easily expressed in SQL.
These include:
fileSizeKbfileNamefilePathbookTypeauthorauthorSurnameVornamelockedAn easy approach to handle these is be to create a per-book "search"
record table which we write to any time there is a known change to the book
or book files. While this is an extra hit on writes it improves our reads
significantly and sets us up to more readily support external search mechanisms
like elasticsearch.
The schema for this table (name pending) would be something like:
While we could index these fields as needed to limit full table scans, it's unlikely that we will
improve performance significantly. The way we search reduces the likelihood of index searches.
This data should be considered ephemeral. In the situation where we have changes
to the logic that indexes into the table - such as via configuration change or
a software update - we can flush the table with a
TRUNCATE TABLE(or similar)and re-index fields again. An example of changes would be a software update adding
the
title_file_asfield, which may omitThe,A, or other text based onconfiguration.
This SHOULD be added as a task to the "Tasks" settings page. This MUST be
executed as part of the initial migration so that the table is filled.
Note
This MAY be extended to support features like omitting articles from titles
via a
title_file_ascolumn.Sorts that are per-user
Sorts for per-user values are going to incur a hit in most cases.
However, the logic is simple enough to handle in JPA specifications
and other related expressions.
These include:
personalRatinglastReadTimereadStatusdateFinishedreadingProgressThis can be seen as part of the app books service logic. The only item which
is somewhat confusing to support is "reading progress". We should only support
the Grimmory % reading progress, but how that is kept up to date with
other systems is outside the scope of this document.
Random Sort
With pagination, the "random" sort is a big question mark.
There's a number of ways to support this, all with their own tradeoffs.
For our use cases, we just need "perceivably random" for humans without any
requirement on cryptographic randomness, and to support the same random values
across pages.
Normally, a
SELECT * FROM book ORDER BY random()could be used, but it wouldnot survive the pagination process and is not very performant because it needs
to be executed for every record.
Sort during index is to have a handful of random sorts indexed into our search
index table (
random_1,random_2,random_3,random_4,random_5) andthen sort via one or more of those fields (at random). Then, store the chosen
fields in the cursor value so that pages are consistent.
With 5 "random" fields available, each with an ascending and descending, we have
32 possible random sorts available to us if we always use all 5. However, we
can actually choose any of them in any order - so we actually have many.. hundred?
My head hurts thinking through permutations like that. There are enough permutations
that people will not be able to notice the pattern.
Facets Implementation
A query parameter for faceting MUST be added to the
/v1/books/pageendpoint asthe
facetquery parameter. Each subsequent instance of thefacetquery parameterSHOULD be used add an additional "and" facet. However, this behavior MAY be modified
by passing a value to
facet_logic-and,or, ornot.Facets MAY be supported via the book index table and JPA specification definitions.
Search and advanced Faceting
A query parameter called
queryshould be added to the endpoint which allows forfree form queries to be applied.
Bare Search Terms
Terms that are not otherwise matched by any query language or shortcuts should be
considered bare search terms and should be used as query values against the following
fields:
However, this RFC does not define how these must be matched, just that they should be.
Series Collapse
The "Collapse Series" feature may be supported by adding
is_first_in_seriesto the searchtable. This could be handled during index time to "select" the first book in a series when
the book browser has series collapse enabled. This could be exposed as a facet just like
any other.
Query Language
A simple DSL for querying data should be available within this query field
at some point in the future, but not at release. This query language would
support the same flexibility of magic shelves across multiple domains - such
as series or author.
The definition of the query language is outside the scope of this RFC,
but is intended to allow power users to explore their library, should be human readable,
human writable, and expressive enough to support everything within magic shelves.
For an example query language, Github's Issues filtering should be used as an inspiration.
Magic Shelf Support
We should surface magic shelves in the "shelf" facet (though we may want to omit
the count for now) and when a magic shelf is selected in a facet, short circuit
the facet-to-specification code to apply all of the facets normally applied by
the magic shelf.
This makes magic shelves "feel" more like real shelves.
Books Endpoint
Any changes to the books
pageendpoint MUST NOT be breaking to existinguser agents. The request should have new (optional) query parameters added,
and the response should have new fields added that match closely with OPDS 2.0
fields.
We should continue to support the
pagerequest parameter to keep previous behaviors.However, we must add a
cursorparameter which embeds the page informationin an opaque cursor - such as how to handle randomness between pages, or other
information that supports stable pagination.
Add a
cursorto thepageobject in the response which exposes thecurrent page's cursor. Add a
linksproperty to the response objectwhich follows the OPDS pattern for links to other relevant pages.
Links MUST have a rel for self, and MAY have a rel for first, previous, and next.
When present, each of these MUST use the
cursorparameter.The links SHOULD have a descriptive
relwhich MUST be a string array:selfis the canonical current page.firstis the canonical first page if we were to reset pagination, and may be the same asself.previousis the previous page, if it exists. This MUST be omitted if we are on the first page.nextis the next page, if it exists. This MUST be omitted if we are on the last page.facetis a link to view the available facets, which is described in the section Book Facets Endpoint.Note
This means that there is no way to go to a specific page.
Given our current application, we do not need to support users selecting pages in that way.
Books Facets Endpoint
To support features that require the values and counts for each facet,
the facets endpoint exposes what "options" are available to a user for
the books endpoint. This accepts the same facet & sort parameters as the
books endpoint, and applies the facets to each of the values.
Note that to get the correct values, each facet should omit itself
when calculating facets and should limit to the top 100 distinct values.
Risks
Are there any backwards-incompatible changes?
To my knowledge, there are no backwards-incompatible changes being introduced.
Does this project have special implications for security and data privacy?
No, this should not be introducing any access that users did not have before.
Could this change significantly increase load on any of our backend systems?
Yes. This is a change in our sorting, so the behavior we choose could
take us from one painful query to get all records to multiple painful
queries, one for each page of data.
Does this project have any dependencies?
No, this project does not have any dependencies.
However, this will be the blueprints for other paginated endpoints
and should be designed with that in mind.
Alternative solutions
App API endpoints
It's possible to use the app API endpoints for a subset of these use cases.
However:
They can work but will require us to accept previous design decisions that may not make sense for our use case.
Sort Random with Modulo
Sort during index or sort with modulo.
Sort with modulo is to have a set of prime numbers (eg,
2,3,5,7,11),you pick one at random, and then sort by
id % selected_prime. This is slower butcan provide really great results as far as randomness.
This was opted against for performance and complexity reasons.
Magic Shelves calculated on change
Magic shelves could be supported by listening to changes and updating a concrete shelf
when those changes update the applicability of a book to a magic shelf. Given
the nature of there being more reads than writes of most libraries, this could be
an improvement on resource utilization and would allow for us to simplify many
endpoints.
I believe this would be the right move. However, this is a pretty significant change in
behavior and could be risky.
Beta Was this translation helpful? Give feedback.
All reactions