Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API results can skip pages #5326

Open
obulat opened this issue Jan 9, 2025 · 7 comments
Open

API results can skip pages #5326

obulat opened this issue Jan 9, 2025 · 7 comments
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API

Comments

@obulat
Copy link
Contributor

obulat commented Jan 9, 2025

Description

The Openverse API shows inconsistent pagination when searching for the term bamberg:

  • Page 1 indicates multiple pages of results (e.g., page_count = 12).
  • Page 2 returns no results.
  • Page 3 unexpectedly contains results, breaking the logical flow of paginated data.

Reproduction

  1. GET https://api.openverse.org/v1/images/?page=1&q=bamberg
    • Observe that results are returned and page_count suggests multiple pages (e.g., 12 pages).
  2. GET https://api.openverse.org/v1/images/?q=bamberg&page=2
    • Observe that no results are returned.
  3. GET https://api.openverse.org/v1/images/?q=bamberg&page=3
    • Observe that results are returned, despite page 2 being empty.

Expected vs. Actual Behavior

  • Expected: The API should either return results consistently across each page or accurately reflect the number of valid pages in page_count.
  • Actual: Page 2 is empty while Page 3 contains results, contradicting the pagination info from Page 1.

Additional Information

This is probably related to the dead link filtering of the results.

@obulat obulat added 💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API labels Jan 9, 2025
@openverse-bot openverse-bot moved this to 📋 Backlog in Openverse Backlog Jan 9, 2025
@madewithkode
Copy link
Collaborator

Hi @obulat is there a way to reproduce this locally? Tried with exact query on the local data but there seems to not be any matching data.

@obulat
Copy link
Contributor Author

obulat commented Jan 12, 2025

HI @madewithkode ! I haven't tried reproducing this locally, but if this is, indeed, related to dead link filtering, then it should be possible to reproduce by changing the local data.

This is what I would try to reproduce it:

  1. For all but the first image that is returned for "cat" in the sample_image.csv, replace the URL to make it invalid (change one symbol of the id?).
  2. Re-index the data using just init
  3. Make requests with a low page_size (e.g., 2), and adding filter_dead=true parameter to the query: http://localhost:50280/v1/images/?q=cat&page_size=2&filter_dead=true

Using the API debugging guidelines would be really helpful when looking at which results are being filtered.

@madewithkode
Copy link
Collaborator

Thanks for providing further guidelines. I will not be picking this up immediately as I'm currently exploring another task, but if it's still open when I'm freed up, I would try to give it a shot.

@madewithkode
Copy link
Collaborator

Hi @obulat I've been taking a look at this and also trying to reproduce locally. Few observations/questions so far:

  • if indeed this issue is related to dead link filtering, how come it is able to occur in production on queries(like the one in the issue description) without filter_dead=true or is the config settings.FILTER_DEAD_LINKS_BY_DEFAULT perhaps set to True in production?

  • I have observed a weird behavior while reproducing locally, whereby image links that I can directly reach via my browser are marked as dead and removed from the result set in check_dead_links. In fact, all the links marked as dead during the course of my reproduction were indeed reachable from within the browser. Is this a known issue or perhaps an isolated one?

@obulat
Copy link
Contributor Author

obulat commented Jan 15, 2025

  • if indeed this issue is related to dead link filtering, how come it is able to occur in production on queries(like the one in the issue description) without filter_dead=true or is the config settings.FILTER_DEAD_LINKS_BY_DEFAULT perhaps set to True in production?

Yes, you're right, in production, filter dead links is on by default.

  • I have observed a weird behavior while reproducing locally, whereby image links that I can directly reach via my browser are marked as dead and removed from the result set in check_dead_links. In fact, all the links marked as dead during the course of my reproduction were indeed reachable from within the browser. Is this a known issue or perhaps an isolated one?

Did you try to debug what the response from the dead link check is? Using either the debugger, or even a log statement here:

@madewithkode
Copy link
Collaborator

Did you try to debug what the response from the dead link check is?

I'd say yes, even though not exactly where you have pointed out. I did try logging the actual urls being deleted here:

And in turn tried to visit them directly from my browser and they were all live. Perhaps the head request considers something else in order to determine link validity?
I'd try to log the actual request responses now where you have suggested and see what happens. However, It appears the responses are being cached so it's kind of heard debugging once first request is cached. Can you recommend a simple way to invalidate the cache and have requests run without them for the purpose of debugging?

@obulat
Copy link
Contributor Author

obulat commented Jan 16, 2025

I'd try to log the actual request responses now where you have suggested and see what happens. However, It appears the responses are being cached so it's kind of heard debugging once first request is cached. Can you recommend a simple way to invalidate the cache and have requests run without them for the purpose of debugging?

I don't know of the top of my head how to invalidate the cache, but I used Redis Insight to view the cached data and edit it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants