Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[production] Production site is intermittently timing out #722

Closed
hancush opened this issue Feb 16, 2021 · 11 comments
Closed

[production] Production site is intermittently timing out #722

hancush opened this issue Feb 16, 2021 · 11 comments

Comments

@hancush
Copy link

hancush commented Feb 16, 2021

We've set up downtime notifications for our properties and it looks like WWIC is timing out from time to time. (We're at 99+ percent uptime over the last 30 days, FWIW.)

I started to look into this and noticed we were getting exceptions like this every time a download was requested:

Error handling request
Traceback (most recent call last):
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
psycopg2.ProgrammingError: relation "sfm_pc_membershiporganizationdownload" does not exist
LINE 1: ...ershiporganizationdownload"."member_realend" FROM "sfm_pc_me...
                                                             ^


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 139, in handle_request
    for item in respiter:
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/djqscsv/djqscsv.py", line 154, in _iter_csv
    for record in values_qs:
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/models/query.py", line 250, in __iter__
    self._fetch_all()
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/models/query.py", line 1121, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/models/query.py", line 106, in __iter__
    for row in compiler.results_iter(chunked_fetch=self.chunked_fetch):
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/models/sql/compiler.py", line 841, in results_iter
    results = self.execute_sql(MULTI, chunked_fetch=chunked_fetch)
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/models/sql/compiler.py", line 899, in execute_sql
    raise original_exception
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/models/sql/compiler.py", line 889, in execute_sql
    cursor.execute(sql, params)
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/raven/contrib/django/client.py", line 127, in execute
    return real_execute(self, sql, params)
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/utils/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
django.db.utils.ProgrammingError: relation "sfm_pc_membershiporganizationdownload" does not exist
LINE 1: ...ershiporganizationdownload"."member_realend" FROM "sfm_pc_me...

Also, download files were empty. Looking at #678, it looked like the materialized views backing downloads weren't created, so I ran the make_materialized_views command manually. That addressed the errors and fixed downloads. We should double check that, that management command gets run when it needs to be so views exist when we expect them to.

Even with that fix, download requests are pretty sluggish. So, I looked at the config for the process that runs the app, and it seems like we're only using one worker to fulfill requests: gunicorn -w 1 -t 180 --log-level debug -b 127.0.0.1:8000 sfm_pc.wsgi:application

If there's sufficient traffic when a download is requested, this could lead to a bottleneck and some requests may time out. I think we should address this in the Heroku migration rather than investing additional time into our current setup (especially since we're up 99 percent of the time). We might also consider ways of making the downloads more efficient or pushing them off into asynchronous work so they don't block other requests.

@tlongers
Copy link
Member

tlongers commented Feb 16, 2021

Thanks for this report. In the past (see #489), we have also had issues with bots/scrapers hitting the download links that have caused performance issues. May be worth checking for this.Generally, if users are trying to download a lot of data and it breaks, we get a polite email asking if we're ok (our users are lovely!).

I agree we need to rethink how we handle downloads as this has caused huge headaches. The our basic use case, which could need updating, is that users want all the data for a country and aren't particularly interested in per-entity slices. My inclination is to sidestep the programmatic aspects and simply offer up copy of the spreadsheet used to create the data in WWIC. So, after import is complete, place the import spreadsheet as a static asset the user can just grab. That needs some thinking about as well though.

@hancush
Copy link
Author

hancush commented Feb 17, 2021

I've disabled downloads, but performance is still suffering. Consulting the access logs, we're getting a lot of traffic to search pages from Bing and Petal bots, specifically with assorted parameters for sort and number of rows. Short of barring crawling of search results entirely, I think adding the nofollow attribute to the sort and rows links could help, though the bots may take a little time to recognize the changes.

@tlongers
Copy link
Member

Thanks, we're fine being indexed of course but this seems quite excessive. If the additional nofollow tags don't change much by tomorrow, PetalBot and Bing get a ban, per robots.txt.

User-agent: PetalBot
Disallow: /

Beginning to wonder if #357 isn't beginning to bite a bit as well; making ourselves more discoverable, offering better metadata, to search engines.

@hancush
Copy link
Author

hancush commented Feb 17, 2021

Yes to revisiting SEO. In addition to improving the metadata, I assume we're allowing crawling of the search results in order to expose links to and thereby index personnel, units, and incidents, but we could achieve that without the performance hit by adding a site map.

In the meantime, I've added the nofollow directive to most links on the search page. I'll give the bots in question a little time to act right and, if they don't comply by tomorrow morning (8 a.m. Central), I'll go ahead and block them from crawling the site.

@tlongers
Copy link
Member

tlongers commented Feb 18, 2021

Looks to me like the bot-caused performance issue is resolved by the addition of nofollow attributes to the table sorts. That done, what do we have to take forward from this issue that will require further work? I can see three things :

  • the mechanism by which we deliver downloads;
  • site behaviour and performance monitoring, and the attendant privacy and regulatory compliance (GDPR etc); and
  • SEO and sitemap/crawler-friendly indexing stuff.

What else do you see in this issue?

@tlongers tlongers changed the title Production site is intermittently timing out [production] Production site is intermittently timing out Feb 18, 2021
@hancush
Copy link
Author

hancush commented Feb 18, 2021

The big task I see is performance tuning across the site. That would entail setting up automated load testing based on expected number of users (analytics should give us a good idea of what's normal, as well as some signal re: extremes, e.g., traffic spikes after you promote a launch) and making improvements, such as caching, to accommodate those patterns of use. N.b., SEO improvements can also net performance benefits if they allow us to, e.g., disallow crawling of the search pages.

There are some changes in progress, namely the search refactor and upcoming migration to Heroku, that will affect site performance, so I would wait to tune until after those are complete.

I'll also add that I wonder if the issue with downloads was a red herring and it isn't that downloads are particularly heavy / sluggish but that heavy bot traffic was causing sluggish requests across the site. With that said, I do like your idea of making the source sheets available, especially now that a canonical version of location data will be available and per-country slices would be more useful to users than the current entity-level slices.

@hancush
Copy link
Author

hancush commented Feb 18, 2021

We're getting alerts of timeouts again, and I'm still seeing the problematic bot traffic. I'm thinking it's time to block Bing and PetalBot, at least temporarily. (We might decide to re-allow them if we disable search result crawling.)

@tlongers
Copy link
Member

Okay. Bing and PetalBot begone!

@hancush
Copy link
Author

hancush commented Feb 18, 2021

Done and done. Looking snappy! Given the huge difference blocking the bots has made, @tlongers, I'm even more convinced downloads were not the culprit. I've turned them back on and confirmed that they're just as snappy as the rest of the site. That's not to say we can't improve upon them later. 🙂

If the site is still stable in a few days, say Monday AM my time, I'll go ahead and close this issue. Meanwhile, I think we have a few things to spin off here:

Have I missed anything?

@tlongers
Copy link
Member

Thanks. I've commented on #357, and created #729, #730 and #731 to develop these follow on issues. Can you check over them quickly and add in any other thoughts or text that's needed, and then close the present issue?

@hancush
Copy link
Author

hancush commented Feb 23, 2021

These look great, @tlongers, thank you so much for your issue farming. 🐮🤠

@hancush hancush closed this as completed Feb 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants