[production] Production site is intermittently timing out #722

hancush · 2021-02-16T14:19:54Z

We've set up downtime notifications for our properties and it looks like WWIC is timing out from time to time. (We're at 99+ percent uptime over the last 30 days, FWIW.)

I started to look into this and noticed we were getting exceptions like this every time a download was requested:

Error handling request
Traceback (most recent call last):
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
psycopg2.ProgrammingError: relation "sfm_pc_membershiporganizationdownload" does not exist
LINE 1: ...ershiporganizationdownload"."member_realend" FROM "sfm_pc_me...
                                                             ^


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/gunicorn/workers/sync.py", line 139, in handle_request
    for item in respiter:
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/djqscsv/djqscsv.py", line 154, in _iter_csv
    for record in values_qs:
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/models/query.py", line 250, in __iter__
    self._fetch_all()
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/models/query.py", line 1121, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/models/query.py", line 106, in __iter__
    for row in compiler.results_iter(chunked_fetch=self.chunked_fetch):
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/models/sql/compiler.py", line 841, in results_iter
    results = self.execute_sql(MULTI, chunked_fetch=chunked_fetch)
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/models/sql/compiler.py", line 899, in execute_sql
    raise original_exception
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/models/sql/compiler.py", line 889, in execute_sql
    cursor.execute(sql, params)
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/raven/contrib/django/client.py", line 127, in execute
    return real_execute(self, sql, params)
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/utils/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/home/datamade/.virtualenvs/sfm/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
django.db.utils.ProgrammingError: relation "sfm_pc_membershiporganizationdownload" does not exist
LINE 1: ...ershiporganizationdownload"."member_realend" FROM "sfm_pc_me...

Also, download files were empty. Looking at #678, it looked like the materialized views backing downloads weren't created, so I ran the make_materialized_views command manually. That addressed the errors and fixed downloads. We should double check that, that management command gets run when it needs to be so views exist when we expect them to.

Even with that fix, download requests are pretty sluggish. So, I looked at the config for the process that runs the app, and it seems like we're only using one worker to fulfill requests: gunicorn -w 1 -t 180 --log-level debug -b 127.0.0.1:8000 sfm_pc.wsgi:application

If there's sufficient traffic when a download is requested, this could lead to a bottleneck and some requests may time out. I think we should address this in the Heroku migration rather than investing additional time into our current setup (especially since we're up 99 percent of the time). We might also consider ways of making the downloads more efficient or pushing them off into asynchronous work so they don't block other requests.

The text was updated successfully, but these errors were encountered:

tlongers · 2021-02-16T14:58:39Z

Thanks for this report. In the past (see #489), we have also had issues with bots/scrapers hitting the download links that have caused performance issues. May be worth checking for this.Generally, if users are trying to download a lot of data and it breaks, we get a polite email asking if we're ok (our users are lovely!).

I agree we need to rethink how we handle downloads as this has caused huge headaches. The our basic use case, which could need updating, is that users want all the data for a country and aren't particularly interested in per-entity slices. My inclination is to sidestep the programmatic aspects and simply offer up copy of the spreadsheet used to create the data in WWIC. So, after import is complete, place the import spreadsheet as a static asset the user can just grab. That needs some thinking about as well though.

hancush · 2021-02-17T18:14:00Z

I've disabled downloads, but performance is still suffering. Consulting the access logs, we're getting a lot of traffic to search pages from Bing and Petal bots, specifically with assorted parameters for sort and number of rows. Short of barring crawling of search results entirely, I think adding the nofollow attribute to the sort and rows links could help, though the bots may take a little time to recognize the changes.

tlongers · 2021-02-17T19:11:09Z

Thanks, we're fine being indexed of course but this seems quite excessive. If the additional nofollow tags don't change much by tomorrow, PetalBot and Bing get a ban, per robots.txt.

User-agent: PetalBot
Disallow: /

Beginning to wonder if #357 isn't beginning to bite a bit as well; making ourselves more discoverable, offering better metadata, to search engines.

hancush · 2021-02-17T20:16:08Z

Yes to revisiting SEO. In addition to improving the metadata, I assume we're allowing crawling of the search results in order to expose links to and thereby index personnel, units, and incidents, but we could achieve that without the performance hit by adding a site map.

In the meantime, I've added the nofollow directive to most links on the search page. I'll give the bots in question a little time to act right and, if they don't comply by tomorrow morning (8 a.m. Central), I'll go ahead and block them from crawling the site.

tlongers · 2021-02-18T10:27:14Z

Looks to me like the bot-caused performance issue is resolved by the addition of nofollow attributes to the table sorts. That done, what do we have to take forward from this issue that will require further work? I can see three things :

the mechanism by which we deliver downloads;
site behaviour and performance monitoring, and the attendant privacy and regulatory compliance (GDPR etc); and
SEO and sitemap/crawler-friendly indexing stuff.

What else do you see in this issue?

hancush · 2021-02-18T14:31:07Z

The big task I see is performance tuning across the site. That would entail setting up automated load testing based on expected number of users (analytics should give us a good idea of what's normal, as well as some signal re: extremes, e.g., traffic spikes after you promote a launch) and making improvements, such as caching, to accommodate those patterns of use. N.b., SEO improvements can also net performance benefits if they allow us to, e.g., disallow crawling of the search pages.

There are some changes in progress, namely the search refactor and upcoming migration to Heroku, that will affect site performance, so I would wait to tune until after those are complete.

I'll also add that I wonder if the issue with downloads was a red herring and it isn't that downloads are particularly heavy / sluggish but that heavy bot traffic was causing sluggish requests across the site. With that said, I do like your idea of making the source sheets available, especially now that a canonical version of location data will be available and per-country slices would be more useful to users than the current entity-level slices.

hancush · 2021-02-18T16:59:15Z

We're getting alerts of timeouts again, and I'm still seeing the problematic bot traffic. I'm thinking it's time to block Bing and PetalBot, at least temporarily. (We might decide to re-allow them if we disable search result crawling.)

tlongers · 2021-02-18T17:36:07Z

Okay. Bing and PetalBot begone!

hancush · 2021-02-18T21:39:18Z

Done and done. Looking snappy! Given the huge difference blocking the bots has made, @tlongers, I'm even more convinced downloads were not the culprit. I've turned them back on and confirmed that they're just as snappy as the rest of the site. That's not to say we can't improve upon them later. 🙂

If the site is still stable in a few days, say Monday AM my time, I'll go ahead and close this issue. Meanwhile, I think we have a few things to spin off here:

Update General page titles, improve page metadata, and miscellaneous SEO improvements #357 to cover general SEO improvements
Open new issue for load testing and performance tuning
Open new issue for revision to download
Open new issue for GDPR compliance and disclosure (will respond to your email about the current setup, but would be nice to have a place to track any related work)

Have I missed anything?

tlongers · 2021-02-23T10:09:50Z

Thanks. I've commented on #357, and created #729, #730 and #731 to develop these follow on issues. Can you check over them quickly and add in any other thoughts or text that's needed, and then close the present issue?

hancush · 2021-02-23T15:41:34Z

These look great, @tlongers, thank you so much for your issue farming. 🐮🤠

tlongers added the performance label Feb 16, 2021

hancush mentioned this issue Feb 17, 2021

Redirect downloads, update message files to accommodate new text #723

Merged

hancush mentioned this issue Feb 17, 2021

Add rel="nofollow" to filter, sort, n result, and download links in search results #725

Merged

tlongers added the high priority label Feb 18, 2021

tlongers changed the title ~~Production site is intermittently timing out~~ [production] Production site is intermittently timing out Feb 18, 2021

hancush added a commit that referenced this issue Feb 18, 2021

Disallow crawling by Bingbot and PetalBot, connects #722

fe88954

This was referenced Feb 18, 2021

Disallow crawling by Bingbot and PetalBot #726

Merged

Revert PR adding download redirect #727

Merged

This was referenced Feb 23, 2021

General page titles, improve page metadata, and miscellaneous SEO improvements #357

Open

Load test and performance tuning WWIC #729

Open

Revisions to download mechanism #730

Closed

hancush closed this as completed Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[production] Production site is intermittently timing out #722

[production] Production site is intermittently timing out #722

hancush commented Feb 16, 2021

tlongers commented Feb 16, 2021 •

edited

Loading

hancush commented Feb 17, 2021

tlongers commented Feb 17, 2021

hancush commented Feb 17, 2021

tlongers commented Feb 18, 2021 •

edited

Loading

hancush commented Feb 18, 2021

hancush commented Feb 18, 2021

tlongers commented Feb 18, 2021

hancush commented Feb 18, 2021 •

edited by tlongers

Loading

tlongers commented Feb 23, 2021

hancush commented Feb 23, 2021

[production] Production site is intermittently timing out #722

[production] Production site is intermittently timing out #722

Comments

hancush commented Feb 16, 2021

tlongers commented Feb 16, 2021 • edited Loading

hancush commented Feb 17, 2021

tlongers commented Feb 17, 2021

hancush commented Feb 17, 2021

tlongers commented Feb 18, 2021 • edited Loading

hancush commented Feb 18, 2021

hancush commented Feb 18, 2021

tlongers commented Feb 18, 2021

hancush commented Feb 18, 2021 • edited by tlongers Loading

tlongers commented Feb 23, 2021

hancush commented Feb 23, 2021

tlongers commented Feb 16, 2021 •

edited

Loading

tlongers commented Feb 18, 2021 •

edited

Loading

hancush commented Feb 18, 2021 •

edited by tlongers

Loading