Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new production system is twitchy about running the monitor status job #27

Open
elrayle opened this issue May 13, 2022 · 1 comment
Open

Comments

@elrayle
Copy link

elrayle commented May 13, 2022

Normal process

See Monitoring Connections to Authorities to understand the normal processing.

Current problem

This process is being twitchy. It runs sometimes and not others. We were not able to pin down why. It worked great in the previous production system for years.

Potential future work

  • Production is currently configured to run jobs :async. If you set up sidekiq or some other job system, you can update production environment to run in background.
  • If you switch to background jobs, the Pingdom process should still work. Instead of getting a time out, the page will return an error on the 4:10 am load if one of the authorities is failing. The slight change is that you will no longer see the down/up pattern. You will only see a down when an authority is failing.
@elrayle
Copy link
Author

elrayle commented May 13, 2022

I tried the following for debugging.

  • Redeploy the system - This seemed to work sometimes and have no affect other times. And redeploying has its own issues where the system doesn't come back up sometimes.

  • Turn off performance calculations by turning off performance related displays in the production env file in S3. The calculations are very processor intensive.

DISPLAY_PERFORMANCE_GRAPH=false
DISPLAY_PERFORMANCE_DATATABLE=false

NOTE: I've never gotten Graph generation to work in Elastic Beanstalk even though they run fine on my laptop. I wasn't able to test them in the new setup because of this issue.

  • I reset the preferred time zone to a time that expired the cache during the day when I could watch it fail. Only it never failed when I did that. This could mean there is something unique happening during the night that is causing the failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant