Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Events collected, but not triggering next process #23

Open
Tracked by #12
franTarkenton opened this issue Apr 24, 2023 · 2 comments
Open
Tracked by #12

Events collected, but not triggering next process #23

franTarkenton opened this issue Apr 24, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@franTarkenton
Copy link
Member

The listener has been deployed to openshift and is listening to events successfully.

downloaded the database from the server and can see the events being recorded in the db cache.

Need to figure out why subsequent events are not being triggered properly, either be a restarted process, or when all the data becomes available by the long running process.

What is happening:

  • events are streaming in and being recorded to the sqlite db.

What should be happening

  • when all the expected events are available emits a message.

TODO:

  • use the db pulled from openshift to diagnose whether all the data is actually available.
  • figure out if all the data is available
  • if not diagnose the listener config
  • if yes diagnose the method that detects completion

Def of Done:

  • events are collected
  • events are cached
  • events are acknowledged to message queue
  • detect all data is now available
  • triggers next step in pipeline
  • crash recovery process, checks cached events and recovers correctly
@franTarkenton franTarkenton added the bug Something isn't working label Apr 24, 2023
@franTarkenton franTarkenton self-assigned this Apr 24, 2023
@franTarkenton
Copy link
Member Author

The current process is setup to start the listener, and then it just monitors and logs the messages it receives. The process is getting rebooted every 45 minutes because it does not have a healthcheck or a liveliness probe configured. Working on adding a fastapi end point that services the health and liveliness probes. Once this is complete and implemented should get all the message events in the logs and can then start debugging why some messages don't seem to be received. ATM the message events are lost when the pod dies (every 45 minutes).

@franTarkenton
Copy link
Member Author

Listener now runs with a readiness and liveliness check which should eliminate the reboot of the pod every 45 minutes. Hoping that this will result in the events that we are expecting to show up in the queue to now show up. Specific events that are not included in the database are the ones for the datasets in this directory:

https://hpfx.collab.science.gc.ca/20230529/WXO-DD/model_gem_global/15km/grib2/lat_lon/00/090/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Review / QA
Development

No branches or pull requests

1 participant