Show when runs are being processed #322

takluyver · 2024-08-29T11:51:41Z

Put an icon next to the run number while a processing job is working on a run. The tooltip if you hover over that cell shows more details. This will also mean that new runs appear in the table when the backend starts processing them, instead of when the first job finishes.

The status column would be another option, but we recently discussed hiding that by default.

codecov · 2024-08-29T11:54:43Z

Codecov Report

Attention: Patch coverage is 67.25146% with 56 lines in your changes missing coverage. Please review.

Project coverage is 74.50%. Comparing base (c295f8d) to head (987f966).

Files with missing lines	Patch %	Lines
damnit/backend/extraction_control.py	40.42%	28 Missing ⚠️
damnit/gui/table.py	74.50%	13 Missing ⚠️
damnit/backend/extract_data.py	83.58%	11 Missing ⚠️
damnit/gui/main_window.py	0.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #322      +/-   ##
==========================================
- Coverage   74.81%   74.50%   -0.32%     
==========================================
  Files          32       32              
  Lines        4892     5028     +136     
==========================================
+ Hits         3660     3746      +86     
- Misses       1232     1282      +50

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

JamesWrigley · 2024-08-29T12:21:05Z

I confess I don't really like that the implementation depends on the GUI being running throughout the processing time, it'd be quite confusing to see some runs being processed in one window and them not being processed in another (particularly important for long-running jobs like at SPB and MID). I'd prefer to do this properly and save the job state in the database.

Other things:

If I read the code correctly this will show a job still running even if the slurm job was preempted, which is quite possible on exfel. We should periodically (every ~2 minutes or so) check slurm to see if the job is still alive.
~~This also doesn't account for multiple jobs running simultaneously for the same run, but I think we can overlook that for now since we don't really support it anyway.~~ (oops, no that's not true)

JamesWrigley · 2024-08-29T12:24:14Z

We should also have at least some tests 😅

takluyver · 2024-08-29T12:50:11Z

I know it doesn't work correctly for GUIs launched while a run is already processing, but I chose to do it this way because:

I'm trying to avoid writing to the DB when every extractor process starts, because the thundering herd can hit database timeouts waiting for the lock.
We probably want the Kafka messages anyway, for prompt updates, so I don't think this makes it any harder to do a better implementation later.
- It might even be possible to use Kafka's own persistence, rewinding a few hours and replaying messages from before the GUI started. Not sure about this yet.
- Another option would be for the extract_data machinery to send out 'still running' messages every couple of minutes while the child process runs the context file, so any newly launched GUIs gets the correct state after a while.
From what I've seen at FXE, it's normal to leave windows open for long periods, so the simple implementation is already useful.

this will show a job still running even if the slurm job was preempted, which is quite possible on exfel. We should periodically (every ~2 minutes or so) check slurm to see if the job is still alive.

Yup, makes sense.

JamesWrigley · 2024-08-29T13:00:18Z

The main reason I'm against this implementation is that it can lie to the user. That's a hard blocker for me, we should avoid displaying wrong information at all costs. A secondary reason is that we need the same functionality for the web interface, so I'd prefer if we didn't implement something that will specifically only work for the Qt GUI.

From what I've seen at FXE, it's normal to leave windows open for long periods, so the simple implementation is already useful.

I would rather not have the feature at all than display potentially wrong information 🤷 Even apart from that, other users doing analysis sometimes open the GUI on fastx so it doesn't really matter if the session on the control hutch PC's is always running.

takluyver · 2024-08-29T13:16:16Z

Would you be OK with this if any incorrect information was time-limited, so it got the right state within, say, 60 seconds after you launch the GUI? Or would that still be unacceptable?

we need the same functionality for the web interface, so I'd prefer if we didn't implement something that will specifically only work for the Qt GUI.

I'd still build that around the backend sending out messages similar to these, just that some server piece would have to pass them on to the frontend. I don't see this as something that only works for Qt (besides the parts in the GUI code, of course).

JamesWrigley · 2024-08-29T13:27:26Z

Not a fan of eventual consistency in interfaces 😛 It's also kinda confusing that people opening a fresh session will think that certain runs just started reprocessing. But as long as it's not too long then then it's ok, though I'd lightly prefer a max of like 30s than 60s.

takluyver · 2024-08-29T16:40:27Z

OK, already processing runs should show up within 10 seconds, i.e. they send a 'running' message every 10 seconds.

The GUI checks every 2 minutes, as you suggested, for Slurm jobs that have exited without sending a 'finished' message. I think it should also be feasible to catch most types of failure & cancellation and send the message before the job exits, but this is a good backstop in any case.

damnit/backend/extract_data.py

damnit/gui/table.py

JamesWrigley · 2024-08-30T09:53:57Z

Just for context, I would still like to refactor this in the future to save the job status' in the database so we can alert users if a job was preempted or failed for some reason (e.g. timeouts).

takluyver added 2 commits August 29, 2024 11:43

Send Kafka messages when processing a run

5a8d1c2

Add processing indicators in the table

5b61316

takluyver added the enhancement New feature or request label Aug 29, 2024

takluyver requested a review from JamesWrigley August 29, 2024 11:51

Add test for processing status indicators

7465b17

takluyver added 3 commits August 29, 2024 15:16

Send regular messages as long as processing is running

80d48db

Refactor Extractor class for clearer code

02c57d3

Check every 2 minutes for crashed/cancelled Slurm jobs

542fd68

JamesWrigley requested changes Aug 30, 2024

View reviewed changes

damnit/backend/extract_data.py Show resolved Hide resolved

damnit/gui/table.py Outdated Show resolved Hide resolved

Add username to processing status info

8c53ab1

takluyver force-pushed the processing-status branch from a08ece3 to 8c53ab1 Compare August 30, 2024 14:21

Move job status tracking machinery into the backend

987f966

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show when runs are being processed #322

Show when runs are being processed #322

takluyver commented Aug 29, 2024

codecov bot commented Aug 29, 2024 •

edited

Loading

JamesWrigley commented Aug 29, 2024 •

edited

Loading

JamesWrigley commented Aug 29, 2024

takluyver commented Aug 29, 2024

JamesWrigley commented Aug 29, 2024

takluyver commented Aug 29, 2024

JamesWrigley commented Aug 29, 2024

takluyver commented Aug 29, 2024

JamesWrigley commented Aug 30, 2024 •

edited

Loading

Show when runs are being processed #322

Are you sure you want to change the base?

Show when runs are being processed #322

Conversation

takluyver commented Aug 29, 2024

codecov bot commented Aug 29, 2024 • edited Loading

Codecov Report

JamesWrigley commented Aug 29, 2024 • edited Loading

JamesWrigley commented Aug 29, 2024

takluyver commented Aug 29, 2024

JamesWrigley commented Aug 29, 2024

takluyver commented Aug 29, 2024

JamesWrigley commented Aug 29, 2024

takluyver commented Aug 29, 2024

JamesWrigley commented Aug 30, 2024 • edited Loading

codecov bot commented Aug 29, 2024 •

edited

Loading

JamesWrigley commented Aug 29, 2024 •

edited

Loading

JamesWrigley commented Aug 30, 2024 •

edited

Loading