What happened
While reading through the job_pipeline task in intel_owl/tasks.py (lines 247-266), I noticed that when job.execute() throws an exception, the except block marks individual plugin reports as FAILED but it never actually updates the Job object itself.
Here's the relevant code:
@shared_task(base=FailureLoggedTask, name="job_pipeline", soft_time_limit=100)
def job_pipeline(job_id: int):
from api_app.models import Job
job = Job.objects.get(pk=job_id)
try:
job.execute()
except Exception as e:
logger.exception(e)
for report in (
list(job.analyzerreports.all())
+ list(job.connectorreports.all())
+ list(job.pivotreports.all())
+ list(job.visualizerreports.all())
):
report.status = report.STATUSES.FAILED.value
report.save()
The issue is inside Job.execute() (line 714 in models.py), the very first thing it does is set self.status = self.STATUSES.RUNNING and save it. So by the time an exception is raised (say during _get_pipeline() or _get_signatures() or if the broker is temporarily down), the job is already marked as RUNNING in the DB.
But the except block above doesn't set it back to FAILED. It also doesn't:
- Set job.finished_analysis_time
- Send a WebSocket notification via JobConsumer.serialize_and_send_job(job)
- Update the parent Investigation status
So the job just stays stuck in RUNNING forever. The job_set_final_status task (line 224) is supposed to handle this cleanup, but it's chained as the last step of the Celery pipeline if the pipeline never starts, that task never runs.
I noticed there's already a check_stuck_analysis periodic task (line 124) that catches these after ~25 minutes and marks them as failed, but that's more of a safety net. It also doesn't send any WebSocket notifications, so the user still sees a stuck spinner until they manually refresh.
Worth noting: the run_plugin task (line 268) does call JobConsumer.serialize_and_send_job(job) after exception handling, so there's already a pattern for this in the codebase job_pipeline just doesn't follow it.
Environment
- OS: Ubuntu (Docker)
- IntelOwl version: develop (v6.6.0)
What did you expect to happen
If job.execute() fails, the except block should also clean up the job state properly. Something like:
- Set job.status to FAILED
- Append the error to job.errors
- Set job.finished_analysis_time = now()
- Call JobConsumer.serialize_and_send_job(job) so the frontend knows right away
- Update the Investigation status if the job belongs to one
Basically the same stuff job_set_final_status would have done if the pipeline had completed normally.
How to reproduce your issue
- Cause an exception during job.execute() for example a DB error in _get_signatures(), a broken PythonConfig import, or even just the broker being momentarily unreachable when runner() is called
- The job will be stuck as RUNNING in the database indefinitely
- Frontend shows an infinite spinner with no error feedback
- Eventually check_stuck_analysis cleans it up after ~25 min, but without notifying the frontend
You can also just read the code to confirm:
- intel_owl/tasks.py line 247-266 → the incomplete except block
- intel_owl/tasks.py line 224-232 → job_set_final_status that does the proper cleanup
- api_app/models.py line 714-715 → where execute() sets status to RUNNING before building the pipeline
Error messages and logs
No error on the frontend that's the whole problem. The user just sees a job stuck in "Running" forever. Server-side, the exception gets logged via logger.exception(e) but from the user's perspective nothing happens.
DB state of a stuck job:
- job.status = running (never reaches a terminal state)
- job.finished_analysis_time = NULL
- Because finished_analysis_time is NULL, remove_old_jobs won't clean these up either they just accumulate
What happened
While reading through the
job_pipelinetask inintel_owl/tasks.py(lines 247-266), I noticed that whenjob.execute()throws an exception, theexceptblock marks individual plugin reports asFAILEDbut it never actually updates the Job object itself.Here's the relevant code:
The issue is inside Job.execute() (line 714 in models.py), the very first thing it does is set self.status = self.STATUSES.RUNNING and save it. So by the time an exception is raised (say during _get_pipeline() or _get_signatures() or if the broker is temporarily down), the job is already marked as RUNNING in the DB.
But the except block above doesn't set it back to FAILED. It also doesn't:
So the job just stays stuck in RUNNING forever. The job_set_final_status task (line 224) is supposed to handle this cleanup, but it's chained as the last step of the Celery pipeline if the pipeline never starts, that task never runs.
I noticed there's already a check_stuck_analysis periodic task (line 124) that catches these after ~25 minutes and marks them as failed, but that's more of a safety net. It also doesn't send any WebSocket notifications, so the user still sees a stuck spinner until they manually refresh.
Worth noting: the run_plugin task (line 268) does call JobConsumer.serialize_and_send_job(job) after exception handling, so there's already a pattern for this in the codebase job_pipeline just doesn't follow it.
Environment
What did you expect to happen
If job.execute() fails, the except block should also clean up the job state properly. Something like:
Basically the same stuff job_set_final_status would have done if the pipeline had completed normally.
How to reproduce your issue
You can also just read the code to confirm:
Error messages and logs
No error on the frontend that's the whole problem. The user just sees a job stuck in "Running" forever. Server-side, the exception gets logged via logger.exception(e) but from the user's perspective nothing happens.
DB state of a stuck job: