Skip to content

[BUG] job_pipeline task does not set job's final status on exception jobs get stuck in RUNNING forever #3653

@Abhishek9639

Description

@Abhishek9639

What happened

While reading through the job_pipeline task in intel_owl/tasks.py (lines 247-266), I noticed that when job.execute() throws an exception, the except block marks individual plugin reports as FAILED but it never actually updates the Job object itself.

Here's the relevant code:

@shared_task(base=FailureLoggedTask, name="job_pipeline", soft_time_limit=100)
def job_pipeline(job_id: int):
    from api_app.models import Job

    job = Job.objects.get(pk=job_id)
    try:
        job.execute()
    except Exception as e:
        logger.exception(e)
        for report in (
            list(job.analyzerreports.all())
            + list(job.connectorreports.all())
            + list(job.pivotreports.all())
            + list(job.visualizerreports.all())
        ):
            report.status = report.STATUSES.FAILED.value
            report.save()

The issue is inside Job.execute() (line 714 in models.py), the very first thing it does is set self.status = self.STATUSES.RUNNING and save it. So by the time an exception is raised (say during _get_pipeline() or _get_signatures() or if the broker is temporarily down), the job is already marked as RUNNING in the DB.

But the except block above doesn't set it back to FAILED. It also doesn't:

  • Set job.finished_analysis_time
  • Send a WebSocket notification via JobConsumer.serialize_and_send_job(job)
  • Update the parent Investigation status

So the job just stays stuck in RUNNING forever. The job_set_final_status task (line 224) is supposed to handle this cleanup, but it's chained as the last step of the Celery pipeline if the pipeline never starts, that task never runs.

I noticed there's already a check_stuck_analysis periodic task (line 124) that catches these after ~25 minutes and marks them as failed, but that's more of a safety net. It also doesn't send any WebSocket notifications, so the user still sees a stuck spinner until they manually refresh.

Worth noting: the run_plugin task (line 268) does call JobConsumer.serialize_and_send_job(job) after exception handling, so there's already a pattern for this in the codebase job_pipeline just doesn't follow it.

Environment

  1. OS: Ubuntu (Docker)
  2. IntelOwl version: develop (v6.6.0)

What did you expect to happen

If job.execute() fails, the except block should also clean up the job state properly. Something like:

  1. Set job.status to FAILED
  2. Append the error to job.errors
  3. Set job.finished_analysis_time = now()
  4. Call JobConsumer.serialize_and_send_job(job) so the frontend knows right away
  5. Update the Investigation status if the job belongs to one

Basically the same stuff job_set_final_status would have done if the pipeline had completed normally.

How to reproduce your issue

  1. Cause an exception during job.execute() for example a DB error in _get_signatures(), a broken PythonConfig import, or even just the broker being momentarily unreachable when runner() is called
  2. The job will be stuck as RUNNING in the database indefinitely
  3. Frontend shows an infinite spinner with no error feedback
  4. Eventually check_stuck_analysis cleans it up after ~25 min, but without notifying the frontend

You can also just read the code to confirm:

  • intel_owl/tasks.py line 247-266 → the incomplete except block
  • intel_owl/tasks.py line 224-232 → job_set_final_status that does the proper cleanup
  • api_app/models.py line 714-715 → where execute() sets status to RUNNING before building the pipeline

Error messages and logs

No error on the frontend that's the whole problem. The user just sees a job stuck in "Running" forever. Server-side, the exception gets logged via logger.exception(e) but from the user's perspective nothing happens.

DB state of a stuck job:

  • job.status = running (never reaches a terminal state)
  • job.finished_analysis_time = NULL
  • Because finished_analysis_time is NULL, remove_old_jobs won't clean these up either they just accumulate

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions