Skip to content

Zombie running job blocks all subsequent task calls when process crashes #202

@2488583886

Description

@2488583886

Summary

When a Codex task process crashes or is killed externally, its job record remains in status: "running" permanently. All subsequent task calls in the same Claude session are rejected with:

Task {job-id} is still running. Use /codex:status before continuing it.

This effectively blocks all Codex usage in that session until the user manually fixes the state file.

Steps to Reproduce

  1. Start a Codex task via /codex:rescue (foreground or background)
  2. The task process crashes or is killed (e.g., OOM, signal, Codex app-server timeout)
  3. Job record in state.json remains status: "running" with a dead PID
  4. All subsequent task calls in the same session fail immediately with "Task is still running"

Root Cause

In codex-companion.mjs, resolveLatestTrackedTaskThread() checks for active tasks:

const activeTask = visibleJobs.find(
  (job) => job.jobClass === "task" && 
  (job.status === "queued" || job.status === "running")
);
if (activeTask) {
  throw new Error(`Task ${activeTask.id} is still running.`);
}

This only checks the status field in the job record. It does not verify whether the process (stored in job.pid) is actually alive.

Evidence from Production

Observed in plugin version 1.0.3. A session had 6 consecutive failed task attempts, all blocked by the same zombie job:

task-mntm7f2j  status=running  pid=82311  (process DEAD)
task-mntmbrox  status=failed   error="Task task-mntm7f2j is still running"
task-mntmbvx4  status=failed   error="Task task-mntm7f2j is still running"
task-mntmclwl  status=failed   error="Task task-mntm7f2j is still running"
task-mntme0qz  status=failed   error="Task task-mntm7f2j is still running"
task-mntmg2an  status=failed   error="Task task-mntm7f2j is still running"
task-mntnft5z  status=failed   error="Task task-mntm7f2j is still running"

The PID was confirmed dead via kill(82311, 0)ProcessLookupError.

Suggested Fix

In resolveLatestTrackedTaskThread(), before throwing, verify the PID is alive. If the process is dead, mark the job as failed and continue:

const activeTask = visibleJobs.find(
  (job) => job.jobClass === "task" && 
  (job.status === "queued" || job.status === "running")
);

if (activeTask) {
  // Check if the process is actually alive
  const pid = activeTask.pid;
  let processAlive = false;
  if (pid) {
    try {
      process.kill(pid, 0);
      processAlive = true;
    } catch (e) {
      processAlive = (e.code === 'EPERM'); // exists but no permission
    }
  }

  if (processAlive) {
    throw new Error(`Task ${activeTask.id} is still running.`);
  }

  // Process is dead — mark as failed (zombie cleanup)
  upsertJob(workspaceRoot, {
    id: activeTask.id,
    status: "failed",
    phase: "failed",
    pid: null,
    errorMessage: "Process exited without updating job status.",
    completedAt: new Date().toISOString()
  });
}

Environment

  • Plugin version: 1.0.3
  • Codex CLI: @openai/codex (latest)
  • Claude Code: 2.1.89
  • Platform: macOS (Darwin 24.6.0, arm64)
  • Node.js: v22.12.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions