Evaluate options for background task processing

### Context

We need background processing for three use cases:

1. **PDF compression** - Run Ghostscript to compress uploaded scan PDFs
2. **Blackletter OCR pipeline** - Run the blackletter ML pipeline to add OCR and metadata to scans
3. **Blackletter split pipeline** - Run the blackletter ML pipeline to split a full book scan into individual opinions

Currently the app only supports uploading files. None of these processing features exist yet. We need to choose a background processing approach before implementing them, since these tasks (PDF compression, ML inference) are too slow to run in the request/response cycle.

**Constraints:**
- ~10 concurrent users max
- Upload rate limited by physical scanning speed (vflat)
- Low volume workload, not a high-throughput pipeline
- Already have PostgreSQL, want to avoid adding Redis if possible
- Running on Docker Compose (dev) / K8s (prod)

---

### Option 1: `django.tasks` + `django-tasks` DatabaseBackend

Django 6.0 ships a native `django.tasks` framework ([DEP 14](https://github.com/django/deps/blob/main/accepted/0014-background-workers.rst)) with a standard API (`@task` decorator, `.enqueue()`, `TaskResult`). The `django-tasks` package provides a `DatabaseBackend` that stores tasks in PostgreSQL using `SELECT ... FOR UPDATE SKIP LOCKED` for safe concurrent pickup.

**Infrastructure:** +1 Docker service (worker running `manage.py db_worker`, reuses the same Django image)

**Pros:**
- First-party Django API. Task definitions use `django.tasks` from Django core. Backends are swappable without changing task code
- No Redis needed, uses existing PostgreSQL
- Only one extra container for the worker (reuses the same image, just a different entrypoint command)
- Clean, idiomatic code: `@task` decorator + `.enqueue()`
- Transaction-safe enqueueing via `transaction.on_commit()`
- Async support: `aenqueue()`, `aget_result()`
- Future-proof: this is the direction Django is heading

**Cons:**
- No built-in automatic retry. If the worker crashes mid-task, the task stays in RUNNING state forever
- No built-in scheduling or delayed execution
- Youngest ecosystem (~1 year), fewer community resources
- DatabaseBackend polls for new tasks (slight latency vs push-based brokers)

**Stale task recovery (manual retry):**
Since `django.tasks` has no automatic retry, we can implement a recovery pattern:
- Add a `processing_started_at` timestamp to the Scan model
- Write a management command that resets scans stuck in COMPRESSING/SCANNING status for longer than X minutes
- Run it via a K8s CronJob every 15 minutes
- This effectively gives us retry behavior without needing it built into the task framework

---

### Option 2: `subprocess.Popen` + K8s CronJob

Spawn a detached OS process from the view that runs a Django management command. The child process is fully independent of the web worker (survives Gunicorn worker recycling).

**Infrastructure:** No new services. Management commands run as child processes of the web worker, or via K8s CronJobs for batch/scheduled work.

**Pros:**
- Zero dependencies, zero configuration
- Survives web worker restarts (`start_new_session=True` fully detaches the child process)
- True parallelism (separate process, own GIL)
- Management commands are testable and can be run manually from the shell
- Good stepping stone before committing to a task queue

**Cons:**
- 1-3 second startup overhead per task (loads Python + Django each time)
- 50-200MB memory per subprocess
- No concurrency control (must build a limiter yourself)
- No task persistence or result tracking (must track status in the model)
- Orphan process risk if commands hang
- No visibility into running tasks without custom tooling

---

### Option 3: Django-Q2 (ORM/PostgreSQL backend)

Third-party task queue (actively maintained fork of django-q). Can use the Django ORM as a broker, so no Redis needed.

**Infrastructure:** +1 Docker service (worker running `manage.py qcluster`)

**Pros:**
- No Redis needed, uses existing PostgreSQL as broker
- Best Django admin integration of all options (tasks, results, failures, and schedules all visible)
- Built-in retry with configurable timeout
- Supports task chains, schedules, result tracking
- Mature codebase (original django-q is ~8 years old, Q2 fork is actively maintained)

**Cons:**
- Third-party package, not on the Django standardization path (Django chose DEP 14 instead)
- Multiprocessing model uses more memory per worker than threading
- Smaller community than Celery
- API uses string-based task references (`"scanning.tasks.compress_pdf"`) rather than typed function calls

---

### Option 4: Huey + Redis

Lightweight task queue designed as a simpler Celery alternative. Requires Redis as a broker.

**Infrastructure:** +2 Docker services (Redis container + worker running `manage.py run_huey`)

**Pros:**
- Clean, simple decorator API, less boilerplate than Celery
- Single package with built-in Django integration
- Supports periodic tasks, task pipelines, task locking
- Lightweight and fast
- Well-maintained (10+ years, single primary author)

**Cons:**
- **Requires Redis** (new infrastructure dependency we don't currently have)
- Smaller ecosystem, no built-in web dashboard
- Single-process consumer model limits horizontal scaling

---

### Option 5: Celery + Redis

The industry-standard distributed task queue for Python/Django.

**Infrastructure:** +2 Docker services (Redis + Celery worker). Optional +1 for Celery Beat (scheduled tasks).

**Pros:**
- Most mature and well-documented option (15+ years, used by Instagram, Mozilla, Adyen)
- Excellent retry support: automatic retries, exponential backoff, dead letter queues
- Excellent monitoring: Flower web dashboard, django-celery-results admin
- Handles complex workflows (chains, groups, chords)
- Easy to scale horizontally

**Cons:**
- **Requires Redis** (new infrastructure dependency)
- Most boilerplate to set up (celery.py, broker config, worker management)
- Overkill for our low-volume, staff-triggered workload
- Configuration can be finicky (serializers, timezones, result backends)

---

### Comparison

| Criteria | django.tasks | subprocess + CronJob | Django-Q2 | Huey | Celery |
|---|---|---|---|---|---|
| **New deps** | 1 (`django-tasks`) | 0 | 1 (`django-q2`) | 1 (`huey`) | 2-3 |
| **New Docker services** | +1 (worker) | 0 | +1 (worker) | +2 (Redis + worker) | +2-3 (Redis + worker + beat) |
| **Requires Redis** | No | No | No | Yes | Yes |
| **Task persistence** | Yes (DB) | No | Yes (DB) | Yes (Redis) | Yes (Redis) |
| **Retry support** | Manual (CronJob) | Manual | Built-in | Built-in | Excellent |
| **Admin UI** | Basic | None | Best | Basic | Good (Flower) |
| **Setup complexity** | Low | Trivial | Low | Low-medium | Medium-high |
| **Maturity** | New (~1 year) | N/A | Good (~8 years) | Good (10+ years) | Excellent (15+ years) |

---

### Recommendation: `django.tasks` + `django-tasks` DatabaseBackend

For our workload (staff-triggered, low volume, ~10 users, PostgreSQL already in place, no Redis), `django.tasks` is the best fit:

1. **No Redis.** We already have PostgreSQL and don't need a new infrastructure dependency for our throughput level
2. **First-party API.** `django.tasks` is in Django core. Task code is portable across backends, so if we ever need to swap to a Redis-backed backend, we change one setting
3. **Minimal overhead.** Only one extra container running the same Django image with a different command
4. **Future-proof.** This is where Django is heading (DEP 14). Investing in this API now means we're aligned with the ecosystem as it matures
5. **The retry gap is solvable.** A management command + K8s CronJob that resets stuck tasks every 15 minutes gives us effective retry behavior. This pattern works regardless of which backend we choose
6. **Right-sized.** Celery and Huey solve problems we don't have (high throughput, complex routing, distributed workers). Django-Q2 is solid but is a third-party library betting against Django's own standardization direction. subprocess works but has no persistence or visibility

If the ecosystem proves too young or we hit limitations, Django-Q2 is the natural fallback (also PostgreSQL-based, no Redis, good admin UI).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate options for background task processing #18

Context

Option 1: `django.tasks` + `django-tasks` DatabaseBackend

Option 2: `subprocess.Popen` + K8s CronJob

Option 3: Django-Q2 (ORM/PostgreSQL backend)

Option 4: Huey + Redis

Option 5: Celery + Redis

Comparison

Recommendation: `django.tasks` + `django-tasks` DatabaseBackend

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Criteria	django.tasks	subprocess + CronJob	Django-Q2	Huey	Celery
New deps	1 (`django-tasks`)	0	1 (`django-q2`)	1 (`huey`)	2-3
New Docker services	+1 (worker)	0	+1 (worker)	+2 (Redis + worker)	+2-3 (Redis + worker + beat)
Requires Redis	No	No	No	Yes	Yes
Task persistence	Yes (DB)	No	Yes (DB)	Yes (Redis)	Yes (Redis)
Retry support	Manual (CronJob)	Manual	Built-in	Built-in	Excellent
Admin UI	Basic	None	Best	Basic	Good (Flower)
Setup complexity	Low	Trivial	Low	Low-medium	Medium-high
Maturity	New (~1 year)	N/A	Good (~8 years)	Good (10+ years)	Excellent (15+ years)

Evaluate options for background task processing #18

Description

Context

Option 1: django.tasks + django-tasks DatabaseBackend

Option 2: subprocess.Popen + K8s CronJob

Option 3: Django-Q2 (ORM/PostgreSQL backend)

Option 4: Huey + Redis

Option 5: Celery + Redis

Comparison

Recommendation: django.tasks + django-tasks DatabaseBackend

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option 1: `django.tasks` + `django-tasks` DatabaseBackend

Option 2: `subprocess.Popen` + K8s CronJob

Recommendation: `django.tasks` + `django-tasks` DatabaseBackend