-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Context
We need background processing for three use cases:
- PDF compression - Run Ghostscript to compress uploaded scan PDFs
- Blackletter OCR pipeline - Run the blackletter ML pipeline to add OCR and metadata to scans
- Blackletter split pipeline - Run the blackletter ML pipeline to split a full book scan into individual opinions
Currently the app only supports uploading files. None of these processing features exist yet. We need to choose a background processing approach before implementing them, since these tasks (PDF compression, ML inference) are too slow to run in the request/response cycle.
Constraints:
- ~10 concurrent users max
- Upload rate limited by physical scanning speed (vflat)
- Low volume workload, not a high-throughput pipeline
- Already have PostgreSQL, want to avoid adding Redis if possible
- Running on Docker Compose (dev) / K8s (prod)
Option 1: django.tasks + django-tasks DatabaseBackend
Django 6.0 ships a native django.tasks framework (DEP 14) with a standard API (@task decorator, .enqueue(), TaskResult). The django-tasks package provides a DatabaseBackend that stores tasks in PostgreSQL using SELECT ... FOR UPDATE SKIP LOCKED for safe concurrent pickup.
Infrastructure: +1 Docker service (worker running manage.py db_worker, reuses the same Django image)
Pros:
- First-party Django API. Task definitions use
django.tasksfrom Django core. Backends are swappable without changing task code - No Redis needed, uses existing PostgreSQL
- Only one extra container for the worker (reuses the same image, just a different entrypoint command)
- Clean, idiomatic code:
@taskdecorator +.enqueue() - Transaction-safe enqueueing via
transaction.on_commit() - Async support:
aenqueue(),aget_result() - Future-proof: this is the direction Django is heading
Cons:
- No built-in automatic retry. If the worker crashes mid-task, the task stays in RUNNING state forever
- No built-in scheduling or delayed execution
- Youngest ecosystem (~1 year), fewer community resources
- DatabaseBackend polls for new tasks (slight latency vs push-based brokers)
Stale task recovery (manual retry):
Since django.tasks has no automatic retry, we can implement a recovery pattern:
- Add a
processing_started_attimestamp to the Scan model - Write a management command that resets scans stuck in COMPRESSING/SCANNING status for longer than X minutes
- Run it via a K8s CronJob every 15 minutes
- This effectively gives us retry behavior without needing it built into the task framework
Option 2: subprocess.Popen + K8s CronJob
Spawn a detached OS process from the view that runs a Django management command. The child process is fully independent of the web worker (survives Gunicorn worker recycling).
Infrastructure: No new services. Management commands run as child processes of the web worker, or via K8s CronJobs for batch/scheduled work.
Pros:
- Zero dependencies, zero configuration
- Survives web worker restarts (
start_new_session=Truefully detaches the child process) - True parallelism (separate process, own GIL)
- Management commands are testable and can be run manually from the shell
- Good stepping stone before committing to a task queue
Cons:
- 1-3 second startup overhead per task (loads Python + Django each time)
- 50-200MB memory per subprocess
- No concurrency control (must build a limiter yourself)
- No task persistence or result tracking (must track status in the model)
- Orphan process risk if commands hang
- No visibility into running tasks without custom tooling
Option 3: Django-Q2 (ORM/PostgreSQL backend)
Third-party task queue (actively maintained fork of django-q). Can use the Django ORM as a broker, so no Redis needed.
Infrastructure: +1 Docker service (worker running manage.py qcluster)
Pros:
- No Redis needed, uses existing PostgreSQL as broker
- Best Django admin integration of all options (tasks, results, failures, and schedules all visible)
- Built-in retry with configurable timeout
- Supports task chains, schedules, result tracking
- Mature codebase (original django-q is ~8 years old, Q2 fork is actively maintained)
Cons:
- Third-party package, not on the Django standardization path (Django chose DEP 14 instead)
- Multiprocessing model uses more memory per worker than threading
- Smaller community than Celery
- API uses string-based task references (
"scanning.tasks.compress_pdf") rather than typed function calls
Option 4: Huey + Redis
Lightweight task queue designed as a simpler Celery alternative. Requires Redis as a broker.
Infrastructure: +2 Docker services (Redis container + worker running manage.py run_huey)
Pros:
- Clean, simple decorator API, less boilerplate than Celery
- Single package with built-in Django integration
- Supports periodic tasks, task pipelines, task locking
- Lightweight and fast
- Well-maintained (10+ years, single primary author)
Cons:
- Requires Redis (new infrastructure dependency we don't currently have)
- Smaller ecosystem, no built-in web dashboard
- Single-process consumer model limits horizontal scaling
Option 5: Celery + Redis
The industry-standard distributed task queue for Python/Django.
Infrastructure: +2 Docker services (Redis + Celery worker). Optional +1 for Celery Beat (scheduled tasks).
Pros:
- Most mature and well-documented option (15+ years, used by Instagram, Mozilla, Adyen)
- Excellent retry support: automatic retries, exponential backoff, dead letter queues
- Excellent monitoring: Flower web dashboard, django-celery-results admin
- Handles complex workflows (chains, groups, chords)
- Easy to scale horizontally
Cons:
- Requires Redis (new infrastructure dependency)
- Most boilerplate to set up (celery.py, broker config, worker management)
- Overkill for our low-volume, staff-triggered workload
- Configuration can be finicky (serializers, timezones, result backends)
Comparison
| Criteria | django.tasks | subprocess + CronJob | Django-Q2 | Huey | Celery |
|---|---|---|---|---|---|
| New deps | 1 (django-tasks) |
0 | 1 (django-q2) |
1 (huey) |
2-3 |
| New Docker services | +1 (worker) | 0 | +1 (worker) | +2 (Redis + worker) | +2-3 (Redis + worker + beat) |
| Requires Redis | No | No | No | Yes | Yes |
| Task persistence | Yes (DB) | No | Yes (DB) | Yes (Redis) | Yes (Redis) |
| Retry support | Manual (CronJob) | Manual | Built-in | Built-in | Excellent |
| Admin UI | Basic | None | Best | Basic | Good (Flower) |
| Setup complexity | Low | Trivial | Low | Low-medium | Medium-high |
| Maturity | New (~1 year) | N/A | Good (~8 years) | Good (10+ years) | Excellent (15+ years) |
Recommendation: django.tasks + django-tasks DatabaseBackend
For our workload (staff-triggered, low volume, ~10 users, PostgreSQL already in place, no Redis), django.tasks is the best fit:
- No Redis. We already have PostgreSQL and don't need a new infrastructure dependency for our throughput level
- First-party API.
django.tasksis in Django core. Task code is portable across backends, so if we ever need to swap to a Redis-backed backend, we change one setting - Minimal overhead. Only one extra container running the same Django image with a different command
- Future-proof. This is where Django is heading (DEP 14). Investing in this API now means we're aligned with the ecosystem as it matures
- The retry gap is solvable. A management command + K8s CronJob that resets stuck tasks every 15 minutes gives us effective retry behavior. This pattern works regardless of which backend we choose
- Right-sized. Celery and Huey solve problems we don't have (high throughput, complex routing, distributed workers). Django-Q2 is solid but is a third-party library betting against Django's own standardization direction. subprocess works but has no persistence or visibility
If the ecosystem proves too young or we hit limitations, Django-Q2 is the natural fallback (also PostgreSQL-based, no Redis, good admin UI).
Metadata
Metadata
Assignees
Labels
Type
Projects
Status