-
Notifications
You must be signed in to change notification settings - Fork 143
Description
API workers (PulpApiWorker / gunicorn SyncWorker) exhibit continuous RSS growth (~1 kB/request) even under minimal load (health probes only). The growth is caused by glibc heap fragmentation — Django's request cycle allocates and frees many small C-level objects (ORM query compilers, SQL strings, psycopg cursor state), and glibc's malloc retains the freed pages in the process heap rather than returning them to the OS.
Evidence
Profiling on a live Ansible Automation Platform 2.6 deployment (pulpcore 3.49.49, Django 4.2.27, Python 3.12, OpenShift):
- Python object counts are stable —
gc.get_objects()delta is ~0 after initial lazy initialization gc.collect()recovers 0 bytes — no reference cyclesmalloc_trim(0)reclaims ~2 MB immediately — confirms heap fragmentation- RSS grows linearly at ~1 kB/request without trimming, with no upper bound
- Master process RSS is flat — only forked workers are affected
Impact
Over hours, worker RSS climbs from ~150 MB to multiple GB, leading to:
- Gunicorn worker timeout (
SIGKILL) - Health probe failures
- Pod OOM kills and restarts
This is observed even with zero user activity — Kubernetes liveness/readiness probes alone drive the growth.
Proposed fix
PR #7481 adds periodic gc.collect() + malloc_trim(0) in PulpApiWorker.handle_request(), configurable via PULP_MEMORY_TRIM_INTERVAL env var (default: every 1024 requests, set to 0 to disable). Linux-only, graceful no-op on other platforms.