Skip to content

configd: race condition in script_output.py cached_results cleanup causes KeyError #9864

@beltashazzar

Description

@beltashazzar

Important notices

Describe the bug

The caching mechanism added in #7515 has a race condition in script_output.py's execute() method. The class-level Action.cached_results dict is shared across all configd worker threads with no synchronization. When multiple API calls arrive concurrently, two threads can race on the cache cleanup loop:

# Thread A iterates cached keys
for key in list(Action.cached_results.keys()):
    if Action.cached_results[key]['expire'] < now:          # Thread A reads key
        os.remove(Action.cached_results[key]['filename'])    # Thread B deletes key first → KeyError
        del Action.cached_results[key]

Thread A snapshots the keys and begins iterating. Thread B concurrently expires and deletes the same key. Thread A then hits a KeyError on line 71 or 72.

To reproduce

Any frequent API poller triggers this reliably. In my case, opnsense-exporter scraped by Prometheus every 15 seconds. Each scrape fires multiple concurrent OPNsense API calls (gateways, services, firewall stats), each spawning a configd thread that enters the cache cleanup path.

Within a few hours of a service configd restart, the cache re-corrupts and tracebacks recur every few minutes:

unable to sendback response for <uuid>, message was Traceback (most recent call last):
  File "/usr/local/opnsense/service/modules/processhandler.py", line 188, in run
    result = self.action_handler.execute(...)
  File "/usr/local/opnsense/service/modules/processhandler.py", line 355, in execute
    return action_obj.execute(action_params, message_uuid, connection)
  File "/usr/local/opnsense/service/modules/actions/script_output.py", line 71, in execute
    os.remove(Action.cached_results[key]['filename'])
KeyError: '<sha256_hash>'

Also seen on line 72 (del Action.cached_results[key]) and occasionally FileNotFoundError on /tmp/tmpcfd_* when two threads race to os.remove() the same temp file.

Expected behavior

Cache cleanup should be thread-safe. Concurrent API calls should not corrupt the shared cache dict.

Suggested fix

Wrap the cleanup loop in a try/except:

for key in list(Action.cached_results.keys()):
    try:
        if Action.cached_results[key]['expire'] < now:
            try:
                os.remove(Action.cached_results[key]['filename'])
            except (KeyError, FileNotFoundError):
                pass
            del Action.cached_results[key]
    except KeyError:
        pass

Or use a threading.Lock around all access to Action.cached_results.

Environment

  • OPNsense 25.1 (also reproducible on 24.7)
  • Triggered by opnsense-exporter with 15s Prometheus scrape interval
  • Workaround: increased scrape interval to 60s, which reduces but doesn't eliminate the race

Metadata

Metadata

Assignees

Labels

cleanupLow impact changes

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions