-
Notifications
You must be signed in to change notification settings - Fork 913
Description
Important notices
- I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
- I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue
Describe the bug
The caching mechanism added in #7515 has a race condition in script_output.py's execute() method. The class-level Action.cached_results dict is shared across all configd worker threads with no synchronization. When multiple API calls arrive concurrently, two threads can race on the cache cleanup loop:
# Thread A iterates cached keys
for key in list(Action.cached_results.keys()):
if Action.cached_results[key]['expire'] < now: # Thread A reads key
os.remove(Action.cached_results[key]['filename']) # Thread B deletes key first → KeyError
del Action.cached_results[key]Thread A snapshots the keys and begins iterating. Thread B concurrently expires and deletes the same key. Thread A then hits a KeyError on line 71 or 72.
To reproduce
Any frequent API poller triggers this reliably. In my case, opnsense-exporter scraped by Prometheus every 15 seconds. Each scrape fires multiple concurrent OPNsense API calls (gateways, services, firewall stats), each spawning a configd thread that enters the cache cleanup path.
Within a few hours of a service configd restart, the cache re-corrupts and tracebacks recur every few minutes:
unable to sendback response for <uuid>, message was Traceback (most recent call last):
File "/usr/local/opnsense/service/modules/processhandler.py", line 188, in run
result = self.action_handler.execute(...)
File "/usr/local/opnsense/service/modules/processhandler.py", line 355, in execute
return action_obj.execute(action_params, message_uuid, connection)
File "/usr/local/opnsense/service/modules/actions/script_output.py", line 71, in execute
os.remove(Action.cached_results[key]['filename'])
KeyError: '<sha256_hash>'
Also seen on line 72 (del Action.cached_results[key]) and occasionally FileNotFoundError on /tmp/tmpcfd_* when two threads race to os.remove() the same temp file.
Expected behavior
Cache cleanup should be thread-safe. Concurrent API calls should not corrupt the shared cache dict.
Suggested fix
Wrap the cleanup loop in a try/except:
for key in list(Action.cached_results.keys()):
try:
if Action.cached_results[key]['expire'] < now:
try:
os.remove(Action.cached_results[key]['filename'])
except (KeyError, FileNotFoundError):
pass
del Action.cached_results[key]
except KeyError:
passOr use a threading.Lock around all access to Action.cached_results.
Environment
- OPNsense 25.1 (also reproducible on 24.7)
- Triggered by opnsense-exporter with 15s Prometheus scrape interval
- Workaround: increased scrape interval to 60s, which reduces but doesn't eliminate the race