-
Notifications
You must be signed in to change notification settings - Fork 5k
Description
Description
The http_endpoint input (health reporting added in PR #44310 per #44281) currently calls status.Degraded when a client request fails validation (e.g., wrong HTTP method, failed auth). The same status.Degraded mechanism is used for these client validation failures and component level failures.
Also, Fleet uses worst-case aggregation across components, a single rejected request can mark the entire agent as "Unhealthy" in Fleet UI.
The input stays DEGRADED until the next valid request arrives. There is no background health check or periodic reset. This is particularly troublesome for sporadic webhook sources that may go hours/days between events, a single stray GET from a health check probe or network scanner leaves the agent showing Unhealthy the entire time.
Would we be able to improve this behavior such that client validation failures do not affect the input's health status. Rejecting an invalid request is expected, healthy behavior, and not a sign of degradation. Logging the rejection rather than changing health status might be a better fit.
Background
The original spec in #44281 only considered infrastructure level conditions like sustained back pressure as potential DEGRADED triggers. Client validation failures were not discussed. The implementation in PR #44310 went beyond this by marking all validation failures as DEGRADED.
Current behavior
- Deploy an Elastic Agent with an
http_endpointintegration (default config accepts POST only) - Send one valid POST. Agent shows
Healthyin Fleet - Send a GET request:
curl http://host:port/path - Agent transitions to
DEGRADED(HTTP 405) - Agent stays DEGRADED until the next valid POST arrives
From a real agent diagnostic bundle, this cycle is visible in the event logs:
07:25:54Z Unit state changed (HEALTHY->DEGRADED): request did not validate: only POST requests are allowed
07:29:04Z Unit state changed (DEGRADED->HEALTHY): Healthy ← triggered by a valid POST ~4 mins later
Affected validation paths
All of these trigger status.Degraded in handler.go:
- Wrong HTTP method (405)
- Failed basic auth (401)
- Invalid secret header (401)
- Wrong content-type (415)
- Missing, malformed, or mismatched HMAC signature (401)
- Malformed JSON body
- CRC validation failure
- Invalid query parameters
Thank you for considering this. Happy to provide additional detail, diagnostics, or help test a fix.