Skip to content

CEL input: Add OTel tracing#48440

Open
chrisberkhout wants to merge 81 commits intoelastic:mainfrom
chrisberkhout:cel-otel-tracing
Open

CEL input: Add OTel tracing#48440
chrisberkhout wants to merge 81 commits intoelastic:mainfrom
chrisberkhout:cel-otel-tracing

Conversation

@chrisberkhout
Copy link
Contributor

@chrisberkhout chrisberkhout commented Jan 16, 2026

Proposed commit message

CEL input: Add OTel tracing (#)

Instruments the CEL Input with OpenTelemetry tracing. Sampling is 100% -
all operation is covered. By default no exporter is set up and traces
will not be exported. Export can be configured to go to the console or
to an OTLP endpoint using the `grpc` (default) or `http/protobuf`
protocols.

Typically OTel tracing considers the whole process to be the "resource".
However, in this case the resource is the input instance. For that
reason a trace provider is created specifically for the input instance
and it is not explicitly set as the global tracer provider.

There is an extra environment variable to override any other
configuration and disable export for a specific input:
`BEATS_OTEL_TRACES_DISABLE=cel`.

Spans covering HTTP requests are enriched with attributes for request
and response headers, with values automatically (but configurably)
redacted to protect sensitive data.

Normal request logging and Filebeat logs will include span and trace IDs
that allow correlation with the OTel data. This is done in any location
to which we can pass a logger from the trace creation site. Other
Filebeat logging will lack the IDs. Because logger attributes are
append-only we pass around a logger with modified attributes rather than
modify attributes in a global logger.

Normal request logging had unused functionality for including a
`trace.id` field. That has been removed in favor of an OTel-specific
implementation that adds `trace.id` and `span.id` if there is a current,
valid span.

Requests initiated by CEL will have spans added by `otelhttp` and will
identify the correct parent span using trace data from the request
context. Since the relevant eval-time context is not propagated to those
requests by mito, cel-go[1] or oauth2[2], `ContextInjector` is used to
rewrite each request to include the current context as it is processed.

[1]: https://github.com/google/cel-go/issues/557
[2]: https://github.com/golang/oauth2/issues/262

There were a couple of things for which the initial approach changed:

  • Use of https://pkg.go.dev/go.opentelemetry.io/contrib/exporters/autoexport to interpret OTel environment variables and set up the exporter was removed in favor of manual handling, which seems to be standard when using the Go SDK (unlike implementations in some other languages).
  • The context with OTel tracing data needs to be propagated the HTTP client used by CEL so that HTTP spans are attached to the correct parent span. That was initially done with a change in Mito: Add HTTPWithContextFnOpts so requests can have eval-time context mito#118. That has been closed to avoid changing Mito. Now it is done in the CEL Input by having ContextInjector rewrite requests in the client used by CEL, which also solves the problem for OAuth2 requests.

There are some differences from the attribute and other names given in the planning document:

  • cel.periodic.program_count
    → Changed to cel.periodic.execution_count to match cel.program.execution.
  • cel.program.batch_count
    → Removed. It would only indicate whether an execution returned any events or not. Any other batching is internal to the CEL evaluation.
  • cel.{periodic,program}.success
    → Removed, in favor of span status.
  • cel.program.error_message
    → Not set. Uses SetStatus and RecordError instead.
  • BEATS_OTEL_TRACING_DISABLE
    → Changed to BEATS_OTEL_TRACES_DISABLE to match OTEL_TRACES_EXPORTER and OTEL_EXPORTER_OTLP_TRACES_*.

Handling of span-specific context and loggers is somewhat cumbersome. Refactoring to extract separate functions from run for separate stages of processing will help to tidy this up and is planned as follow-up work: #48464.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • I have added an entry in ./changelog/fragments using the changelog tool.

How to test this PR locally

You can use otel-desktop-viewer as simple receiver and viewer of OTel traces:

# Install it
go install github.com/CtrlSpice/otel-desktop-viewer@latest

# Run it. It will open its web UI
otel-desktop-viewer

# In another terminal, set it as the destination for OTel traces
export OTEL_TRACES_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=grpc
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317

In the terminal with those environment variables set, you can run the input with an example that includes OAuth2 and multiple requests per period, like this:

(cd x-pack/filebeat && go build) && ./x-pack/filebeat/filebeat run -c <(echo '
filebeat.inputs:
- type: cel
  enabled: true
  id: cel-1
  interval: 5s
  resource.url: https://api.ipify.org/?format=json&passwd=mysecretword
  program: |
    get(state.url).Body.as(body, state.with({
        "events": [body.decode_json()],
        "want_more": int(state.?runcount.orValue(1)) % 3 != 0,
        "runcount": int(state.?runcount.orValue(1)) + 1,
    }))
  resource.tracer.enable: true
  resource.tracer.filename: "x-pack/filebeat/logs/cel/http-request-trace-cel-*.ndjson"
  auth.oauth2.enabled: true
  auth.oauth2.client.id: someclientid
  auth.oauth2.client.secret: someclientsecret
  auth.oauth2.scopes: scope.me
  auth.oauth2.token_url: https://oauth-mock.mock.beeceptor.com/oauth/token/github
  auth.oauth2.endpoint_params:
    grant_type: client_credentials
  otel.trace.redacted:
    - User-Agent
  otel.trace.unredacted:
    - Authorization
output.elasticsearch:
  hosts: ["https://elasticsearch:9200"]
  username: "elastic"
  password: "changeme"
  protocol: "https"
  ssl.verification_mode: "none"
  preset: balanced
logging.level: debug
logging.to_stderr: true
')

You can also use Elastic Observability to receive and view OTel traces, but it involves a bit more setup.

Bring up the Elastic Stack:

elastic-package stack up -v

In Kibana, go to "Management > Integrations" and go to the "APM" integration page. Click "Manage APM integration in Fleet", then "Add Elastic APM". Under "Configure integration > Integration settings > General > Server configuration", change the Host and URL settings to use '0.0.0.0' instead of 'localhost'. Under "Where to add this integration?", choose "Existing hosts > Elastic Agent (elastic-package)". Then click "Save and continue".

Now, back in the terminal, find the IP address of the agent container.

docker ps # confirm the agent container name is elastic-package-stack-elastic-agent-1
AGENT="elastic-package-stack-elastic-agent-1"
AGENT_IP=$(docker inspect "$AGENT" \
  --format '{{ (index .NetworkSettings.Networks "elastic-package-stack_default").IPAddress }}')
echo "$AGENT_IP" # confirm the IP was found

Use that as the destination for OTel traces:

export OTEL_TRACES_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=grpc
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://$AGENT_IP:8200"

Then from the terminal with those settings you can run the input using example Filebeat configuration as above.

To view the exported traces in Kibana, go to "Observability > Applications > Traces".

Related

Use cases

This tracing is to be used for troubleshooting, particularly for Agentless.

Screenshots

OTel traces for the CEL Input in Elastic Observability:
Screenshot 2026-02-06 at 15-46-40 cel periodic run - Transactions - unknown - Service inventory - APM - Observability - Elastic

@chrisberkhout chrisberkhout self-assigned this Jan 16, 2026
@chrisberkhout chrisberkhout added enhancement Filebeat Filebeat Team:Security-Service Integrations Security Service Integrations Team labels Jan 16, 2026
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 16, 2026
@github-actions
Copy link
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Jan 16, 2026

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b cel-otel-tracing upstream/cel-otel-tracing
git merge upstream/main
git push upstream cel-otel-tracing

@mergify
Copy link
Contributor

mergify bot commented Jan 16, 2026

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @chrisberkhout? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@mergify
Copy link
Contributor

mergify bot commented Jan 23, 2026

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b cel-otel-tracing upstream/cel-otel-tracing
git merge upstream/main
git push upstream cel-otel-tracing

@github-actions
Copy link
Contributor

github-actions bot commented Jan 30, 2026

🔍 Preview links for changed docs

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Feb 3, 2026
chrisberkhout and others added 10 commits March 13, 2026 16:00
… GetExporterTypeFromEnv() for the case of metrics off, traces on".

The chage in "Tidy up GetExporterTypeFromEnv() for the case of metrics
off, traces on" was to require OTEL_METRICS_EXPORTER to be set and not
default it to otlp becuase an OTEL_EXPORTER_OTLP_ENDPOINT was set.  This
is important because otherwise configuration to send traces to an OTLP
endpoint would activate the metrics export, which the endpoint may not
be able to handle.
Make the tracer provider injectable via an optional field on the
input struct so tests can capture spans with a SpanRecorder. Add
three tests covering the happy path (single execution), the
want_more loop (two executions per period), and an evaluation
error, asserting on span names, parent-child relationships,
attributes, and status codes.
The trace span tests were waiting for a fixed 5s timeout even when
the first periodic run had already completed.

This updates the tests to cancel when the root periodic run span
(cel.periodic.run) is observed as ended in the span recorder. The
timeout remains as a safety guard for regressions, but no longer
drives normal test completion.

Why this is safe:
- assertions in these tests target spans and attributes produced
  inside that completed periodic run
- cancellation now happens after the root run trace is complete,
  avoiding the earlier cancellation race during publish

Result: tests keep the same tracing assertions while running in
~0.01s per test instead of ~5s.
if n < 0 {
return 0
}
return time.Duration(n) * time.Millisecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit odd to me. Why are you converting n to time.Duration if n is the number of milliseconds?

You can multiply a number directly by time.Millisecond and you'll have a time.Duration

Suggested change
return time.Duration(n) * time.Millisecond
return n * time.Millisecond

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yeah, there no hidden reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's required to make the types work, so I'm reverting to leave it as it was.

Something like 55 * time.Millisecond would work because 55 is an untyped constant.

Apparently time.Duration(n) * time.Millisecond is idiomatic and consistent with the library design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is language behaviour; n is not a time.Duration while time.Millisecond is, and Go does not have C-like implicit arithmetic type coercion.

found = true
b, err := strconv.ParseBool(strings.TrimSpace(raw))
if err != nil {
return false, true, fmt.Errorf("%s must be boolean: %w", k, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question]

Does it make sense to return found = true on an error? The key is present, but the value is invalid... Could you elaborate more on the intended behaviour here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not. I'll change that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

func TestNewExporterCfgFromEnv_ExporterDefaultsToNone(t *testing.T) {
// unset BEATS_OTEL_TRACES_DISABLE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this comment, should it be removed?

Suggested change
// unset BEATS_OTEL_TRACES_DISABLE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests above and below set environment variables and make assertions about how the configuration is read from them.

For this test no setup is necessary. The comment is meant to indicate that the test makes assertions about the behavior when this variable is not set.

I don't mind removing it. What do you think?


func TestNewExporterCfgFromEnv_NotInsecureByDefault(t *testing.T) {
t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "otlp-receiver.example.com:4317")
// unset OTEL_EXPORTER_OTLP_INSECURE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm confused by this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same again. Could make it look less magical, like this:

// With OTEL_EXPORTER_OTLP_INSECURE not set.

Co-authored-by: Tiago Queiroz <github@queiroz.life>
Co-authored-by: Tiago Queiroz <github@queiroz.life>
Co-authored-by: Tiago Queiroz <github@queiroz.life>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Filebeat Filebeat Team:Docs Label for the Observability docs team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Security-Service Integrations Security Service Integrations Team

Projects

None yet

Development

Successfully merging this pull request may close these issues.