Skip to content

feat(experimental): Retina in Rust#2079

Draft
nddq wants to merge 13 commits intomainfrom
rust-experiment
Draft

feat(experimental): Retina in Rust#2079
nddq wants to merge 13 commits intomainfrom
rust-experiment

Conversation

@nddq
Copy link
Member

@nddq nddq commented Feb 25, 2026

Description

An experimental rewrite of Retina in Rust for better performance, specifically targeting the advanced metrics path with added Hubble support

What's working:

  • packetparser, dropreason and equivalent metrics, parity with the Go implementation and added some more
  • Hubble flow logs, UI

Related Issue

If this pull request is related to any issue, please mention it here. Additionally, make sure that the issue is assigned to you before submitting this pull request.

Checklist

  • I have read the contributing documentation.
  • I signed and signed-off the commits (git commit -S -s ...). See this documentation on signing commits.
  • I have correctly attributed the author(s) of the code.
  • I have tested the changes locally.
  • I have followed the project's style guidelines.
  • I have updated the documentation, if necessary.
  • I have added tests, if applicable.

Screenshots (if applicable) or Testing Completed

Please add any relevant screenshots or GIFs to showcase the changes made.

Additional Notes

Add any additional notes or context about the pull request here.


Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.

nddq added 13 commits February 20, 2026 19:56
…bserver

Introduce an experimental Rust implementation of the Retina agent and
operator under experimental/. This is a from-scratch rewrite using
aya-rs for eBPF and tonic for gRPC, targeting feature parity with the
Go Hubble control plane.

Components:
- retina-agent: DaemonSet binary with eBPF packet capture, Hubble
  gRPC observer (GetFlows), Prometheus metrics, health probes, and
  pprof debug endpoints
- retina-operator: watches Pods/Services/Nodes, streams IP cache
  updates to agents via gRPC
- retina-core: shared library with Plugin trait, flow enrichment,
  IP cache, metrics, and flow store
- retina-proto: Hubble-compatible protobuf definitions
- packetparser plugin: eBPF TC classifiers for host and pod-level
  packet capture with conntrack

Key features:
- Cilium-compatible numeric identities via label hashing (range
  256-65535) with reserved identities for Host, World, RemoteNode,
  and KubeAPIServer
- Flow Summary field with TCP flags and UDP protocol display
- Pod CIDR gateway IP registration for correct host identity
- Distroless container images for both agent and operator
- xtask build system with kind cluster deployment support

Project layout follows cmd/crates/plugins convention.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Implement all remaining Hubble Observer RPCs and enhance existing ones:

- GetFlows: add flow filtering engine (whitelist/blacklist with IP, pod,
  label, verdict, protocol, port, TCP flags, node_name glob, identity),
  since/until time-range, first flag, node_name in responses
- ServerStatus: add version, flows_rate, num_connected/unavailable_nodes
- GetNodes: return single node with name, version, state, uptime, stats
- GetNamespaces: extract unique namespaces from flow store
- GetAgentEvents: full implementation with follow + historical modes,
  emit AGENT_STARTED and IPCACHE_UPSERTED/DELETED events
- GetDebugEvents: return empty stream instead of Unimplemented

Also fix eBPF verifier rejection by using zeroed struct init and correct
PacketEvent padding, and fix LinkAttribute::NetNsId rename in newer
netlink-packet-route.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Replace raw deploy YAMLs with a Helm chart under deploy/ with
templated values for images, ports, and feature gates (operator,
hubble relay, hubble UI). Update xtask deploy commands to use
helm upgrade --install with --reuse-values and --wait instead of
kubectl apply + set image + rollout restart.

Add Hubble Peer gRPC service (peer.proto) so relay can discover
agent nodes. Implement Peer.Notify in the agent's gRPC server to
stream the node list from the operator's IP cache.

Add debug HTTP endpoints: /debug/ipcache on the agent, and a
full debug server on the operator (ipcache dump, service/node
state, pod watcher state).

Extend IP cache with dump() and service/node tracking in the
operator's watchers.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
… and gRPC resilience

- Add dual eBPF event delivery: BPF ring buffer (kernel >= 5.8) with
  automatic fallback to perf event array on older kernels. The eBPF
  crate uses a cfg feature flag to compile both variants; xtask builds
  both and the loader auto-selects at runtime via kernel version check.
  Ring buffer size is configurable via --ring-buffer-size CLI flag.

- Port packet sampling from the Go agent: probabilistic 1/N sampling
  using bpf_get_prandom_u32(), controlled via RETINA_CONFIG BPF array
  map and --sampling-rate CLI flag. Control flags (SYN/FIN/RST),
  timeouts, and periodic reports always fire regardless of sampling.

- Add Criterion benchmark suite (8 targets) for retina-core: flow,
  enricher, ipcache, filter, metrics, store, pipeline, and contention
  benchmarks with shared test data factories.

- Improve gRPC resilience: HTTP/2 keepalive on both client (agent) and
  server (operator), transient error classification in retry loop
  (WARN + reset backoff instead of ERROR + exponential backoff), and
  graceful shutdown via serve_with_shutdown so operator drains streams
  cleanly on SIGTERM instead of crashing agent connections.

- Refactor core crates: consolidate AgentEventStore into generic store,
  improve IpCache identity hashing, add FlowFilterSet, and expose
  select internals for benchmark access.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
…cher

- Batch IP cache snapshot into single gRPC frame (IpCacheBatch proto
  message) instead of sending N individual IpCacheUpdate messages
- Add gzip compression for operator-agent gRPC streams
- Add change detection to skip redundant upserts when pod metadata
  hasn't changed (tracked via upserts_skipped stat)
- Fix stale connection leak by detecting client disconnect via
  tx.closed() in tokio::select alongside broadcast recv
- Add graceful operator shutdown: sends SHUTDOWN marker so agents
  preserve their cache during rolling updates instead of clearing it
- Add operator debug endpoints (/debug/ipcache, /debug/stats) for
  live introspection of cache state and connection metrics
- Wait for IP cache sync before attaching veth eBPF programs so the
  enricher has identity data from the first packet
- Resolve pod names via neighbor table + IP cache lookup and log them
  alongside veth attachments for easier debugging
- Add retry with exponential backoff for agent IP cache sync connection
- Remove dead code: IDENTITY_UNKNOWN, resolve_identity(), IpCache::is_empty(),
  RingBuffer::is_empty(), OperatorState::len(), ETH_P_IP constant
- Add Criterion benchmark suite (8 targets) with shared test helpers

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
- Use Arc::clone(&x) instead of x.clone() for Arc types to distinguish
  refcount bumps from deep clones
- Replace wildcard imports (use retina_common::*) with explicit named
  imports in flow.rs and bench_helpers.rs
- Convert 19 verbose match-with-early-return patterns to let-else
- Use is_some_and for Option matching with predicates in filter.rs
- Use if-let-else instead of 2-arm match on Option in watcher.rs
- Rename DebugState.state to .operator to eliminate state.state stutter
- Normalize .unwrap() to .expect("lock poisoned") on all lock
  acquisitions for consistent, descriptive panic messages

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
…stence

Prefer TCX with LinkOrder::first() (kernel >= 6.6) so Retina's TC
classifiers run before any other programs on the interface. Falls back
to legacy TC with priority 1 on older kernels.

All eBPF programs already return TC_ACT_UNSPEC, which maps to TCX_NEXT
in TCX and continue in legacy cls_bpf, so Retina passively observes
packets without affecting subsequent program verdicts.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Add build.rs scripts to both agent and operator that capture git commit
SHA, git tag (or "dev"), and rustc version at compile time. This enables
--version output and logs version info at startup.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
…ring

Add a new dropreason plugin that attaches fexit probes to 5 kernel
functions (nf_hook_slow, nf_nat_inet_fn, __nf_conntrack_confirm,
tcp_v4_connect, inet_csk_accept) to capture packet drop events with
full 5-tuple metadata.

Key design decisions:
- Runtime BTF offset resolution: kernel struct field offsets are
  resolved from /sys/kernel/btf/vmlinux at startup via btf-rs and
  passed to eBPF through a BPF array map, eliminating hardcoded
  offsets that break across kernel versions.
- inet_csk_accept EAGAIN filtering: non-blocking accept() with empty
  queue returns -EAGAIN, which is normal behavior. The probe reads the
  kernel *err pointer and skips EAGAIN to avoid false positives.
- tcp_v4_connect uaddr fallback: for early connect failures where the
  sock doesn't have addresses yet, reads the destination from the
  sockaddr_in argument (stable UAPI layout).
- Human-readable errno in flow summary: maps kernel return codes to
  names (EHOSTUNREACH, EACCES, etc.) in Hubble flow extensions.
- Auto-selects ring buffer (kernel >= 5.8) or perf event array.
- Enabled via --enable-dropreason flag, with Helm chart support.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
…drops

When tcp_v4_connect fails at route lookup (EHOSTUNREACH, ENETUNREACH),
the kernel hasn't assigned a source IP yet, so the drop event shows
src=0.0.0.0 which can't be enriched with K8s metadata.

Fix by capturing the calling process's PID via bpf_get_current_pid_tgid()
in eBPF, then resolving the pod's IP in userspace by reading
/proc/{pid}/net/fib_trie to find the LOCAL addresses in the process's
network namespace. This correctly identifies both regular pods (pod IP)
and hostNetwork pods (node IP).

Also adds hostPID:true to the agent DaemonSet so the agent can access
/proc entries for all processes on the node.

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
… workspace hardening

Enrich flows with Kubernetes service metadata so Hubble tooling can
filter and display service-level traffic:

- Populate source_service/destination_service on flows when the IP
  resolves to a ClusterIP or LoadBalancer IP in the operator's cache
- Add source_service/destination_service matching in FlowFilterSet so
  hubble observe --to-service and --from-service work end-to-end
- Set is_reply=false on drop flows so the Hubble UI's default
  reply:[false] whitelist no longer silently filters out all drops

Additional changes in this batch:

- Enable clippy pedantic+nursery workspace-wide with targeted exceptions
- Rewrite README with architecture overview and full command reference
- Add per-flow drop metrics (source/dest pod context) alongside the
  existing per-reason aggregate counters
- Expand dropreason eBPF plugin: BTF-based kernel drop reason enum
  resolution, fexit hooks for nf_hook_slow/tcp_v4_connect/inet_csk_accept,
  tracepoints for tcp_retransmit/tcp_send_reset/tcp_receive_reset
- Add ebpf::poll_readable helper (epoll-based) shared across plugins
- Operator debug endpoint improvements and health probes
- Helm chart: configurable resource limits, suppress list, sampling rate
- Use serde_yaml for YAML config parsing in the agent

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Remove the custom retry_with_backoff helper in retina-core and use the
backon crate's Retryable trait directly at each call site. This
eliminates the internal retry module while gaining jitter support to
avoid thundering-herd reconnection storms.

Each caller now uses an outer loop (for stream-ended reconnects) with
backon's ExponentialBuilder for error backoff (1s-60s with jitter).

Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
@nddq nddq changed the title [experimental] Retina in Rust feat(experimental): Retina in Rust Feb 25, 2026
@github-actions
Copy link

Retina Code Coverage Report

Total coverage no change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant