Conversation
…bserver Introduce an experimental Rust implementation of the Retina agent and operator under experimental/. This is a from-scratch rewrite using aya-rs for eBPF and tonic for gRPC, targeting feature parity with the Go Hubble control plane. Components: - retina-agent: DaemonSet binary with eBPF packet capture, Hubble gRPC observer (GetFlows), Prometheus metrics, health probes, and pprof debug endpoints - retina-operator: watches Pods/Services/Nodes, streams IP cache updates to agents via gRPC - retina-core: shared library with Plugin trait, flow enrichment, IP cache, metrics, and flow store - retina-proto: Hubble-compatible protobuf definitions - packetparser plugin: eBPF TC classifiers for host and pod-level packet capture with conntrack Key features: - Cilium-compatible numeric identities via label hashing (range 256-65535) with reserved identities for Host, World, RemoteNode, and KubeAPIServer - Flow Summary field with TCP flags and UDP protocol display - Pod CIDR gateway IP registration for correct host identity - Distroless container images for both agent and operator - xtask build system with kind cluster deployment support Project layout follows cmd/crates/plugins convention. Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Implement all remaining Hubble Observer RPCs and enhance existing ones: - GetFlows: add flow filtering engine (whitelist/blacklist with IP, pod, label, verdict, protocol, port, TCP flags, node_name glob, identity), since/until time-range, first flag, node_name in responses - ServerStatus: add version, flows_rate, num_connected/unavailable_nodes - GetNodes: return single node with name, version, state, uptime, stats - GetNamespaces: extract unique namespaces from flow store - GetAgentEvents: full implementation with follow + historical modes, emit AGENT_STARTED and IPCACHE_UPSERTED/DELETED events - GetDebugEvents: return empty stream instead of Unimplemented Also fix eBPF verifier rejection by using zeroed struct init and correct PacketEvent padding, and fix LinkAttribute::NetNsId rename in newer netlink-packet-route. Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Replace raw deploy YAMLs with a Helm chart under deploy/ with templated values for images, ports, and feature gates (operator, hubble relay, hubble UI). Update xtask deploy commands to use helm upgrade --install with --reuse-values and --wait instead of kubectl apply + set image + rollout restart. Add Hubble Peer gRPC service (peer.proto) so relay can discover agent nodes. Implement Peer.Notify in the agent's gRPC server to stream the node list from the operator's IP cache. Add debug HTTP endpoints: /debug/ipcache on the agent, and a full debug server on the operator (ipcache dump, service/node state, pod watcher state). Extend IP cache with dump() and service/node tracking in the operator's watchers. Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
… and gRPC resilience - Add dual eBPF event delivery: BPF ring buffer (kernel >= 5.8) with automatic fallback to perf event array on older kernels. The eBPF crate uses a cfg feature flag to compile both variants; xtask builds both and the loader auto-selects at runtime via kernel version check. Ring buffer size is configurable via --ring-buffer-size CLI flag. - Port packet sampling from the Go agent: probabilistic 1/N sampling using bpf_get_prandom_u32(), controlled via RETINA_CONFIG BPF array map and --sampling-rate CLI flag. Control flags (SYN/FIN/RST), timeouts, and periodic reports always fire regardless of sampling. - Add Criterion benchmark suite (8 targets) for retina-core: flow, enricher, ipcache, filter, metrics, store, pipeline, and contention benchmarks with shared test data factories. - Improve gRPC resilience: HTTP/2 keepalive on both client (agent) and server (operator), transient error classification in retry loop (WARN + reset backoff instead of ERROR + exponential backoff), and graceful shutdown via serve_with_shutdown so operator drains streams cleanly on SIGTERM instead of crashing agent connections. - Refactor core crates: consolidate AgentEventStore into generic store, improve IpCache identity hashing, add FlowFilterSet, and expose select internals for benchmark access. Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
…cher - Batch IP cache snapshot into single gRPC frame (IpCacheBatch proto message) instead of sending N individual IpCacheUpdate messages - Add gzip compression for operator-agent gRPC streams - Add change detection to skip redundant upserts when pod metadata hasn't changed (tracked via upserts_skipped stat) - Fix stale connection leak by detecting client disconnect via tx.closed() in tokio::select alongside broadcast recv - Add graceful operator shutdown: sends SHUTDOWN marker so agents preserve their cache during rolling updates instead of clearing it - Add operator debug endpoints (/debug/ipcache, /debug/stats) for live introspection of cache state and connection metrics - Wait for IP cache sync before attaching veth eBPF programs so the enricher has identity data from the first packet - Resolve pod names via neighbor table + IP cache lookup and log them alongside veth attachments for easier debugging - Add retry with exponential backoff for agent IP cache sync connection - Remove dead code: IDENTITY_UNKNOWN, resolve_identity(), IpCache::is_empty(), RingBuffer::is_empty(), OperatorState::len(), ETH_P_IP constant - Add Criterion benchmark suite (8 targets) with shared test helpers Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
- Use Arc::clone(&x) instead of x.clone() for Arc types to distinguish
refcount bumps from deep clones
- Replace wildcard imports (use retina_common::*) with explicit named
imports in flow.rs and bench_helpers.rs
- Convert 19 verbose match-with-early-return patterns to let-else
- Use is_some_and for Option matching with predicates in filter.rs
- Use if-let-else instead of 2-arm match on Option in watcher.rs
- Rename DebugState.state to .operator to eliminate state.state stutter
- Normalize .unwrap() to .expect("lock poisoned") on all lock
acquisitions for consistent, descriptive panic messages
Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
…stence Prefer TCX with LinkOrder::first() (kernel >= 6.6) so Retina's TC classifiers run before any other programs on the interface. Falls back to legacy TC with priority 1 on older kernels. All eBPF programs already return TC_ACT_UNSPEC, which maps to TCX_NEXT in TCX and continue in legacy cls_bpf, so Retina passively observes packets without affecting subsequent program verdicts. Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Add build.rs scripts to both agent and operator that capture git commit SHA, git tag (or "dev"), and rustc version at compile time. This enables --version output and logs version info at startup. Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
…ring Add a new dropreason plugin that attaches fexit probes to 5 kernel functions (nf_hook_slow, nf_nat_inet_fn, __nf_conntrack_confirm, tcp_v4_connect, inet_csk_accept) to capture packet drop events with full 5-tuple metadata. Key design decisions: - Runtime BTF offset resolution: kernel struct field offsets are resolved from /sys/kernel/btf/vmlinux at startup via btf-rs and passed to eBPF through a BPF array map, eliminating hardcoded offsets that break across kernel versions. - inet_csk_accept EAGAIN filtering: non-blocking accept() with empty queue returns -EAGAIN, which is normal behavior. The probe reads the kernel *err pointer and skips EAGAIN to avoid false positives. - tcp_v4_connect uaddr fallback: for early connect failures where the sock doesn't have addresses yet, reads the destination from the sockaddr_in argument (stable UAPI layout). - Human-readable errno in flow summary: maps kernel return codes to names (EHOSTUNREACH, EACCES, etc.) in Hubble flow extensions. - Auto-selects ring buffer (kernel >= 5.8) or perf event array. - Enabled via --enable-dropreason flag, with Helm chart support. Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
…drops
When tcp_v4_connect fails at route lookup (EHOSTUNREACH, ENETUNREACH),
the kernel hasn't assigned a source IP yet, so the drop event shows
src=0.0.0.0 which can't be enriched with K8s metadata.
Fix by capturing the calling process's PID via bpf_get_current_pid_tgid()
in eBPF, then resolving the pod's IP in userspace by reading
/proc/{pid}/net/fib_trie to find the LOCAL addresses in the process's
network namespace. This correctly identifies both regular pods (pod IP)
and hostNetwork pods (node IP).
Also adds hostPID:true to the agent DaemonSet so the agent can access
/proc entries for all processes on the node.
Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
… workspace hardening Enrich flows with Kubernetes service metadata so Hubble tooling can filter and display service-level traffic: - Populate source_service/destination_service on flows when the IP resolves to a ClusterIP or LoadBalancer IP in the operator's cache - Add source_service/destination_service matching in FlowFilterSet so hubble observe --to-service and --from-service work end-to-end - Set is_reply=false on drop flows so the Hubble UI's default reply:[false] whitelist no longer silently filters out all drops Additional changes in this batch: - Enable clippy pedantic+nursery workspace-wide with targeted exceptions - Rewrite README with architecture overview and full command reference - Add per-flow drop metrics (source/dest pod context) alongside the existing per-reason aggregate counters - Expand dropreason eBPF plugin: BTF-based kernel drop reason enum resolution, fexit hooks for nf_hook_slow/tcp_v4_connect/inet_csk_accept, tracepoints for tcp_retransmit/tcp_send_reset/tcp_receive_reset - Add ebpf::poll_readable helper (epoll-based) shared across plugins - Operator debug endpoint improvements and health probes - Helm chart: configurable resource limits, suppress list, sampling rate - Use serde_yaml for YAML config parsing in the agent Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Remove the custom retry_with_backoff helper in retina-core and use the backon crate's Retryable trait directly at each call site. This eliminates the internal retry module while gaining jitter support to avoid thundering-herd reconnection storms. Each caller now uses an outer loop (for stream-ended reconnects) with backon's ExponentialBuilder for error backoff (1s-60s with jitter). Signed-off-by: Quang Nguyen <nguyenquang@microsoft.com>
Retina Code Coverage ReportTotal coverage no change |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
An experimental rewrite of Retina in Rust for better performance, specifically targeting the advanced metrics path with added Hubble support
What's working:
Related Issue
If this pull request is related to any issue, please mention it here. Additionally, make sure that the issue is assigned to you before submitting this pull request.
Checklist
git commit -S -s ...). See this documentation on signing commits.Screenshots (if applicable) or Testing Completed
Please add any relevant screenshots or GIFs to showcase the changes made.
Additional Notes
Add any additional notes or context about the pull request here.
Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.