Replies: 2 comments
-
|
Built a proof-of-concept validating the context propagation problem and the proposed fix: https://gist.github.com/jsxs0/e07793c4f62b746f3f7e14aa5a9f86d4 Key finding: |
Beta Was this translation helpful? Give feedback.
-
|
Prototyped the capture/restore mechanism with benchmarks: https://gist.github.com/jsxs0/920ee6463a422faa50707744c18dac76 Results: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Making Rage Fully Observable: Metrics & Context Propagation
Google Summer of Code 2026 Proposal | Ruby Organization | Mentored by @cuneyter and @daz-codes
1. Abstract
Rage collapses HTTP APIs, background jobs, WebSockets, and domain events into a single fiber-based process — but today, the moment work crosses a fiber boundary, request context vanishes and the runtime becomes a black box. This proposal closes both gaps. First, it adds automatic ActiveSupport::CurrentAttributes propagation across Rage::Deferred task boundaries by extending the existing Context capture/restore system. Second, it builds a metrics layer into Rage::Telemetry — event loop lag, fiber utilization, queue depth, and connection pressure — exportable via OpenTelemetry. The result: Rage applications become fully observable in production without bolting on external instrumentation.
2. Problem Statement
2.1 The Vanishing Context
Every Rage request begins with fiber-local state. At lib/rage/log_processor.rb:107-109, three keys are set:
When a developer uses ActiveSupport::CurrentAttributes (e.g., Current.user = authenticated_user), those values also live in fiber-local storage — because Rage explicitly configures this at lib/rage/ext/setup.rb:
The problem appears at lib/rage/deferred/queue.rb:38:
Fiber.schedule creates a fresh fiber. Every fiber-local variable — including all CurrentAttributes — is wiped clean. The context parameter partially restores logger tags via Rage::Deferred::Context.build (context.rb:9-19), but this only captures two specific keys:
Current.user, Current.request_id, Current.tenant — all gone. The same pattern repeats in SSE (lib/rage/sse/application.rb:19,37) and in the FiberScheduler itself (lib/rage/fiber_scheduler.rb:119-138): parent fiber context never propagates to child fibers.
2.2 Why This Matters
2.3 The Invisible Engine
Rage::Telemetry (added in v1.20.0) provides 13 event-based spans covering fiber dispatch, controller actions, WebSocket lifecycle, deferred tasks, events, and SSE streaming. These answer "what happened" — but not "how is the system feeling."
What's completely missing:
The ecosystem includes a dedicated opentelemetry-instrumentation-rage gem (rage-rb/opentelemetry-instrumentation-rage) that delivers span-level observability. Yet, essential runtime metrics are not exported via this specific integration or any other available observability channel.
3. Proposed Solution
3.1 Architecture Overview
3.2 Context Propagation: Fiber Storage Approach
To avoid hardcoding ActiveSupport::CurrentAttributes logic directly into Context.build (which would tightly couple Rage internals with an external dependency), this proposal seeks to encapsulate state within the fiber boundary—leveraging the same mechanism already implemented for logger tag propagation.
During the boot sequence, Rage intercepts ActiveSupport::CurrentAttributes to ensure that attribute writes are directed to a specific Fiber[:key] (e.g., Fiber[:__rage_current_attributes]). This allows Context.build to capture state with zero-overhead, treating CurrentAttributes identically to logger tags without requiring subclass reflection at capture time:
# lib/rage/deferred/context.rb — capture is architecturally streamlined
Fiber[:__rage_logger_tags], # [4] logger tags
Fiber[:__rage_logger_context], # [5] logger context
nil, # [6] user context
Fiber[:__rage_current_attributes], # [7] CurrentAttributes (mirrored pattern)
Restoration is equally direct—requiring only a simple assignment of the fiber-local key within the new fiber. The ActiveSupport::CurrentAttributes.subclasses collection is memoized at boot since it remains static after system initialization.
The specific patching mechanism will be rigorously prototyped during the community bonding period, evaluating two primary strategies:
Both candidates ensure that CurrentAttributes data persists across fiber boundaries and propagates through the same high-performance path as existing logger metadata.
3.3 Decoupling Observability: Rage::Telemetry Raw Data & Instrumentation Tooling
Following the separation of concerns between Rage core and instrumentation packages, this proposal splits the metrics work across two repositories:
This keeps Rage core instrumentation-agnostic. A Datadog or Prometheus instrumentation gem could use the same raw data interface without any changes to Rage itself.
3.4 Runtime Metrics: What to Measure
Event Loop Lag — the most critical metric:
Fiber Utilization — hooked into existing core.fiber.dispatch and core.fiber.spawn spans:
Deferred Queue Depth — exposes existing @backlog_size from queue.rb:13:
Connection Count and GC Pressure — additional gauges using Iodine.run_every probes.
Socket Backlog — the volume of concurrent connections awaiting acceptance. This metric is exposed via Rage::Telemetry as a raw accessor to be consumed by downstream instrumentation packages.
3.5 OpenTelemetry Export
The opentelemetry-instrumentation-rage gem (rage-rb/opentelemetry-instrumentation-rage) consumes Rage::Telemetry’s raw accessors and tools to construct OTel instruments.
It utilises the LoopLagProber to emit a Histogram for event loop lag and reads Rage::Telemetry.raw[:active_fibers], [:deferred_queue_depth], and [:socket_backlog] to emit native Gauges.
This architecture ensures that the Rage core remains entirely instrumentation-agnostic, functioning as an observability bridge consistent with the patterns utilised by Rails, Sidekiq, and Puma.
4. Project Deliverables
5. Timeline / Milestones
Community Bonding (before coding starts)
Phase 1: Context Propagation (Weeks 1-3, ~45 hours)
Week 1: Extend Context.build to capture CurrentAttributes. Write failing tests first: enqueue a task with Current.user = X, assert Current.user == X inside the task.
Week 2: Apply same fix to SSE::Application and FiberScheduler. Handle edge cases: serialization of complex attribute values, nested CurrentAttributes subclasses, classes with reset_callbacks.
Week 3: Integration tests across all three fiber boundaries. Ensure existing logger tag propagation still works (zero regressions). Handle the case where ActiveSupport is not loaded (graceful no-op).
Milestone: CurrentAttributes automatically propagate across all fiber boundaries. All existing tests pass.
Phase 2: Metrics Engine (Weeks 4-6, ~45 hours)
Week 4: Implement raw metric accessors and LoopLagProber tool in Rage::Telemetry. Unit tests for each accessor. Ensure LoopLagProber correctly wraps Iodine.run_every with accurate lag calculation.
Week 5: Build event loop lag probe using Iodine.run_every. Build fiber utilization gauge hooked into FiberWrapper. Build queue depth gauge from @backlog_size.
Week 6: Build connection pressure gauge. Build GC pressure probe. Integration test: start a Rage app, make requests, verify metrics update correctly.
Milestone: All five runtime metrics measured and accessible via Rage::Telemetry.raw.
Midterm Evaluation
Context propagation complete. Metrics engine functional with five runtime metrics.
Phase 3: OpenTelemetry Export (Weeks 7-9, ~45 hours)
Week 7: Implement OTelExporter bridging Rage metrics to opentelemetry-ruby SDK. Test with OTel Collector.
Week 8: Build Grafana dashboard template with panels for loop lag, fiber utilization, queue depth, connections, GC pressure. Write configuration guide.
Week 9: Buffer week for integration issues, mentor feedback, and edge cases (high-cardinality labels, metric reset on process fork).
Milestone: Metrics exportable to Prometheus/Grafana via OpenTelemetry. Dashboard template ready.
Phase 4: Documentation & Hardening (Weeks 10-12, ~40 hours)
Week 10: YARD documentation for all public APIs. Usage guide with examples: "How to monitor Rage in production."
Week 11: Benchmark impact of context propagation (overhead per task enqueue). Benchmark metric collection (overhead per request). Ensure < 1% performance impact.
Week 12: Final cleanup, PR review incorporation, submission.
Milestone: All deliverables complete. PRs ready for merge.
6. Related Work
Rails CurrentAttributes Propagation
Rails' Active Job automatically propagates CurrentAttributes via serialization into the job payload. Sidekiq implements this via middleware. Rage's approach is more direct: since tasks run in-process (same Ruby VM), we can capture live object references rather than serializing/deserializing — making propagation both faster and more complete.
Node.js AsyncLocalStorage
Node.js solved the same problem with AsyncLocalStorage (added in v12.17.0), which automatically propagates context across async boundaries. Rage's fiber-local storage serves the same purpose, but lacks the automatic propagation that AsyncLocalStorage provides. This project adds that missing piece.
Puma Metrics
Puma exposes thread pool stats, backlog, and request queue depth via Puma.stats. Rage needs equivalent visibility for its fiber-based model. The key difference: Puma's thread pool is bounded and well-understood; Rage's fiber pool is dynamic, making metrics even more critical for capacity planning.
7. Risk Mitigation
8. Previous Contributions
Rage Framework
Proof of Concept
Ruby Core (ruby/csv)
Conference Speaking
9. About Me
I am a software engineer based in Japan building production Server-Sent Events systems for industrial IoT equipment monitoring. My daily work involves exactly the observability challenges this project addresses: tracking request context across async boundaries, measuring event loop health under load, and correlating background processing with the HTTP requests that triggered it.
I chose this project because I have firsthand production experience with the pain of losing context across fiber boundaries. In my IoT monitoring systems, a sensor alert that triggers a background notification must carry the device ID, tenant, and request trace through every async hop — or the notification is useless. Rage's Deferred system has the same gap, and I know exactly how to fix it because I have already read the code where context is captured (context.rb), where it is lost (queue.rb:38), and where it is partially restored (task.rb:62).
GitHub: jsxs0
10. Availability
11. Communication Plan
Google Doc with commenting enabled
@cuneyter @daz-codes — happy to adjust scope or approach based on your guidance!
Beta Was this translation helpful? Give feedback.
All reactions