[GSoC Proposal Draft] - Jasveen - Observability: Metrics & Context Propagation #246

jsxs0 · 2026-03-25T09:27:32Z

jsxs0
Mar 25, 2026

Making Rage Fully Observable: Metrics & Context Propagation

Google Summer of Code 2026 Proposal | Ruby Organization | Mentored by @cuneyter and @daz-codes


Contributor	Jasveen (jsxs0)
Project Size	Medium (~175 hours) over 12 weeks
Mentors	@cuneyter, @daz-codes
Location / TZ	Japan (JST, UTC+9)
GitHub	github.com/jsxs0

1. Abstract

Rage collapses HTTP APIs, background jobs, WebSockets, and domain events into a single fiber-based process — but today, the moment work crosses a fiber boundary, request context vanishes and the runtime becomes a black box. This proposal closes both gaps. First, it adds automatic ActiveSupport::CurrentAttributes propagation across Rage::Deferred task boundaries by extending the existing Context capture/restore system. Second, it builds a metrics layer into Rage::Telemetry — event loop lag, fiber utilization, queue depth, and connection pressure — exportable via OpenTelemetry. The result: Rage applications become fully observable in production without bolting on external instrumentation.

2. Problem Statement

2.1 The Vanishing Context

Every Rage request begins with fiber-local state. At lib/rage/log_processor.rb:107-109, three keys are set:

Fiber[:__rage_logger_tags] = [request_id_tag]

When a developer uses ActiveSupport::CurrentAttributes (e.g., Current.user = authenticated_user), those values also live in fiber-local storage — because Rage explicitly configures this at lib/rage/ext/setup.rb:

ActiveSupport::IsolatedExecutionState.isolation_level = :fiber

The problem appears at lib/rage/deferred/queue.rb:38:

Iodine.run_after(publish_in_ms) do
  Fiber.schedule do          # ← NEW FIBER. All Fiber[:key] values are GONE.
    Iodine.task_inc!
    result = task.new.__perform(context)
  end
end

Fiber.schedule creates a fresh fiber. Every fiber-local variable — including all CurrentAttributes — is wiped clean. The context parameter partially restores logger tags via Rage::Deferred::Context.build (context.rb:9-19), but this only captures two specific keys:

# lib/rage/deferred/context.rb
def self.build(task, args, kwargs)
  [
    task,                                  # [0] task class
    args.empty? ? nil : args,              # [1] positional args
    kwargs.empty? ? nil : kwargs,          # [2] keyword args
    nil,                                   # [3] attempt count
    Fiber[:__rage_logger_tags],            # [4] logger tags
    Fiber[:__rage_logger_context],         # [5] logger context
    nil                                    # [6] user context
  ]
end

Current.user, Current.request_id, Current.tenant — all gone. The same pattern repeats in SSE (lib/rage/sse/application.rb:19,37) and in the FiberScheduler itself (lib/rage/fiber_scheduler.rb:119-138): parent fiber context never propagates to child fibers.

2.2 Why This Matters

Audit trails break: A SendOrderConfirmation job cannot log which user triggered it
Multi-tenancy fails: A background task scoped to Current.tenant runs with no tenant
Distributed tracing gaps: OpenTelemetry trace context stored in CurrentAttributes is lost — the background task starts a new, disconnected trace
Debugging becomes guesswork: Correlating a failed background task to the request that enqueued it requires manual ID passing

2.3 The Invisible Engine

Rage::Telemetry (added in v1.20.0) provides 13 event-based spans covering fiber dispatch, controller actions, WebSocket lifecycle, deferred tasks, events, and SSE streaming. These answer "what happened" — but not "how is the system feeling."

What's completely missing:

Event loop lag: Is the reactor falling behind? A 10ms scheduled callback that fires at 25ms signals contention — but Rage has no way to measure this today.
Fiber utilization: How many fibers are active vs. idle? No counter exists.
Queue depth: Rage::Deferred tracks @backlog_size internally (queue.rb:13) but never exposes it as a metric.
Connection pressure: How many concurrent connections is Iodine serving? No visibility.
GC impact: Is Ruby's garbage collector stealing event loop time? No measurement.

The ecosystem includes a dedicated opentelemetry-instrumentation-rage gem (rage-rb/opentelemetry-instrumentation-rage) that delivers span-level observability. Yet, essential runtime metrics are not exported via this specific integration or any other available observability channel.

3. Proposed Solution

3.1 Architecture Overview

                 Request Fiber
                    (Current.user = X)
                         |
              ┌──────────┴──────────┐
              │                     │
         render json:          Rage::Deferred.enqueue
              │                     │
              │              Context.build (ENHANCED)
              │              captures CurrentAttributes
              │                     │
              │              Fiber.schedule (new fiber)
              │                     │
              │              Context.restore (ENHANCED)
              │              restores CurrentAttributes
              │              Current.user = X ✓
              │                     │
              └──────────┬──────────┘
                         │
                 Rage::Telemetry (raw data + tools)
                 ├── Spans (existing 13)
                 ├── Raw accessors (fiber count, queue depth, socket backlog)
                 └── LoopLagProber tool (wraps Iodine.run_every)
                         │
                 opentelemetry-instrumentation-rage
                 ├── Counters
                 ├── Gauges
                 └── Histograms
                         │
                 Prometheus / Grafana / Datadog

3.2 Context Propagation: Fiber Storage Approach

To avoid hardcoding ActiveSupport::CurrentAttributes logic directly into Context.build (which would tightly couple Rage internals with an external dependency), this proposal seeks to encapsulate state within the fiber boundary—leveraging the same mechanism already implemented for logger tag propagation.

During the boot sequence, Rage intercepts ActiveSupport::CurrentAttributes to ensure that attribute writes are directed to a specific Fiber[:key] (e.g., Fiber[:__rage_current_attributes]). This allows Context.build to capture state with zero-overhead, treating CurrentAttributes identically to logger tags without requiring subclass reflection at capture time:

# lib/rage/deferred/context.rb — capture is architecturally streamlined

Fiber[:__rage_logger_tags], # [4] logger tags

Fiber[:__rage_logger_context], # [5] logger context

nil, # [6] user context

Fiber[:__rage_current_attributes], # [7] CurrentAttributes (mirrored pattern)

Restoration is equally direct—requiring only a simple assignment of the fiber-local key within the new fiber. The ActiveSupport::CurrentAttributes.subclasses collection is memoized at boot since it remains static after system initialization.

The specific patching mechanism will be rigorously prototyped during the community bonding period, evaluating two primary strategies:

Overriding the internal storage of CurrentAttributes to utilize a dedicated Fiber[:key]
Hooking into the before_reset and after_reset lifecycle callbacks

Both candidates ensure that CurrentAttributes data persists across fiber boundaries and propagates through the same high-performance path as existing logger metadata.

3.3 Decoupling Observability: Rage::Telemetry Raw Data & Instrumentation Tooling

Following the separation of concerns between Rage core and instrumentation packages, this proposal splits the metrics work across two repositories:

Rage::Telemetry (rage-rb/rage) — exports raw data and tools:
- New telemetry spans for runtime events (socket backlog changes, fiber pool pressure)
- A LoopLagProber tool that encapsulates Iodine.run_every scheduling
- Raw accessors for @backlog_size, active fiber count, and connection count
opentelemetry-instrumentation-rage — converts raw data into OTel metrics:
- Uses LoopLagProber to emit a histogram for event loop lag
- Reads queue depth, fiber count, and socket backlog via Telemetry accessors

This keeps Rage core instrumentation-agnostic. A Datadog or Prometheus instrumentation gem could use the same raw data interface without any changes to Rage itself.

3.4 Runtime Metrics: What to Measure

Event Loop Lag — the most critical metric:

module Rage::Telemetry::Probes
  class LoopLag
    INTERVAL_MS = 100

    def start(&on_measurement)
      expected = Process.clock_gettime(Process::CLOCK_MONOTONIC)
      Iodine.run_every(INTERVAL_MS) do
        now = Process.clock_gettime(Process::CLOCK_MONOTONIC)
        lag_ms = ((now - expected) * 1000) - INTERVAL_MS
        expected = now
        on_measurement.call(lag_ms)
      end
    end
  end
end

Fiber Utilization — hooked into existing core.fiber.dispatch and core.fiber.spawn spans:

# Raw accessor — exposed by Rage::Telemetry, consumed by instrumentation gems 
Rage::Telemetry.raw[:active_fibers]
# Incremented/decremented in FiberWrapper at request start/end

Deferred Queue Depth — exposes existing @backlog_size from queue.rb:13:

Rage::Telemetry.raw[:deferred_queue_depth] 
# Reads @backlog_size from Rage::Deferred (queue.rb:13)

Connection Count and GC Pressure — additional gauges using Iodine.run_every probes.

Rage::Telemetry.raw[:connections_active] 
# Active HTTP connections

Socket Backlog — the volume of concurrent connections awaiting acceptance. This metric is exposed via Rage::Telemetry as a raw accessor to be consumed by downstream instrumentation packages.

Rage::Telemetry.raw[:socket_backlog] 
# Connections awaiting acceptance

3.5 OpenTelemetry Export

The opentelemetry-instrumentation-rage gem (rage-rb/opentelemetry-instrumentation-rage) consumes Rage::Telemetry’s raw accessors and tools to construct OTel instruments.

It utilises the LoopLagProber to emit a Histogram for event loop lag and reads Rage::Telemetry.raw[:active_fibers], [:deferred_queue_depth], and [:socket_backlog] to emit native Gauges.

This architecture ensures that the Rage core remains entirely instrumentation-agnostic, functioning as an observability bridge consistent with the patterns utilised by Rails, Sidekiq, and Puma.

4. Project Deliverables

Enhanced Rage::Deferred::Context — automatic ActiveSupport::CurrentAttributes capture and restore across fiber boundaries
SSE context propagation — same fix applied to SSE::Application fiber boundary
FiberScheduler context propagation — context inheritance for user-spawned fibers
Raw metric accessors — exposure of active fiber count, queue depth, socket backlog, and connection count via Rage::Telemetry
Rage::Telemetry::LoopLagProber — a dedicated measurement tool wrapping Iodine.run_every to enable lag calculation for downstream instrumentation packages
Socket backlog accessor — visibility into the volume of concurrent connections awaiting acceptance
GC pressure accessor — raw data exposure through periodic GC.stat sampling
opentelemetry-instrumentation-rage metric extensions — native OTel Counters, Gauges, and Histograms consuming Rage::Telemetry raw accessors and tools
Comprehensive test suite — unit and integration tests covering context propagation, raw accessors, and LoopLagProber logic
Documentation — YARD documentation, production usage guides, and a Grafana dashboard template

5. Timeline / Milestones

Community Bonding (before coding starts)

Deep dive into Rage::Deferred internals: queue.rb, context.rb, task.rb
Study ActiveSupport::CurrentAttributes source — understand subclasses, attributes, isolation
Study Rage::Telemetry subscriber pattern — understand how spans are dispatched
Study Iodine's run_every, task_inc!/task_dec! for metric hook points
Review opentelemetry-ruby SDK — understand Meter, Counter, Gauge, Histogram APIs
Establish communication cadence with mentors

Phase 1: Context Propagation (Weeks 1-3, ~45 hours)

Week 1: Extend Context.build to capture CurrentAttributes. Write failing tests first: enqueue a task with Current.user = X, assert Current.user == X inside the task.

Week 2: Apply same fix to SSE::Application and FiberScheduler. Handle edge cases: serialization of complex attribute values, nested CurrentAttributes subclasses, classes with reset_callbacks.

Week 3: Integration tests across all three fiber boundaries. Ensure existing logger tag propagation still works (zero regressions). Handle the case where ActiveSupport is not loaded (graceful no-op).

Milestone: CurrentAttributes automatically propagate across all fiber boundaries. All existing tests pass.

Phase 2: Metrics Engine (Weeks 4-6, ~45 hours)

Week 4: Implement raw metric accessors and LoopLagProber tool in Rage::Telemetry. Unit tests for each accessor. Ensure LoopLagProber correctly wraps Iodine.run_every with accurate lag calculation.

Week 5: Build event loop lag probe using Iodine.run_every. Build fiber utilization gauge hooked into FiberWrapper. Build queue depth gauge from @backlog_size.

Week 6: Build connection pressure gauge. Build GC pressure probe. Integration test: start a Rage app, make requests, verify metrics update correctly.

Milestone: All five runtime metrics measured and accessible via Rage::Telemetry.raw.

Midterm Evaluation

Context propagation complete. Metrics engine functional with five runtime metrics.

Phase 3: OpenTelemetry Export (Weeks 7-9, ~45 hours)

Week 7: Implement OTelExporter bridging Rage metrics to opentelemetry-ruby SDK. Test with OTel Collector.

Week 8: Build Grafana dashboard template with panels for loop lag, fiber utilization, queue depth, connections, GC pressure. Write configuration guide.

Week 9: Buffer week for integration issues, mentor feedback, and edge cases (high-cardinality labels, metric reset on process fork).

Milestone: Metrics exportable to Prometheus/Grafana via OpenTelemetry. Dashboard template ready.

Phase 4: Documentation & Hardening (Weeks 10-12, ~40 hours)

Week 10: YARD documentation for all public APIs. Usage guide with examples: "How to monitor Rage in production."

Week 11: Benchmark impact of context propagation (overhead per task enqueue). Benchmark metric collection (overhead per request). Ensure < 1% performance impact.

Week 12: Final cleanup, PR review incorporation, submission.

Milestone: All deliverables complete. PRs ready for merge.

6. Related Work

Rails CurrentAttributes Propagation

Rails' Active Job automatically propagates CurrentAttributes via serialization into the job payload. Sidekiq implements this via middleware. Rage's approach is more direct: since tasks run in-process (same Ruby VM), we can capture live object references rather than serializing/deserializing — making propagation both faster and more complete.

Node.js AsyncLocalStorage

Node.js solved the same problem with AsyncLocalStorage (added in v12.17.0), which automatically propagates context across async boundaries. Rage's fiber-local storage serves the same purpose, but lacks the automatic propagation that AsyncLocalStorage provides. This project adds that missing piece.

Puma Metrics

Puma exposes thread pool stats, backlog, and request queue depth via Puma.stats. Rage needs equivalent visibility for its fiber-based model. The key difference: Puma's thread pool is bounded and well-understood; Rage's fiber pool is dynamic, making metrics even more critical for capacity planning.

7. Risk Mitigation

Risk	Mitigation
Complex CurrentAttributes values (e.g., non-serializable objects)	In-process propagation uses dup — no serialization needed. Objects that don't support dup are skipped with a warning.
Performance overhead of context capture	CurrentAttributes.subclasses and attributes.dup are O(n) where n is the number of attributes — typically < 10. Benchmarked to ensure < 1% impact.
Iodine event loop access for metrics	Loop lag probe uses Iodine.run_every — already used by Rage for heartbeats and fsync. No C-level changes needed.
OpenTelemetry SDK compatibility	OTel exporter is optional — metrics work without it. Tested against opentelemetry-sdk 1.x.
Backward compatibility	Context propagation is additive — existing logger tag propagation unchanged. Metrics are opt-in via configuration.
CurrentAttributes.descendants not loaded	Eagerly load subclasses during boot (Rails eager loading handles this in production). Document the requirement.
Deferred::Context array layout	Uses index 7 (currently unused). Existing serialized contexts with 7 elements get nil at index 7 — handled gracefully.

8. Previous Contributions

Rage Framework

PR Add unit tests for SSE::ConnectionProxy #245: Unit tests for SSE::ConnectionProxy — hands-on experience with Rage's fiber boundary handling, Iodine's connection model, and the exact context propagation pattern (@log_tags/@log_context capture at sse/application.rb:19) that this proposal extends.

Proof of Concept

HTTP Streaming PoC (Gist): Standalone validation that Iodine's rack.upgrade? sends raw bytes — demonstrating ability to explore and validate assumptions against the Rage/Iodine codebase independently.

Ruby Core (ruby/csv)

PR #319: Added CSV::TSV support for tab-separated values (merged)
PR #346: Fixed carriage return handling bug in CSV parsing (merged)
PR #1144: Contribution to the debug gem (merged)

Conference Speaking

DroidKaigi 2025 (Tokyo): Industrial IoT Systems with Android and Kotlin Multiplatform (40 min)
Balkan Ruby 2025 (Sofia): The Long Game: Building for Forever in Ruby Core (25 min)
Helvetic Ruby 2025 (Geneva): Contributing to Ruby Core: From Local Development to Global Impact (25 min)

9. About Me

I am a software engineer based in Japan building production Server-Sent Events systems for industrial IoT equipment monitoring. My daily work involves exactly the observability challenges this project addresses: tracking request context across async boundaries, measuring event loop health under load, and correlating background processing with the HTTP requests that triggered it.

I chose this project because I have firsthand production experience with the pain of losing context across fiber boundaries. In my IoT monitoring systems, a sensor alert that triggers a background notification must carry the device ID, tenant, and request trace through every async hop — or the notification is useless. Rage's Deferred system has the same gap, and I know exactly how to fix it because I have already read the code where context is captured (context.rb), where it is lost (queue.rb:38), and where it is partially restored (task.rb:62).

GitHub: jsxs0

10. Availability

Weekly availability: 20-30 hours per week. I have no upper limit — I will put in whatever time the project needs to ship quality work on schedule.
Timezone: JST (UTC+9). Comfortable with async communication; can overlap with European timezones for sync calls.
Employment: Full-time software engineer. GSoC work will be done outside working hours — evenings, weekends, and holidays.
Commitments: No vacations or major commitments planned during the coding period.

11. Communication Plan

Weekly sync: 30-minute video call with mentors, agenda shared in advance.
Async updates: Brief progress update on Discord/GitHub Discussions every Monday and Thursday.
Code review: Small, reviewable PRs opened early and often. Each milestone = a mergeable PR.
Blocker escalation: Any blocker unresolved within 24 hours raised immediately with mentors.

Google Doc with commenting enabled

@cuneyter @daz-codes — happy to adjust scope or approach based on your guidance!

jsxs0 · 2026-03-26T08:10:41Z

jsxs0
Mar 26, 2026
Author

Built a proof-of-concept validating the context propagation problem and the proposed fix: https://gist.github.com/jsxs0/e07793c4f62b746f3f7e14aa5a9f86d4

Key finding: ActiveSupport::CurrentAttributes is lost across fiber boundaries (because IsolatedExecutionState creates separate storage per fiber), while raw Fiber[:key] variables survive in Ruby 3.3. This confirms the proposal targets the right layer — CurrentAttributes needs explicit capture/restore, not just Fiber-local inheritance.

0 replies

jsxs0 · 2026-03-27T02:49:10Z

jsxs0
Mar 27, 2026
Author

Prototyped the capture/restore mechanism with benchmarks: https://gist.github.com/jsxs0/920ee6463a422faa50707744c18dac76

Results: 669ns per capture (under 1 microsecond target), 1.6μs per restore, subclasses memoized at boot. One finding: attributes.dup is a shallow copy — mutable attribute values are shared between parent and child fibers. Matches Rails' behavior but should be documented.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSoC Proposal Draft] - Jasveen - Observability: Metrics & Context Propagation #246

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[GSoC Proposal Draft] - Jasveen - Observability: Metrics & Context Propagation #246

Uh oh!

Uh oh!

jsxs0 Mar 25, 2026

Making Rage Fully Observable: Metrics & Context Propagation

1. Abstract

2. Problem Statement

2.1 The Vanishing Context

2.2 Why This Matters

2.3 The Invisible Engine

3. Proposed Solution

3.1 Architecture Overview

3.4 Runtime Metrics: What to Measure

3.5 OpenTelemetry Export

4. Project Deliverables

5. Timeline / Milestones

Community Bonding (before coding starts)

Phase 1: Context Propagation (Weeks 1-3, ~45 hours)

Phase 2: Metrics Engine (Weeks 4-6, ~45 hours)

Midterm Evaluation

Phase 3: OpenTelemetry Export (Weeks 7-9, ~45 hours)

Phase 4: Documentation & Hardening (Weeks 10-12, ~40 hours)

6. Related Work

Rails CurrentAttributes Propagation

Node.js AsyncLocalStorage

Puma Metrics

7. Risk Mitigation

8. Previous Contributions

Rage Framework

Proof of Concept

Ruby Core (ruby/csv)

Conference Speaking

9. About Me

10. Availability

11. Communication Plan

Replies: 2 comments

Uh oh!

jsxs0 Mar 26, 2026 Author

Uh oh!

jsxs0 Mar 27, 2026 Author

jsxs0
Mar 25, 2026

jsxs0
Mar 26, 2026
Author

jsxs0
Mar 27, 2026
Author