Skip to content

Add telemetry metrics infrastructure#103

Open
1ntEgr8 wants to merge 1 commit intomainfrom
users/elton/telemetry-infrastructure
Open

Add telemetry metrics infrastructure#103
1ntEgr8 wants to merge 1 commit intomainfrom
users/elton/telemetry-infrastructure

Conversation

@1ntEgr8
Copy link
Collaborator

@1ntEgr8 1ntEgr8 commented Feb 27, 2026

Summary

  • Add CUDA event timing in NCCLWorker and compilation timing in Planner/PassManager
  • New ZMQ-based MetricsSink for streaming metrics from workers and coordinator
  • Add Python telemetry module with MetricsServer, deserialization, and CSV sink
  • Planner::Compile now returns CompileResult with metrics alongside the plan

@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from 8e5c75f to 6072324 Compare February 27, 2026 02:34
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from 59cf112 to 8afcfbe Compare February 27, 2026 02:34
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from 6072324 to a47ec50 Compare February 27, 2026 15:20
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from 8afcfbe to c2fc1b8 Compare February 27, 2026 15:20
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from a47ec50 to 8e9ff84 Compare February 27, 2026 15:23
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from c2fc1b8 to 24089c6 Compare February 27, 2026 15:23
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from 8e9ff84 to a105bfa Compare February 27, 2026 15:26
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from 24089c6 to 253705b Compare February 27, 2026 15:26
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from a105bfa to 15563ef Compare February 27, 2026 15:49
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from 253705b to 75615e9 Compare February 27, 2026 15:49
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from 15563ef to 1bc6146 Compare February 27, 2026 15:55
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from 75615e9 to c9b9866 Compare February 27, 2026 15:55
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch 2 times, most recently from ba5a895 to 10c7b31 Compare February 27, 2026 15:59
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from c9b9866 to 5a4ea3a Compare February 27, 2026 15:59
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from 10c7b31 to 1bc6146 Compare February 27, 2026 16:00
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from 5a4ea3a to c9b9866 Compare February 27, 2026 16:00
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from 1bc6146 to 1bde980 Compare February 27, 2026 16:07
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from c9b9866 to 513c2e4 Compare February 27, 2026 16:10
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from 1bde980 to 1e51ac0 Compare February 27, 2026 16:20
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from 513c2e4 to 785abed Compare February 27, 2026 16:20
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from 1e51ac0 to 78e320e Compare February 27, 2026 17:26
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from 785abed to 934ddf6 Compare February 27, 2026 17:26
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from 78e320e to ab4876c Compare February 27, 2026 17:27
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from 934ddf6 to 6891e02 Compare February 27, 2026 17:27
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from ab4876c to b766140 Compare February 27, 2026 17:32
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from 6891e02 to a9ca420 Compare February 27, 2026 17:32
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from 826713b to b766140 Compare February 27, 2026 18:44
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from 9164902 to a9ca420 Compare February 27, 2026 18:44
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from b766140 to 7a2fe57 Compare February 27, 2026 18:51
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from a9ca420 to 1e612fe Compare February 27, 2026 18:51
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from 7a2fe57 to a42304a Compare February 27, 2026 19:08
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from 1e612fe to fb5e8ec Compare February 27, 2026 19:08
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from a42304a to e128158 Compare February 27, 2026 20:27
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from fb5e8ec to a4d12f2 Compare February 27, 2026 20:32
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from e128158 to 217114d Compare February 27, 2026 20:57
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from a4d12f2 to 4efae8a Compare February 27, 2026 20:57
@1ntEgr8 1ntEgr8 force-pushed the users/elton/null-topo-shortest-path branch from 217114d to 2162b44 Compare February 27, 2026 20:59
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from 4efae8a to 3994c0b Compare February 27, 2026 20:59
Base automatically changed from users/elton/null-topo-shortest-path to main February 27, 2026 21:06
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch 2 times, most recently from 3aeb805 to 3994c0b Compare February 27, 2026 21:06
Add CUDA event timing in NCCLWorker, compilation stage timing in
Planner and PassManager, ZMQ-based MetricsSink for streaming metrics
from workers to a central collector, HTTP MetricsServer for querying
metrics, and CSV export sinks.

Key changes:
- Add MetricsData types for compilation, NCCL worker, and pass timing
- Add MetricsSink with ZMQ PUB/SUB for streaming metrics
- Add NCCLWorkerMetrics with CUDA event-based timing
- Add per-pass timing via PassManager::RunTimed()
- Add Planner::Compile() with CompileResult including metrics
- Add Python telemetry module with HTTP server and CSV export
- Wire metrics collection through Coordinator and NodeAgent
@1ntEgr8 1ntEgr8 force-pushed the users/elton/telemetry-infrastructure branch from 3994c0b to 031bcf9 Compare February 27, 2026 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant