Conversation
This commit covers: ✔ Chunk 1: Add axum deps to workspace, migrate core layer ✔ Chunk 2: Migrate aggregator_core instrumented() to axum middleware and part of ◼ Chunk 3: Migrate problem_details and error handling to axum Resolves #4283 Authored by Claude
Claude's plan file
Migrate Janus from Trillium to Axum (Issue #4283)ContextTrillium is a less widely adopted web framework with a smaller ecosystem. Axum (built on Scope Assessment42 Rust files reference trillium. 6 Cargo.toml files have trillium dependencies. The key areas are:
Migration Strategy: Chunk-by-Chunk (Bottom-Up)Each chunk is a standalone PR that compiles and passes tests. Chunk 1: Dependencies & Core LayerGoal: Add axum deps, migrate lowest-level shared utilities that don't break anything. Files to modify:
Chunk 2: Aggregator Core InstrumentationGoal: Migrate the Files to modify:
Chunk 3: Problem Details & Error HandlingGoal: Migrate error types from Files to modify:
Key mapping:
Chunk 4: Request Queue MiddlewareGoal: Migrate Files to modify:
Chunk 5: DAP HTTP Handlers (Main Aggregator)Goal: Migrate all DAP endpoint handlers and the Files to modify:
Key mappings:
Chunk 6: Aggregator APIGoal: Migrate the admin API. Files to modify:
Chunk 7: Server InfrastructureGoal: Migrate server startup, signal handling, zpages. Files to modify:
Chunk 8: Interop BinariesGoal: Migrate test/interop servers. Files to modify:
Chunk 9: Test InfrastructureGoal: Migrate all test utilities and test files. Files to modify:
Key mapping:
Chunk 10: CleanupGoal: Remove all trillium dependencies.
VerificationAfter each chunk:
After full migration: Notes
|
… axum - Add `Stopper` type wrapping `CancellationToken` to replace `trillium_tokio::Stopper` throughout the codebase - Add `CloneCounterObserver`/`CloneCounter` to replace `trillium_tokio::CloneCounterObserver` for tracking spawned tasks - Rewrite `setup_server` to use `axum::serve` with graceful shutdown - Rewrite `zpages_server`/`zpages_handler` as axum Router - Rewrite `prometheus_metrics_server` to use axum and `prometheus::TextEncoder` directly - Migrate `binaries/aggregator.rs` to use axum Router and `Router::nest` for aggregator API path prefix - Extract `Error::to_response(&self)` method so both `Error` and `ArcError` can implement `IntoResponse` without requiring `Clone` - Migrate `#[cfg(feature = "test-util")]` modules in `aggregation_job_init.rs` and `aggregation_job_continue.rs` to use axum Router + `tower::ServiceExt::oneshot` - Update all `Stopper` imports across job_driver, garbage_collector, aggregation_job_creator, and integration_tests
Migrate all test code from trillium_testing patterns to axum/tower: - Replace `get/post/put/delete().run_async(&handler)` with `handler.oneshot(Request::builder()...)` using tower::ServiceExt - Replace trillium assertion macros (assert_status!, assert_headers!, assert_response!, assert_body_contains!) with direct assertions on http::Response - Convert handler types from `Box<dyn Handler>` to `Router` - Update auth token test helpers for http::HeaderMap - Migrate trillium_tokio server tests to axum::serve with graceful shutdown - Update aggregator_api middleware and route patterns for axum Remove all trillium dependencies except trillium-rustls and trillium-tokio which are still needed transitively by divviup-client in integration tests.
- Fix CloneCounterObserver::into_future race condition: register the Notified future before checking the count so a concurrent notify_one() cannot be lost between the check and the await. - Stream upload body instead of buffering: replace to_bytes(body, usize::MAX) with into_data_stream().into_async_read(), restoring the streaming behavior from the old trillium handler and eliminating the unbounded memory DoS vector. - Remove duplicate CORS handling: the CorsLayer already sets Access-Control-Allow-Origin: *, so the manual origin-echoing code in hpke_config() and upload() was redundant and contradictory. - Wire LIFO queue into aggregation job routes: the queue middleware was constructed but silently discarded. Now applied via a sub-router with from_fn_with_state, restoring back-pressure and load-shedding for helper aggregators. - Remove dead _aggregation_job_post variable.
- Fix Error::Db(User(...)) losing response body by using downcast() instead of downcast_ref() + status_code(), and remove now-unused status_code() - Reuse state.http_client in ready_endpoint instead of allocating per request - Use append instead of insert for proxy response headers to preserve multi-valued headers - Clean up put_aggregation_job test helper's dead .header() builder call - Add comment explaining axum layer ordering (last = outermost) - Remove unused State parameter and generic from replace_mime_types middleware - Return error response instead of panicking on metrics encode failure - Log zpages server errors instead of silently discarding them - Restrict AggregatorState and check_content_type_value visibility - Improve content-type error messages to include expected type - Update stale draft-07 doc reference to draft-18 - Add TODO for aggregator API metrics wiring
…to LIFO queue The interop aggregator's proxy_handler was constructing responses via `(status, body).into_response()`, which sets a default content-type of application/octet-stream. Backend headers were then appended, resulting in duplicate content-type headers. The client saw the default one first, causing all Docker integration tests to fail with "unexpected content type in server response". Fix by building an empty Response and using insert (not append) for backend headers. Also update cached_resource.rs to use MIME-aware content-type validation via check_content_type instead of raw string comparison, and add CancellationToken extension support to lifo_queue_middleware for future client-disconnect detection.
Use `append` instead of `insert` when copying backend headers in the interop proxy so multi-valued headers (e.g. Set-Cookie) are preserved. This is safe now that the response is constructed via Response::builder() which starts with no default headers. Add inline TODO(#4402) noting upload-resp should be upload-errors per DAP-18.
…aggregator API Move `HttpMetrics`, `ErrorCode`, and `http_metrics_middleware` from the aggregator's http_handlers into `janus_aggregator_core` so both the DAP aggregator and the aggregator API can share the same metrics plumbing. `HttpMetrics::new(meter, counter_name)` takes a configurable counter name so each crate gets its own metric (`janus_aggregator_responses` vs `janus_aggregator_api_responses`). Wire up the middleware in `aggregator_api` by adding Extension and middleware layers to the router, replacing the previously unused `_meter` parameter.
CloneCounter::drop used notify_one(), which could miss a concurrent drop if two counters dropped between the observer registering its Notified future and awaiting it. Switch to notify_waiters() so the observer is always woken regardless of how many drops occur in that window.
The Trillium-to-axum migration inadvertently changed CORS from echoing the request Origin header to returning wildcard `*`. While both allow cross-origin access, wildcard disallows credentialed requests. Use `AllowOrigin::mirror_request()` to restore the original behavior.
- This adds an `axum_hpke_config` handler function, wiring it into Axum. - It leaves behind the existing `hpke_config` and `hpke_config_cors_preflight` methods to support other unit tests that want them for now, but we'll clean that up in part 6. - Strips out part 1's demo test, and instead routes `"/hpke_config"` via `axum::routing::get(axum_hpke_config::<C>).layer(hpke_cors)`, roughly like I do in Relay. - Pulled `impl Error` and `into_response_with_retry_after` out of #4409. Partly authored by Claude
- This adds an `axum_hpke_config` handler function, wiring it into Axum. - It leaves behind the existing `hpke_config` and `hpke_config_cors_preflight` methods to support other unit tests that want them for now, but we'll clean that up in part 6. - Strips out part 1's demo test, and instead routes `"/hpke_config"` via `axum::routing::get(axum_hpke_config::<C>).layer(hpke_cors)`, roughly like I do in Relay. - Pulled `impl Error` and `into_response_with_retry_after` out of #4409. Partly authored by Claude
- This adds an `axum_hpke_config` handler function, wiring it into Axum. - It leaves behind the existing `hpke_config` and `hpke_config_cors_preflight` methods to support other unit tests that want them for now, but we'll clean that up in part 6. - Strips out part 1's demo test, and instead routes `"/hpke_config"` via `axum::routing::get(axum_hpke_config::<C>).layer(hpke_cors)`, roughly like I do in Relay. - Pulled `impl Error` and `into_response_with_retry_after` out of #4409. Partly authored by Claude WIP
- This adds an `axum_hpke_config` handler function, wiring it into Axum. - It leaves behind the existing `hpke_config` and `hpke_config_cors_preflight` methods to support other unit tests that want them for now, but we'll clean that up in part 6. - Strips out part 1's demo test, and instead routes `"/hpke_config"` via `axum::routing::get(axum_hpke_config::<C>).layer(hpke_cors)`, roughly like I do in Relay. - Pulled `impl Error` and `into_response_with_retry_after` out of #4409. Partly authored by Claude
* Migrate to Axum [part 2]: Migrate hpke_config to Axum - This adds an `axum_hpke_config` handler function, wiring it into Axum. - It leaves behind the existing `hpke_config` and `hpke_config_cors_preflight` methods to support other unit tests that want them for now, but we'll clean that up in part 6. - Strips out part 1's demo test, and instead routes `"/hpke_config"` via `axum::routing::get(axum_hpke_config::<C>).layer(hpke_cors)`, roughly like I do in Relay. - Pulled `impl Error` and `into_response_with_retry_after` out of #4409. Partly authored by Claude
Resolves #4283
Authored by Claude with a bunch of hand-holding by me.
Review roadmap
Review Roadmap
Migrates from Trillium to Axum across the entire server stack (~4700 lines changed, 55 files, 13 commits). The first 9 commits perform the migration; the last 4 fix regressions and address review feedback.
Commits
e902c8aecore/src/http.rs,problem_details3521a7edStopper,CloneCounter,setup_server,zpages, Prometheus764859ceproblem_detailsd2df7d12daa2d6b024a6e17d10bbd1beccdf423accc833cf68a1ef4eCloneCounterObserver, stream upload body, wire LIFO queue, remove duplicate CORS703c013cError::Dbdowncast, proxy headers, visibility, error messages23fc6634CancellationTokento LIFO queue54809101appendnotinsertfor multi-valued headers;upload-respTODOGroup 1: Dependencies (skim)
Trillium crates out,
axum 0.8/tower 0.5/tower-http 0.6in.Group 2: Core utilities — read first
Small, self-contained changes the rest depends on.
extract_bearer_token()takes&HeaderMap;check_content_type()generic overMediaTypeWithAuthenticationTokenimplemented forHeaderMapinstrumented()becomes Axum middleware; addsBYTES_HISTOGRAM_BOUNDARIESGroup 3: Error handling and response types
ProblemDocumentimplementsIntoResponse; removed Trillium conn-ext traitsError::Db(User(...))fixed to usedowncast()so response body is preservedGroup 4: Main handler file — review in sections
aggregator/src/aggregator/http_handlers.rs (1440 lines changed) — read in order:
EncodedBody<T>,EmptyBodyimplementingIntoResponseAggregatorState; newHttpMetricsstructbuild()—Routerwith routes, CORS layer, metrics; comment explains layer orderingtrillium-opentelemetry::Metrics+StatusCounterconn→ extractors, returnResult<Response, Error>. Exception:upload()usesBodystream (notBytes) for disconnect detectionvalidate_content_typedelegates to genericcheck_content_type::<M>()Group 5: Server lifecycle
Stopper(wrapsCancellationToken),CloneCounter/CloneCounterObserver(race-condition fix in68a1ef4e),setup_server()usesaxum::serve+ graceful shutdownRouter::nest()compositionGroup 6: Aggregator API
Router; metrics layer; removedReplaceMimeTypeshandler and unused State genericGroup 7: Metrics
tower::ServiceExt::oneshot()Group 8: LIFO queue middleware
CancellationTokenextension support to race acquisition against client disconnectGroup 9: Interop binaries and integration proxy
Response::builder()+append(not defaultinto_response())Group 10: Test migrations (largest by line count, most mechanical)
Pattern:
trillium_testing::TestConn→Request::builder()+tower::ServiceExt::oneshot(). Spot-check a few, then skim.upload_client_early_disconnectandupload_client_http11_bulktestshttp_handlers/tests/— all follow the same patternWhat to look for
.layer()= outermost). See comment inhttp_handlers.rsbuild().upload()usesBodystream +into_data_stream().into_async_read(), notBytes. This restores streaming and enables disconnect detection. Fixed in68a1ef4e.into_future()now registersNotifiedbefore checking count (68a1ef4e). Subtle ordering dependency.68a1ef4e; now applied via sub-router withfrom_fn_with_state.CorsLayer::permissive()returns*; manual origin-echoing removed. Intentional.Response::builder()(no default headers) +append. Fixed across703c013c→23fc6634→54809101.upload-respmedia type should beupload-errorsper DAP-18 (deferred).703c013c.cached_resource.rs— switched from raw string comparison to MIME-awarecheck_content_type(23fc6634).