Skip to content

Implementing a PID per pod#931

Draft
kris-gaudel wants to merge 139 commits intomainfrom
kris-gaudel/poc-pid-per-pod
Draft

Implementing a PID per pod#931
kris-gaudel wants to merge 139 commits intomainfrom
kris-gaudel/poc-pid-per-pod

Conversation

@kris-gaudel
Copy link
Contributor

Building off of the architecture discussed here and former work here

This PR implements a proof-of-concept for how a centralized PID state service can be shared among adaptive circuit breakers in a pod. The idea is that a centralized PID state service aggregates observations from all workers and broadcasts rejection rate updates, ensuring all workers have a consistent, pod-wide view of service health.

Notable features:

  • WAL and binary format
  • Messages are encoded inMessagePack format
  • Centralized PID state service communication to clients over async-bus
  • Unit tests for various components of this feature
  • Example where multiple clients are simulated as individual processes communicating back and forth with the PID state service

References:

AbdulRahmanAlHamali and others added 30 commits October 22, 2025 17:08
Update variable names

Fill sliding window with 1 hr worth data

Add comment

Update experiemnt resource to be deterministic

Change deterministic default value to false

Cleanup

Remove unused variable

Make initial seed error rate more customizable

Add seed_error_rate as a property
* Prefilling added

* Change initial duration to 900 s
* testing different circuit breaking scenarios

* adding concurrency

* adds more puts to get further information during phases

* Fixing concurrency, unprotected ping, extras

* update classic sustained test

* cleaning up outdated tests, and testing without ping rate

* modify ki instead of dividing by window size

---------

Co-authored-by: Abdulrahman Alhamali <abdulrahman.alhamali@shopify.com>
Aguasvivas22 and others added 22 commits November 20, 2025 13:25
* switch algorithm to a sliding window instead of discrete

* clean up AI slop PR

* provide observations per minute to the smoother from the pid controller

* fix tests

* remove unused smoother in setup

* fix alpha value

* fix tests and run experiments

* fix max size for sliding window and rerun experiments
Remove unnecessary comments and fix smoother initalization
* add elastic defensiveness

* remove kd

* update docs and run experiments

* fix tests

---------

Co-authored-by: Fernando Aguasvivas <fernando.aguasvivas@shopify.com>
…d classic modes (#808)

* use lambda to implement dual circuit breaker

* fix tests

* class method to allow setting a global selector for adaptive config

* remove unneeded methods that are no longer used
use CircuitBreakerBehaviour

* allow adaptive_circuit_breaker_selector to be unset

* make dynamic configuration call only on acquire

* require resource to be provided

* fix: avoid an extra lookup to use_adaptive in metrics

* various fixes

* nit: cleanup

* fix logging to both circuit breakers at the same time

* use logic from both acquire methods in dual_circuit_breaker

* fix last error tracking test by using properly scoped exceptions variable

* fix demos

* remove unused attr_reader for name
use symbols for active_breaker_type instead of strings
create separate acquire methods for readability
use active_breaker_type consistently to determine breaker type
remove useless AI-generated tests
fix useful tests

* nit: typo fix

* nit: remove TODO comment

* don't return if variables are nil (dangerous)
use_adaptive proc no longer exists at initialization (proc is set at runtime)

* address PR comments

* make active_circuit_breaker readable for tests
clean up unneeded tests and AI fluff

* improve track both breakers test with assert_equal

* replace "legacy" with "classic", notify on change in cb type

* replace legacy with classic

* notify on change in cb type, remove verbose comments, and refactor `handle_adaptive_acquire`

* metrics functions for dcb example

* have error handling parity between classic and adaptive

* fix exceptions parameter passing

* reference as instance variable

* use acquire directly from circuit breakers

* fix dcb test and remove verbose content

* fix double counting of errors

* undo the move of method to public

* make active_breaker_type public

* rerun experiments

---------

Co-authored-by: Kristofer Gaudel <kris.gaudel@shopify.com>
Co-authored-by: Abdulrahman Alhamali <abdulrahman.alhamali@shopify.com>

# Configure Semian with dual circuit breaker
Semian::NetHTTP.semian_configuration = proc do |host, port|
Semian::NetHTTP.semian_configuration = proc do |host, _port|
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed by the linter, not related to the the functionality of this PR

Base automatically changed from pid-take-2 to main March 19, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants