Skip to content

Add adaptive circuit breaker#760

Merged
AbdulRahmanAlHamali merged 184 commits intomainfrom
pid-take-2
Mar 19, 2026
Merged

Add adaptive circuit breaker#760
AbdulRahmanAlHamali merged 184 commits intomainfrom
pid-take-2

Conversation

@AbdulRahmanAlHamali
Copy link
Contributor

@AbdulRahmanAlHamali AbdulRahmanAlHamali commented Sep 23, 2025

Add a new adaptive circuit breaker that:

  1. Works without need for configuration
  2. Has the ability to open partially/fully depending on the severity of the incident

The circuit breaker has two main components:

Ideal Error Rate Estimator

This estimator tries to find out the expected error rate from a healthy dependency. It does that through a method of simple exponential smoothing, which basically relies on calculating a weighted average, with a few extra domain-specific hints:

  1. Ignore any observations that are obviously too high
  2. Converge slowly towards higher values, and quickly towards lower values
  3. Be more receptive to signals for the first 30 minutes after boot (because we start with a random guess for the value), and less receptive afterwards

PID Controller (Process, Integral, Derivative) Controller

See wikipedia link

This controller increases/decreases the the rejection rate of the circuit breaker based on the value of:

kI * Integral + kP * P + kD * derivative

Where P is:

ErrorDelta - (1 - ErrorDelta) * RejectionRate

The intuition of the value of P:

  • Increases when error rate increases
  • Decreases when rejection rate increases
  • The rejection rate has a (1 - ErrorDelta) multiplier, that we call the "defensiveness" multiplier. This allows rejection to increase more aggressively if the error rate is high

Note: The derivate component is currently set to 0. The Integral component mostly contributes as a history to prevent the circuit breaker from fluctuating too much.

attr_reader :name, :pid_controller, :ping_thread

def initialize(name:, kp: 1.0, ki: 0.1, kd: 0.0,
window_size: 10, history_duration: 3600,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the calculation of online mean, Welford's algorithm was discussed as a solution to use constant space

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we want to allow developers to configure the interval of integration? (history_duration)?

AbdulRahmanAlHamali and others added 24 commits October 22, 2025 17:08
Update variable names

Fill sliding window with 1 hr worth data

Add comment

Update experiemnt resource to be deterministic

Change deterministic default value to false

Cleanup

Remove unused variable

Make initial seed error rate more customizable

Add seed_error_rate as a property
* Prefilling added

* Change initial duration to 900 s
* testing different circuit breaking scenarios

* adding concurrency

* adds more puts to get further information during phases

* Fixing concurrency, unprotected ping, extras

* update classic sustained test

* cleaning up outdated tests, and testing without ping rate

* modify ki instead of dividing by window size

---------

Co-authored-by: Abdulrahman Alhamali <abdulrahman.alhamali@shopify.com>
adriangudas and others added 27 commits January 23, 2026 10:37
…hread

using a single thread for all PID controller statuses
* Add dead zone ratio to PID controller for noise suppression

- Introduced `dead_zone_ratio` parameter in the PID controller to suppress noise from small deviations in error rates.
- Updated the `calculate_p_value` method to implement dead zone logic, allowing for more stable control responses.
- Enhanced tests to validate the behavior of the dead zone in various scenarios, ensuring it does not impede recovery while effectively filtering noise.

* Added experiment tests

* Refine dead zone logic in PID controller

- Updated the handling of the dead zone in the PID controller to allow full signal response above the dead zone, improving control accuracy.
- Adjusted comments for clarity regarding the purpose of the dead zone in noise suppression.

* Added experiment results
@AbdulRahmanAlHamali
Copy link
Contributor Author

Note, we are going to ignore CLA: It is complaining about commits made by people who were at Shopify and left

@AbdulRahmanAlHamali AbdulRahmanAlHamali merged commit 74be612 into main Mar 19, 2026
34 of 35 checks passed
@AbdulRahmanAlHamali AbdulRahmanAlHamali deleted the pid-take-2 branch March 19, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants