chore: add disaster simulation script#259
Conversation
Adds a script to simulate losing a host. This script has three different ways of simulating that loss to enable us to develop recovery steps for Swarm and Control Plane/Etcd in parallel.
📝 WalkthroughWalkthroughAdds a new Bash script Changes
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Adds an option to reset the Lima E2E fixture back to its initial state without tearing it down entirely. This can save a significant amount of time between tests.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hack/simulate-disaster.sh`:
- Around line 171-173: The dispatch case labeled "full" calls a nonexistent
function simulate_full_node_loss; update that case to call the actual function
simulate_full_loss (replace simulate_full_node_loss with simulate_full_loss) so
the "full" branch invokes the defined function and won't fail at runtime.
- Around line 121-125: Update the usage text in the simulate-disaster.sh header:
fix the duplicated word "different different" and include the missing `reset`
option in the synopsis string (change "Usage: $1 <swarm|etcd|full> <host-id>
[host-id ...]" to include `reset`, e.g. "Usage: $1 <swarm|etcd|full|reset>
<host-id> [host-id ...]") and adjust the descriptive paragraph to remove the
duplicate word so it reads "three different types of disasters" (or similar).
Ensure you update both the usage line and the description near that header.
- Line 17: Fix the unquoted $@ expansions in the for-loops and correct the
misnamed function call: change the three loops that iterate host_id (the ones
using for host_id in $@) to use quoted expansion (for host_id in "$@") to
prevent word-splitting and globbing, update the calls that forward args to use
quoted slices where shown (simulate_swarm_node_loss "${@:2}",
simulate_etcd_node_loss "${@:2}"), replace the invalid simulate_full_node_loss
call with the actual function name simulate_full_loss and pass quoted args
(simulate_full_loss "${@:2}"), and ensure the script's entry call uses main "$@"
instead of unquoted arguments.
- Fix etcd simulation for client-mode servers - Remove database services in etcd simulation - Rebuild control-plane in reset - Remove `set -x`
07d41dc to
decc80d
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hack/simulate-disaster.sh`:
- Line 121: Update the usage synopsis in simulate-disaster.sh to show
mode-specific forms rather than a single line that implies every mode needs
<host-id>; replace the current "Usage: $1 <swarm|etcd|full|reset> <host-id>
[host-id ...]" with two clearer variants (e.g. "Usage: $1 <swarm|etcd|full>
<host-id> [host-id ...]" and " $1 reset") or an equivalent multi-line help
message so that "reset" is shown as not requiring host-id(s); ensure the updated
usage string appears where the script prints help/usage so callers see the
correct mode-specific argument requirements.
- Around line 163-175: The case branches for "swarm", "etcd", and "full" should
validate that at least one host ID argument was passed before calling
simulate_swarm_node_loss, simulate_etcd_node_loss, or simulate_full_loss; if no
host IDs are provided, print a clear error to stderr (e.g., "error: <mode>
requires at least one <host-id>") and exit with a non-zero status instead of
silently returning success. Update the case block to check the argument count or
"${@:2}" emptiness before invoking those functions and call exit 1 on failure so
the script fails fast and surfaces the misuse.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: aff32cbc-3cf5-4b59-9e9a-1678dbea60dd
📒 Files selected for processing (1)
hack/simulate-disaster.sh

Summary
Adds a script to simulate losing a host. This script has three different ways of simulating that loss to enable us to develop recovery steps for Swarm and Control Plane/Etcd in parallel.
Testing
NOTE - You can get this script without checking out this whole branch by doing:
Then to use the script: