中文说明请见 README_zh.md
DT (Distributed Testing) is a lightweight fault-injection framework for validating distributed databases and distributed systems under process and network failures.
It uses a Controller-Agent architecture to orchestrate actions across multiple nodes, and can optionally call a Probe service after each step to verify correctness, availability, or recovery behavior.
DT is designed for scenarios where you want to:
- bring up a multi-node test environment and control each node remotely
- inject failures such as process stop/restart, pause/resume, and network disruption
- verify system behavior after each action with application-specific checks
- reproduce distributed failure scenarios in a repeatable, config-driven way
- Controller: orchestrates the test flow, tracks registered agents, and dispatches commands in order
- Agent: runs on each target machine, manages local processes, and exposes an HTTP API
- Probe: an optional external service that validates expected behavior after each test step
┌─────────────┐ Register / Heartbeat ┌─────────────┐
│ │ ◄──────────────────────────────► │ │
│ Controller │ HTTP API calls │ Agent 1 │
│ │ ───────────────────────────────► │ (Instance) │
│ │ └─────────────┘
│ │ HTTP API calls
│ │ ───────────────────────────────► ┌─────────────┐
│ │ │ Agent 2 │
│ │ │ (Instance) │
└─────────────┘ └─────────────┘
│
│ Probe callback (validation)
▼
┌─────────────┐
│ Probe │ (external checker / client)
└─────────────┘
DT currently supports:
- distributed multi-node orchestration through Controller-Agent
- process lifecycle control:
- init
- start
- stop
- restart
- pause
- continue
- network fault injection based on iptables:
- dropport
- recoverport
- droppkg
- limitspeed
- data operations:
- backup
- cleanup
- optional probe-based validation after each step
- Go
- Linux
- iptables
- root / sudo privileges for network fault injection
makeThis builds:
- bin/agent
- bin/ctrl
- bin/probe
- bin/instance
The build targets are defined in the project Makefile.
Example agent config (etc/agent_cfg.toml):
ip = "127.0.0.1"
port = "54320"
ctrl_addr = "127.0.0.1:54321"
data_dir = "./dt/agent"Example controller config (etc/ctrl_example_cfg.toml):
addr = "127.0.0.1:54321"
instance_count = 1
[instance]
[instance.server]
count = 1
[[test_cmd]]
cmd = "init"
instances = ["server1"]
args = ""
probe = "http://127.0.0.1:8080/probe/server/init"
[[test_cmd]]
cmd = "start"
instances = ["server1"]
dir = "./dt/instance"
args = "nohup ../bin/instance &"
probe = "http://127.0.0.1:8080/probe/server/start"The example configs in the repo use exactly this style.
You can start the sample components in sample/:
cd sample
./start_probe.sh
./start_ctrl.sh
./start_agent.shOr run the binaries directly:
./bin/agent -role=agent -cfg=etc/agent_cfg.toml
./bin/ctrl -role=controller -cfg=etc/ctrl_example_cfg.tomlThe agent and controller binaries are both built from cmd/main.go and selected by the -role flag.
make clean| Command | Description | Example Args |
|---|---|---|
| init | initialize an instance | ./cockroach init --stores=... |
| start | start an instance | nohup ./cockroach start ... & |
| stop | stop an instance (SIGKILL) | - |
| restart | restart an instance | same as start |
| pause | pause an instance (SIGSTOP) | - |
| continue | resume an instance (SIGCONT) | - |
| dropport | block a port via iptables | 8080 |
| recoverport | recover blocked port connectivity | 8080 |
| droppkg | randomly drop packets | INPUT/8080/10 |
| limitspeed | limit traffic rate | INPUT/8080/second/100 |
| sleep | sleep for N seconds | 10 |
| backup | back up instance data | dir |
| cleanup | clean instance data | - |
| shutdown | shut down the agent | - |
If probe is specified in a test_cmd, DT will call the probe endpoint after that step.
A typical workflow looks like this:
- start a controller
- start agents on target nodes
- define a sequence of
test_cmdsteps in controller config - execute actions such as:
- start a node
- pause a node
- drop a service port
- recover connectivity
- restart the node
- run a probe after each step to validate expected behavior
The repo includes a CockroachDB example config under etc/ctrl_cockroach_cfg.toml for reference.
.
├── cmd/ # main entrypoint
├── pkg/
│ ├── agent/ # agent logic: process control, API, network faults
│ ├── controller/ # controller logic: scheduling, agent management
│ └── util/ # shared utilities
├── sample/ # sample probe, sample instance, helper scripts
├── etc/ # config examples
└── bin/ # build output (generated by make)
You can implement custom validation logic in sample/probe/probe.go.
Typical probe extensions include:
- registering HTTP routes such as
/probe/server/start - performing health checks or client operations
- verifying correctness after a fault action
- returning probe results back to the controller flow
Then reference the probe URL in controller config:
probe = "http://host:port/probe/path"- Linux only
- network fault injection depends on iptables
- root / sudo privileges are required for network fault operations
- probe logic is application-specific and must be implemented separately
- best suited for controlled test environments rather than production systems
See the LICENSE file in the repository root.