DT

中文说明请见 README_zh.md

DT (Distributed Testing) is a lightweight fault-injection framework for validating distributed databases and distributed systems under process and network failures.

It uses a Controller-Agent architecture to orchestrate actions across multiple nodes, and can optionally call a Probe service after each step to verify correctness, availability, or recovery behavior.

Why DT

DT is designed for scenarios where you want to:

bring up a multi-node test environment and control each node remotely
inject failures such as process stop/restart, pause/resume, and network disruption
verify system behavior after each action with application-specific checks
reproduce distributed failure scenarios in a repeatable, config-driven way

Core Concepts

Controller: orchestrates the test flow, tracks registered agents, and dispatches commands in order
Agent: runs on each target machine, manages local processes, and exposes an HTTP API
Probe: an optional external service that validates expected behavior after each test step

Architecture

┌─────────────┐         Register / Heartbeat      ┌─────────────┐
│             │ ◄──────────────────────────────► │             │
│ Controller  │          HTTP API calls          │   Agent 1   │
│             │ ───────────────────────────────► │ (Instance)  │
│             │                                  └─────────────┘
│             │          HTTP API calls
│             │ ───────────────────────────────► ┌─────────────┐
│             │                                  │   Agent 2   │
│             │                                  │ (Instance)  │
└─────────────┘                                  └─────────────┘
       │
       │ Probe callback (validation)
       ▼
┌─────────────┐
│    Probe    │   (external checker / client)
└─────────────┘

What DT Can Do

DT currently supports:

distributed multi-node orchestration through Controller-Agent
process lifecycle control:
- init
- start
- stop
- restart
- pause
- continue
network fault injection based on iptables:
- dropport
- recoverport
- droppkg
- limitspeed
data operations:
- backup
- cleanup
optional probe-based validation after each step

Quick Start

Requirements

Go
Linux
iptables
root / sudo privileges for network fault injection

Build

make

This builds:

bin/agent
bin/ctrl
bin/probe
bin/instance

The build targets are defined in the project Makefile.

Configure

Example agent config (etc/agent_cfg.toml):

ip = "127.0.0.1"
port = "54320"
ctrl_addr = "127.0.0.1:54321"
data_dir = "./dt/agent"

Example controller config (etc/ctrl_example_cfg.toml):

addr = "127.0.0.1:54321"
instance_count = 1

[instance]
  [instance.server]
  count = 1

[[test_cmd]]
cmd = "init"
instances = ["server1"]
args = ""
probe = "http://127.0.0.1:8080/probe/server/init"

[[test_cmd]]
cmd = "start"
instances = ["server1"]
dir = "./dt/instance"
args = "nohup ../bin/instance &"
probe = "http://127.0.0.1:8080/probe/server/start"

The example configs in the repo use exactly this style.

Run

You can start the sample components in sample/:

cd sample
./start_probe.sh
./start_ctrl.sh
./start_agent.sh

Or run the binaries directly:

./bin/agent -role=agent -cfg=etc/agent_cfg.toml
./bin/ctrl -role=controller -cfg=etc/ctrl_example_cfg.toml

The agent and controller binaries are both built from cmd/main.go and selected by the -role flag.

Clean

make clean

Supported Commands

Command	Description	Example Args
init	initialize an instance	./cockroach init --stores=...
start	start an instance	nohup ./cockroach start ... &
stop	stop an instance (SIGKILL)	-
restart	restart an instance	same as start
pause	pause an instance (SIGSTOP)	-
continue	resume an instance (SIGCONT)	-
dropport	block a port via iptables	8080
recoverport	recover blocked port connectivity	8080
droppkg	randomly drop packets	INPUT/8080/10
limitspeed	limit traffic rate	INPUT/8080/second/100
sleep	sleep for N seconds	10
backup	back up instance data	dir
cleanup	clean instance data	-
shutdown	shut down the agent	-

If probe is specified in a test_cmd, DT will call the probe endpoint after that step.

Example Scenario

A typical workflow looks like this:

start a controller
start agents on target nodes
define a sequence of test_cmd steps in controller config
execute actions such as:
- start a node
- pause a node
- drop a service port
- recover connectivity
- restart the node
run a probe after each step to validate expected behavior

The repo includes a CockroachDB example config under etc/ctrl_cockroach_cfg.toml for reference.

Repository Layout

.
├── cmd/              # main entrypoint
├── pkg/
│   ├── agent/        # agent logic: process control, API, network faults
│   ├── controller/   # controller logic: scheduling, agent management
│   └── util/         # shared utilities
├── sample/           # sample probe, sample instance, helper scripts
├── etc/              # config examples
└── bin/              # build output (generated by make)

Extending Probes

You can implement custom validation logic in sample/probe/probe.go.

Typical probe extensions include:

registering HTTP routes such as /probe/server/start
performing health checks or client operations
verifying correctness after a fault action
returning probe results back to the controller flow

Then reference the probe URL in controller config:

probe = "http://host:port/probe/path"

Current Limitations

Linux only
network fault injection depends on iptables
root / sudo privileges are required for network fault operations
probe logic is application-specific and must be implemented separately
best suited for controlled test environments rather than production systems

License

See the LICENSE file in the repository root.

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
cmd		cmd
etc		etc
pkg		pkg
sample		sample
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
README_zh.md		README_zh.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DT

Why DT

Core Concepts

Architecture

What DT Can Do

Quick Start

Requirements

Build

Configure

Run

Clean

Supported Commands

Example Scenario

Repository Layout

Extending Probes

Current Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DT

Why DT

Core Concepts

Architecture

What DT Can Do

Quick Start

Requirements

Build

Configure

Run

Clean

Supported Commands

Example Scenario

Repository Layout

Extending Probes

Current Limitations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages