Skip to content

zimulala/dt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

107 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DT

中文说明请见 README_zh.md

DT (Distributed Testing) is a lightweight fault-injection framework for validating distributed databases and distributed systems under process and network failures.

It uses a Controller-Agent architecture to orchestrate actions across multiple nodes, and can optionally call a Probe service after each step to verify correctness, availability, or recovery behavior.

Why DT

DT is designed for scenarios where you want to:

  • bring up a multi-node test environment and control each node remotely
  • inject failures such as process stop/restart, pause/resume, and network disruption
  • verify system behavior after each action with application-specific checks
  • reproduce distributed failure scenarios in a repeatable, config-driven way

Core Concepts

  • Controller: orchestrates the test flow, tracks registered agents, and dispatches commands in order
  • Agent: runs on each target machine, manages local processes, and exposes an HTTP API
  • Probe: an optional external service that validates expected behavior after each test step

Architecture

┌─────────────┐         Register / Heartbeat      ┌─────────────┐
│             │ ◄──────────────────────────────► │             │
│ Controller  │          HTTP API calls          │   Agent 1   │
│             │ ───────────────────────────────► │ (Instance)  │
│             │                                  └─────────────┘
│             │          HTTP API calls
│             │ ───────────────────────────────► ┌─────────────┐
│             │                                  │   Agent 2   │
│             │                                  │ (Instance)  │
└─────────────┘                                  └─────────────┘
       │
       │ Probe callback (validation)
       ▼
┌─────────────┐
│    Probe    │   (external checker / client)
└─────────────┘

What DT Can Do

DT currently supports:

  • distributed multi-node orchestration through Controller-Agent
  • process lifecycle control:
    • init
    • start
    • stop
    • restart
    • pause
    • continue
  • network fault injection based on iptables:
    • dropport
    • recoverport
    • droppkg
    • limitspeed
  • data operations:
    • backup
    • cleanup
  • optional probe-based validation after each step

Quick Start

Requirements

  • Go
  • Linux
  • iptables
  • root / sudo privileges for network fault injection

Build

make

This builds:

  • bin/agent
  • bin/ctrl
  • bin/probe
  • bin/instance

The build targets are defined in the project Makefile.

Configure

Example agent config (etc/agent_cfg.toml):

ip = "127.0.0.1"
port = "54320"
ctrl_addr = "127.0.0.1:54321"
data_dir = "./dt/agent"

Example controller config (etc/ctrl_example_cfg.toml):

addr = "127.0.0.1:54321"
instance_count = 1

[instance]
  [instance.server]
  count = 1

[[test_cmd]]
cmd = "init"
instances = ["server1"]
args = ""
probe = "http://127.0.0.1:8080/probe/server/init"

[[test_cmd]]
cmd = "start"
instances = ["server1"]
dir = "./dt/instance"
args = "nohup ../bin/instance &"
probe = "http://127.0.0.1:8080/probe/server/start"

The example configs in the repo use exactly this style.

Run

You can start the sample components in sample/:

cd sample
./start_probe.sh
./start_ctrl.sh
./start_agent.sh

Or run the binaries directly:

./bin/agent -role=agent -cfg=etc/agent_cfg.toml
./bin/ctrl -role=controller -cfg=etc/ctrl_example_cfg.toml

The agent and controller binaries are both built from cmd/main.go and selected by the -role flag.

Clean

make clean

Supported Commands

Command Description Example Args
init initialize an instance ./cockroach init --stores=...
start start an instance nohup ./cockroach start ... &
stop stop an instance (SIGKILL) -
restart restart an instance same as start
pause pause an instance (SIGSTOP) -
continue resume an instance (SIGCONT) -
dropport block a port via iptables 8080
recoverport recover blocked port connectivity 8080
droppkg randomly drop packets INPUT/8080/10
limitspeed limit traffic rate INPUT/8080/second/100
sleep sleep for N seconds 10
backup back up instance data dir
cleanup clean instance data -
shutdown shut down the agent -

If probe is specified in a test_cmd, DT will call the probe endpoint after that step.

Example Scenario

A typical workflow looks like this:

  1. start a controller
  2. start agents on target nodes
  3. define a sequence of test_cmd steps in controller config
  4. execute actions such as:
    • start a node
    • pause a node
    • drop a service port
    • recover connectivity
    • restart the node
  5. run a probe after each step to validate expected behavior

The repo includes a CockroachDB example config under etc/ctrl_cockroach_cfg.toml for reference.

Repository Layout

.
├── cmd/              # main entrypoint
├── pkg/
│   ├── agent/        # agent logic: process control, API, network faults
│   ├── controller/   # controller logic: scheduling, agent management
│   └── util/         # shared utilities
├── sample/           # sample probe, sample instance, helper scripts
├── etc/              # config examples
└── bin/              # build output (generated by make)

Extending Probes

You can implement custom validation logic in sample/probe/probe.go.

Typical probe extensions include:

  • registering HTTP routes such as /probe/server/start
  • performing health checks or client operations
  • verifying correctness after a fault action
  • returning probe results back to the controller flow

Then reference the probe URL in controller config:

probe = "http://host:port/probe/path"

Current Limitations

  • Linux only
  • network fault injection depends on iptables
  • root / sudo privileges are required for network fault operations
  • probe logic is application-specific and must be implemented separately
  • best suited for controlled test environments rather than production systems

License

See the LICENSE file in the repository root.

About

DT (Distributed Testing) is a lightweight fault-injection framework for validating distributed databases and distributed systems under process and network failures.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors