A generic, multi-threaded, and ergonomic Rust crate for Multi-Armed Bandit (MAB) algorithms—designed for engineers who need extensibility, custom reward modeling, and a clear path to contextual bandits.
- Support key MAB algorithms:
- MVP:
Epsilon-Greedy,UCB1,Thompson Sampling - Future:
LinUCB, etc.
- MVP:
- Built-in multi-threading for computational bottlenecks (e.g., regret calculation, reward evaluation).
- Clean, extensible abstraction for stateless and contextual bandits, using Rust traits and generics.
- User-defined reward types and flexible action/context typing via trait bounds.
- Target audience: ML engineers & backend engineers running real-time services in Rust.
- Not a general-purpose RL framework (e.g., no DQN, A3C, etc.)
- Not a rich visualization tool
// Action: Represents an arm/action in the bandit problem
pub trait Action: Clone + Eq + Hash + Send + Sync + 'static {
type ValueType;
fn id(&self) -> usize;
fn name(&self) -> String { ... }
fn value(&self) -> Self::ValueType;
}
// Reward: Represents the reward signal
pub trait Reward: Clone + Send + Sync + 'static {
fn value(&self) -> f64;
}
// Context: Represents contextual information (for contextual bandits)
pub trait Context: Clone + Send + Sync + 'static {
type DimType: ndarray::Dimension;
fn to_ndarray(&self) -> ndarray::Array<f64, Self::DimType>;
}
// BanditPolicy: Core trait for all bandit algorithms
pub trait BanditPolicy<A, R, C>: Send + Sync + 'static
where
A: Action,
R: Reward,
C: Context,
{
fn choose_action(&self, context: &C) -> A;
fn update(&mut self, context: &C, action: &A, reward: &R);
fn reset(&mut self);
}
// Environment: Simulated environment for running experiments
pub trait Environment<A, R, C>: Send + Sync + 'static { ... }- Extensibility: Implement these traits for your own types to plug into the framework.
- Action/Reward/Context: Highly generic, must satisfy trait bounds above.
- Parameters:
epsilon: f64, initial actions - Tracks average reward and count per action
- Generic over action, reward, and context types
- The
Simulatorstruct orchestrates the interaction between a bandit policy and an environment. - Collects cumulative rewards, regret, and per-step metrics via
SimulationResults.
Example:
use octopus::algorithms::epsilon_greedy::EpsilonGreedyPolicy;
use octopus::simulation::simulator::Simulator;
use octopus::traits::entities::{Action, Reward, Context, DummyContext};
// Define your own Action, Reward, and Environment types implementing the required traits
// ...
let actions = vec![/* your actions here */];
let mut policy = EpsilonGreedyPolicy::new(0.1, &actions).unwrap();
let environment = /* your environment here */;
let mut simulator = Simulator::new(policy, environment);
let results = simulator.run(1000, &actions);
println!("Cumulative reward: {}", results.cumulative_reward);- Internal parallelism (e.g., for regret calculation) uses
rayonand thread-safe primitives (Mutex). - User-facing API is single-threaded for simplicity; internal operations are parallelized where beneficial.
- No explicit builder pattern or thread-safe wrappers in the current API.
- Uses
ndarrayfor context features - Uses
rayonfor parallelism - Error handling via
thiserror - (Planned) Optional
serdefor serialization
- Add more algorithms:
UCB1,LinUCBetc. - Benchmark suite comparing with Python implementations
- Async/streaming reward update support
- Optional logging/tracing integration
- Unit tests for all algorithms and simulation logic
- Integration tests with simulated reward distributions
- Stress tests for multi-threaded scenarios