Skip to content

llm-db/MLKV

Repository files navigation

MLKV+ [Under Development]

It is a general-purpose, distributed, heterogeneous, and modular key–value data framework for GPU Ap- plication(e.g. Embedding Model Training). It integrates two complementary bindings: a GPU-resident layer for high-throughput in-memory access and a CPU/disk layer for large-scale persistent storage. Be- tween these bindings, MLKV+ employs application-aware data migration and multiple optimized transfer paths—including GPU’s High-Bandwidth Memory (HBM) ↔ DRAM ↔ SSD and direct HBM ↔ SSD pipelines.

MLKV+ Dataflow

  • What is GDS? GDS (GPUDirect Storage) enables direct data transfers between GPU memory and storage, bypassing the CPU to significantly accelerate I/O operations. As part of the NVIDIA CUDA Toolkit, GDS is supported on NVIDIA GPUs with Volta architecture or newer. For more details, refer to the official documentation. In MLKV+, we leverage GDS to optimize storage access performance.

How to build MLKV+

It will build the PyTorch extension and the libmlkvplus library.

# clone submodule
git submodule update --init --recursive

# create conda envs
conda env create -f env.yml
conda activate mlkv_plus

# build MLKV+(PyTorch)
MAX_JOBS=$(($(nproc)-1)) CUDA_SM="86" pip install -e .
  • Please change CUDA_SM to your own Computer Compacity of GPU.
  • You can change MAX_JOBS to your wanted number of jobs to compile.

Playground of MLKV+

Warning: The playground is not perfect currently, it may raise CUDA errors in some cases.

  • You can run the single node playground by:
    python playground/mlkvp_playground.py
  • You can run the distributed playground by:
    torchrun --nproc_per_node=4 playground/dist_mlkvp_playground.py
    • Please change --nproc_per_node to your wanted number of GPUs.

How to build libmlkvplus

# clone submodule
git submodule update --init --recursive

# create conda envs
conda env create -f env.yml
conda activate mlkv_plus

# build libmlkvplus
mkdir -p build && cd ./build
cmake .. -Dsm=86 && make -j$(($(nproc)-1)) && cmake --install . --component gycsb_python_binding

Playground of libmlkvplus

  • You can run the simple example by:
    ./test/mlkv_plus_simple_example

Benchmark

We use gYCSB framework to benchmark MLKV+ performance.

  • Please ensure that you already clone the submodule of gYCSB and build libmlkvplus or MLKV+ (PyTorch + libmlkvplus).
  • Installing gYCSB in the root directory by:
    pip install -e ./gYCSB
  • Running a simple benchmark by:
    gycsb singlerun --runner_config gycsb_running_config.yaml --running_name mlkv_plus

How to install GPUDirect Storage

Please refer to the official documentation to install GPUDirect Storage.

Known Issues

  • G-Page Cache IO Errors: The G-Page Cache may encounter IO errors during Get operations, such as:

    Failed to get from SST files: IO error: GDS read failed: Incomplete GDS read: 
    requested 262144 bytes (aligned), got 262144 bytes at offset 10223616, 
    need at least 265268 bytes for requested range
    
  • MultiGet Operation: The MultiGet logic has known limitations and maybe degrade the performance at some special cases.

  • PyTorch Binding: The PyTorch binding may occasionally raise CUDA errors:

    torch.AcceleratorError: CUDA error: __global__ function call is not configured
    

About

MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage (ICDE 2025 Industry Track)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors