Ultar

Ultra-scale tar / webdataset

Note

No gurantees whatsoever. We are just trying to make webdataset random-accessible, And make the indexing fast as frick

Contribute / Develop

The project is built with zig. You should be able to build it with a zig>=0.15 install.

zig build -Doptimize=ReleaseSafe

Install

CI should provide a build for you. It should be fully static so no glibc requirements.

Performance

Remote NFS storage (hosted by lambdalabs), scanning 32 tar files (~1.4GB each) with a single instance / process of indexer

        User time (seconds): 3.20
        System time (seconds): 72.28
        Percent of CPU this job got: 183%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:41.09
        Maximum resident set size (kbytes): 10624

On this particular system a single instance of indexer saturates at around 10Gibps with most of the time spent on Linux's NFS server.

It runs sligthly too fast for local NVMe storage so I didn't bother a instrumented test.

Methodology

Simple single-process event loop based IO provided by libxev & thus wielding the full power of IO_URING.

Have I mentioned it's written with zig

Python Bindings

The python/ directory contains ABI3-compatible Python bindings for the Lua dataloader.

# Build
zig build python-bindings -Doptimize=ReleaseSafe

# Build wheel
python -m build --wheel --no-isolation python/

# Install
pip install python/dist/*.whl

See python/README.md for usage details.

Lua Scripting API

The dataloader uses Lua scripts for flexible data loading pipelines. Scripts use standard Lua require() to import modules:

local loader = require("ultar.loader")
local utix = require("ultar.utix")

return {
    init_ctx = function(rank, world_size, config)
        return {
            tar_path = config.tar_path,
            idx_path = config.idx_path,
        }
    end,

    row_generator = function(ctx)
        local tar = loader:open_file(ctx.tar_path)
        local idx = utix.open(ctx.idx_path)

        for row in idx:iter() do
            for i = 1, #row.keys do
                if row.sizes[i] > 0 then
                    loader:add_entry(tar, row.keys[i],
                        row.offset + row.offsets[i], row.sizes[i])
                end
            end
            loader:finish_row()
        end

        loader:close_file(tar)
    end,
}

Available Modules

Module	Description
`ultar.loader`	Async data loading interface - open files, add entries, finish rows
`ultar.utix`	Read `.utix` (msgpack) index files
`ultar.scandir`	Directory scanning utilities

LSP Integration

We ship type stubs for LuaLS (the standard Lua language server). This provides:

Autocompletion for all ultar modules
Hover documentation with function signatures
Type checking for parameters
Go to definition support

Quick Setup (Recommended)

If you've installed ultar-dataloader via pip, use the CLI to set up LSP:

cd your-project/
ultar-dataloader init-lsp

This creates a .luarc.json pointing to the type stubs shipped with the package.

Manual Setup

For development or custom setups, add .luarc.json to your project root:

{
  "$schema": "https://raw.githubusercontent.com/LuaLS/vscode-lua/master/setting/schema.json",
  "workspace.library": [
    "/path/to/ultar/lua-types"
  ],
  "runtime.version": "LuaJIT"
}

Or get the path programmatically:

ultar-dataloader types-path

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
lua-types/ultar		lua-types/ultar
python		python
ultar_httpd		ultar_httpd
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.luarc.json		.luarc.json
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
Caddyfile		Caddyfile
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
XevOstream.zig		XevOstream.zig
build.zig		build.zig
build.zig.zon		build.zig.zon
concurrent_ring.zig		concurrent_ring.zig
dataloader.zig		dataloader.zig
docker-compose.caddy.yml		docker-compose.caddy.yml
docker-compose.tailscale.yml		docker-compose.tailscale.yml
indexer.zig		indexer.zig
lua_dataloader.zig		lua_dataloader.zig
lua_rt.zig		lua_rt.zig
msgpack.zig		msgpack.zig
octal.zig		octal.zig
scanners.zig		scanners.zig
tailscale-serve.json		tailscale-serve.json
tardefs.zig		tardefs.zig
tests.zig		tests.zig

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ultar

Contribute / Develop

Install

Performance

Methodology

Python Bindings

Lua Scripting API

Available Modules

LSP Integration

Quick Setup (Recommended)

Manual Setup

About

Uh oh!

Releases 13

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ultar

Contribute / Develop

Install

Performance

Methodology

Python Bindings

Lua Scripting API

Available Modules

LSP Integration

Quick Setup (Recommended)

Manual Setup

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages