Python bindings for Cactus Engine via FFI. Auto-installed when you run source ./setup.
# Setup environment
source ./setup
# Build shared library for Python
cactus build --python
# Download models
cactus download LiquidAI/LFM2-VL-450M
cactus download openai/whisper-small
# Optional: set your Cactus Cloud API key for automatic cloud fallback
cactus authfrom cactus import cactus_init, cactus_complete, cactus_destroy
import json
model = cactus_init("weights/lfm2-vl-450m")
messages = [{"role": "user", "content": "What is 2+2?"}]
response = json.loads(cactus_complete(model, messages))
print(response["response"])
cactus_destroy(model)Initialize a model and return its handle.
| Parameter | Type | Description |
|---|---|---|
model_path |
str |
Path to model weights directory |
corpus_dir |
str |
Optional path to RAG corpus directory for document Q&A |
model = cactus_init("weights/lfm2-vl-450m")
rag_model = cactus_init("weights/lfm2-rag", corpus_dir="./documents")Run chat completion. Returns JSON string with response and metrics.
| Parameter | Type | Description |
|---|---|---|
model |
handle | Model handle from cactus_init |
messages |
list|str |
List of message dicts or JSON string |
tools |
list |
Optional tool definitions for function calling |
temperature |
float |
Sampling temperature |
top_p |
float |
Top-p sampling |
top_k |
int |
Top-k sampling |
max_tokens |
int |
Maximum tokens to generate |
stop_sequences |
list |
Stop sequences |
include_stop_sequences |
bool |
Include matched stop sequences in output (default: False) |
force_tools |
bool |
Constrain output to tool call format |
tool_rag_top_k |
int |
Select top-k relevant tools via Tool RAG (default: 2, 0 = use all tools) |
confidence_threshold |
float |
Minimum confidence for local generation (default: 0.7, triggers cloud_handoff when below) |
callback |
fn |
Streaming callback fn(token, token_id, user_data) |
# Basic completion
messages = [{"role": "user", "content": "Hello!"}]
response = cactus_complete(model, messages, max_tokens=100)
print(json.loads(response)["response"])
# With tools
tools = [{
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}]
response = cactus_complete(model, messages, tools=tools)
# Streaming
def on_token(token, token_id, user_data):
print(token, end="", flush=True)
cactus_complete(model, messages, callback=on_token)Response format (all fields always present):
{
"success": true,
"error": null,
"cloud_handoff": false,
"response": "Hello! How can I help?",
"function_calls": [],
"confidence": 0.85,
"time_to_first_token_ms": 45.2,
"total_time_ms": 163.7,
"prefill_tps": 619.5,
"decode_tps": 168.4,
"ram_usage_mb": 245.67,
"prefill_tokens": 28,
"decode_tokens": 50,
"total_tokens": 78
}Cloud handoff response (when model detects low confidence):
{
"success": false,
"error": null,
"cloud_handoff": true,
"response": null,
"function_calls": [],
"confidence": 0.18,
"time_to_first_token_ms": 45.2,
"total_time_ms": 45.2,
"prefill_tps": 619.5,
"decode_tps": 0.0,
"ram_usage_mb": 245.67,
"prefill_tokens": 28,
"decode_tokens": 0,
"total_tokens": 28
}When cloud_handoff is True, the model's confidence dropped below confidence_threshold (default: 0.7) and recommends deferring to a cloud-based model for better results. Handle this in your application:
result = json.loads(cactus_complete(model, messages))
if result["cloud_handoff"]:
# Defer to cloud API (e.g., OpenAI, Anthropic)
response = call_cloud_api(messages)
else:
response = result["response"]Transcribe audio using a Whisper model. Returns JSON string.
| Parameter | Type | Description |
|---|---|---|
model |
handle | Whisper model handle |
audio_path |
str |
Path to audio file (WAV) |
prompt |
str |
Whisper prompt for language/task |
whisper = cactus_init("weights/whisper-small")
prompt = "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>"
response = cactus_transcribe(whisper, "audio.wav", prompt=prompt)
print(json.loads(response)["response"])
cactus_destroy(whisper)Get text embeddings. Returns list of floats.
| Parameter | Type | Description |
|---|---|---|
model |
handle | Model handle |
text |
str |
Text to embed |
normalize |
bool |
L2-normalize embeddings (default: False) |
embedding = cactus_embed(model, "Hello world")
print(f"Dimension: {len(embedding)}")Get image embeddings from a VLM. Returns list of floats.
embedding = cactus_image_embed(model, "image.png")Get audio embeddings from a Whisper model. Returns list of floats.
embedding = cactus_audio_embed(whisper, "audio.wav")Reset model state (clear KV cache). Call between unrelated conversations.
cactus_reset(model)Stop an ongoing generation (useful with streaming callbacks).
cactus_stop(model)Free model memory. Always call when done.
cactus_destroy(model)Get the last error message, or None if no error.
error = cactus_get_last_error()
if error:
print(f"Error: {error}")Tokenize text. Returns list of token IDs.
tokens = cactus_tokenize(model, "Hello world")
print(tokens) # [1234, 5678, ...]Query RAG corpus for relevant text chunks. Requires model initialized with corpus_dir.
| Parameter | Type | Description |
|---|---|---|
model |
handle | Model handle (must have corpus_dir set) |
query |
str |
Query text |
top_k |
int |
Number of chunks to retrieve (default: 5) |
model = cactus_init("weights/lfm2-rag", corpus_dir="./documents")
chunks = cactus_rag_query(model, "What is machine learning?", top_k=3)
for chunk in chunks:
print(f"Score: {chunk['score']:.2f} - {chunk['text'][:100]}...")Pass images in the messages for vision-language models:
vlm = cactus_init("weights/lfm2-vl-450m")
messages = [{
"role": "user",
"content": "Describe this image",
"images": ["path/to/image.png"]
}]
response = cactus_complete(vlm, messages)
print(json.loads(response)["response"])See python/example.py for a complete example covering:
- Text completion
- Text/image/audio embeddings
- Vision (VLM)
- Speech transcription
python python/example.py