Local LLM Setup & Performance (GPU / Tokens/sec) #2

Rami8612 started this conversation in Local LLMs

Rami8612
Mar 14, 2026
Maintainer

Local Model Performance Benchmarks

Running RamiBot with local models can vary significantly depending on the GPU, model size, and inference settings.

This discussion is intended to collect performance benchmarks from the community.

Please share your setup using the following format:

Hardware

GPU model
GPU VRAM
System RAM

Model

LLM used
Model URL

Inference

Context size
Inference software (LM Studio, Ollama, etc.)
Tokens/sec performance

Example

Hardware

GPU: RTX 3070 8GB (Laptop)
System RAM: 16GB

Model

Qwen3.5-4B-Claude-Opus-4.6-Distilled-GGUF (Q8_0)
https://huggingface.co/avalon2244/Qwen3.5-4B-Claude-Opus-4.6-Distilled-GGUF

Inference

Context: 30k
Software: LM Studio
Performance: ~62 tokens/sec

These benchmarks help others choose the best setup for running RamiBot locally.

If you test multiple models on the same GPU, feel free to post comparisons.

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment