A command-line chat interface for a llm, desinigned to ask questions about computer science and engineering. It was created and trained with a standard GPU-equipped (6GB VRAM or more) home computer in mind. In the future it will be trained with aditional databases,for a better comprehension of the subjects. I will also comment every step of the code, so it can be easily understood and modified by anyone. The system uses 4-bit quantization to reduce memory usage while maintaining good inference quality.
Future updates will include expanded training databases and fully commented source code for educational purposes.
- Local LLM inference (no API required)
- 4-bit quantization (BitsAndBytes)
- Persistent memory system (
fact_unit.json) - Context-aware conversation
- CLI-based interface
- Lightweight design for 6GB VRAM GPUs
Default model:
Qwen/Qwen2.5-3B-Instruct
Loaded with:
- 4-bit quantization (nf4)
- float16 compute
- automatic device mapping (CPU/GPU)
- Linux (recommended)
- Python 3.10 or 3.11 (3.14 not recommended)
- NVIDIA GPU with 6GB+ VRAM
- CUDA properly installed
- 15GB+ free disk space (first install downloads large CUDA wheels)
git clone https://github.com/PGFerraz/AM_Eng.git
cd AM_Engpython3.10 -m venv .venv
source .venv/bin/activateGPU Version (CUDA 12.1 example)
pip install torch --index-url https://download.pytorch.org/whl/cu121CPU Version (if no GPU)
pip install torch --index-url https://download.pytorch.org/whl/cpuInstall the remaining dependencies manually:
pip install transformers accelerate bitsandbytesIf you see:
torch.OutOfMemoryErrorTry:
- Closing other GPU applications
- Reducing max_new_tokens in the code
- Restarting your system
Clear pip cache:
pip cache purge
rm -rf ~/.cache/pip