This is an example on how to set up a coding agent using only free, open source tools, on your local machine.
Although it proved to be useful for the author, it is not meant to be a solution for every use case.
It is not meant to be better than your favourite, commercial agent.
Instead, it provides a solution which is independent from commercial actors and licenses.
As licenses for commercial software are often slow to be retrieved, this should allow experimenting freely without a particularly high cost.
Finally, using local agents provides useful details on how coding agents work under the hood. Potentially, this could lead to more cost-efficient prompts development.
Before starting experimenting with any of the tools presented in this document, it is your responsibility to do you own risk and security assessment. The author takes no responsibility.
As it is generally good practice to know what is happening under the hood. The author suggest to take a look at IBM's series on AI. Most importantly, the following video.
The solution provided is composed of multiple different layers of tools, organised as follows:
| Tool | Description | License |
|---|---|---|
| VSCode | A highly customisable, free and open-source IDE | MIT |
| continue.dev | A VSCode extension that allows to interact with AI coding agents and provides tools to them | Apache 2.0 |
| llama-swap | Allows to execute multiple models on the same machine, in parallel and on demand | MIT |
| llama.cpp | Provides access to LLMs through HTTP APIs | MIT |
| LLM Models | Models to provide answers to user prompts and implement agentic behavior | - |
| Model | Description | License |
|---|---|---|
| Qwen3.5 9B | Latest model in the Qwen family by Alibaba Cloud. 9 billion parameters. Trained for reasoning, vision and tool usage. | Apache 2.0 |
| Qwen2.5 Coder 3B Instruct | Older Qwen release with 1/3 parameters of Qwen3.5. Faster inference. Used for autocompletion. | Non-commercial |
| Nomic Embed Text V1.5 | Small model for text embeddings. Converts text into vectors for comparison. | Apache 2.0 |
The whole stack is being used proficiently with the following hardware specifications.
Chip: Apple M3
Memory: 16GBThe experience may vary on different hardware. With fewer resources, largest models might not be fully offloaded to GPU for inference purposes. This might make inference itself slower.
The following steps will guide you on how to set a proper AI assistant in VSCode. All these steps have alternatives. The ones proposed here showcase the solution which better suits the author's requirements, to the best of his knowledge at the time the document has been written.
-
Install VSCode in case you don't have it installed already. See further instructions.
-
Install continue.dev from the extensions marketplace in VSCode. See further instructions.
-
Install llama.cpp. See further instructions.
llama.cpp can start a single server to provide access to a given LLM on its own.
Since we would need multiple servers running on demand, then we will use llama-swap to execute multiple
_llama.cpp_servers. -
Select models based on hardware compatibility
Before downloading models, it's important to choose the right quantization level based on your hardware specifications. Hugging Face provides information on model sizes according to quantization levels, typically displayed as a card on the right-hand side of the model page.
Guidelines for model selection:
- For Apple M3 with 16GB RAM: You can comfortably run models with up to 9B parameters at Q4_K_M quantization
- For smaller devices (8GB RAM or less): Consider using smaller models (3B parameters) or higher quantization levels (Q3_K_M, Q4_0)
- For maximum speed: Use lower parameter counts or higher quantization (Q2_K, Q3_K_S)
- For best quality: Use lower quantization levels (Q4_K_M, Q5_K_M, Q8_0) but be aware of memory requirements
The three models used in this setup are:
- Qwen3.5 9B (Agent model): Requires ~6-7GB RAM at Q4_K_M quantization
- Qwen2.5 Coder 3B Instruct (Autocomplete model): Requires ~2-3GB RAM at Q4_K_M quantization
- Nomic Embed Text V1.5 (Embedding model): Requires ~1-2GB RAM at Q4_K_M quantization
Always check the hardware compatibility information on the Hugging Face model page before downloading.
-
Download the LLM models. Here, we will use llama.cpp to download models from huggingface to the local machine.
llama-swap does not allow to download models from hugging face on demand. Hence, this step must be done manually for each model.
However, this is useful to assess whether the model is working or not. Some downloaded model are not suitable for llama.cpp as they do not fulfill the LLaMa specifications.
In order to download a the three models described in
Tools > Models, execute the following commands:# Download embedding model llama-cli -hf nomic-ai/nomic-embed-text-v1.5-GGUF:Q4_K_M # Download autocomplete model llama-cli -hf unsloth/Qwen2.5-Coder-3B-Instruct-GGUF:Q4_K_M # Download agent model llama-cli -hf bartowski/Qwen_Qwen3.5-9B-GGUF:Q4_K_M
The
-hfargument allows to specify a model on HuggingFace. The:Q4_K_Mpostfix define quantisation. This has a direct impact on the size of the model. Please, refer to the rightmost part of the huggingface page of a specific model to check hardware compatibility.Using
llama.cppinstalled through homebrew, the download folder is set to~/Library/Caches/llama.cpp. This could vary. Another alternative would be to manually download the models to a user-defined folder. -
Install llama-swap. See further instructions.
-
Configure llama_swap to serve the models downloaded in step 5. The configuration file in
llama-swap.yamlallows to load all the three models on demand.Execute llama-swap with the following command:
llama-swap --config ./llama-swap.yaml --listen localhost:8080
See further configuration options here.
-
Configure continue.dev extension for VScode to use various models provided by the llama-swap server, for different roles.
A sample configuration is provided in .continue/config.yaml. To use this as global, default configuration, copy it in /Users/$USER/.continue/config.yaml, assuming a unix system is used.
NOTE Qwen3.5 model has got provider: openai, while the others have provider: llama.cpp. Although the second is the expected one in this configuration, it has provided unsuccessfull in tool calling. This is probably due to how the model was trained for this task. To overcome this, issue, the former setting has been preferred. Due to compatibility issues related to tool calling, the .continue/rules/TOOLS.md has been defined, as well.
It is also possible to instruct AI agent to always rely on rules defined per project. For example, clearly instructing an AI agent on always searching for a given file in src/ would prevent him from scanning the whole root folder, making the whole searching process faster.
To do so, just create one or more markdown files in the ./continue/rules folder detailing the desired ruleset. See further information..
The llama-swap.config files defines the --context parameter to 48000 tokens. Good results have been obtained with smaller context of 12000 tokens as well.
In order to prevent filling up the context too quickly, it is to:
-
Instruct the model to be as concise as possible (Occam Razor).
-
Preferring targeted actions with limited scope, rather than broader, complex actions.
When an agent hits context limits, it throws an error. As generated tokens enter the context, then it there would be no place for newest ones.
Rather than setting a larger context, it is possible to set the --context-shift argument in llama-swap.config configuration for Qwen3.5 model.
As the model would then "forget" earlier tokens, it might tend allucinate. See further information on model parameters.
Tool calls might be mis-interpreted by continue. Few times, the extension failed to represent diffs in the chat when streaming is enabled (default).
This seem to have been mitigated by introducing the --jinja argument in the llama-swap.config configuration.
Another a slight improvement was achieved by setting .continue/rules/TOOLS.md. Such a ruleset attempts to mitigate interference between tools calling, thinking and output streaming.
Another solution, would be to disable the stream function. However, this might have inpact on usability (user won't see the result of the agentic action until done).
Instead of the VSCode + continue.dev setup, you can use Claude Code, a commercial AI-powered coding assistant.
Getting Started with Claude Code:
- Install the Claude Code CLI tool
- Follow the official documentation for configuration
- You can still use the same LLM models mentioned above if you want to run them locally and connect them to Claude Code.
Note: This document focuses on the open-source, self-hosted solution, but Claude Code can be a viable alternative depending on your needs and preferences.
