Llama-swap is a lightweight, transparent proxy server designed to automate model swapping for llama.cpp's server. It allows users to run multiple large language models (LLMs) locally and switch between them dynamically without restarting applications.
Key Features:
Automatic model switching based on API requests
Compatibility with any OpenAI-compatible local server (llama.cpp, vllm, tabbyAPI)
Real-time web UI for monitoring model activity and logs
Support for Docker and Podman containerization
Zero external dependencies
Ideal for developers and machine learning enthusiasts who want to experiment with different models or provide flexible LLM capabilities in their applications.
README
llama-swap
Run multiple LLM models on your machine and hot-swap between them as needed. llama-swap works with any OpenAI API-compatible server, giving you the flexibility to switch models without restarting your applications.
Built in Go for performance and simplicity, llama-swap has zero dependencies and is incredibly easy to set up. Get started in minutes - just one binary and one configuration file.
Features:
✅ Easy to deploy and configure: one binary, one configuration file. no external dependencies
✅ On-demand model switching
✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, stable-diffusion.cpp, etc.)
future proof, upgrade your inference servers at any time.
Nightly container images with llama-swap and llama-server are built for multiple platforms (cuda, vulkan, intel, etc.) including non-root variants with improved security.
The stable-diffusion.cpp server is also included for the musa and vulkan platforms.
$ docker pull ghcr.io/mostlygeek/llama-swap:cuda
# run with a custom configuration and models directory
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
-v /path/to/models:/models \
-v /path/to/custom/config.yaml:/app/config.yaml \
ghcr.io/mostlygeek/llama-swap:cuda
# configuration hot reload supported with a
# directory volume mount
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
-v /path/to/models:/models \
-v /path/to/custom/config.yaml:/app/config.yaml \
-v /path/to/config:/config \
ghcr.io/mostlygeek/llama-swap:cuda -config /config/config.yaml -watch-config
more examples
# pull latest images per platform
docker pull ghcr.io/mostlygeek/llama-swap:cpu
docker pull ghcr.io/mostlygeek/llama-swap:cuda
docker pull ghcr.io/mostlygeek/llama-swap:vulkan
docker pull ghcr.io/mostlygeek/llama-swap:intel
docker pull ghcr.io/mostlygeek/llama-swap:musa
# tagged llama-swap, platform and llama-server version images
docker pull ghcr.io/mostlygeek/llama-swap:v166-cuda-b6795
# non-root cuda
docker pull ghcr.io/mostlygeek/llama-swap:cuda-non-root
When a request is made to an OpenAI compatible endpoint, llama-swap will extract the model value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to handle the request correctly.
In the most basic configuration llama-swap handles one model at a time. For more advanced use cases, the groups feature allows multiple models to be loaded at the same time. You have complete control over how your system resources are used.
Reverse Proxy Configuration (nginx)
If you deploy llama-swap behind nginx, disable response buffering for streaming endpoints. By default, nginx buffers responses which breaks Server‑Sent Events (SSE) and streaming chat completion. (#236)
As a safeguard, llama-swap also sets X-Accel-Buffering: no on SSE responses. However, explicitly disabling proxy_buffering at your reverse proxy is still recommended for reliable streaming behavior.
Monitoring Logs on the CLI
# sends up to the last 10KB of logs
$ curl http://host/logs
# streams combined logs
curl -Ns http://host/logs/stream
# stream llama-swap's proxy status logs
curl -Ns http://host/logs/stream/proxy
# stream logs from upstream processes that llama-swap loads
curl -Ns http://host/logs/stream/upstream
# stream logs only from a specific model
curl -Ns http://host/logs/stream/{model_id}
# stream and filter logs with linux pipes
curl -Ns http://host/logs/stream | grep 'eval time'
# appending ?no-history will disable sending buffered history first
curl -Ns 'http://host/logs/stream?no-history'
Do I need to use llama.cpp's server (llama-server)?
Any OpenAI compatible server would work. llama-swap was originally designed for llama-server and it is the best supported.
For Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly to SIGTERM signals for proper shutdown.
Star History
> [!NOTE]
> ⭐️ Star this project to help others discover it!