llama-swap mostlygeek

golang llama llamacpp localllama localllm openai openai-api vllm

Use this command to install llama-swap:

winget install --id=mostlygeek.llama-swap -e

Llama-swap is a lightweight, transparent proxy server designed to automate model swapping for llama.cpp's server. It allows users to run multiple large language models (LLMs) locally and switch between them dynamically without restarting applications.

Key Features:

Automatic model switching based on API requests
Compatibility with any OpenAI-compatible local server (llama.cpp, vllm, tabbyAPI)
Real-time web UI for monitoring model activity and logs
Support for Docker and Podman containerization
Zero external dependencies

Ideal for developers and machine learning enthusiasts who want to experiment with different models or provide flexible LLM capabilities in their applications.

README

llama-swap header image GitHub Downloads (all assets, all releases) GitHub Actions Workflow Status GitHub Repo stars

llama-swap

Run multiple LLM models on your machine and hot-swap between them as needed. llama-swap works with any OpenAI API-compatible server, giving you the flexibility to switch models without restarting your applications.

Built in Go for performance and simplicity, llama-swap has zero dependencies and is incredibly easy to set up. Get started in minutes - just one binary and one configuration file.

Features:

✅ Easy to deploy and configure: one binary, one configuration file. no external dependencies
✅ On-demand model switching
✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc)
- future proof, upgrade your inference servers at any time.
✅ OpenAI API supported endpoints:
- v1/completions
- v1/chat/completions
- v1/embeddings
- v1/audio/speech (#36)
- v1/audio/transcriptions (docs)
✅ llama-server (llama.cpp) supported endpoints
- v1/rerank, v1/reranking, /rerank
- /infill - for code infilling
- /completion - for completion endpoint
✅ llama-swap API
- /ui - web UI
- /upstream/:model_id - direct access to upstream server (demo)
- /models/unload - manually unload running models (#58)
- /running - list currently running models (#61)
- /log - remote log monitoring
- /health - just returns "OK"
✅ Customizable

llama-swap mostlygeek

README

llama-swap

Features:

Web UI

Installation

Docker Install (download images)

Homebrew Install (macOS/Linux)

WinGet Install (Windows)

Pre-built Binaries

Building from source

Configuration

How does llama-swap work?

Reverse Proxy Configuration (nginx)

Monitoring Logs on the CLI

Do I need to use llama.cpp's server (llama-server)?

Star History