inferrs ericcurtin
Use this command to install inferrs:
winget install --id=ericcurtin.inferrs -e A conservative-memory inference engine for LLMs
winget install --id=ericcurtin.inferrs -e A conservative-memory inference engine for LLMs
A TurboQuant LLM inference server.
Most LLM serving stacks force a trade-off between features and resource usage. inferrs targets both:
| inferrs | vLLM | llama.cpp | |
|---|---|---|---|
| Language | Rust | Python/C++ | C/C++ |
| Streaming (SSE) | ✓ | ✓ | ✓ |
| KV cache management | TurboQuant, Per-context alloc, PagedAttention | PagedAttention | Per-context alloc |
| Memory friendly | ✓ — lightweight | ✗ — claims most GPU memory | ✓ — lightweight |
| Binary footprint | Single binary | Python environment + deps | Single binary |
/v1/completions, /v1/chat/completions,
/v1/models, /health/v1/messages (streaming and non-streaming)/api/generate, /api/chat, /api/tags,
/api/ps, /api/show, /api/versionmacOS / Linux
brew tap ericcurtin/inferrs
brew install inferrs
Windows
scoop bucket add inferrs https://github.com/ericcurtin/scoop-inferrs
scoop install inferrs
inferrs run google/gemma-4-E2B-it
inferrs serve --paged-attention google/gemma-4-E2B-it
inferrs serve google/gemma-4-E2B-it
inferrs serve