inferrs ericcurtin

Use this command to install inferrs:

winget install --id=ericcurtin.inferrs -e

Inferrs is a high-performance inference engine designed for large language models (LLMs), optimized for memory efficiency and resource utilization. It enables users to run LLMs with minimal overhead while maintaining performance and compatibility across various hardware backends.

Key Features:

Hardware Support: Inferrs supports multiple hardware acceleration options, including CUDA, ROCm, Metal, Hexagon, OpenVino, MUSA, CANN, Vulkan, and CPU.
API Compatibility: It offers OpenAI-compatible endpoints (/v1/completions, /v1/chat/completions), Anthropic-compatible APIs (/v1/messages for streaming and non-streaming), and Ollama-compatible interfaces (/api/generate, /api/chat).
TurboQuant Optimization: Inferrs leverages TurboQuant to improve inference efficiency on lower-end hardware while maintaining accuracy.
Single-Binary Deployment: The tool is distributed as a single binary, simplifying setup and deployment across different environments.
Model Flexibility: It supports running models in various formats, including vLLM-style and llama.cpp-style execution.

Audience & Benefit:

Ideal for developers, data scientists, and organizations seeking to deploy LLMs efficiently. Inferrs provides a lightweight yet powerful solution that minimizes resource usage while maintaining compatibility with popular AI frameworks and APIs. It is particularly beneficial for users working with constrained hardware environments or those requiring fast, efficient inference without compromising on model performance.

Inferrs can be installed via winget for seamless setup and integration into existing workflows.

inferrs

A TurboQuant LLM inference server.

Why inferrs?

Most LLM serving stacks force a trade-off between features and resource usage. inferrs targets both:

	inferrs	vLLM	llama.cpp
Language	Rust	Python/C++	C/C++
Streaming (SSE)	✓	✓	✓
KV cache management	TurboQuant, Per-context alloc, PagedAttention	PagedAttention	Per-context alloc
Memory friendly	✓ — lightweight	✗ — claims most GPU memory	✓ — lightweight
Binary footprint	Single binary	Python environment + deps	Single binary

Features

OpenAI-compatible API — /v1/completions, /v1/chat/completions, /v1/models, /health
Anthropic-compatible API — /v1/messages (streaming and non-streaming)
Ollama-compatible API — /api/generate, /api/chat, /api/tags, /api/ps, /api/show, /api/version
Hardware backends — CUDA, ROCm, Metal, Hexagon, OpenVino, MUSA, CANN, Vulkan and CPU

Quick start

Install

macOS / Linux

brew tap ericcurtin/inferrs
brew install inferrs

Windows

scoop bucket add inferrs https://github.com/ericcurtin/scoop-inferrs
scoop install inferrs

Run

inferrs run google/gemma-4-E2B-it

Serve

Serve a specific model vLLM-style

inferrs serve --paged-attention google/gemma-4-E2B-it