BitLlama imonoonoko

ai cli inference llm machine-learning rust

Use this command to install BitLlama:

winget install --id=imonoonoko.BitLlama -e

BitLlama is a high-performance LLM inference engine built in Rust, designed to enable efficient and scalable machine learning model inference. It leverages advanced techniques like 1.58-bit ternary quantization, Test-Time Training (TTT), and the Soul learning system to optimize model performance while maintaining accuracy.

Key Features:

1.58-bit Ternary Quantization: Reduces memory usage and computational overhead without compromising model quality.
Test-Time Training (TTT): Dynamically adjusts models during inference to improve accuracy in real-world scenarios.
Soul Learning System: Enhances adaptability by enabling continuous learning and fine-tuning of models on-the-fly.
MCP Server/Client Architecture: Supports distributed processing for scalable and efficient model deployment.
Model Compatibility: Works with popular architectures like Llama, Mistral, Gemma, Qwen, BitNet, and more.
OpenAI-Compatible API: Offers seamless integration with existing workflows through an API server.

Ideal for developers, researchers, and organizations working in AI and machine learning, BitLlama provides a powerful toolset to deploy efficient, scalable, and adaptable language models. It is particularly beneficial for applications requiring low-resource usage, high performance, or private RAG (Retrieve-and-Generate) setups. Installable via winget, BitLlama empowers users to build and deploy advanced AI solutions with ease and precision.

BitLlama

Pure Rust LLM inference engine with Soul learning and hierarchical memory.

> Status: v1.0.0 — Development Complete. This project is fully functional and no longer under active development.

What is BitLlama?

A local LLM inference engine written entirely in Rust. It runs GGUF and safetensors models on your PC, with a unique Soul system that lets the AI learn and remember across conversations.

Key features:

GGUF + safetensors model inference (Llama 2/3, Gemma 2/3, Qwen2.5, Mistral, BitNet)
Soul learning — teach the AI via LoRA fine-tuning from conversations
Memory system — 4-layer hierarchical memory (Episodes/Facts/Concepts/Worldview)
Sleep consolidation — background memory organization (Tidy/Fold/Merge/Elevate/Dream)
Desktop GUI (Tauri 2.0 + Svelte 5) with Japanese/English i18n
CLI with chat, learning, API server, RAG, and MCP support
CUDA acceleration + Q8 KV Cache
1096+ tests, Pure Rust single binary

Quick Start

Install

# Homebrew (macOS / Linux)
brew tap imonoonoko/bitllama &amp;&amp; brew install bitllama

# Windows (winget)
winget install imonoonoko.BitLlama

# Or download from GitHub Releases

Run

bitllama pull bartowski/gemma-2-2b-it-GGUF
bitllama run ~/.bitllama/models/gemma-2-2b-it-Q4_K_M.gguf

Teach

bitllama learn "My name is Onoko" --model model.gguf --save onoko.soul
bitllama run model.gguf --soul onoko.soul

API Server

bitllama serve model.gguf --port 8000
# OpenAI-compatible: POST /v1/chat/completions

Desktop GUI

BitLlama Desktop — built with Tauri 2.0 + Svelte 5.

Model download, management, and auto-recommendation
Streaming chat with conversation history
Soul learning (chat, drag & drop, correction)

Model	Format	Chat Template
Llama-2 7B/13B	GGUF	llama2
Llama-3 8B	GGUF	llama3
Gemma-2 2B/9B	GGUF	gemma
Gemma-3	GGUF	gemma
Qwen2.5 0.5B-7B	GGUF	chatml
Mistral 7B	GGUF	mistral
BitNet 2B4T	safetensors	bitnet

Model	Speed	vs llama.cpp
Llama-2 7B	45.4 tok/s	90%
Mistral 7B	42.1 tok/s	89%
Gemma-2 2B	75.1 tok/s	74%

BitLlama imonoonoko

README

BitLlama

What is BitLlama?

Quick Start

Install

Run

Teach

API Server

Desktop GUI

Supported Models

Performance

Architecture

Soul & Memory Architecture

Build from Source

What Was Built

Acknowledgments

License