llamafile is a framework designed to simplify the distribution and execution of large language models (LLMs) using single-file executables. It combines llama.cpp with Cosmopolitan Libc into one cohesive tool, enabling users to run LLMs locally without complex installations.
Key Features:
Creates single-file executables that run on most computers.
Supports various models like LLaVA and Mistral for diverse applications.
Offers both a web UI chat interface and an OpenAI-compatible API endpoint.
Compatible across platforms, including Windows, macOS, and Linux.
Provides command-line interfaces for direct interaction with models.
Audience & Benefits:
Ideal for developers seeking to deploy models without complex setups and end-users desiring easy access to powerful LLMs. Both benefit from reduced complexity, streamlined workflows, enhanced privacy through local execution, and versatile model support for varied applications.
llamafile can be installed via winget, ensuring a seamless experience for integrating and running advanced language models efficiently and securely.
README
llamafile
llamafile lets you distribute and run LLMs with a single file.
Our goal is to make open LLMs much more
accessible to both developers and end users. We're doing that by
combining llama.cpp with Cosmopolitan Libc into one
framework that collapses all the complexity of LLMs down to
a single-file executable (called a "llamafile") that runs
locally on most operating systems and CPU archiectures, with no installation.
llamafile also includes whisperfile, a single-file speech-to-text tool built on whisper.cpp and the same Cosmopolitan packaging. It supports transcription and translation of audio files across all the same platforms, with no installation required.
v0.10.0
llamafile versions starting from 0.10.0 use a new build system, aimed at keeping our code more easily
aligned with the latest versions of llama.cpp. This means they support more recent models and functionalities,
but at the same time they might be missing some of
the features you were accustomed to (check out this doc for a high-level description of what has been done). If you liked
the "classic experience" more, you will always be able to access the previous versions from our
page. Our pre-built llamafiles always
show which version of the server they have been bundled with (, ), so you will always know
which version of the software you are downloading.
> We want to hear from you!
Whether you are a new user or a long-time fan, please share what you find most valuable about llamafile and what would make it more useful for you.
Read more via the blog and add your voice to the discussion here.
Quick Start
Download and run your first llamafile in minutes:
# Download an example model (Qwen3.5 0.8B)
curl -LO https://huggingface.co/mozilla-ai/llamafile_0.10.0/resolve/main/Qwen3.5-0.8B-Q8_0.llamafile
# Make it executable (macOS/Linux/BSD)
chmod +x Qwen3.5-0.8B-Q8_0.llamafile
# Run it
./Qwen3.5-0.8B-Q8_0.llamafile
We chose this model because that's the smallest one we have
built a llamafile for, so most likely to work out-of-the-box for you.
If you have powerful hardware and/or GPUs, feel free to choose
larger and more expressive models which should provide more accurate
responses.
Windows users: Rename the file to add .exe extension before running.
Documentation
Check the full documentation in the docs/ folder, or directly jump into one of the following subsections:
While the llamafile project is Apache 2.0-licensed, our changes
to llama.cpp and whisper.cpp are licensed under MIT (just like the projects
themselves) so as to remain compatible and upstreamable in the future,
should that be desired.
The llamafile logo on this page was generated with the assistance of DALL·E 3.