llama.cpp ggml

Use this command to install llama.cpp:

winget install --id=ggml.llamacpp -e

llama.cpp is a high-performance inference library for Large Language Models (LLMs) implemented in C/C++. Designed to enable efficient and scalable LLM deployment across various hardware architectures, it supports minimal setup while maintaining state-of-the-art performance.

Key Features:

Multi-bit quantization support (1.5-bit, 2-bit, up to 8-bit) for optimized memory usage and faster inference.
GPU acceleration via CUDA, HIP (for AMD GPUs), and MUSA (Moore Threads GPUs).
CPU optimizations leveraging ARM NEON, Apple's Accelerate framework, and x86 instruction sets (AVX, AVX2, AVX512, AMX).
Support for a wide range of models, including LLaMA, Mistral, Falcon, Alpaca, and others.
Minimal runtime dependencies, ensuring ease of deployment.
Hybrid CPU-GPU inference to handle models larger than available GPU memory.

Audience & Benefit: Ideal for developers and researchers seeking a lightweight yet powerful solution for integrating LLM capabilities into applications. llama.cpp enables seamless deployment across diverse hardware, from mobile devices to data centers, with minimal resource overhead. Its extensive model support and optimization features make it a versatile tool for advancing AI applications efficiently.

Available via winget for easy installation.

llama.cpp

LLM inference in C/C++

manifesto / ggml / ops / maintainer PRs / dev branches / compile times / lib llama API / llama-server REST API

Quick start

A few options to get llama.cpp installed on your machine:

Visit https://llama.app and follow the instructions
Run with Docker - see our Docker documentation

Backend	Target devices
BLAS	All
BLIS	All
CANN	Ascend NPU
CUDA	Nvidia GPU
HIP	AMD GPU
Hexagon [In Progress]	Snapdragon
IBM zDNN	IBM Z & LinuxONE
MUSA	Moore Threads GPU
Metal	Apple Silicon
OpenCL	Adreno GPU
OpenVINO [In Progress]	Intel CPUs, GPUs, and NPUs
RPC	All
SYCL	Intel GPU
VirtGPU	VirtGPU APIR
Vulkan	GPU
WebGPU	All
ZenDNN	AMD CPU

llama.cpp ggml

README

llama.cpp

Quick start

Description

Supported backends

Documentation

Tools

Development

Contributing

Acknowledgements