vLLM
By vLLM (UC Berkeley Sky Lab)
Open-source high-throughput inference and serving engine using PagedAttention, supporting 200+ model architectures.
Best for
- self-hosted serving
- high-throughput inference
- production deployment
- Jun 20, 2026Added — open-source high-throughput inference and serving engine using PagedAttention.
Other Local & on-device AI
Ollama
Run open-weight LLMs locally with a single command. Bundles model weights, quantizations, and an OpenAI-compatible HTTP API into a clean CLI.
LM Studio
Desktop GUI for downloading and chatting with local LLMs. The friendly way to try open-weight models without touching a terminal.
llama.cpp
C/C++ inference engine for LLaMA-family models. The library that quietly powers most local AI apps — fast, low-level, runs on almost anything.
Jan
Open-source ChatGPT alternative that runs entirely offline. Built on llama.cpp with a clean desktop UI and an OpenAI-compatible API.
MLX
Apple's array framework for Apple Silicon. Designed to run ML workloads natively on M-series Macs with unified memory between CPU and GPU.
GPT4All
Open-source desktop app for running LLMs locally with a chat UI, document RAG, and a browsable model catalog.
Open WebUI
Self-hosted, extensible ChatGPT-style web interface for local and remote models, with offline operation and RAG.
AnythingLLM
All-in-one desktop/self-hosted app for document chat (RAG) and agents over local or cloud models.