omlx: Turning Apple Silicon into an LLM Inference Server from the macOS Menu Bar

Anyone running local LLMs on a Mac probably shares one pain point: model loading is slow, and switching models is even slower. Especially when you need to run multiple models simultaneously for comparison testing.

omlx tries to solve this problem in a somewhat unorthodox way: use the SSD as a cache.

What It Does

omlx is an LLM inference server running on Apple Silicon, built on the MLX framework. Two core features:

Continuous Batching: Multiple requests can enter the inference pipeline simultaneously - the model doesn't have to finish one request before accepting the next. This directly impacts throughput in multi-user scenarios.

SSD Caching: Model weights can be cached to the SSD, so switching models doesn't require reloading from disk to memory. For Mac users, SSD speed isn't as fast as unified memory, but it's significantly faster than a full reload.

The entire service is managed from the macOS menu bar - select models, check status, adjust parameters, no terminal needed.

Details Worth Noting

13K stars, 1.1K forks, Apache 2.0 license. Written in Python, homepage at omlx.ai. Last update was May 9th, maintenance frequency is steady.

OpenAI API compatible, which means you can plug omlx directly as a local OpenAI-compatible endpoint into Cursor, Claude Code, OpenClaw, and other tools. This compatibility layer is the key to local inference tools being practically usable - otherwise you'd need to write additional adapters.

322 open issues for a 13K star project is not small. It means a large user base, but also that some rough edges haven't been smoothed out yet.

Is It Usable?

If you have an M-series Mac and want to run local inference for development testing or daily use, omlx is one of the more mature options in the ecosystem now. Its SSD caching shines in multi-model switching scenarios - no more waiting for model loads every time you switch.

Continuous batching isn't very noticeable for individual users (usually only one request at a time), but if you're using a Mac for small-scale serving or multi-Agent parallel testing, this feature shows real value.

The limitations are clear: Apple Silicon's unified memory is the ceiling. M2 Max with 96GB is already consumer-grade max, barely enough for a quantized 70B parameter model, anything larger is unrealistic. omlx has no magic - it just squeezes efficiency to the limit within existing hardware constraints.

How It Differs from Competitors

Mac local inference tools already exist - MLX's official mlx-lm, Ollama, LM Studio, etc. omlx differentiates in two areas:

Menu bar management: Lightweight, doesn't occupy a window, always visible. More convenient for daily users than opening a terminal or standalone app.
SSD caching + continuous batching: This combination is rare in the Mac ecosystem. Especially SSD caching, which is a real efficiency boost for developers with frequent model switching needs.

If you only occasionally run a chat model, Ollama might be simpler. But if you use your Mac as a local inference server, omlx is worth trying.

The next version would benefit from a proper Web UI. The current pure menu bar approach has a steep learning curve for newcomers.

What It Does

Details Worth Noting

Is It Usable?

How It Differs from Competitors

Related

9Router: Route Claude Code, Cursor, Codex to 40+ Free Model Sources, RTK Saves 40% Tokens, Auto-Fallback Never Stops

AiToEarn: An Open Source Framework for Making Money with AI, But Don't Be Fooled by the Name

bolt.diy: Open Source Bolt.new, Bringing AI Full-Stack Dev from Cloud to Local