Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
vLLM-MLX – Run LLMs on Mac at 464 tok/s (github.com/waybarrios)
33 points by waybarrios 40 days ago | hide | past | favorite | 3 comments


Hey HN! I built vLLM-MLX alike framework on macOS, which is painfully slow on Apple Silicon machines.

vLLM-MLX brings native GPU acceleration using Apple's MLX framework, with:

  • OpenAI-compatible API (drop-in replacement)
  • Multimodal: Text, Images, Video, Audio in one server
  • Continuous batching for concurrent users (3.4x speedup)
  • TTS in 10+ languages (Kokoro, Chatterbox)
  • MCP tool calling support

  Performance on M4 Max:
  - Llama-3.2-1B-4bit: 464 tok/s
  - Qwen3-0.6B: 402 tok/s
  - Whisper STT: 197x real-time
Quick start: pip install -e . vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit

Works with standard OpenAI SDK. Happy to answer questions!

GitHub: https://github.com/waybarrios/vllm-mlx


What’s the recommended RAM for running some of these? There is a “Memory” section but the numbers look low compared to what I was expecting - maybe this is right but they are heavily quantised.

Basically trying to work out what I get to play with on my 16Gb M1.


same here




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: