Key Findings
The most interesting takeaways from the community benchmarks
Memory bandwidth still dominates generational leaps. The M1 Ultra (800 GB/s) posts a median of 36.5 tok/s on large models while the newer M4 Pro (273 GB/s) manages 45.9 tok/s median — a modest edge despite three generations of architectural gains. The M3 Max (400 GB/s, 36GB) pulls far ahead with a 155 tok/s median, showing that bandwidth plus memory capacity together set the ceiling.
The M4 leads on efficiency by a wide margin. At 0.34 watts per token, the base M4 is the most power-efficient Mac chip in the dataset — nearly half the cost of the M4 Pro (0.54 W/tok) and the M4 Max (0.76 W/tok). The A18 Pro edges it out at 0.19 W/tok but is constrained to 8GB and sub-25 tok/s median throughput.
MoE models now run at serious speed on 128GB machines. GPT-OSS-120B hits 74.3 tok/s on M4 Max 128GB and Qwen3.5-122B-A10B reaches 59.5 tok/s on M5 Max 128GB — both activating only ~10–12B parameters per token. Even the M2 Max 96GB runs Qwen3.5-122B at 36.9 tok/s, making 120B+ MoE models broadly accessible across the high-RAM Mac lineup.
The M5 Max makes a strong debut in the dataset. With 42 runs logged, the M5 Max posts a median of 100 tok/s and an average of 173 tok/s — the highest average of any chip in the dataset. It hits 1,014 tok/s on a 104M Qwen pretrain (LM Studio / GGUF) and sustains 102 tok/s output on gemma4:e4b, with prefill on the same gemma run measuring 1,012 tok/s.
The A18 Pro has a hidden prefill superpower. Running Llama-3.2-3B-Instruct-4bit via MLX, the A18 Pro recorded a prefill speed of 18,600 tok/s — more than 19× faster than the M4’s 942 tok/s on the same model. Output generation remains limited to ~17–21 tok/s due to 8GB RAM and 100 GB/s bandwidth, but prompt ingestion is extraordinarily fast for an edge device.
Findings last updated: 2026-05-05 · Data above is live; narrative is human-curated and refreshed on a monthly cadence.
Reasoning Models — Three Throughput Metricsv3.1+
For DeepSeek-R1, Qwen3-thinking, and other reasoning models, output tok/s now excludes thinking time. Prefill (input) and reasoning rates are tracked separately so you can see where each model spends its time.
Prefill Speed Leaderboardv3.1+
Top input tokens/sec rates — the prefill phase often dominates total latency on smaller prompts. Computed from prompt_tokens / prompt_eval_duration on submissions from app v3.1+.
| # | Model | Chip | RAM | prefill tok/s | output tok/s | Backend | User |
Throughput by Apple Silicon Chip
Average, median, and max tokens/sec across all tested models per chip
Memory Bandwidth vs Throughput
Each dot is a benchmark run — bandwidth is the primary driver of LLM inference speed
Power Efficiency
Tokens per Watt of system power by chip
Time to First Token
Median TTFT by chip (lower is better)
Backend Showdown
Average throughput by inference backend (backends with fewer than 2 runs excluded)
Leaderboard: Top 15 Fastest Runs
Highest output tokens/sec across all submissions
| # | Model | Chip | RAM | tok/s | TTFT | W/tok | Backend | User |
Big Model Club: 100B+ Parameters
Running frontier-class models locally on a Mac
| Model | Chip | RAM | tok/s | Quant | Backend |
Model Format Distribution
Submissions by model format (GGUF vs MLX)
Memory Tier Performance
Average tokens/sec grouped by unified memory size
Community Contributors
Top contributors by number of benchmark submissions