Key Findings
The most interesting takeaways from the community benchmarks
Bandwidth still beats generation count. The M1 Ultra (800 GB/s) holds a 36.5 tok/s median against the M4 Pro's 45.9 tok/s (273 GB/s) — three generations of silicon barely outrun raw memory throughput. The M3 Max at 400 GB/s with 36GB blows past both at 164.6 tok/s median, showing capacity and bandwidth together set the real ceiling.
The M5 Pro debuts as the new efficiency king. Its single logged run posts 0.16 W/tok — below even the A18 Pro's 0.19 W/tok — while delivering 64.4 tok/s and a 550ms TTFT. If the sample holds up, the M5 Pro is roughly a third as power-hungry as the M4 Pro (0.54 W/tok) at comparable throughput.
172B MoE now runs on a laptop. minimax-m2.5-reap-172b-mlx hits 51.6 tok/s on the M5 Max 128GB at 4-bit, joining gpt-oss-120b (74.3 tok/s on M4 Max) and Qwen3.5-122B-A10B (59.5 tok/s on M5 Max) in the comfortably-interactive tier. Even the M2 Max 96GB sustains 36.9 tok/s on the 122B Qwen MoE.
Reasoning models hold throughput surprisingly well. On the M5 Max 128GB, gemma4:e4b generates 695 reasoning tokens at 89.2 tok/s — only ~15% slower than its 105 tok/s output rate. The M5 Pro running qwen3.6-35b actually reasons faster than it generates final output (67.9 vs 64.4 tok/s), suggesting the thinking phase isn't the bottleneck many feared.
The A18 Pro's prefill is still untouchable. Llama-3.2-3B-Instruct-4bit on MLX hits 18,600 tok/s prefill on the iPhone chip — nearly 20× the M4's 942 tok/s and roughly 10× the M5 Max's best of 1,911 tok/s on gemma4:e4b. Output stays modest at 16.6 tok/s, but for prompt-heavy workloads, nothing in the dataset ingests faster.
Findings last updated: 2026-05-22 · Data above is live; narrative is human-curated and refreshed on a monthly cadence.
New: Repeated-Run Benchmarksv3.5+
Submissions from app v3.5+ can now bundle N consecutive runs into a single benchmark group, with mean and 95% bootstrap confidence intervals computed across reps.
Why it matters. A single benchmark run captures one snapshot of the system — thermal state, scheduler choices, sampler seed, background load all bleed into the result. An N-rep group runs the same configuration multiple times and reports the mean plus a 95% CI, so you can see the actual spread, not just one point. That CI is what tells you whether a 5% gap between two configs is real or noise.
How to spot one. On the
leaderboard, group reps now show a small “±X.XX · N reps” line under the tok/s value, and the expand panel reveals a Group Context section with the full mean ± CI for tok/s, TTFT, and J/Tok. The
explorer exposes the group aggregates as sortable columns.
How to run them yourself. Open Anubis OSS v3.5 or later, set the Repetitions stepper in the benchmark toolbar to anything from 2–20, and pick a seed strategy — Random (default) captures both hardware and sampler variance, Fixed isolates hardware-only variance for tight reproducibility comparisons. Each rep streams normally; the group aggregate is computed at completion.
Reasoning Models — Three Throughput Metricsv3.1+
For DeepSeek-R1, Qwen3-thinking, and other reasoning models, output tok/s now excludes thinking time. Prefill (input) and reasoning rates are tracked separately so you can see where each model spends its time.
Prefill Speed Leaderboardv3.1+
Top input tokens/sec rates — the prefill phase often dominates total latency on smaller prompts. Computed from prompt_tokens / prompt_eval_duration on submissions from app v3.1+.
| # | Model | Chip | RAM | prefill tok/s | output tok/s | Backend | User |
Throughput by Apple Silicon Chip
Average, median, and max tokens/sec across all tested models per chip
Memory Bandwidth vs Throughput
Each dot is a benchmark run — bandwidth is the primary driver of LLM inference speed
Power Efficiency
Tokens per Watt of system power by chip
Time to First Token
Median TTFT by chip (lower is better)
Backend Showdown
Average throughput by inference backend (backends with fewer than 2 runs excluded)
Leaderboard: Top 15 Fastest Runs
Highest output tokens/sec across all submissions
| # | Model | Chip | RAM | tok/s | TTFT | W/tok | Backend | User |
Big Model Club: 100B+ Parameters
Running frontier-class models locally on a Mac
| Model | Chip | RAM | tok/s | Quant | Backend |
Model Format Distribution
Submissions by model format (GGUF vs MLX)
Memory Tier Performance
Average tokens/sec grouped by unified memory size
Community Contributors
Top contributors by number of benchmark submissions