Benchmark Analysis — Anubis OSS

Key Findings

The most interesting takeaways from the community benchmarks

Memory bandwidth still dominates generational leaps. The M1 Ultra (800 GB/s) posts a median of 36.5 tok/s on large models while the newer M4 Pro (273 GB/s) manages 45.9 tok/s median — a modest edge despite three generations of architectural gains. The M3 Max (400 GB/s, 36GB) pulls far ahead with a 155 tok/s median, showing that bandwidth plus memory capacity together set the ceiling.

The M4 leads on efficiency by a wide margin. At 0.34 watts per token, the base M4 is the most power-efficient Mac chip in the dataset — nearly half the cost of the M4 Pro (0.54 W/tok) and the M4 Max (0.76 W/tok). The A18 Pro edges it out at 0.19 W/tok but is constrained to 8GB and sub-25 tok/s median throughput.

MoE models now run at serious speed on 128GB machines. GPT-OSS-120B hits 74.3 tok/s on M4 Max 128GB and Qwen3.5-122B-A10B reaches 59.5 tok/s on M5 Max 128GB — both activating only ~10–12B parameters per token. Even the M2 Max 96GB runs Qwen3.5-122B at 36.9 tok/s, making 120B+ MoE models broadly accessible across the high-RAM Mac lineup.

The M5 Max makes a strong debut in the dataset. With 42 runs logged, the M5 Max posts a median of 100 tok/s and an average of 173 tok/s — the highest average of any chip in the dataset. It hits 1,014 tok/s on a 104M Qwen pretrain (LM Studio / GGUF) and sustains 102 tok/s output on gemma4:e4b, with prefill on the same gemma run measuring 1,012 tok/s.

The A18 Pro has a hidden prefill superpower. Running Llama-3.2-3B-Instruct-4bit via MLX, the A18 Pro recorded a prefill speed of 18,600 tok/s — more than 19× faster than the M4’s 942 tok/s on the same model. Output generation remains limited to ~17–21 tok/s due to 8GB RAM and 100 GB/s bandwidth, but prompt ingestion is extraordinarily fast for an edge device.

Findings last updated: 2026-05-05 · Data above is live; narrative is human-curated and refreshed on a monthly cadence.

Reasoning Models — Three Throughput Metricsv3.1+

For DeepSeek-R1, Qwen3-thinking, and other reasoning models, output tok/s now excludes thinking time. Prefill (input) and reasoning rates are tracked separately so you can see where each model spends its time.

Prefill Speed Leaderboardv3.1+

Top input tokens/sec rates — the prefill phase often dominates total latency on smaller prompts. Computed from prompt_tokens / prompt_eval_duration on submissions from app v3.1+.

#	Model	Chip	RAM	prefill tok/s	output tok/s	Backend	User

Throughput by Apple Silicon Chip

Average, median, and max tokens/sec across all tested models per chip

Memory Bandwidth vs Throughput

Each dot is a benchmark run — bandwidth is the primary driver of LLM inference speed

Backend Showdown

Average throughput by inference backend (backends with fewer than 2 runs excluded)

Leaderboard: Top 15 Fastest Runs

Highest output tokens/sec across all submissions

#	Model	Chip	RAM	tok/s	TTFT	W/tok	Backend	User

Big Model Club: 100B+ Parameters

Running frontier-class models locally on a Mac

Model	Chip	RAM	tok/s	Quant	Backend

Community Contributors

Top contributors by number of benchmark submissions

Mac LLM Benchmark Analysis

Key Findings

Reasoning Models — Three Throughput Metricsv3.1+

Prefill Speed Leaderboardv3.1+

Throughput by Apple Silicon Chip

Memory Bandwidth vs Throughput

Power Efficiency

Time to First Token

Backend Showdown

Leaderboard: Top 15 Fastest Runs

Big Model Club: 100B+ Parameters

Model Format Distribution

Memory Tier Performance

Community Contributors