anubis-oss

Anubis

Local LLM Testing & Benchmarking for Apple Silicon

🚨 Benchmark analysis is live! Check out the results here, over 375+ community submitted runs analyzed Benchmark Report

Anubis is a native macOS app for benchmarking, comparing, and managing local large language models using any OpenAI-compatible endpoint - Ollama, MLX, oMLX, LM Studio Server, OpenWebUI, Docker Models, etc. Built with SwiftUI for Apple Silicon, it provides real-time hardware telemetry correlated with full, history-saved inference performance - something no CLI tool or chat wrapper offers. Export benchmarks directly without having to screenshot, and export the raw data as .MD or .CSV from the history. You can even OLLAMA PULL models directly within the app.

What’s New

Ollama Thinking Toggle (New in 3.2)

Tri-state control over Ollama’s think request parameter, exposed in the Benchmark Performance disclosure when the Ollama backend is selected.

Auto (default) — omit the field so the model uses its server-side default; safe for older Ollama versions and non-thinking models that reject the parameter
On — force think:true to enable reasoning where supported
Off — force think:false to disable reasoning on models that default it on (e.g. recent DeepSeek-R1 builds)

The choice persists across launches.

Reasoning-Aware Metrics & Prefill Speed (New in 3.1)

Output tokens/sec is now visible-throughput only for reasoning models. Previously, thinking time was charged against TTFT and thinking tokens were counted as output, inflating the numbers. Fixes #17 and #18.

Output tok/s excludes thinking time — for DeepSeek-R1, Qwen3-thinking, GLM, gpt-oss, and other reasoning models
Prefill (input) tokens/sec is a first-class metric — visible on the TTFT card, in session history, in CSV export, and on the leaderboard
Reasoning split — thinking-model runs record reasoning tokens and reasoning duration separately; the session detail view shows reasoning tok/s alongside output tok/s
Visible thinking — both Ollama and OpenAI-compatible backends decode reasoning content (reasoning_content, reasoning, or inline <think>…</think> tags) and surface it wrapped in <think>…</think> markers in the response
Better error messages — Ollama HTTP errors no longer surface as “timed out after 0 seconds”

Apple Intelligence Backend (New in 3.0) 🍎

Anubis now benchmarks Apple’s on-device Foundation Model alongside Ollama, MLX, and the rest — no server, no network, no setup. If your Mac supports Apple Intelligence (macOS 26+), it shows up in the backend menu automatically.

Zero configuration — pick Apple Intelligence from the backend selector and run; it talks directly to the on-device model via Apple’s FoundationModels framework
Streaming token output like every other backend, with the same live charts and metric cards
System prompt support maps to Foundation Models Instructions
Cleanly hidden on macOS versions or hardware without Apple Intelligence

Reports Tab — Export (New in 3.0)

Export the per-model performance summary directly from the Reports tab.

Markdown — branded report with hardware banner, table, and Fastest / Most-efficient summary
CSV — flat per-model rows for spreadsheet analysis
Respects your current selection, or exports all models when none are selected

Denser Benchmark Dashboard (New in 3.0)

3-column live chart grid (was 2) — fits more on screen without scrolling
Collapsible Session Details to reclaim vertical space
Run Time card fills the trailing grid slot so the layout always reads even
Cleaner chart axes — wall-clock x-axis labels removed in favor of gridlines

Hardware Stress Testing (New in 2.9)

Push your Apple Silicon to its limits and observe power draw, thermal throttling, and frequency scaling under controlled load - all from within the Monitor.

CPU stress - spawns yes processes per core. Choose All Cores, P-Cores only, E-Cores only, or Single Core
GPU stress - Metal compute shader renders a Mandelbrot fractal zoom in a separate window. Randomized zoom targets and color palettes on each run. Four intensity levels (Low / Medium / High / Extreme) control iterations, supersampling, and passes per frame
Memory bandwidth stress - allocates memory then continuously streams through it with memcpy to saturate the memory bus. Reports measured bandwidth in GB/s, directly comparable to your chip’s theoretical max. Three pressure levels (Light 25% / Moderate 50% / Heavy 75% of free memory)
Safety mechanisms - 5-minute auto-timeout, thermal watchdog (auto-stop at critical), GPU auto-downgrade if FPS drops below 5, cleanup on view disappear and app quit

Floating Monitor HUD (New in 2.9)

A compact, frameless, always-on-top overlay showing live system metrics - launchable from any tab via the sidebar or from the Monitor’s Float button.

Dark glass material, draggable, visible on all Spaces
Live CPU %, GPU %, memory, power, GPU frequency, and thermal state
Hides the main window when launched from Monitor (detach mode) or stays alongside when launched from the sidebar

15 Benchmark Prompts (New in 2.9)

Five new built-in prompts covering causal reasoning, system design, dialogue writing, historical analysis, and constrained writing - bringing the total to 15 across five categories.

Why Anubis?

The local LLM ecosystem on macOS is fragmented:

Chat wrappers (Ollama, LM Studio, Jan) focus on conversation, not systematic testing
Performance monitors (asitop, macmon, mactop) are CLI-only and lack LLM context
Evaluation frameworks (promptfoo) require YAML configs and terminal expertise
No tool correlates hardware metrics (GPU / CPU / ANE / power / memory) with inference speed in real time

Anubis fills that gap - all in a native macOS app.

Leaderboard Submissions Now Available! Submit directly through the app

The dataset is robust and open source - check it out here, please contribute!

Features

Benchmark

Real-time performance dashboard for single-model testing.

Select any model from any configured backend
Stream responses with live metrics overlay
8 metric cards: Output Tok/s, GPU %, CPU %, TTFT (with Prefill tok/s subtitle), Process Memory, Model Memory, Thermal State, GPU Frequency
7 live charts: Tokens/sec, GPU utilization, CPU utilization, process memory, GPU/CPU/ANE/DRAM power, GPU frequency - all updating in real time
Power telemetry: Real-time GPU, CPU, ANE, and DRAM power consumption in watts via IOReport
Process monitoring: Auto-detects backend process by port (Ollama, LM Studio, mlx-lm, vLLM, etc.) with manual process picker
Reasoning-aware: Output tok/s excludes thinking time; reasoning tok/s and prefill tok/s tracked separately for thinking models like DeepSeek-R1 and Qwen3-thinking
Detailed session stats: output tok/s (excludes reasoning), prefill tok/s, peak tok/s, TTFT, model load time, context length, eval duration, power averages
Configurable parameters: temperature, top-p, max tokens, system prompt
15 prompt presets organized by category (Reasoning, Coding, Creative, Knowledge, Instruction)
Session history with full replay, CSV export, and Markdown reports
3-column expanded dashboard: Full-screen metrics view showing all charts without scrolling - system info, utilization, cores, power, and frequency at a glance
Image export: Copy to clipboard, save as PNG, or share - 2x retina rendering with watermark, respects light/dark mode
Smart URL handling: Auto-strips /v1 suffix from backend URLs to prevent double-pathing errors

Arena

Side-by-side A/B model comparison with the same prompt.

Dual model selectors with independent backend selection
Sequential mode (memory-safe, one at a time) or Parallel mode (both simultaneously)
Shared prompt, system prompt, and generation parameters
Real-time streaming in both panels
Voting system: pick Model A, Model B, or Tie - votes are persisted
Per-panel stats grid (9 metrics each)
Model manager: view loaded models and unload to free memory
Comparison history with voting records

System Monitor

Standalone real-time hardware monitoring dashboard - no benchmark required.

One-click start: Begin recording CPU, GPU, memory, power, and thermal metrics
3-column live dashboard: All charts visible at once - CPU/GPU utilization, memory, per-core grids, power breakdown, GPU frequency
Stress testing: CPU, GPU (Mandelbrot), and memory bandwidth stress tests with adjustable intensity
Floating HUD: Detach a compact always-on-top metrics overlay while you work
Accumulating charts: Data builds up over time with automatic downsampling for long sessions
System info card: Live readouts for CPU %, GPU %, memory, power draw, and thermal state
No persistence: Data lives in memory only - nothing is saved when the monitor is closed

Leaderboard

Upload your benchmark results to the community leaderboard and see how your Mac stacks up against other Apple Silicon machines.

One-click upload from the benchmark toolbar after a completed run
Community rankings sorted by output tok/s with full drill-down into performance, power, and hardware details
Three throughput metrics per row: output tok/s, prefill tok/s, and reasoning tok/s — see exactly where each model spends its time
Model quantization & format tracking - every submission records the quantization level (Q4_K_M, FP16, 4-bit, etc.) and model format (GGUF vs MLX) so you can compare apples to apples
Filter by chip, model, quantization, or format to compare like-for-like
Data Explorer - interactive pivot table and charting powered by FINOS Perspective
Privacy-first: no accounts, no response text uploaded - just metrics and a display name
HMAC-signed submissions with server-side rate limiting

Vault

Unified model management across all backends.

Aggregated model list with search and backend filter chips
Running models section with live VRAM usage
Model inspector: size, parameters, quantization, format (GGUF/MLX), family, context window, architecture details, file path
Automatic metadata enrichment for OpenAI-compatible models - parses model IDs for family and parameter count, scans ~/.lmstudio/models/ and ~/.cache/huggingface/hub/ for disk size, quantization, and path
Pull new models, delete existing ones, unload from memory
Popular model suggestions for quick setup
Total disk usage display

Auto-Update

Anubis checks for updates automatically via Sparkle and notifies you when a new version is available.

Automatic checks on launch with user-controlled frequency
Manual check via the app menu (Anubis OSS > Check for Updates…) or Settings > About
Updates are code-signed, notarized, and verified with EdDSA before installation

Screenshots

GPU Core detail Screenshot 2026-02-25 at 4 08 44 PM

Arena Mode Screenshot 2026-02-25 at 4 21 50 PM

Settings (add connections with quick presets) Screenshot 2026-02-25 at 4 24 00 PM

Vault - View model details, unload, and Pull models directly for Ollama Screenshot 2026-02-25 at 4 14 57 PM

Supported Backends

Backend	Type	Default Port	Setup
Apple Intelligence	On-device (Foundation Models)	—	macOS 26+ with Apple Intelligence enabled. No setup; appears in the backend menu when supported.
Ollama	Native support	11434	Install from ollama.com - auto-detected on launch
LM Studio	OpenAI-compatible	1234	Enable local server in LM Studio settings
mlx-lm	OpenAI-compatible	8080	`pip install mlx-lm && mlx_lm.server --model <model>`
vLLM	OpenAI-compatible	8000	Add in Settings
LocalAI	OpenAI-compatible	8080	Add in Settings
Docker ModelRunner	OpenAI-compatible	user selected	Add in Settings

Any OpenAI-compatible server can be added through Settings > Add OpenAI-Compatible Server with a name, URL, and optional API key.

Hardware Metrics

Anubis captures Apple Silicon telemetry during inference via IOReport and system APIs:

Metric	Source	Description
GPU Utilization	IOReport	GPU active residency percentage
CPU Utilization	`host_processor_info`	Usage across all cores
GPU Power	IOReport Energy Model	GPU power consumption in watts
CPU Power	IOReport Energy Model	CPU (E-cores + P-cores) power in watts
ANE Power	IOReport Energy Model	Neural Engine power consumption
DRAM Power	IOReport Energy Model	Memory subsystem power
GPU Frequency	IOReport GPU Stats	Weighted average from P-state residency
Process Memory	`proc_pid_rusage`	Backend process `phys_footprint` (includes Metal/GPU allocations)
Thermal State	`ProcessInfo.thermalState`	System thermal pressure level

Process Monitoring

Anubis automatically detects which process is serving your model:

Port-based detection: Uses lsof to find the PID listening on the inference port (called once per benchmark start)
Backend identification: Matches process path and command-line args to identify Ollama, LM Studio, mlx-lm, vLLM, LocalAI, llama.cpp
Memory accounting: Uses phys_footprint (same as Activity Monitor) which includes Metal/GPU buffer allocations - critical for MLX and other GPU-accelerated backends
LM Studio support: Walks Electron app bundle descendants to find the model-serving process
Manual override: Process picker lets you select any process by name, sorted by memory usage

Metrics degrade gracefully - if IOReport access is unavailable (e.g., in a VM), Anubis still shows inference-derived metrics.

Requirements

macOS 15.0 (Sequoia) or later
Apple Silicon (M1 / M2 / M3 / M4 / M5 +) - Intel is not supported
8 GB unified memory minimum (16 GB+ recommended for larger models)
At least one inference backend installed (Ollama recommended)

Getting Started

1. Install Ollama (or another backend)

# macOS - install Ollama
brew install ollama

# Start the server
ollama serve

# Pull a model
ollama pull llama3.2:3b

2. Build & Run Anubis

git clone https://github.com/uncSoft/anubis-oss.git
cd anubis-oss/anubis
open anubis.xcodeproj

In Xcode:

Set your development team in Signing & Capabilities
Build and run (Cmd+R)

Anubis will auto-detect Ollama on launch. Other backends can be added in Settings.

3. Run Your First Benchmark

Select a model from the dropdown
Type a prompt or pick one from Presets
Click Run
Watch the metrics light up in real time

4. Submit to the Leaderboard

After a benchmark completes, click the Upload button in the benchmark toolbar to submit your results to the community leaderboard. Enter a display name and your run will appear in the rankings - no account required. Only performance metrics and hardware info are submitted; response text is never uploaded.

Building from Source

# Clone
git clone https://github.com/uncSoft/anubis-oss.git
cd anubis-oss/anubis

# Build via command line
xcodebuild -scheme anubis-oss -configuration Debug build

# Run tests
xcodebuild -scheme anubis-oss -configuration Debug test

# Or just open in Xcode
open anubis.xcodeproj

Dependencies

Resolved automatically by Swift Package Manager on first build:

Package	Purpose	License
GRDB.swift	SQLite database	MIT
Sparkle	Auto-update framework	MIT
Swift Charts	Data visualization	Apple

Architecture

Anubis follows MVVM with a layered service architecture:

┌─────────────────────────────────────────────────────────────┐
│                    PRESENTATION LAYER                       │
│  BenchmarkView  ArenaView  MonitorView  VaultView  Settings │
├─────────────────────────────────────────────────────────────┤
│                      SERVICE LAYER                          │
│   MetricsService   InferenceService   ModelService   Export │
├─────────────────────────────────────────────────────────────┤
│                    INTEGRATION LAYER                        │
│  OllamaClient  OpenAICompatibleClient  IOReportBridge  ProcessMonitor │
├─────────────────────────────────────────────────────────────┤
│                    PERSISTENCE LAYER                        │
│   SQLite (GRDB)              File System                    │
└─────────────────────────────────────────────────────────────┘

Views display data and delegate to ViewModels. ViewModels coordinate Services. Services are stateless and use async/await. Integrations are thin adapters wrapping external systems (Ollama API, IOReport, etc.).

Project Structure

anubis/
├── App/                    # Entry point, app state, navigation
├── Features/
│   ├── Benchmark/          # Performance dashboard
│   ├── Arena/              # A/B model comparison
│   ├── Monitor/            # System monitor, stress tests, floating HUD
│   ├── Vault/              # Model management
│   └── Settings/           # Backend config, about, help, contact
├── Services/               # MetricsService, InferenceService, ExportService
├── Integrations/           # OllamaClient, OpenAICompatibleClient, IOReportBridge, ProcessMonitor
├── Models/                 # Data models (BenchmarkSession, ModelInfo, etc.)
├── Database/               # GRDB setup & migrations
├── DesignSystem/           # Theme, colors, reusable components
├── Demo/                   # Demo mode for App Store review
└── Utilities/              # Formatters, constants, logger, benchmark prompts

Backend Abstraction

All inference backends implement a shared protocol, making it straightforward to add new ones:

protocol InferenceBackend {
    var id: String { get }
    var displayName: String { get }
    var isAvailable: Bool { get async }

    func listModels() async throws -> [ModelInfo]
    func generate(prompt: String, parameters: GenerationParameters)
        -> AsyncThrowingStream<InferenceChunk, Error>
}

Data Storage

All data is stored locally - nothing leaves your machine.

Data	Location
Database	`~/Library/Application Support/Anubis/anubis.db`
Exports	Generated on demand (CSV, Markdown)
Preferences	UserDefaults

Troubleshooting

Ollama shows “Disconnected”

# Make sure Ollama is running
ollama serve

# Verify it's accessible
curl http://localhost:11434/api/tags

No GPU metrics

GPU metrics require IOReport access via IOKit
Some configurations or VMs may not expose these APIs
Anubis will still show inference-derived metrics (tokens/sec, TTFT, etc.)

High memory usage

Use Sequential mode in Arena to run one model at a time
Unload unused models via Arena > Models > Unload All
Choose smaller quantized models (Q4_K_M over Q8_0)

Model not appearing

Click Refresh Models in Settings
Ensure the model is pulled: ollama pull <model-name>
For OpenAI-compatible backends, verify the server is running and the URL is correct

Contributing

Contributions are welcome. A few guidelines:

Follow the existing patterns - MVVM, async/await, guard-let over force-unwrap
Keep files under 300 lines - split if larger
One feature per PR - small, focused changes are easier to review
Test services and integrations - views are harder to unit test, but services should have coverage
Handle errors gracefully - always provide errorDescription and recoverySuggestion

Adding a New Backend

Create a new file in Integrations/ implementing InferenceBackend
Register it in InferenceService
Add configuration UI in Settings/
That’s it - the rest of the app works through the protocol

Support the Project

If Anubis is useful to you, consider buying me a coffee on Ko-fi or sponsoring on GitHub. It helps fund continued development and new features.

A sandboxed, less feature rich version is also available on the Mac App Store if you prefer a managed install.

License

GPL-3.0 License - see LICENSE for details.

Other projects: DevPad · Nabu

This site is open source. Improve this page.