Mac Studio as Homelab AI Node

Overview

A used Mac Studio M1 Max with 64GB unified memory running Ollama, powering all LLM inference in the homelab — the wiki pipeline, the monitoring pipeline, and interactive sessions. Lives at 192.168.1.45, answers to Legolas.

Why I Built It

I was running LLM inference on an HP Pavilion with CPU only — about 15 tokens per second on a reasonably-sized model, which meant several minutes per wiki pipeline run. When you’re doing batch processing across dozens of documents that compounds, and it kills the flow.

The Apple Silicon case comes down to memory bandwidth. A 31B model doesn’t fit in GPU memory on most consumer hardware, but the M1 Max has ~400 GB/s of unified memory bandwidth and 64GB that the CPU and GPU both access. You get 25+ tokens per second on 27-31B models without quantization tricks. Not a data center, but fast enough that you stop noticing the wait.

I bought a used 2022 Mac Studio M1 Max for $2,299. New ones with 64GB were backordered to late June at the time, and the used market was thin — I paid close to new price for a three-year-old machine. The form factor makes up for it: small, silent, and low power draw under inference load.

How It Works

Component	Details
Machine	Mac Studio 2022, M1 Max, 64GB unified memory
Hostname	Legolas (192.168.1.45)
Inference backend	Ollama, listening on 0.0.0.0:11434
Text cleaning	gemma4:e2b
Crystallization / generation	qwen3.6:35b-a3b-coding-nvfp4
PDF and image OCR	minicpm-v:8b
Embeddings	nomic-embed-text

Legolas is the only inference node in the homelab — everything routes to 192.168.1.45:11434. Ollama runs with OLLAMA_HOST=0.0.0.0 via a launchd plist, accessible to any machine on the server VLAN.

64GB means multiple models stay resident at once with no reload delays. The wiki pipeline runs three models in sequence — gemma4:e2b for cleaning, qwen3.6:35b for crystallization, minicpm-v:8b for PDF pages — and all three stay in memory between calls. First call after startup takes a moment; after that it’s fast.

nomic-embed-text also runs on Legolas. Every time a wiki page is written, it gets embedded and stored in pgvector on the Postgres VM. Semantic search across the full wiki runs against those embeddings — how I find related pages without just grepping through files.

The monitoring pipeline runs through n8n on the Proxmox node and calls Legolas for all synthesis. Every hour, n8n collects metrics from Prometheus, Uptime Kuma, UniFi, and the Synology, hands them to Ollama in parallel for summarization, and writes the results to Postgres. Legolas handles all of it without noticeable load.

Challenges

qwen3.6:35b silently freezes on macOS. The model shows as loaded in ollama ps, requests are accepted, nothing comes back. Fix is pkill -f ollama then brew services restart ollama — not systemctl, Legolas runs macOS. After restart, clear any stale lock files on wiki-llm before relaunching.
MLX wasn’t ready for Gemma 4. mlx-community 4-bit quantizations wouldn’t load, LM Studio’s MLX backend didn’t support Gemma 4, chat templates needed manual handling. Ollama was the only reliable path when I set this up.
10-second settle delay after vision calls. After consecutive minicpm-v:8b requests, the next gemma4:e2b stream closes early without done=true. Root cause undiagnosed — a sleep(10) between the last vision call and the first text call fixes it in the pipeline.
Used market timing. Bought this when new 64GB Mac Studios were backordered to late June. Thin inventory meant paying close to new price for a three-year-old machine. It arrived in a week; the new one wouldn’t have shipped until summer.

Result

Running since April 2026. It replaced the Pavilion for all AI workloads — the wiki pipeline runs in a fraction of the previous time, the monitoring pipeline hits every hourly cycle, and interactive sessions with 30B+ models are fast enough not to notice.

The Pavilion now does what it’s actually suited for: it has an NVIDIA MX550 and runs Jellyfin with hardware transcoding. Cleaner than trying to make it do both.

2026-05-04

../