Raspberry Pi 5 + AI HAT+ 2: Build a Low-cost On-call Assistant for Remote Ops
Build a low-cost on-call assistant using Raspberry Pi 5 + AI HAT+2 to triage alerts and summarize incidents for remote SREs.
Stop drowning in noisy pagers: build an on-call assistant that triages alerts, summarizes incidents, and frees up your async time
If you’re a remote SRE or sysadmin, you know the pain: pages trigger anywhere, at any hour, and most of them are noise. What if a $200–$300 home-lab node could filter, enrich, and summarize alerts before they hit your primary pager — and do the heavy context-gathering for you? In this guide (2026 edition) I’ll walk you through a complete, practical build using the Raspberry Pi 5 and the new AI HAT+ 2 to run an edge LLM that monitors systems, triages alerts, and produces human-ready incident summaries for remote teams.
What you’ll build — at a glance
By the end you’ll have a small, resilient service running in your home lab that:
- Receives webhooks from Alertmanager (Prometheus) or other monitoring tools
- Queries relevant time-series and logs to enrich each alert
- Runs on-device inference on the AI HAT+ 2 to classify & triage
- Generates a concise incident summary and actionable next steps
- Notifies the right channel (Slack, PagerDuty, SMS) based on severity and timezone
Why Raspberry Pi 5 + AI HAT+ 2 in 2026?
Late 2025 and early 2026 saw major momentum in edge AI for observability. The AI HAT+ 2 unlocked affordable on-device generative models for the Raspberry Pi 5, enabling low-latency, privacy-friendly inference without cloud egress. For remote ops teams this translates to:
- Lower alert-to-decision latency — useful for time-sensitive incidents
- Reduced cloud costs and data exposure because sensitive logs stay on-prem
- Offline resilience when your cloud consoles are unreachable
ZDNET and other outlets highlighted how the AI HAT+ 2 made generative AI practical on the Pi 5 — making on-device SRE assistants feasible outside of enterprise lab setups.
Hardware & parts (budget-friendly)
- Raspberry Pi 5 (8GB recommended; 16GB ideal for heavier models)
- AI HAT+ 2 — the official accelerator board that runs optimized LLMs on-device
- Fast microSD (or NVMe adapter + SSD for durability)
- Active cooling (case + fan) and a decent power supply
- Optional: small USB speaker or a status LED panel for local alerts
Software stack overview
Keep the stack lightweight and modular so you can iterate:
- Base OS: Raspberry Pi OS 64-bit or Ubuntu 24.04 (server)
- Container runtime: Docker + Docker Compose
- Monitoring: Prometheus (node-exporter), Grafana (optional on-device), Alertmanager
- Triage service: Python FastAPI app that calls the on-device LLM
- Storage: SQLite (local) or lightweight timeseries/log store
- Notification sinks: Slack webhook, PagerDuty integration, SMS via Twilio (optional)
High-level architecture
Alert flow:
- Prometheus Alertmanager -> webhook to Pi triage service
- Triage service queries Prometheus HTTP API + recent logs (Loki/Fluent Bit) to collect context
- Context is fed to the on-device LLM on the AI HAT+ 2 which returns classification, priority, and a 3–5 line incident summary
- Triage service decides whether to escalate, snooze, or acknowledge, then posts to Slack or PagerDuty with summary + runbook links
Step 1 — OS & base setup
Flash a 64-bit image (Ubuntu 24.04 or Raspberry Pi OS) and complete the first-boot updates. Enable SSH and configure a static IP or DHCP reservation so Alertmanager can reach the device reliably.
Install basic packages and Docker:
sudo apt update && sudo apt upgrade -y
sudo apt install -y docker.io docker-compose git python3-venv python3-pip
Step 2 — install AI HAT+ 2 runtime
The AI HAT+ 2 vendor provides an SDK and drivers. Use the vendor instructions to install firmware and runtime bindings. Typical steps:
- Enable required interfaces (SPI/I2C) via raspi-config or Ubuntu device config
- Install the SDK (pip wheel or apt package provided by vendor)
- Verify with the vendor-supplied sample inference script
After installation, run a simple inference test to ensure your model runs and the NPU responds.
Step 3 — deploy the monitoring stack (Docker Compose)
Use Docker Compose to keep services isolated. Here’s a concise example compose file structure (trimmed):
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus:/etc/prometheus
ports:
- 9090:9090
alertmanager:
image: prom/alertmanager
volumes:
- ./alertmanager:/etc/alertmanager
ports:
- 9093:9093
node-exporter:
image: prom/node-exporter
network_mode: host
triage:
build: ./triage
ports:
- 8080:8080
devices:
- /dev/ai_hat_npu:/dev/ai_hat_npu # vendor-specific device
volumes:
- ./data:/data
This runs Prometheus + Alertmanager locally and exposes a triage FastAPI service on port 8080.
Step 4 — write the triage service
Core responsibilities:
- Authenticate incoming webhooks (HMAC verification)
- Query Prometheus API for the last N datapoints for the affected metric
- Fetch log snippets (Loki/Fluent Bit) for the window around the alert
- Call the on-device LLM with a structured prompt to: classify, assign severity, and produce a summary
- Decide action: escalated vs snooze vs tame; send notifications
Minimal FastAPI webhook (concept)
from fastapi import FastAPI, Request
import requests
import sqlite3
# placeholder import - use the AI HAT+2 SDK module
from ai_hat_sdk import LocalModel
app = FastAPI()
model = LocalModel(model_path='/models/triage-ggml.bin')
PROM_URL = 'http://prometheus:9090'
@app.post('/alert')
async def alert_hook(req: Request):
payload = await req.json()
# 1) verify HMAC here
# 2) extract alert fields
alert = payload['alerts'][0]
metric = alert['labels'].get('metric')
# 3) query Prometheus for last 10 points
q = f"{metric}[5m]"
res = requests.get(f"{PROM_URL}/api/v1/query?query={q}").json()
timeseries = res['data']['result']
# 4) build prompt
prompt = build_prompt(alert, timeseries)
# 5) run local inference
result = model.generate(prompt, max_tokens=256)
# 6) parse and send notification
summary, severity, action = parse_model_output(result)
post_to_slack(summary, severity)
# 7) store to sqlite for audit
conn = sqlite3.connect('/data/incidents.db')
conn.execute('INSERT INTO incidents (summary,severity) VALUES (?,?)', (summary, severity))
conn.commit()
conn.close()
return {'status': 'ok'}
Note: replace ai_hat_sdk with the official SDK exposed by AI HAT+ 2. The pattern is the same for any local LLM runtime.
Step 5 — prompt engineering for triage
Design prompts to be explicit and prescriptive. Your prompt should:
- Start with a short instruction: classify the alert as noise, warning, or incident
- Include compact context: metric name, recent datapoints, top log snippets, service owner, runbook URL
- Limit output to structured JSON: {"severity":"","summary":"","next_steps":[...],"confidence":0.0}
Instruction: You are an SRE assistant. Given the metric and logs, return a JSON with severity, a 2-line summary, 2 suggested next steps, and confidence.
Metric: http_request_duration_seconds
Recent points: [0.23,0.25,0.95,1.8,2.3]
Logs: "timeout while connecting to upstream", "upstream 10.0.0.5 reset"
Output JSON:
Step 6 — secure the pipeline
- HMAC-sign Alertmanager webhooks using a shared secret
- Run the triage service behind a reverse proxy (NGINX) with TLS
- Limit SSH access and use firewall rules to permit only the monitoring stack
- Scrub sensitive log content before storing or sending off-device
Step 7 — notification and routing logic
Simple routing rules you can implement:
- severity == incident -> create PagerDuty event and post a Slack #oncall message
- severity == warning && confidence < 0.6 -> create a low-priority Slack thread for review
- noise -> auto-acknowledge in Alertmanager and log the event for analyst review
Step 8 — test with simulated alerts
Before you flip this into production, run a simulation harness that:
- Generates Prometheus alerts with different signatures (CPU spikes, network errors, flapping)
- Measures triage latency and model inference time
- Checks false-positive and false-negative triage decisions and logs corrective labels
Tuning for performance and cost
- Choose smaller, quantized models for the AI HAT+ 2 to keep inference under 1–2 seconds for short prompts
- Cache runbook snippets and frequently used patterns to avoid repeated inference for rerouted duplicates
- Use CPU-only fallback when NPU is busy (and re-queue inference)
- Monitor Pi 5 thermal and CPU stats; add heatsink/active cooling to prevent throttling
Advanced features to add later
- Active learning loop: capture human corrections and periodically fine-tune a small local model or maintain a ruleset
- Distributed Pi cluster for high availability — use leader election and persistent storage for incident state
- Voice summaries for overnight on-call: TTS generated from summaries on the Pi speaker
- Integrate with runbook automation (Ansible/Playbooks) to run pre-approved remediation steps with gating
Operational best practices
- Keep a human-in-the-loop for high-impact incidents — the assistant should recommend, not unilaterally remediate
- Track precision/recall of triage decisions and put a cadence on review (weekly)
- Rotate keys and audit logs for your triage node regularly
- Define an override process so engineers can bypass the assistant during drills or incident swarms
2026 trends and how this fits your remote ops workflow
In 2026, more teams adopted hybrid observability: cloud-native monitoring with edge inference for fast pre-triage. The benefits are especially important for remote teams who rely on async context — a short, crisp summary from your on-call assistant reduces the need for immediate synchronous check-ins. Expect LLM model runtimes to get smaller and more optimized for accelerators like the AI HAT+ 2; plan to upgrade models annually and keep your prompt templates under version control.
Risks and tradeoffs
Edge LLMs are powerful but imperfect. Key tradeoffs:
- False negatives can delay critical incidents if the assistant misclassifies — always enforce human override and alert escalation timeouts
- On-device models may miss subtle cross-service correlations visible only in centralized analytics — use your assistant for first-pass triage, not final root-cause
- Hardware failure of your Pi must be planned for — have a cloud fallback or secondary Pi in HA configuration
Example: sample alert lifecycle (concise)
- Alertmanager sends webhook for increased 5xx rate to /alert
- Triage service queries last 5m of 5xx rate and pulls two relevant logs
- LLM outputs severity=incident, summary="Upstream 10.0.0.5 returning 502, likely connection resets", next_steps=["Notify backend owner", "Check upstream host 10.0.0.5 network"]
- Service creates a PagerDuty incident, posts the summary to Slack, and attaches runbook link
Actionable takeaways
- Start small: run the triage service in passive mode (post-only to a review channel) before automatic escalation
- Design strict JSON output prompts to keep parsing deterministic
- Secure webhooks with HMAC and TLS to prevent spoofed alerts
- Monitor Pi health and enforce HA or cloud fallback for critical on-call paths
Closing — why build this now
Edge AI on accessible hardware like the Raspberry Pi 5 + AI HAT+ 2 makes practical, low-cost on-call assistants a reality in 2026. For remote SREs and admins, that means fewer wakeups for noisy alerts, faster context, and more time for async response and automation. Start with the patterns above, tune prompts to your stack, and keep humans in control for high-stakes decisions.
Ready to build? Clone a starter triage repo, hook it to a lab Prometheus instance, and run simulations tonight. Share your results with your team and iterate: the fastest path to value is small, measurable improvements in pager noise and mean time to acknowledge.
Call to action
Try this in your home lab this week: set up a Pi 5 + AI HAT+ 2, deploy the minimal triage stack in passive mode, and run 50 simulated alerts. Track how many alerts the assistant filters or summarizes correctly, then tune prompts for 1–2 hours. If you want a reviewed starter repo and a pre-built prompt set tuned for Prometheus + Loki, sign up on remotejob.live or share your build in the community — I’ll review and suggest improvements based on your telemetry.
Related Reading
- Checklist: Last-Mile Exam Day Logistics — From Arrival to ID Checks (2026 Update)
- How to Spot Timely Business News (IPOs, Rebrands, Stock Moves) Your Audience Will Pay For
- Creating a Windows Test Matrix for Mobile App UX: Accounting for OEM Skins
- What the Trump Mobile Delivery Debacle Teaches You About Vetting Furniture Preorders
- Shot-by-Shot: The Horror References in Mitski’s 'Where’s My Phone?' Video
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
From Phone to Portfolio: Building Android 17 Feature Demos That Win Hiring Managers
Android 17 for Remote Android Devs: What Cinnamon Bun Means for Your CI/CD and Device Labs
How to Run an Effective Remote ‘Game Day’ Using Process-Killing Tools
Using Android Skin Rankings to Prioritize Bug Fixes and Feature Flags for Global Users
The Ethics and Risks of Using LLMs to Build Micro Apps for Clients and Employers
From Our Network
Trending stories across our publication group