raspberry-pidevopsautomation

Raspberry Pi 5 + AI HAT+ 2: Build a Low-cost On-call Assistant for Remote Ops

UUnknown

2026-02-25

10 min read

Build a low-cost on-call assistant using Raspberry Pi 5 + AI HAT+2 to triage alerts and summarize incidents for remote SREs.

Stop drowning in noisy pagers: build an on-call assistant that triages alerts, summarizes incidents, and frees up your async time

If you’re a remote SRE or sysadmin, you know the pain: pages trigger anywhere, at any hour, and most of them are noise. What if a $200–$300 home-lab node could filter, enrich, and summarize alerts before they hit your primary pager — and do the heavy context-gathering for you? In this guide (2026 edition) I’ll walk you through a complete, practical build using the Raspberry Pi 5 and the new AI HAT+ 2 to run an edge LLM that monitors systems, triages alerts, and produces human-ready incident summaries for remote teams.

What you’ll build — at a glance

By the end you’ll have a small, resilient service running in your home lab that:

Receives webhooks from Alertmanager (Prometheus) or other monitoring tools
Queries relevant time-series and logs to enrich each alert
Runs on-device inference on the AI HAT+ 2 to classify & triage
Generates a concise incident summary and actionable next steps
Notifies the right channel (Slack, PagerDuty, SMS) based on severity and timezone

Why Raspberry Pi 5 + AI HAT+ 2 in 2026?

Late 2025 and early 2026 saw major momentum in edge AI for observability. The AI HAT+ 2 unlocked affordable on-device generative models for the Raspberry Pi 5, enabling low-latency, privacy-friendly inference without cloud egress. For remote ops teams this translates to:

Lower alert-to-decision latency — useful for time-sensitive incidents
Reduced cloud costs and data exposure because sensitive logs stay on-prem
Offline resilience when your cloud consoles are unreachable

ZDNET and other outlets highlighted how the AI HAT+ 2 made generative AI practical on the Pi 5 — making on-device SRE assistants feasible outside of enterprise lab setups.

Hardware & parts (budget-friendly)

Raspberry Pi 5 (8GB recommended; 16GB ideal for heavier models)
AI HAT+ 2 — the official accelerator board that runs optimized LLMs on-device
Fast microSD (or NVMe adapter + SSD for durability)
Active cooling (case + fan) and a decent power supply
Optional: small USB speaker or a status LED panel for local alerts

Software stack overview

Keep the stack lightweight and modular so you can iterate:

Base OS: Raspberry Pi OS 64-bit or Ubuntu 24.04 (server)
Container runtime: Docker + Docker Compose
Monitoring: Prometheus (node-exporter), Grafana (optional on-device), Alertmanager
Triage service: Python FastAPI app that calls the on-device LLM
Storage: SQLite (local) or lightweight timeseries/log store
Notification sinks: Slack webhook, PagerDuty integration, SMS via Twilio (optional)

High-level architecture

Alert flow:

Prometheus Alertmanager -> webhook to Pi triage service
Triage service queries Prometheus HTTP API + recent logs (Loki/Fluent Bit) to collect context
Context is fed to the on-device LLM on the AI HAT+ 2 which returns classification, priority, and a 3–5 line incident summary
Triage service decides whether to escalate, snooze, or acknowledge, then posts to Slack or PagerDuty with summary + runbook links

Step 1 — OS & base setup

Flash a 64-bit image (Ubuntu 24.04 or Raspberry Pi OS) and complete the first-boot updates. Enable SSH and configure a static IP or DHCP reservation so Alertmanager can reach the device reliably.

Install basic packages and Docker:

sudo apt update && sudo apt upgrade -y
sudo apt install -y docker.io docker-compose git python3-venv python3-pip

Step 2 — install AI HAT+ 2 runtime

The AI HAT+ 2 vendor provides an SDK and drivers. Use the vendor instructions to install firmware and runtime bindings. Typical steps:

Enable required interfaces (SPI/I2C) via raspi-config or Ubuntu device config
Install the SDK (pip wheel or apt package provided by vendor)
Verify with the vendor-supplied sample inference script

After installation, run a simple inference test to ensure your model runs and the NPU responds.

Step 3 — deploy the monitoring stack (Docker Compose)

Use Docker Compose to keep services isolated. Here’s a concise example compose file structure (trimmed):

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus:/etc/prometheus
    ports:
      - 9090:9090
  alertmanager:
    image: prom/alertmanager
    volumes:
      - ./alertmanager:/etc/alertmanager
    ports:
      - 9093:9093
  node-exporter:
    image: prom/node-exporter
    network_mode: host
  triage:
    build: ./triage
    ports:
      - 8080:8080
    devices:
      - /dev/ai_hat_npu:/dev/ai_hat_npu  # vendor-specific device
    volumes:
      - ./data:/data

This runs Prometheus + Alertmanager locally and exposes a triage FastAPI service on port 8080.

Step 4 — write the triage service

Core responsibilities:

Authenticate incoming webhooks (HMAC verification)
Query Prometheus API for the last N datapoints for the affected metric
Fetch log snippets (Loki/Fluent Bit) for the window around the alert
Call the on-device LLM with a structured prompt to: classify, assign severity, and produce a summary
Decide action: escalated vs snooze vs tame; send notifications

Minimal FastAPI webhook (concept)

from fastapi import FastAPI, Request
import requests
import sqlite3
# placeholder import - use the AI HAT+2 SDK module
from ai_hat_sdk import LocalModel

app = FastAPI()
model = LocalModel(model_path='/models/triage-ggml.bin')

PROM_URL = 'http://prometheus:9090'

@app.post('/alert')
async def alert_hook(req: Request):
    payload = await req.json()
    # 1) verify HMAC here
    # 2) extract alert fields
    alert = payload['alerts'][0]
    metric = alert['labels'].get('metric')

    # 3) query Prometheus for last 10 points
    q = f"{metric}[5m]"
    res = requests.get(f"{PROM_URL}/api/v1/query?query={q}").json()
    timeseries = res['data']['result']

    # 4) build prompt
    prompt = build_prompt(alert, timeseries)

    # 5) run local inference
    result = model.generate(prompt, max_tokens=256)

    # 6) parse and send notification
    summary, severity, action = parse_model_output(result)
    post_to_slack(summary, severity)

    # 7) store to sqlite for audit
    conn = sqlite3.connect('/data/incidents.db')
    conn.execute('INSERT INTO incidents (summary,severity) VALUES (?,?)', (summary, severity))
    conn.commit()
    conn.close()
    return {'status': 'ok'}

Note: replace ai_hat_sdk with the official SDK exposed by AI HAT+ 2. The pattern is the same for any local LLM runtime.

Step 5 — prompt engineering for triage

Design prompts to be explicit and prescriptive. Your prompt should:

Start with a short instruction: classify the alert as noise, warning, or incident
Include compact context: metric name, recent datapoints, top log snippets, service owner, runbook URL
Limit output to structured JSON: {"severity":"","summary":"","next_steps":[...],"confidence":0.0}

Instruction: You are an SRE assistant. Given the metric and logs, return a JSON with severity, a 2-line summary, 2 suggested next steps, and confidence.

Metric: http_request_duration_seconds
Recent points: [0.23,0.25,0.95,1.8,2.3]
Logs: "timeout while connecting to upstream", "upstream 10.0.0.5 reset"

Output JSON:

Step 6 — secure the pipeline

HMAC-sign Alertmanager webhooks using a shared secret
Run the triage service behind a reverse proxy (NGINX) with TLS
Limit SSH access and use firewall rules to permit only the monitoring stack
Scrub sensitive log content before storing or sending off-device

Step 7 — notification and routing logic

Simple routing rules you can implement:

severity == incident -> create PagerDuty event and post a Slack #oncall message
severity == warning && confidence < 0.6 -> create a low-priority Slack thread for review
noise -> auto-acknowledge in Alertmanager and log the event for analyst review

Step 8 — test with simulated alerts

Before you flip this into production, run a simulation harness that:

Generates Prometheus alerts with different signatures (CPU spikes, network errors, flapping)
Measures triage latency and model inference time
Checks false-positive and false-negative triage decisions and logs corrective labels

Tuning for performance and cost

Choose smaller, quantized models for the AI HAT+ 2 to keep inference under 1–2 seconds for short prompts
Cache runbook snippets and frequently used patterns to avoid repeated inference for rerouted duplicates
Use CPU-only fallback when NPU is busy (and re-queue inference)
Monitor Pi 5 thermal and CPU stats; add heatsink/active cooling to prevent throttling

Advanced features to add later

Active learning loop: capture human corrections and periodically fine-tune a small local model or maintain a ruleset
Distributed Pi cluster for high availability — use leader election and persistent storage for incident state
Voice summaries for overnight on-call: TTS generated from summaries on the Pi speaker
Integrate with runbook automation (Ansible/Playbooks) to run pre-approved remediation steps with gating

Operational best practices

Keep a human-in-the-loop for high-impact incidents — the assistant should recommend, not unilaterally remediate
Track precision/recall of triage decisions and put a cadence on review (weekly)
Rotate keys and audit logs for your triage node regularly
Define an override process so engineers can bypass the assistant during drills or incident swarms

2026 trends and how this fits your remote ops workflow

In 2026, more teams adopted hybrid observability: cloud-native monitoring with edge inference for fast pre-triage. The benefits are especially important for remote teams who rely on async context — a short, crisp summary from your on-call assistant reduces the need for immediate synchronous check-ins. Expect LLM model runtimes to get smaller and more optimized for accelerators like the AI HAT+ 2; plan to upgrade models annually and keep your prompt templates under version control.

Risks and tradeoffs

Edge LLMs are powerful but imperfect. Key tradeoffs:

False negatives can delay critical incidents if the assistant misclassifies — always enforce human override and alert escalation timeouts
On-device models may miss subtle cross-service correlations visible only in centralized analytics — use your assistant for first-pass triage, not final root-cause
Hardware failure of your Pi must be planned for — have a cloud fallback or secondary Pi in HA configuration

Example: sample alert lifecycle (concise)

Alertmanager sends webhook for increased 5xx rate to /alert
Triage service queries last 5m of 5xx rate and pulls two relevant logs
LLM outputs severity=incident, summary="Upstream 10.0.0.5 returning 502, likely connection resets", next_steps=["Notify backend owner", "Check upstream host 10.0.0.5 network"]
Service creates a PagerDuty incident, posts the summary to Slack, and attaches runbook link

Actionable takeaways

Start small: run the triage service in passive mode (post-only to a review channel) before automatic escalation
Design strict JSON output prompts to keep parsing deterministic
Secure webhooks with HMAC and TLS to prevent spoofed alerts
Monitor Pi health and enforce HA or cloud fallback for critical on-call paths

Closing — why build this now

Edge AI on accessible hardware like the Raspberry Pi 5 + AI HAT+ 2 makes practical, low-cost on-call assistants a reality in 2026. For remote SREs and admins, that means fewer wakeups for noisy alerts, faster context, and more time for async response and automation. Start with the patterns above, tune prompts to your stack, and keep humans in control for high-stakes decisions.

Ready to build? Clone a starter triage repo, hook it to a lab Prometheus instance, and run simulations tonight. Share your results with your team and iterate: the fastest path to value is small, measurable improvements in pager noise and mean time to acknowledge.

Call to action

Try this in your home lab this week: set up a Pi 5 + AI HAT+ 2, deploy the minimal triage stack in passive mode, and run 50 simulated alerts. Track how many alerts the assistant filters or summarizes correctly, then tune prompts for 1–2 hours. If you want a reviewed starter repo and a pre-built prompt set tuned for Prometheus + Loki, sign up on remotejob.live or share your build in the community — I’ll review and suggest improvements based on your telemetry.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Up Next

From Phone to Portfolio: Building Android 17 Feature Demos That Win Hiring Managers

android•11 min read

Android 17 for Remote Android Devs: What Cinnamon Bun Means for Your CI/CD and Device Labs

reliability•10 min read

How to Run an Effective Remote ‘Game Day’ Using Process-Killing Tools

mobile•10 min read

Using Android Skin Rankings to Prioritize Bug Fixes and Feature Flags for Global Users

AI•9 min read

The Ethics and Risks of Using LLMs to Build Micro Apps for Clients and Employers

From Our Network

Trending stories across our publication group

How a $170 Smartwatch Can Double Your On-Location Productivity

freelances.live

gadgets•12 min read

Designing Employee Perks Inspired by Night-Life Hospitality (Cocktails, Culture, and Community)

Monetize Product Alerts: How to Build a Micro-Business Around Curating Tech Deals for Local Businesses

onlinejobs.store

monetization•10 min read

Monetize Product Alerts: How to Build a Micro-Business Around Curating Tech Deals for Local Businesses

Building Community‑First Forums: A Creator Playbook Inspired by Digg’s Relaunch

freelances.site

community•11 min read

Building Community‑First Forums: A Creator Playbook Inspired by Digg’s Relaunch

2026-02-25T03:16:39.918Z