Monitoring & Alerts¶

System monitoring, alerting, and performance tracking for Proxmox infrastructure.

📊 Monitoring Overview¶

Comprehensive monitoring strategy:

System Metrics: CPU, memory, disk, network usage
Service Health: VM/container status and performance
Storage Monitoring: Disk health, ZFS status, backup verification
Network Monitoring: Connectivity, bandwidth, latency
Alert Management: Proactive notifications for issues

🔧 Built-in Proxmox Monitoring¶

Proxmox Web Interface Monitoring¶

System Status Dashboard: - Node summary with resource usage - VM/container status overview - Storage utilization - Network interface statistics

Performance Graphs: - CPU usage over time - Memory utilization trends - Network traffic patterns - Storage I/O statistics

Command Line Monitoring¶

# System resource usage
htop
iotop
iftop

# Proxmox-specific commands
pvesh get /nodes/$(hostname)/status
pvesh get /nodes/$(hostname)/storage
pvesh get /cluster/resources

# VM/Container status
qm list
pct list

# Storage status
zpool status
df -h

📈 Advanced Monitoring Stack¶

Prometheus + Grafana Setup¶

Deploy monitoring stack in LXC container:

# Create monitoring container
pct create 300 \
  local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
  --hostname monitoring \
  --memory 4096 \
  --cores 2 \
  --net0 name=eth0,bridge=vmbr0,ip=192.168.1.50/24,gw=192.168.1.1 \
  --storage local-lvm \
  --rootfs local-lvm:20

Install Prometheus:

# Update system
apt update && apt upgrade -y

# Create prometheus user
useradd --no-create-home --shell /bin/false prometheus

# Download and install Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.40.0/prometheus-2.40.0.linux-amd64.tar.gz
tar xvf prometheus-2.40.0.linux-amd64.tar.gz

# Install binaries
cp prometheus-2.40.0.linux-amd64/prometheus /usr/local/bin/
cp prometheus-2.40.0.linux-amd64/promtool /usr/local/bin/

# Set permissions
chown prometheus:prometheus /usr/local/bin/prometheus
chown prometheus:prometheus /usr/local/bin/promtool

# Create directories
mkdir /etc/prometheus
mkdir /var/lib/prometheus
chown prometheus:prometheus /etc/prometheus
chown prometheus:prometheus /var/lib/prometheus

Prometheus Configuration:

# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['192.168.1.240:9100']  # Proxmox host

  - job_name: 'pve-exporter'
    static_configs:
      - targets: ['192.168.1.240:9221']  # Proxmox PVE exporter

Install Grafana:

# Add Grafana repository
wget -q -O - https://packages.grafana.com/gpg.key | apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" > /etc/apt/sources.list.d/grafana.list

# Install Grafana
apt update
apt install grafana

# Enable and start Grafana
systemctl enable grafana-server
systemctl start grafana-server

Node Exporter Setup¶

Install on Proxmox host:

# Create node_exporter user
useradd --no-create-home --shell /bin/false node_exporter

# Download and install
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
tar xvf node_exporter-1.5.0.linux-amd64.tar.gz
cp node_exporter-1.5.0.linux-amd64/node_exporter /usr/local/bin/
chown node_exporter:node_exporter /usr/local/bin/node_exporter

Create systemd service:

cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start service
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter

PVE Exporter for Proxmox¶

# Install PVE exporter
pip3 install prometheus-pve-exporter

# Create configuration
cat > /etc/prometheus/pve.yml << 'EOF'
default:
  user: monitoring@pve
  password: your-monitoring-password
  verify_ssl: false
EOF

# Create systemd service
cat > /etc/systemd/system/pve-exporter.service << 'EOF'
[Unit]
Description=Proxmox VE Exporter

[Service]
ExecStart=/usr/local/bin/pve_exporter --config.file /etc/prometheus/pve.yml
Restart=always

[Install]
WantedBy=multi-user.target
EOF

systemctl enable pve-exporter
systemctl start pve-exporter

🚨 Alert Configuration¶

Alertmanager Setup¶

# Download and install Alertmanager
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar xvf alertmanager-0.25.0.linux-amd64.tar.gz
cp alertmanager-0.25.0.linux-amd64/alertmanager /usr/local/bin/
cp alertmanager-0.25.0.linux-amd64/amtool /usr/local/bin/

Alertmanager Configuration:

# /etc/prometheus/alertmanager.yml
global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@yourdomain.com'
  smtp_auth_username: 'alerts@yourdomain.com'
  smtp_auth_password: 'your-app-password'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
- name: 'web.hook'
  email_configs:
  - to: 'admin@yourdomain.com'
    subject: 'Proxmox Alert: {{ .GroupLabels.alertname }}'
    body: |
      {{ range .Alerts }}
      Alert: {{ .Annotations.summary }}
      Description: {{ .Annotations.description }}
      {{ end }}

Alert Rules¶

# /etc/prometheus/alert_rules.yml
groups:
- name: proxmox_alerts
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is above 80% for more than 5 minutes"

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
      description: "Memory usage is above 90%"

  - alert: DiskSpaceLow
    expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Low disk space on {{ $labels.instance }}"
      description: "Disk space is below 10% on {{ $labels.mountpoint }}"

  - alert: VMDown
    expr: pve_up == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "VM/Container is down"
      description: "{{ $labels.instance }} has been down for more than 2 minutes"

📱 Notification Channels¶

Email Notifications¶

Configure SMTP in Proxmox:

Datacenter → Notifications
Add → SMTP Endpoint
Configure SMTP settings: - Server: smtp.gmail.com - Port: 587 - Username/Password: Your credentials - Enable TLS

Test email notifications:

# Test email from command line
echo "Test message" | mail -s "Proxmox Test" admin@yourdomain.com

Slack Integration¶

# Add to alertmanager.yml
receivers:
- name: 'slack-notifications'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
    channel: '#alerts'
    title: 'Proxmox Alert'
    text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

Discord Integration¶

# Discord webhook script
cat > /usr/local/bin/discord-alert.sh << 'EOF'
#!/bin/bash

WEBHOOK_URL="https://discord.com/api/webhooks/YOUR/WEBHOOK/URL"
MESSAGE="$1"

curl -H "Content-Type: application/json" \
     -X POST \
     -d "{\"content\": \"🚨 Proxmox Alert: $MESSAGE\"}" \
     "$WEBHOOK_URL"
EOF

chmod +x /usr/local/bin/discord-alert.sh

📊 Custom Monitoring Scripts¶

System Health Monitor¶

cat > /usr/local/bin/system-health.sh << 'EOF'
#!/bin/bash

# System Health Monitoring Script

ALERT_EMAIL="admin@yourdomain.com"
CPU_THRESHOLD=80
MEMORY_THRESHOLD=90
DISK_THRESHOLD=90

log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}

send_alert() {
    local subject="$1"
    local message="$2"
    echo "$message" | mail -s "$subject" "$ALERT_EMAIL"
    log "ALERT SENT: $subject"
}

# Check CPU usage
cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}')
if (( $(echo "$cpu_usage > $CPU_THRESHOLD" | bc -l) )); then
    send_alert "High CPU Usage Alert" "CPU usage is ${cpu_usage}% (threshold: ${CPU_THRESHOLD}%)"
fi

# Check memory usage
memory_usage=$(free | grep Mem | awk '{printf("%.1f", ($3/$2) * 100.0)}')
if (( $(echo "$memory_usage > $MEMORY_THRESHOLD" | bc -l) )); then
    send_alert "High Memory Usage Alert" "Memory usage is ${memory_usage}% (threshold: ${MEMORY_THRESHOLD}%)"
fi

# Check disk usage
df -h | awk 'NR>1 {print $5 " " $6}' | while read output; do
    usage=$(echo $output | awk '{print $1}' | sed 's/%//')
    partition=$(echo $output | awk '{print $2}')
    if [ $usage -ge $DISK_THRESHOLD ]; then
        send_alert "Low Disk Space Alert" "Disk usage on $partition is ${usage}% (threshold: ${DISK_THRESHOLD}%)"
    fi
done

# Check ZFS pool health
if command -v zpool >/dev/null 2>&1; then
    zpool_status=$(zpool status | grep -E "DEGRADED|FAULTED|OFFLINE|UNAVAIL")
    if [ -n "$zpool_status" ]; then
        send_alert "ZFS Pool Health Alert" "ZFS pool issues detected: $zpool_status"
    fi
fi

log "System health check completed"
EOF

chmod +x /usr/local/bin/system-health.sh

Service Monitoring¶

cat > /usr/local/bin/service-monitor.sh << 'EOF'
#!/bin/bash

# Service Monitoring Script

SERVICES=(
    "pveproxy"
    "pvedaemon"
    "pve-cluster"
    "docker"
)

ALERT_EMAIL="admin@yourdomain.com"

log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}

send_alert() {
    local subject="$1"
    local message="$2"
    echo "$message" | mail -s "$subject" "$ALERT_EMAIL"
    log "ALERT SENT: $subject"
}

for service in "${SERVICES[@]}"; do
    if ! systemctl is-active --quiet "$service"; then
        send_alert "Service Down Alert" "Service $service is not running on $(hostname)"
        log "ERROR: Service $service is down"

        # Attempt to restart service
        systemctl restart "$service"
        sleep 5

        if systemctl is-active --quiet "$service"; then
            send_alert "Service Recovered" "Service $service has been restarted successfully on $(hostname)"
            log "INFO: Service $service restarted successfully"
        else
            send_alert "Service Restart Failed" "Failed to restart service $service on $(hostname)"
            log "ERROR: Failed to restart service $service"
        fi
    else
        log "OK: Service $service is running"
    fi
done
EOF

chmod +x /usr/local/bin/service-monitor.sh

⏰ Monitoring Schedule¶

Cron Configuration¶

# Edit root crontab
crontab -e

# Add monitoring schedules
# System health check every 5 minutes
*/5 * * * * /usr/local/bin/system-health.sh

# Service monitoring every 2 minutes
*/2 * * * * /usr/local/bin/service-monitor.sh

# Backup verification daily at 6 AM
0 6 * * * /usr/local/bin/backup-verify.sh

# Generate daily status report at 8 AM
0 8 * * * /usr/local/bin/backup-status.sh | mail -s "Daily Proxmox Status" admin@yourdomain.com

📱 Dashboard Setup¶

Grafana Dashboard Configuration¶

Import Proxmox Dashboard:

Access Grafana: http://monitoring-ip:3000
Login: admin/admin (change password)
Add Prometheus data source: http://localhost:9090
Import dashboard: Use dashboard ID 10347 for Proxmox

Custom Dashboard Panels: - CPU usage by VM/container - Memory utilization trends - Storage I/O performance - Network traffic patterns - Backup job status - Alert summary

Web-based Status Page¶

# Simple status page generator
cat > /usr/local/bin/generate-status.sh << 'EOF'
#!/bin/bash

STATUS_FILE="/var/www/html/status.html"

cat > "$STATUS_FILE" << EOL
<!DOCTYPE html>
<html>
<head>
    <title>Proxmox Status</title>
    <meta http-equiv="refresh" content="60">
</head>
<body>
    <h1>Proxmox Infrastructure Status</h1>
    <p>Last updated: $(date)</p>

    <h2>System Resources</h2>
    <pre>$(df -h)</pre>

    <h2>Running VMs</h2>
    <pre>$(qm list)</pre>

    <h2>Running Containers</h2>
    <pre>$(pct list)</pre>

    <h2>Recent Alerts</h2>
    <pre>$(tail -20 /var/log/syslog | grep -i alert || echo "No recent alerts")</pre>
</body>
</html>
EOL
EOF

chmod +x /usr/local/bin/generate-status.sh

📋 Monitoring Checklist¶

Daily Monitoring Tasks:

[ ] Review dashboard for anomalies
[ ] Check alert notifications and resolve issues
[ ] Verify backup completion status
[ ] Monitor resource usage trends
[ ] Check service health status

Weekly Monitoring Tasks:

[ ] Review performance trends over the week
[ ] Update alert thresholds if needed
[ ] Test notification channels
[ ] Clean up old monitoring data
[ ] Review and tune monitoring rules

Monthly Monitoring Tasks:

[ ] Capacity planning based on trends
[ ] Update monitoring tools and dashboards
[ ] Review alert effectiveness
[ ] Document any monitoring changes
[ ] Test disaster recovery monitoring

🚨 Troubleshooting¶

Common Monitoring Issues¶

Prometheus Not Scraping:

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Check service status
systemctl status prometheus

# Check configuration
promtool check config /etc/prometheus/prometheus.yml

Grafana Connection Issues:

# Check Grafana logs
journalctl -u grafana-server

# Test data source connection
curl http://localhost:9090/api/v1/query?query=up

Alert Not Firing:

# Check alert rules
promtool check rules /etc/prometheus/alert_rules.yml

# Check Alertmanager status
systemctl status alertmanager