Monitoring & Alerts¶
System monitoring, alerting, and performance tracking for Proxmox infrastructure.
📊 Monitoring Overview¶
Comprehensive monitoring strategy:
System Metrics: CPU, memory, disk, network usage
Service Health: VM/container status and performance
Storage Monitoring: Disk health, ZFS status, backup verification
Network Monitoring: Connectivity, bandwidth, latency
Alert Management: Proactive notifications for issues
🔧 Built-in Proxmox Monitoring¶
Proxmox Web Interface Monitoring¶
System Status Dashboard: - Node summary with resource usage - VM/container status overview - Storage utilization - Network interface statistics
Performance Graphs: - CPU usage over time - Memory utilization trends - Network traffic patterns - Storage I/O statistics
Command Line Monitoring¶
# System resource usage
htop
iotop
iftop
# Proxmox-specific commands
pvesh get /nodes/$(hostname)/status
pvesh get /nodes/$(hostname)/storage
pvesh get /cluster/resources
# VM/Container status
qm list
pct list
# Storage status
zpool status
df -h
📈 Advanced Monitoring Stack¶
Prometheus + Grafana Setup¶
Deploy monitoring stack in LXC container:
# Create monitoring container
pct create 300 \
local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
--hostname monitoring \
--memory 4096 \
--cores 2 \
--net0 name=eth0,bridge=vmbr0,ip=192.168.1.50/24,gw=192.168.1.1 \
--storage local-lvm \
--rootfs local-lvm:20
Install Prometheus:
# Update system
apt update && apt upgrade -y
# Create prometheus user
useradd --no-create-home --shell /bin/false prometheus
# Download and install Prometheus
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.40.0/prometheus-2.40.0.linux-amd64.tar.gz
tar xvf prometheus-2.40.0.linux-amd64.tar.gz
# Install binaries
cp prometheus-2.40.0.linux-amd64/prometheus /usr/local/bin/
cp prometheus-2.40.0.linux-amd64/promtool /usr/local/bin/
# Set permissions
chown prometheus:prometheus /usr/local/bin/prometheus
chown prometheus:prometheus /usr/local/bin/promtool
# Create directories
mkdir /etc/prometheus
mkdir /var/lib/prometheus
chown prometheus:prometheus /etc/prometheus
chown prometheus:prometheus /var/lib/prometheus
Prometheus Configuration:
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['192.168.1.240:9100'] # Proxmox host
- job_name: 'pve-exporter'
static_configs:
- targets: ['192.168.1.240:9221'] # Proxmox PVE exporter
Install Grafana:
# Add Grafana repository
wget -q -O - https://packages.grafana.com/gpg.key | apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" > /etc/apt/sources.list.d/grafana.list
# Install Grafana
apt update
apt install grafana
# Enable and start Grafana
systemctl enable grafana-server
systemctl start grafana-server
Node Exporter Setup¶
Install on Proxmox host:
# Create node_exporter user
useradd --no-create-home --shell /bin/false node_exporter
# Download and install
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
tar xvf node_exporter-1.5.0.linux-amd64.tar.gz
cp node_exporter-1.5.0.linux-amd64/node_exporter /usr/local/bin/
chown node_exporter:node_exporter /usr/local/bin/node_exporter
Create systemd service:
cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
# Enable and start service
systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter
PVE Exporter for Proxmox¶
# Install PVE exporter
pip3 install prometheus-pve-exporter
# Create configuration
cat > /etc/prometheus/pve.yml << 'EOF'
default:
user: monitoring@pve
password: your-monitoring-password
verify_ssl: false
EOF
# Create systemd service
cat > /etc/systemd/system/pve-exporter.service << 'EOF'
[Unit]
Description=Proxmox VE Exporter
[Service]
ExecStart=/usr/local/bin/pve_exporter --config.file /etc/prometheus/pve.yml
Restart=always
[Install]
WantedBy=multi-user.target
EOF
systemctl enable pve-exporter
systemctl start pve-exporter
🚨 Alert Configuration¶
Alertmanager Setup¶
# Download and install Alertmanager
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
tar xvf alertmanager-0.25.0.linux-amd64.tar.gz
cp alertmanager-0.25.0.linux-amd64/alertmanager /usr/local/bin/
cp alertmanager-0.25.0.linux-amd64/amtool /usr/local/bin/
Alertmanager Configuration:
# /etc/prometheus/alertmanager.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@yourdomain.com'
smtp_auth_username: 'alerts@yourdomain.com'
smtp_auth_password: 'your-app-password'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
email_configs:
- to: 'admin@yourdomain.com'
subject: 'Proxmox Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
Alert Rules¶
# /etc/prometheus/alert_rules.yml
groups:
- name: proxmox_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90%"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk space is below 10% on {{ $labels.mountpoint }}"
- alert: VMDown
expr: pve_up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "VM/Container is down"
description: "{{ $labels.instance }} has been down for more than 2 minutes"
📱 Notification Channels¶
Email Notifications¶
Configure SMTP in Proxmox:
Datacenter → Notifications
Add → SMTP Endpoint
Configure SMTP settings: - Server: smtp.gmail.com - Port: 587 - Username/Password: Your credentials - Enable TLS
Test email notifications:
# Test email from command line
echo "Test message" | mail -s "Proxmox Test" admin@yourdomain.com
Slack Integration¶
# Add to alertmanager.yml
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts'
title: 'Proxmox Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
Discord Integration¶
# Discord webhook script
cat > /usr/local/bin/discord-alert.sh << 'EOF'
#!/bin/bash
WEBHOOK_URL="https://discord.com/api/webhooks/YOUR/WEBHOOK/URL"
MESSAGE="$1"
curl -H "Content-Type: application/json" \
-X POST \
-d "{\"content\": \"🚨 Proxmox Alert: $MESSAGE\"}" \
"$WEBHOOK_URL"
EOF
chmod +x /usr/local/bin/discord-alert.sh
📊 Custom Monitoring Scripts¶
System Health Monitor¶
cat > /usr/local/bin/system-health.sh << 'EOF'
#!/bin/bash
# System Health Monitoring Script
ALERT_EMAIL="admin@yourdomain.com"
CPU_THRESHOLD=80
MEMORY_THRESHOLD=90
DISK_THRESHOLD=90
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}
send_alert() {
local subject="$1"
local message="$2"
echo "$message" | mail -s "$subject" "$ALERT_EMAIL"
log "ALERT SENT: $subject"
}
# Check CPU usage
cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}')
if (( $(echo "$cpu_usage > $CPU_THRESHOLD" | bc -l) )); then
send_alert "High CPU Usage Alert" "CPU usage is ${cpu_usage}% (threshold: ${CPU_THRESHOLD}%)"
fi
# Check memory usage
memory_usage=$(free | grep Mem | awk '{printf("%.1f", ($3/$2) * 100.0)}')
if (( $(echo "$memory_usage > $MEMORY_THRESHOLD" | bc -l) )); then
send_alert "High Memory Usage Alert" "Memory usage is ${memory_usage}% (threshold: ${MEMORY_THRESHOLD}%)"
fi
# Check disk usage
df -h | awk 'NR>1 {print $5 " " $6}' | while read output; do
usage=$(echo $output | awk '{print $1}' | sed 's/%//')
partition=$(echo $output | awk '{print $2}')
if [ $usage -ge $DISK_THRESHOLD ]; then
send_alert "Low Disk Space Alert" "Disk usage on $partition is ${usage}% (threshold: ${DISK_THRESHOLD}%)"
fi
done
# Check ZFS pool health
if command -v zpool >/dev/null 2>&1; then
zpool_status=$(zpool status | grep -E "DEGRADED|FAULTED|OFFLINE|UNAVAIL")
if [ -n "$zpool_status" ]; then
send_alert "ZFS Pool Health Alert" "ZFS pool issues detected: $zpool_status"
fi
fi
log "System health check completed"
EOF
chmod +x /usr/local/bin/system-health.sh
Service Monitoring¶
cat > /usr/local/bin/service-monitor.sh << 'EOF'
#!/bin/bash
# Service Monitoring Script
SERVICES=(
"pveproxy"
"pvedaemon"
"pve-cluster"
"docker"
)
ALERT_EMAIL="admin@yourdomain.com"
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}
send_alert() {
local subject="$1"
local message="$2"
echo "$message" | mail -s "$subject" "$ALERT_EMAIL"
log "ALERT SENT: $subject"
}
for service in "${SERVICES[@]}"; do
if ! systemctl is-active --quiet "$service"; then
send_alert "Service Down Alert" "Service $service is not running on $(hostname)"
log "ERROR: Service $service is down"
# Attempt to restart service
systemctl restart "$service"
sleep 5
if systemctl is-active --quiet "$service"; then
send_alert "Service Recovered" "Service $service has been restarted successfully on $(hostname)"
log "INFO: Service $service restarted successfully"
else
send_alert "Service Restart Failed" "Failed to restart service $service on $(hostname)"
log "ERROR: Failed to restart service $service"
fi
else
log "OK: Service $service is running"
fi
done
EOF
chmod +x /usr/local/bin/service-monitor.sh
⏰ Monitoring Schedule¶
Cron Configuration¶
# Edit root crontab
crontab -e
# Add monitoring schedules
# System health check every 5 minutes
*/5 * * * * /usr/local/bin/system-health.sh
# Service monitoring every 2 minutes
*/2 * * * * /usr/local/bin/service-monitor.sh
# Backup verification daily at 6 AM
0 6 * * * /usr/local/bin/backup-verify.sh
# Generate daily status report at 8 AM
0 8 * * * /usr/local/bin/backup-status.sh | mail -s "Daily Proxmox Status" admin@yourdomain.com
📱 Dashboard Setup¶
Grafana Dashboard Configuration¶
Import Proxmox Dashboard:
Access Grafana: http://monitoring-ip:3000
Login: admin/admin (change password)
Add Prometheus data source: http://localhost:9090
Import dashboard: Use dashboard ID 10347 for Proxmox
Custom Dashboard Panels: - CPU usage by VM/container - Memory utilization trends - Storage I/O performance - Network traffic patterns - Backup job status - Alert summary
Web-based Status Page¶
# Simple status page generator
cat > /usr/local/bin/generate-status.sh << 'EOF'
#!/bin/bash
STATUS_FILE="/var/www/html/status.html"
cat > "$STATUS_FILE" << EOL
<!DOCTYPE html>
<html>
<head>
<title>Proxmox Status</title>
<meta http-equiv="refresh" content="60">
</head>
<body>
<h1>Proxmox Infrastructure Status</h1>
<p>Last updated: $(date)</p>
<h2>System Resources</h2>
<pre>$(df -h)</pre>
<h2>Running VMs</h2>
<pre>$(qm list)</pre>
<h2>Running Containers</h2>
<pre>$(pct list)</pre>
<h2>Recent Alerts</h2>
<pre>$(tail -20 /var/log/syslog | grep -i alert || echo "No recent alerts")</pre>
</body>
</html>
EOL
EOF
chmod +x /usr/local/bin/generate-status.sh
📋 Monitoring Checklist¶
Daily Monitoring Tasks:
[ ] Review dashboard for anomalies
[ ] Check alert notifications and resolve issues
[ ] Verify backup completion status
[ ] Monitor resource usage trends
[ ] Check service health status
Weekly Monitoring Tasks:
[ ] Review performance trends over the week
[ ] Update alert thresholds if needed
[ ] Test notification channels
[ ] Clean up old monitoring data
[ ] Review and tune monitoring rules
Monthly Monitoring Tasks:
[ ] Capacity planning based on trends
[ ] Update monitoring tools and dashboards
[ ] Review alert effectiveness
[ ] Document any monitoring changes
[ ] Test disaster recovery monitoring
🚨 Troubleshooting¶
Common Monitoring Issues¶
Prometheus Not Scraping:
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Check service status
systemctl status prometheus
# Check configuration
promtool check config /etc/prometheus/prometheus.yml
Grafana Connection Issues:
# Check Grafana logs
journalctl -u grafana-server
# Test data source connection
curl http://localhost:9090/api/v1/query?query=up
Alert Not Firing:
# Check alert rules
promtool check rules /etc/prometheus/alert_rules.yml
# Check Alertmanager status
systemctl status alertmanager