=================== Monitoring & Alerts =================== .. highlight:: console System monitoring, alerting, and performance tracking for Proxmox infrastructure. 📊 Monitoring Overview ====================== Comprehensive monitoring strategy: - **System Metrics**: CPU, memory, disk, network usage - **Service Health**: VM/container status and performance - **Storage Monitoring**: Disk health, ZFS status, backup verification - **Network Monitoring**: Connectivity, bandwidth, latency - **Alert Management**: Proactive notifications for issues 🔧 Built-in Proxmox Monitoring ============================== Proxmox Web Interface Monitoring -------------------------------- **System Status Dashboard**: - Node summary with resource usage - VM/container status overview - Storage utilization - Network interface statistics **Performance Graphs**: - CPU usage over time - Memory utilization trends - Network traffic patterns - Storage I/O statistics Command Line Monitoring ----------------------- .. code-block:: bash # System resource usage htop iotop iftop # Proxmox-specific commands pvesh get /nodes/$(hostname)/status pvesh get /nodes/$(hostname)/storage pvesh get /cluster/resources # VM/Container status qm list pct list # Storage status zpool status df -h 📈 Advanced Monitoring Stack ============================ Prometheus + Grafana Setup -------------------------- **Deploy monitoring stack in LXC container**: .. code-block:: bash # Create monitoring container pct create 300 \ local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \ --hostname monitoring \ --memory 4096 \ --cores 2 \ --net0 name=eth0,bridge=vmbr0,ip=192.168.1.50/24,gw=192.168.1.1 \ --storage local-lvm \ --rootfs local-lvm:20 **Install Prometheus**: .. code-block:: bash # Update system apt update && apt upgrade -y # Create prometheus user useradd --no-create-home --shell /bin/false prometheus # Download and install Prometheus cd /tmp wget https://github.com/prometheus/prometheus/releases/download/v2.40.0/prometheus-2.40.0.linux-amd64.tar.gz tar xvf prometheus-2.40.0.linux-amd64.tar.gz # Install binaries cp prometheus-2.40.0.linux-amd64/prometheus /usr/local/bin/ cp prometheus-2.40.0.linux-amd64/promtool /usr/local/bin/ # Set permissions chown prometheus:prometheus /usr/local/bin/prometheus chown prometheus:prometheus /usr/local/bin/promtool # Create directories mkdir /etc/prometheus mkdir /var/lib/prometheus chown prometheus:prometheus /etc/prometheus chown prometheus:prometheus /var/lib/prometheus **Prometheus Configuration**: .. code-block:: yaml # /etc/prometheus/prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s rule_files: - "alert_rules.yml" alerting: alertmanagers: - static_configs: - targets: - localhost:9093 scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node-exporter' static_configs: - targets: ['192.168.1.240:9100'] # Proxmox host - job_name: 'pve-exporter' static_configs: - targets: ['192.168.1.240:9221'] # Proxmox PVE exporter **Install Grafana**: .. code-block:: bash # Add Grafana repository wget -q -O - https://packages.grafana.com/gpg.key | apt-key add - echo "deb https://packages.grafana.com/oss/deb stable main" > /etc/apt/sources.list.d/grafana.list # Install Grafana apt update apt install grafana # Enable and start Grafana systemctl enable grafana-server systemctl start grafana-server Node Exporter Setup ------------------- **Install on Proxmox host**: .. code-block:: bash # Create node_exporter user useradd --no-create-home --shell /bin/false node_exporter # Download and install cd /tmp wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz tar xvf node_exporter-1.5.0.linux-amd64.tar.gz cp node_exporter-1.5.0.linux-amd64/node_exporter /usr/local/bin/ chown node_exporter:node_exporter /usr/local/bin/node_exporter **Create systemd service**: .. code-block:: bash cat > /etc/systemd/system/node_exporter.service << 'EOF' [Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target EOF # Enable and start service systemctl daemon-reload systemctl enable node_exporter systemctl start node_exporter PVE Exporter for Proxmox ------------------------ .. code-block:: bash # Install PVE exporter pip3 install prometheus-pve-exporter # Create configuration cat > /etc/prometheus/pve.yml << 'EOF' default: user: monitoring@pve password: your-monitoring-password verify_ssl: false EOF # Create systemd service cat > /etc/systemd/system/pve-exporter.service << 'EOF' [Unit] Description=Proxmox VE Exporter [Service] ExecStart=/usr/local/bin/pve_exporter --config.file /etc/prometheus/pve.yml Restart=always [Install] WantedBy=multi-user.target EOF systemctl enable pve-exporter systemctl start pve-exporter 🚨 Alert Configuration ====================== Alertmanager Setup ------------------ .. code-block:: bash # Download and install Alertmanager cd /tmp wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz tar xvf alertmanager-0.25.0.linux-amd64.tar.gz cp alertmanager-0.25.0.linux-amd64/alertmanager /usr/local/bin/ cp alertmanager-0.25.0.linux-amd64/amtool /usr/local/bin/ **Alertmanager Configuration**: .. code-block:: yaml # /etc/prometheus/alertmanager.yml global: smtp_smarthost: 'smtp.gmail.com:587' smtp_from: 'alerts@yourdomain.com' smtp_auth_username: 'alerts@yourdomain.com' smtp_auth_password: 'your-app-password' route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'web.hook' receivers: - name: 'web.hook' email_configs: - to: 'admin@yourdomain.com' subject: 'Proxmox Alert: {{ .GroupLabels.alertname }}' body: | {{ range .Alerts }} Alert: {{ .Annotations.summary }} Description: {{ .Annotations.description }} {{ end }} Alert Rules ----------- .. code-block:: yaml # /etc/prometheus/alert_rules.yml groups: - name: proxmox_alerts rules: - alert: HighCPUUsage expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is above 80% for more than 5 minutes" - alert: HighMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90 for: 5m labels: severity: critical annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is above 90%" - alert: DiskSpaceLow expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk space is below 10% on {{ $labels.mountpoint }}" - alert: VMDown expr: pve_up == 0 for: 2m labels: severity: critical annotations: summary: "VM/Container is down" description: "{{ $labels.instance }} has been down for more than 2 minutes" 📱 Notification Channels ======================== Email Notifications ------------------- **Configure SMTP in Proxmox**: 1. **Datacenter** → **Notifications** 2. **Add** → **SMTP Endpoint** 3. Configure SMTP settings: - Server: smtp.gmail.com - Port: 587 - Username/Password: Your credentials - Enable TLS **Test email notifications**: .. code-block:: bash # Test email from command line echo "Test message" | mail -s "Proxmox Test" admin@yourdomain.com Slack Integration ----------------- .. code-block:: yaml # Add to alertmanager.yml receivers: - name: 'slack-notifications' slack_configs: - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' channel: '#alerts' title: 'Proxmox Alert' text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}' Discord Integration ------------------- .. code-block:: bash # Discord webhook script cat > /usr/local/bin/discord-alert.sh << 'EOF' #!/bin/bash WEBHOOK_URL="https://discord.com/api/webhooks/YOUR/WEBHOOK/URL" MESSAGE="$1" curl -H "Content-Type: application/json" \ -X POST \ -d "{\"content\": \"🚨 Proxmox Alert: $MESSAGE\"}" \ "$WEBHOOK_URL" EOF chmod +x /usr/local/bin/discord-alert.sh 📊 Custom Monitoring Scripts ============================ System Health Monitor --------------------- .. code-block:: bash cat > /usr/local/bin/system-health.sh << 'EOF' #!/bin/bash # System Health Monitoring Script ALERT_EMAIL="admin@yourdomain.com" CPU_THRESHOLD=80 MEMORY_THRESHOLD=90 DISK_THRESHOLD=90 log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" } send_alert() { local subject="$1" local message="$2" echo "$message" | mail -s "$subject" "$ALERT_EMAIL" log "ALERT SENT: $subject" } # Check CPU usage cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}') if (( $(echo "$cpu_usage > $CPU_THRESHOLD" | bc -l) )); then send_alert "High CPU Usage Alert" "CPU usage is ${cpu_usage}% (threshold: ${CPU_THRESHOLD}%)" fi # Check memory usage memory_usage=$(free | grep Mem | awk '{printf("%.1f", ($3/$2) * 100.0)}') if (( $(echo "$memory_usage > $MEMORY_THRESHOLD" | bc -l) )); then send_alert "High Memory Usage Alert" "Memory usage is ${memory_usage}% (threshold: ${MEMORY_THRESHOLD}%)" fi # Check disk usage df -h | awk 'NR>1 {print $5 " " $6}' | while read output; do usage=$(echo $output | awk '{print $1}' | sed 's/%//') partition=$(echo $output | awk '{print $2}') if [ $usage -ge $DISK_THRESHOLD ]; then send_alert "Low Disk Space Alert" "Disk usage on $partition is ${usage}% (threshold: ${DISK_THRESHOLD}%)" fi done # Check ZFS pool health if command -v zpool >/dev/null 2>&1; then zpool_status=$(zpool status | grep -E "DEGRADED|FAULTED|OFFLINE|UNAVAIL") if [ -n "$zpool_status" ]; then send_alert "ZFS Pool Health Alert" "ZFS pool issues detected: $zpool_status" fi fi log "System health check completed" EOF chmod +x /usr/local/bin/system-health.sh Service Monitoring ------------------ .. code-block:: bash cat > /usr/local/bin/service-monitor.sh << 'EOF' #!/bin/bash # Service Monitoring Script SERVICES=( "pveproxy" "pvedaemon" "pve-cluster" "docker" ) ALERT_EMAIL="admin@yourdomain.com" log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" } send_alert() { local subject="$1" local message="$2" echo "$message" | mail -s "$subject" "$ALERT_EMAIL" log "ALERT SENT: $subject" } for service in "${SERVICES[@]}"; do if ! systemctl is-active --quiet "$service"; then send_alert "Service Down Alert" "Service $service is not running on $(hostname)" log "ERROR: Service $service is down" # Attempt to restart service systemctl restart "$service" sleep 5 if systemctl is-active --quiet "$service"; then send_alert "Service Recovered" "Service $service has been restarted successfully on $(hostname)" log "INFO: Service $service restarted successfully" else send_alert "Service Restart Failed" "Failed to restart service $service on $(hostname)" log "ERROR: Failed to restart service $service" fi else log "OK: Service $service is running" fi done EOF chmod +x /usr/local/bin/service-monitor.sh ⏰ Monitoring Schedule ====================== Cron Configuration ------------------ .. code-block:: bash # Edit root crontab crontab -e # Add monitoring schedules # System health check every 5 minutes */5 * * * * /usr/local/bin/system-health.sh # Service monitoring every 2 minutes */2 * * * * /usr/local/bin/service-monitor.sh # Backup verification daily at 6 AM 0 6 * * * /usr/local/bin/backup-verify.sh # Generate daily status report at 8 AM 0 8 * * * /usr/local/bin/backup-status.sh | mail -s "Daily Proxmox Status" admin@yourdomain.com 📱 Dashboard Setup ================== Grafana Dashboard Configuration ------------------------------- **Import Proxmox Dashboard**: 1. **Access Grafana**: http://monitoring-ip:3000 2. **Login**: admin/admin (change password) 3. **Add Prometheus data source**: http://localhost:9090 4. **Import dashboard**: Use dashboard ID 10347 for Proxmox **Custom Dashboard Panels**: - CPU usage by VM/container - Memory utilization trends - Storage I/O performance - Network traffic patterns - Backup job status - Alert summary Web-based Status Page --------------------- .. code-block:: bash # Simple status page generator cat > /usr/local/bin/generate-status.sh << 'EOF' #!/bin/bash STATUS_FILE="/var/www/html/status.html" cat > "$STATUS_FILE" << EOL Proxmox Status

Proxmox Infrastructure Status

Last updated: $(date)

System Resources

$(df -h)

Running VMs

$(qm list)

Running Containers

$(pct list)

Recent Alerts

$(tail -20 /var/log/syslog | grep -i alert || echo "No recent alerts")
EOL EOF chmod +x /usr/local/bin/generate-status.sh 📋 Monitoring Checklist ======================= Daily Monitoring Tasks: - [ ] **Review dashboard** for anomalies - [ ] **Check alert notifications** and resolve issues - [ ] **Verify backup completion** status - [ ] **Monitor resource usage** trends - [ ] **Check service health** status Weekly Monitoring Tasks: - [ ] **Review performance trends** over the week - [ ] **Update alert thresholds** if needed - [ ] **Test notification channels** - [ ] **Clean up old monitoring data** - [ ] **Review and tune** monitoring rules Monthly Monitoring Tasks: - [ ] **Capacity planning** based on trends - [ ] **Update monitoring tools** and dashboards - [ ] **Review alert effectiveness** - [ ] **Document any monitoring changes** - [ ] **Test disaster recovery** monitoring 🚨 Troubleshooting ================== Common Monitoring Issues ------------------------ **Prometheus Not Scraping**: .. code-block:: bash # Check Prometheus targets curl http://localhost:9090/api/v1/targets # Check service status systemctl status prometheus # Check configuration promtool check config /etc/prometheus/prometheus.yml **Grafana Connection Issues**: .. code-block:: bash # Check Grafana logs journalctl -u grafana-server # Test data source connection curl http://localhost:9090/api/v1/query?query=up **Alert Not Firing**: .. code-block:: bash # Check alert rules promtool check rules /etc/prometheus/alert_rules.yml # Check Alertmanager status systemctl status alertmanager 📚 Additional Resources ======================= - `Prometheus Documentation `__ - `Grafana Documentation `__ - `Proxmox Monitoring Best Practices `__ - `Node Exporter Metrics `__