Skip to content

Auto-restart

DUMB includes an automatic restart system that monitors service health and restarts failed services to maintain system stability without manual intervention.


Overview

The auto-restart system provides:

  • Health monitoring - Periodic health checks for each service
  • Automatic recovery - Restart services that become unhealthy
  • Exponential backoff - Increasing delays between restart attempts
  • Restart limits - Prevent infinite restart loops
  • Grace periods - Allow services time to initialize

Auto-restart status Auto-restart indicators


How it works

%%{ init: { "flowchart": { "curve": "basis" } } }%%
flowchart TD
    A([Service running])
    B{Health check}
    C{Threshold exceeded?}
    D{Restart limit reached?}
    E[Restart service]
    F([Stop retrying])
    G[Wait grace period]

    A ==> B
    B -- Healthy --> A
    B -- Unhealthy --> C
    C -- No --> B
    C -- Yes --> D
    D -- Not reached --> E
    D -- Reached --> F
    E ==> G
    G ==> B
  1. Health check - Service is periodically checked for responsiveness
  2. Unhealthy detection - Multiple consecutive failures trigger action
  3. Restart attempt - Service is stopped and restarted
  4. Grace period - Wait for service to initialize
  5. Repeat - Continue monitoring after restart

Configuration

Auto-restart is configured globally in dumb.auto_restart:

"dumb": {
  "auto_restart": {
    "enabled": false,
    "restart_on_unhealthy": true,
    "healthcheck_interval": 30,
    "unhealthy_threshold": 3,
    "max_restarts": 3,
    "window_seconds": 300,
    "backoff_seconds": [5, 15, 45, 120],
    "grace_period_seconds": 30,
    "services": []
  }
}

Configuration options

Option Default Description
enabled false Enable auto-restart globally
restart_on_unhealthy true Restart when health checks fail
healthcheck_interval 30 Seconds between health checks
unhealthy_threshold 3 Consecutive failures before restart
max_restarts 3 Maximum restarts within the window
window_seconds 300 Time window in seconds
backoff_seconds [5, 15, 45, 120] Backoff delays between restarts
grace_period_seconds 30 Seconds to wait after restart before health checks
services [] Limit auto-restart to these process names

Exponential backoff

To prevent rapid restart loops, delays between restarts increase exponentially:

Attempt Delay
1 5 seconds
2 10 seconds
3 20 seconds
4 40 seconds
5 80 seconds
6+ 120 seconds (max)

The formula: delay = min(initial_delay * (backoff_multiplier ^ attempt), max_delay)


Restart limits

Services have a maximum number of restart attempts within a time window:

  • Default: 5 restarts per hour
  • After reaching the limit, auto-restart pauses for that service
  • The counter resets after the window expires
  • Manual restart resets the counter

Restart limit reached

If a service keeps failing, investigate the root cause rather than increasing limits. Check logs for error messages.


Health checks

Services are monitored using health check endpoints or process status:

HTTP health checks

For services with web interfaces:

"health_check": {
  "type": "http",
  "url": "http://127.0.0.1:8080/health",
  "timeout": 10,
  "interval": 30
}

Process health checks

For services without HTTP endpoints:

"health_check": {
  "type": "process",
  "interval": 30
}

Monitoring restart status

Dashboard indicators

The dashboard shows auto-restart status for each service:

  • Restart count - Number of restarts in current window
  • Last restart - Timestamp of most recent restart
  • Health status - Current healthy/unhealthy state

API endpoints

Query restart status via the API:

# Get service status including restart info
curl http://localhost:8000/api/process/service-status?process_name=Riven%20Backend

Response includes:

{
  "process_name": "Riven Backend",
  "status": "running",
  "healthy": true,
  "restart": {
    "count": 2,
    "last_restart": "2025-01-15T10:30:00Z",
    "enabled": true
  }
}

WebSocket updates

Real-time restart events via /ws/status:

{
  "type": "status",
  "processes": [
    {
      "process_name": "Riven Backend",
      "status": "running",
      "healthy": true,
      "restart": {
        "count": 2,
        "last_restart": "2025-01-15T10:30:00Z",
        "enabled": true
      }
    }
  ]
}

Disabling auto-restart

Per-service

Disable for a specific service:

"riven_backend": {
  "auto_restart": {
    "enabled": false
  }
}

Globally

To disable auto-restart for all services, set enabled: false in each service's auto_restart configuration, or use the Settings page in the frontend.


Best practices

Appropriate thresholds

  • Critical services (Plex, rclone): Lower threshold (2-3)
  • Background services (Zilean, NeutArr): Higher threshold (3-5)

Grace periods

  • Fast-starting services: 10-15 seconds
  • Database-dependent services: 30-60 seconds
  • Services with startup tasks: 60-120 seconds

Monitoring

  • Review restart counts regularly
  • Investigate services with frequent restarts
  • Check logs after restart events

Troubleshooting

Service keeps restarting

  1. Check service logs for errors
  2. Verify configuration is valid
  3. Ensure dependencies are running
  4. Check for port conflicts

Auto-restart not working

  1. Verify auto_restart.enabled is true
  2. Check if restart limit was reached
  3. Ensure health check is configured correctly

Restart delay too long

  • Reduce backoff_multiplier
  • Lower max_delay
  • Reset counter with manual restart