Auto-restart¶

DUMB includes an automatic restart system that monitors service health and restarts failed services to maintain system stability without manual intervention.

Overview¶

The auto-restart system provides:

Health monitoring - Periodic health checks for each service
Automatic recovery - Restart services that become unhealthy
Exponential backoff - Increasing delays between restart attempts
Restart limits - Prevent infinite restart loops
Grace periods - Allow services time to initialize

Auto-restart status Auto-restart indicators

How it works¶

%%{ init: { "flowchart": { "curve": "basis" } } }%%
flowchart TD
    A([Service running])
    B{Health check}
    C{Threshold exceeded?}
    D{Restart limit reached?}
    E[Restart service]
    F([Stop retrying])
    G[Wait grace period]

    A ==> B
    B -- Healthy --> A
    B -- Unhealthy --> C
    C -- No --> B
    C -- Yes --> D
    D -- Not reached --> E
    D -- Reached --> F
    E ==> G
    G ==> B

Health check - Service is periodically checked for responsiveness
Unhealthy detection - Multiple consecutive failures trigger action
Restart attempt - Service is stopped and restarted
Grace period - Wait for service to initialize
Repeat - Continue monitoring after restart

Configuration¶

Auto-restart is configured globally in dumb.auto_restart:

"dumb": {
  "auto_restart": {
    "enabled": false,
    "restart_on_unhealthy": true,
    "healthcheck_interval": 30,
    "unhealthy_threshold": 3,
    "max_restarts": 3,
    "window_seconds": 300,
    "backoff_seconds": [5, 15, 45, 120],
    "grace_period_seconds": 30,
    "services": []
  }
}

Configuration options¶

Option	Default	Description
`enabled`	`false`	Enable auto-restart globally
`restart_on_unhealthy`	`true`	Restart when health checks fail
`healthcheck_interval`	`30`	Seconds between health checks
`unhealthy_threshold`	`3`	Consecutive failures before restart
`max_restarts`	`3`	Maximum restarts within the window
`window_seconds`	`300`	Time window in seconds
`backoff_seconds`	`[5, 15, 45, 120]`	Backoff delays between restarts
`grace_period_seconds`	`30`	Seconds to wait after restart before health checks
`services`	`[]`	Limit auto-restart to these process names

Exponential backoff¶

To prevent rapid restart loops, delays between restarts increase exponentially:

Attempt	Delay
1	5 seconds
2	10 seconds
3	20 seconds
4	40 seconds
5	80 seconds
6+	120 seconds (max)

The formula: delay = min(initial_delay * (backoff_multiplier ^ attempt), max_delay)

Restart limits¶

Services have a maximum number of restart attempts within a time window:

Default: 5 restarts per hour
After reaching the limit, auto-restart pauses for that service
The counter resets after the window expires
Manual restart resets the counter

Restart limit reached

If a service keeps failing, investigate the root cause rather than increasing limits. Check logs for error messages.

Health checks¶

Services are monitored using health check endpoints or process status:

HTTP health checks¶

For services with web interfaces:

"health_check": {
  "type": "http",
  "url": "http://127.0.0.1:8080/health",
  "timeout": 10,
  "interval": 30
}

Process health checks¶

For services without HTTP endpoints:

"health_check": {
  "type": "process",
  "interval": 30
}

Monitoring restart status¶

Dashboard indicators¶

The dashboard shows auto-restart status for each service:

Restart count - Number of restarts in current window
Last restart - Timestamp of most recent restart
Health status - Current healthy/unhealthy state

API endpoints¶

Query restart status via the API:

# Get service status including restart info
curl http://localhost:8000/api/process/service-status?process_name=Riven%20Backend

Response includes:

{
  "process_name": "Riven Backend",
  "status": "running",
  "healthy": true,
  "restart": {
    "count": 2,
    "last_restart": "2025-01-15T10:30:00Z",
    "enabled": true
  }
}

WebSocket updates¶

Real-time restart events via /ws/status:

{
  "type": "status",
  "processes": [
    {
      "process_name": "Riven Backend",
      "status": "running",
      "healthy": true,
      "restart": {
        "count": 2,
        "last_restart": "2025-01-15T10:30:00Z",
        "enabled": true
      }
    }
  ]
}

Disabling auto-restart¶

Per-service¶

Disable for a specific service:

"riven_backend": {
  "auto_restart": {
    "enabled": false
  }
}

Globally¶

To disable auto-restart for all services, set enabled: false in each service's auto_restart configuration, or use the Settings page in the frontend.

Best practices¶

Appropriate thresholds¶

Critical services (Plex, rclone): Lower threshold (2-3)
Background services (Zilean, NeutArr): Higher threshold (3-5)

Grace periods¶

Fast-starting services: 10-15 seconds
Database-dependent services: 30-60 seconds
Services with startup tasks: 60-120 seconds

Monitoring¶

Review restart counts regularly
Investigate services with frequent restarts
Check logs after restart events

Troubleshooting¶

Service keeps restarting¶

Check service logs for errors
Verify configuration is valid
Ensure dependencies are running
Check for port conflicts

Auto-restart not working¶

Verify auto_restart.enabled is true
Check if restart limit was reached
Ensure health check is configured correctly

Restart delay too long¶

Reduce backoff_multiplier
Lower max_delay
Reset counter with manual restart

Dashboard - View restart status
Process Management API - API controls
WebSocket API - Real-time updates