Automatic failover

Automatic failover ensures high availability by promoting a standby instance when the primary PostgreSQL node in a RayDB cluster becomes unavailable. This helps minimize downtime and maintain business continuity.

How Automatic Failover Works

Primary Health Checks: The system continuously monitors the health of the primary instance.
Failover Trigger: If the primary node fails due to hardware issues, network disruptions, or crashes, the system automatically promotes a standby instance.
Replication Switchover: The newly promoted instance takes over as the primary, and replication is reconfigured accordingly.

When Failover Occurs

Failover is triggered under these conditions:

The primary instance becomes unresponsive for a sustained period.
Hardware or network failures affect the primary instance.
The database process crashes and does not recover automatically.

Recovery After Failover

Once failover is completed:

The system re-establishes replication with a new standby instance.
The application reconnects automatically using the updated connection string.
The previous primary instance can be analyzed and recovered if needed.

Considerations

Failover Time: Typically takes seconds to a few minutes, depending on the detection and promotion process.
Data Integrity: Synchronous replication ensures no data loss, whereas asynchronous replication may result in minimal lag.
Application Handling: Applications should implement retry mechanisms for seamless reconnection.

Best Practices

Use Read Replicas: Having at least one standby ensures smooth failover.
Monitor Failover Events: Set up alerts for automatic failover occurrences.