RayDB LogoRayDB

Automatic failover

Automatic failover ensures high availability by promoting a standby instance when the primary PostgreSQL node in a RayDB cluster becomes unavailable. This helps minimize downtime and maintain business continuity.

How Automatic Failover Works

  • Primary Health Checks: The system continuously monitors the health of the primary instance.
  • Failover Trigger: If the primary node fails due to hardware issues, network disruptions, or crashes, the system automatically promotes a standby instance.
  • Replication Switchover: The newly promoted instance takes over as the primary, and replication is reconfigured accordingly.

When Failover Occurs

Failover is triggered under these conditions:

  • The primary instance becomes unresponsive for a sustained period.
  • Hardware or network failures affect the primary instance.
  • The database process crashes and does not recover automatically.

Recovery After Failover

Once failover is completed:

  1. The system re-establishes replication with a new standby instance.
  2. The application reconnects automatically using the updated connection string.
  3. The previous primary instance can be analyzed and recovered if needed.

Considerations

  • Failover Time: Typically takes seconds to a few minutes, depending on the detection and promotion process.
  • Data Integrity: Synchronous replication ensures no data loss, whereas asynchronous replication may result in minimal lag.
  • Application Handling: Applications should implement retry mechanisms for seamless reconnection.

Best Practices

  • Use Read Replicas: Having at least one standby ensures smooth failover.
  • Monitor Failover Events: Set up alerts for automatic failover occurrences.

On this page