Automatic failover
Automatic failover ensures high availability by promoting a standby instance when the primary PostgreSQL node in a RayDB cluster becomes unavailable. This helps minimize downtime and maintain business continuity.
How Automatic Failover Works
- Primary Health Checks: The system continuously monitors the health of the primary instance.
- Failover Trigger: If the primary node fails due to hardware issues, network disruptions, or crashes, the system automatically promotes a standby instance.
- Replication Switchover: The newly promoted instance takes over as the primary, and replication is reconfigured accordingly.
When Failover Occurs
Failover is triggered under these conditions:
- The primary instance becomes unresponsive for a sustained period.
- Hardware or network failures affect the primary instance.
- The database process crashes and does not recover automatically.
Recovery After Failover
Once failover is completed:
- The system re-establishes replication with a new standby instance.
- The application reconnects automatically using the updated connection string.
- The previous primary instance can be analyzed and recovered if needed.
Considerations
- Failover Time: Typically takes seconds to a few minutes, depending on the detection and promotion process.
- Data Integrity: Synchronous replication ensures no data loss, whereas asynchronous replication may result in minimal lag.
- Application Handling: Applications should implement retry mechanisms for seamless reconnection.
Best Practices
- Use Read Replicas: Having at least one standby ensures smooth failover.
- Monitor Failover Events: Set up alerts for automatic failover occurrences.