# Overview Faults are a guarantee to occur within a system. Especially a distributed system that relies on an unreliable network. Fault tolerance is a system's ability to continue to operate when facing these faults. This is done by improving the resiliency of the system through good design by using [[timeouts]] and retries / backoffs during [[Exception Handling]]. # Key Considerations ## Levels of Fault Tolerance - **[[Byzantine fault-tolerant]]** - system continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering with the network. # Implementation Details ## Fault Tolerance in [[Stream Processing]] - [[Microbatching]] - [[Checkpointing]] - [[Database Transactions]] - [[idempotency]] # Useful Links # Related Topics ## Reference #### Working Notes #### Sources