Exception Handling - Ryan Lynch's Hub

# Overview # Key Considerations ## [[Idempotency]] ## Retry (failure) Mechanisms Retires are a great approach for addressing transient failures. Various approaches are shown below, but also consider that any retry approach can cause some issues: From: ([Timeouts, retries, and backoff with jitter](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/)) - Handing retries across layers - Consider a system where the customer's call causes a five-deep stack of service calls. It ends with a query to a database, and three retries at each layer. What happens when the database starts failing queries under load? If each layer retries independently, the load on the database will increase 243x, making it unlikely to ever recover. This is because the retries at each layer multiply -- first three tries, then nine tries, and so on. On the contrary, retrying at the highest layer of the stack may waste work from previous calls, which reduces efficiency. In general, for low-cost control-plane and data-plane operations, our best practice is to retry at a single point in the stack. - ad. Even with a single layer of retries, traffic still significantly increases when errors start. _Circuit breakers_, where calls to a downstream service are stopped entirely when an error threshold is exceeded, are widely promoted to solve this problem. Unfortunately, circuit breakers introduce modal behavior into systems that can be difficult to test, and can introduce significant addition time to recovery. We have found that we can mitigate this risk by limiting retries locally using a [token bucket](https://en.wikipedia.org/wiki/Token_bucket). This allows all calls to retry as long as there are tokens, and then retry at a fixed rate when the tokens are exhausted. AWS added this behavior to the AWS SDK in 2016. So customers using the SDK have [this throttling behavior](https://aws.amazon.com/blogs/developer/introducing-retry-throttling/) built in. - Deciding when to retry. In general, our view is that APIs with side effects aren't safe to retry unless they provide idempotency. This guarantees that the side effects happen only once no matter how often you retry. Read-only APIs are typically idempotent, while resource creation APIs may not be. Some APIs, like the Amazon Elastic Compute Cloud (Amazon EC2) RunInstances API, provide explicit token-based mechanisms to provide idempotency and make them safe to retry. Good API design, and care when implementing clients, is needed to prevent duplicate side-effects. - Knowing which failures are worth retrying. HTTP provides a clear distinction between _client_ and _server_ errors. It indicates that client errors should not be retried with the same request because they aren't going to succeed later, while server errors may succeed on subsequent tries. Unfortunately, eventual consistency in systems significantly blurs this line. A client error one moment may change into a success the next moment as state propagates. | Method | Description | Pros | Cons | Use Cases | | -------------------------- | ----------------------------- | ------------------------- | ------------------------- | ------------------------------ | | [[Simple Retry]] | ![[Simple Retry#Overview]] | ![[Simple Retry#Pros]] | ![[Simple Retry#Cons]] | ![[Simple Retry#Use Cases]] | | [[Delayed Retry]] | ![[Delayed Retry#Overview]] | ![[Delayed Retry#Pros]] | ![[Delayed Retry#Cons]] | ![[Delayed Retry#Use Cases]] | | [[Circuit Breaker]] | ![[Circuit Breaker#Overview]] | ![[Circuit Breaker#Pros]] | ![[Circuit Breaker#Cons]] | ![[Circuit Breaker#Use Cases]] | | Linear Backoff | | | | | | [[Linear Jitter Backoff]] | | | | | | [[Exponential Backoff]] | | | | | | Exponential Jitter Backoff | | | | | # Implementation Details # Useful Links # Related Topics #### Topics to Cover #### Related Topics -