Resilience & Distributed Design Patterns
Resilience patterns are essential for building fault-tolerant distributed systems that gracefully handle failures, network issues, and service degradation. These patterns prevent cascading failures and ensure system availability even when individual components fail. designgurus codecentric circuit-breaker-bulkhead-retries
Circuit Breaker
The Circuit Breaker pattern prevents cascading failures by failing fast when a service is unavailable, protecting system resources from exhaustion.
How It Works
The circuit breaker monitors failure rates and operates in three states: Closed (normal operation), Open (blocking requests), and Half-Open (testing recovery). ijirmps.org
Python Implementation
|  |  | 
Use Cases
- E-commerce Payment Gateway: Prevent overwhelming a failing payment processor by opening the circuit after multiple failures, allowing the service to recover.
- Microservices Communication: Service A stops calling Service B when B is down, preventing thread exhaustion and resource depletion.
Best Practices
- Set appropriate failure thresholds based on service SLAs and expected failure rates
- Use exponential backoff for recovery timeout to avoid thundering herd problems
- Implement fallback mechanisms to provide degraded functionality when circuit is open.
- Monitor circuit breaker state changes and alert on frequent state transitions
Retry with Exponential Backoff
Retry with Exponential Backoff handles transient failures by retrying operations with increasing delays, preventing system overload. circuit-breaker-bulkhead-retries codecentric
How It Works
Each retry waits progressively longer (exponentially increasing delay with random jitter) to avoid synchronized retries and thundering herd effects. temporal.io
Python Implementation
|  |  | 
Use Cases
- Database Connection Failures: Retry transient connection failures with exponential backoff to handle temporary network issues
- API Rate Limiting: Automatically retry requests that fail due to rate limits with increasing delays.
- Message Processing: Retry failed message processing operations in event-driven architectures.
Best Practices
- Only retry idempotent operations to avoid duplicate side effects
- Add random jitter to prevent synchronized retries (thundering herd) linkedin
- Set maximum retry limits to prevent infinite retry loops
- Use different exceptions for retryable vs non-retryable errors
- Combine with circuit breaker to stop retrying when service is consistently failing
Bulkhead & Rate Limiting
Bulkhead pattern isolates resources to prevent cascading failures, while rate limiting controls request throughput to protect services from overload. moldstud.com
How It Works
Bulkhead partitions resources (threads, connections, memory) into isolated pools, ensuring one failing operation doesn’t exhaust all resources.
Python Implementation
|  |  | 
Use Cases
- API Gateway: Limit concurrent requests per tenant to prevent noisy neighbor problems.
- Database Connection Pools: Isolate connection pools per service to prevent one service exhausting all connections.
- Microservices: Dedicate thread pools per downstream dependency to contain failures.
Best Practices
- Size bulkheads based on resource capacity and expected load
- Use token bucket or leaky bucket algorithms for smooth rate limiting
- Implement per-user and per-service rate limits for fair resource allocation
- Monitor resource utilization and adjust limits dynamically
- Combine with circuit breaker for comprehensive protection
Saga pattern
Saga pattern manages distributed transactions across microservices using compensating transactions instead of distributed locks. microservices.io infoq.com learn.microsoft.com
How It Works
Orchestration uses a central coordinator to manage saga flow, while Choreography uses event-driven communication without central control. docs.aws.amazon.com viblo.asia
Python Implementation
|  |  | 
|  |  | 
Use Cases
- E-commerce Order Processing: Coordinate order creation, payment, inventory reservation, and shipping across multiple services.
- Travel Booking: Book flights, hotels, and car rentals as a single transaction with rollback capability.
- Financial Transactions: Manage multi-step financial operations with compensation for failed steps
Best Practices
- Use orchestration for complex workflows with many steps requiring central control
- Use choreography for loose coupling and autonomous services
- Design idempotent compensating transactions to handle duplicate compensation requests
- Store saga state persistently to handle coordinator failures
- Implement timeout handling for long-running sagas
Transaction Outbox
The Outbox pattern ensures atomic database updates and message publishing without unsafe dual writes, enabling reliable event-driven architectures. infoq.com
How It Works
Write business data and outbox event in a single database transaction, then asynchronously publish events from the outbox table using change data capture (CDC).
Python Implementation
|  |  | 
Use Cases
- Order Processing: Atomically save order and publish OrderCreated event without risking inconsistency.
- Saga Orchestration: Combine with saga pattern to ensure reliable saga step execution and event publishing,
- Event Sourcing: Persist events reliably before publishing to event stream.
Best Practices
- Use change data capture (CDC) tools like Debezium for production reliability
- Implement idempotent consumers to handle duplicate event delivery
- Add event versioning to support schema evolution
- Use SELECT FOR UPDATE SKIP LOCKED to enable concurrent publishers
- Regularly archive published events to prevent table growth
Idempotency Keys & Deduplication
Idempotency ensures operations can be safely retried without duplicating side effects, critical for reliable distributed systems. dev.to
How It Works
Clients provide unique idempotency keys with requests; servers store processed keys and return cached responses for duplicate requests.
Python Implementation
|  |  | 
Use Cases
- Payment Processing: Prevent double-charging users when they retry failed payment requests.
- Order Submission: Ensure users can safely retry order submission without creating duplicate orders.
- Message Processing: Deduplicate messages in event-driven systems to handle at-least-once delivery. deduplication-in-distributed-systems
Best Practices
- Generate idempotency keys on the client side using UUIDs or request content hashes
- Set appropriate TTL for idempotency records based on retry window requirements
- Store idempotency keys before processing to handle concurrent duplicate requests
- Use database constraints or Redis SETNX for atomic deduplication
- Include request fingerprint in idempotency key (method, path, key parameters)
Dead Letter Queue
Dead Letter Queue (DLQ) captures failed messages that cannot be processed after multiple retry attempts, enabling manual investigation and reprocessing. temporal.io
How It Works
Messages that fail processing after maximum retries are moved to a separate queue for analysis and manual intervention.
Python Implementation
|  |  | 
Use Cases
- Order Processing: Capture orders that fail validation or payment processing for manual investigation
- Event Processing: Store events that cannot be processed due to schema changes or missing dependencies.
- Integration Failures: Handle third-party API failures that require manual intervention.
Best Practices
- Set appropriate max retry counts based on failure type (transient vs permanent)
- Include rich error context in DLQ messages for debugging
- Implement DLQ monitoring and alerting to detect systemic issues
- Create replay mechanisms to reprocess messages after fixes
- Use separate DLQs per error type for better organization
Timeouts & Fallback
Timeouts prevent indefinite waiting, while fallbacks provide graceful degradation when services are unavailable. temporal.io
How It Works
Set maximum wait times for operations and define alternative responses or behaviors when timeouts occur or services fail.
Python Implementation
|  |  | 
Use Cases
- External API Calls: Set timeouts for third-party APIs and return cached data as fallback
- Database Queries: Timeout long-running queries and return default values for non-critical data.
- Microservices: Provide degraded functionality when downstream services are slow or unavailable.
Best Practices
- Set realistic timeouts based on service SLAs and p99 latency
- Implement cascading timeouts where caller timeout > callee timeout + overhead
- Design meaningful fallbacks that maintain core functionality
- Use cached data as primary fallback strategy
- Monitor fallback usage rates to detect degraded services
Combining Patterns
Real-world resilient systems combine multiple patterns for comprehensive protection.