1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66# S3 resilience design
Every S3 operation sits behind two protections: exponential-backoff retry and a circuit breaker. Together they handle two different failure modes. For the recipe to wire a new operation up, see [how to add a new S3 operation](../how-to/add-s3-operations.md).
## Retry handles transient errors
S3 has a well-documented set of "try again in a moment" errors. Retry attempts those up to 3 times with 1s / 2s / 4s exponential backoff:
- **Retriable errors**: `ServiceUnavailable`, `SlowDown`, `RequestTimeout`, `InternalError`, `ThrottlingException`, `RequestThrottled`, `TooManyRequestsException`, `ProvisionedThroughputExceededException`
- **Retriable HTTP codes**: 408, 429, 500, 502, 503, 504
- **Non-retriable errors**: `NoSuchKey`, `NoSuchBucket`, `AccessDenied`, `InvalidRequest`, `InvalidArgument`, `MalformedXML`, `InvalidBucketName` โ retrying these won't help; fail fast
Boto3's adaptive retry is also enabled, which adds client-side rate limiting when S3 pushes back.
## Circuit breakers handle sustained degradation
Retry is pointless when S3 is actually down. Three consecutive 3-retry chains of `SlowDown` isn't a transient blip โ it's a signal to stop hammering and back off. That's the circuit breaker's job.
**Thresholds** (tuned for S3's usual behaviour; tweak in `src/utils/circuit_breaker.py`):
- 5 consecutive failures โ circuit opens
- 60 seconds in open state before we try again
- 2 successes in half-open โ circuit closes
- Half-open allows 1 concurrent request (probe, not resume)
**States:**
- **CLOSED** โ normal, requests pass through
- **OPEN** โ failing fast, every request raises `CircuitBreakerOpen` without attempting S3
- **HALF_OPEN** โ probing recovery with one request at a time
## Why StorageService has breakers but S3CleanupService does not
Two different reliability requirements:
- **StorageService** handles critical-path uploads and downloads. If it fails, a job fails โ the user sees an error. We want fast-fail behaviour so the user gets a retry-able 503 in hundreds of milliseconds, not a 30-second timeout after 3 retry attempts. Circuit breakers give us that.
- **S3CleanupService** handles best-effort deletes (temp files, expired results). If a delete fails, the file lingers briefly and gets picked up on the next sweep. There's no user waiting for it. Adding a circuit breaker here would just add dead code โ nothing downstream cares about the failure.
Putting a circuit breaker on cleanup would pretend the delete matters more than it does. The absence of one is deliberate.
## Available metrics
The metrics service exposes these to Prometheus:
- `s3_operations_total{operation, bucket, result}` โ `result` is `success` / `error` / `circuit_open`
- `s3_operation_duration_seconds{operation, bucket}` โ latency histogram
- `s3_circuit_breaker_state{circuit_name}` โ 0=closed, 1=half-open, 2=open
- `s3_retry_attempts_total{operation, attempt}` โ retry attempt counters
Prometheus scrapes `/metrics` every 15 seconds, so these values are queryable in Grafana's Explore view. For a quick look without a chart, hit the endpoint directly:
```bash
curl http://localhost:8080/metrics | grep s3_operations_total
curl http://localhost:8080/metrics | grep s3_circuit_breaker_state
```
## Monitoring in practice
The common failure modes, in order of likelihood:
1. **Floci container restart in local dev** โ circuit opens, 60s later closes as Floci recovers. `s3_circuit_breaker_state` tracks this in Prometheus; the `circuit` log lines on the api-gateway container tell the same story with more detail.
2. **ECS IAM role drift in production** โ manifests as `AccessDenied` (non-retriable), surfaces as user-visible errors. Check CloudWatch for the IAM event.
3. **Region-wide S3 throttling** โ very rare but catastrophic. Circuit breakers keep the service up in read-only mode while you coordinate with AWS support.
For the recipe to wire new S3 operations into this system, see [how to add a new S3 operation](../how-to/add-s3-operations.md).