The Origin Story
It started, like most things do, as a quick hack.
We needed a handful of third-party APIs to whitelist our IP address. Our main application runs on Google Cloud Run behind Cloud NAT — which means the egress IP can change at any time. Not great when a vendor says “give us a static IP or we won’t talk to you.”
The solution was dead simple: deploy a tiny proxy on Cloud Run, route its traffic through a VPC connector with a static IP, and have the main app call the proxy instead of the vendor directly. One afternoon, a Flask app, a Dockerfile, and a gcloud run deploy later — done.
Main App → Proxy (static IP via VPC connector) → Third-party API
And it worked. So we moved on.
Over the next two and a half years, every time a new vendor asked us to whitelist an IP, we’d add a line to the service URL map and redeploy. New 3PL provider? Add it. Payment API? Add it. The proxy was invisible infrastructure — it just worked.
Until it didn’t.
The Problem
One morning, requests to completely unrelated third-party APIs started timing out. The 3PL API was fine. The payment API was fine. But our proxy was returning 504s across the board. The culprit? A single downstream service had gone unresponsive.
Here’s what was actually happening under the hood:
Cloud Run (concurrency=80)
│
▼
┌─────────────────┐
│ Container │
│ │
80 concurrent ──▶ │ gunicorn │
requests │ (1 sync worker)│
│ │ │
│ ▼ │
│ requests.get() │ ◀── BLOCKING: one request at a time
│ │
│ 79 requests │
│ sitting in │
│ the queue │
└─────────────────┘
Cloud Run’s default concurrency setting is 80. That means the load balancer happily sends up to 80 simultaneous requests to a single container instance. It assumes the container can actually handle them.
Our app couldn’t. Gunicorn was running with a single sync worker. It could process exactly one request at a time. The other 79 requests would queue up inside the container — invisible to Cloud Run’s autoscaler. As far as Cloud Run was concerned, the instance wasn’t overloaded; it had accepted the connections.
When that one downstream service started hanging with 5+ second response times, the math got ugly fast. One worker, processing requests serially, each taking 5 seconds: that’s 80 × 5 = 400 seconds to drain the queue. Well beyond any reasonable timeout. Requests to perfectly healthy services were stuck waiting behind the slow one.
The whole proxy went down because of one bad downstream.
Why Not Just Add More Workers?
The obvious first instinct: bump gunicorn workers. We actually tried this as an intermediate fix — switching to 1 worker with 8 threads and lowering the concurrency to 12 to buy ourselves time. But this was a band-aid.
Here’s why more workers isn’t the real answer:
- Memory cost. Each gunicorn worker is a full OS process (or thread with the GIL). More workers = more memory per instance = bigger Cloud Run instances = higher cost.
- Diminishing returns. Even 8 threads only handles 8 concurrent requests. Cloud Run’s default is 80. You’re still 10x short.
- Linear scaling. Want to handle 80 concurrent? You’d need 80 threads. At that point, you’re burning resources on thread context switching instead of doing actual work.
- The proxy is I/O-bound. It does zero computation. It receives a request, forwards it, waits for a response, and sends it back. This is exactly the problem async I/O was designed for.
The Solution: Async All The Way
The fix wasn’t a config change — it was an architecture change. We replaced the entire sync stack with an async one:
| Component | Before | After |
|---|---|---|
| Framework | Flask | FastAPI |
| Server | gunicorn (sync) | uvicorn (async) |
| HTTP client | requests (sync) | httpx (async) |
| Python | 3.11 | 3.13 |
| Package manager | poetry | uv |
With async, a single process can handle hundreds of concurrent requests. When one request is waiting on a downstream API, the event loop just… picks up the next one. No threads, no extra processes, no wasted memory.
Before: Flask + requests
from flask import Flask, abort, request, jsonify
import requests
app = Flask(__name__)
@app.route("/proxy", methods=["POST"])
def proxy():
payload = request.json
url = SERVICE_URL_MAP[payload["service"]]
response = requests.request(
method=payload["method"],
url=url,
data=payload.get("data"),
json=payload.get("json"),
params=payload.get("params"),
headers=payload.get("headers"),
timeout=payload.get("timeout", 30),
)
return response.content, response.status_code
Every call to requests.request() blocks the entire worker until the downstream responds. Nothing else runs.
After: FastAPI + httpx
import httpx
from fastapi import FastAPI, Request, Response
from contextlib import asynccontextmanager
@asynccontextmanager
async def lifespan(app: FastAPI):
app.state.http_client = httpx.AsyncClient(
limits=httpx.Limits(max_connections=500, max_keepalive_connections=50)
)
yield
await app.state.http_client.aclose()
app = FastAPI(lifespan=lifespan)
@app.post("/proxy")
async def proxy(request: Request):
payload = await request.json()
client = request.app.state.http_client
response = await client.request(
method=method,
url=url,
data=data,
json=json_payload,
params=params,
headers=headers,
timeout=httpx.Timeout(connect=5.0, read=read_timeout, write=5.0, pool=5.0),
)
return Response(
content=response.content,
status_code=response.status_code,
media_type=response.headers.get("content-type"),
)
Every await is a point where the event loop can switch to another request. One slow downstream? Doesn’t matter — the other 79 requests proceed in parallel.
Key Design Decisions
1. Shared httpx.AsyncClient via lifespan
We create a single httpx.AsyncClient at startup and reuse it across all requests. This gives us connection pooling for free — TCP connections and TLS sessions are reused instead of being set up and torn down on every request. The client is gracefully closed on shutdown.
This uses FastAPI’s lifespan events, which replaced the older @app.on_event("startup") pattern.
2. Fixed connect timeout, caller-controlled read timeout
CONNECT_TIMEOUT = 5.0 # hardcoded — infrastructure concern
timeout = httpx.Timeout(
connect=CONNECT_TIMEOUT,
read=payload.get("timeout", 30), # caller decides
write=5.0,
pool=5.0,
)
Connect timeout is an infrastructure concern — if we can’t reach a service in 5 seconds, something is wrong at the network level. Read timeout varies wildly depending on the downstream API (a bulk export might take 60 seconds, a health check takes 1 second), so we let the caller specify it with a 30-second default.
3. data= vs content= in httpx
This was a subtle gotcha during migration. In requests, data= accepts both dicts (form-encodes them) and raw bytes. In httpx, there’s a distinction:
data=— accepts dicts (form-encodes) or strings/bytescontent=— accepts raw bytes only, raisesTypeErroron dicts
We needed data= to preserve the same behavior as the Flask version. Using content= would have silently broken any caller passing form-encoded data, except it’s not even silent — it raises a TypeError at runtime, not at import time. The kind of bug you only catch in production.
4. Granular error handling
The old version let exceptions bubble up as generic 500s. The new version distinguishes:
except httpx.ConnectTimeout:
return JSONResponse(status_code=504, content={"detail": "Connect timeout to upstream service"})
except httpx.ReadTimeout:
return JSONResponse(status_code=504, content={"detail": "Read timeout from upstream service"})
except httpx.ConnectError:
return JSONResponse(status_code=502, content={"detail": "Failed to connect to upstream service"})
except httpx.HTTPError:
return JSONResponse(status_code=502, content={"detail": "Upstream request failed"})
504 for timeouts, 502 for connection failures. Callers checking response.ok still work since all are non-2xx, but now they can distinguish between “the service is slow” and “the service is down.”
5. Connection pool limits
httpx.Limits(max_connections=500, max_keepalive_connections=50)
We set max_connections=500 to give headroom above Cloud Run’s concurrency setting (100). Without this, under burst traffic the pool could be exhausted and requests would queue waiting for a connection — defeating the purpose of going async. The max_keepalive_connections=50 keeps idle connections reasonable.
The Dockerfile
Before
FROM python:3.11-slim
RUN pip3 install poetry
RUN poetry config virtualenvs.create false
ADD . /app
WORKDIR /app
RUN poetry install --without dev
CMD exec gunicorn --bind :$PORT -t 600 static_ip_proxy.app:app
Note the -t 600 — a 10-minute worker timeout. This was another duct-tape fix. Gunicorn would kill workers that didn’t respond in time, and since everything was serial, slow downstream requests would regularly trip the default 30-second timeout. So we cranked it to 10 minutes. Problem “solved.”
After
FROM python:3.13-slim
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
ADD . /app
WORKDIR /app
RUN uv venv && uv pip install . && rm -rf /root/.cache/uv
ENV PATH="/app/.venv/bin:$PATH"
CMD exec uvicorn static_ip_proxy.app:app --host 0.0.0.0 --port $PORT
Two things to note:
--host 0.0.0.0is required for uvicorn. Unlike gunicorn (which defaults to0.0.0.0), uvicorn defaults to127.0.0.1. In a container, that means it only listens on loopback — Cloud Run’s health checks and traffic can’t reach it.- uv instead of poetry. Not strictly necessary for the async migration, but uv is significantly faster for dependency resolution and installs, which cuts container build times.
Benchmarking with ApacheBench
To quantify the improvement, we benchmarked both versions on Cloud Run with identical default configuration (concurrency 80, auto-scaling). We used httpbin.org’s /delay/5 endpoint as a controlled slow upstream — it waits exactly 5 seconds before responding.
Test command
echo '{"service": "httpbin", "method": "GET", "path": "delay/5"}' > /tmp/ab-payload.json
ab -n 500 -c 100 -T 'application/json' -p /tmp/ab-payload.json
https://<service-url>/proxy
500 requests, 100 concurrent — intentionally exceeding Cloud Run’s default concurrency of 80 to stress the queuing behavior.
Results
| Metric | Flask/gunicorn | FastAPI/uvicorn |
|---|---|---|
| Requests attempted | 500 | 500 |
| Requests completed | 343 | 500 |
| Failed requests | 157+ (ab timed out) | 1 |
| Total time | 19+ min (timed out) | 39.8 seconds |
| Requests/sec | ~0.3 | 12.55 |
| Median response time | N/A (timeouts) | 6.0s |
The math
Flask: Gunicorn processes 1 request at a time. 80 queued requests × 5s each = 400 seconds to drain one batch. ab’s default timeout (30s) expires long before then. Requests fail, ab eventually gives up, we sat there for 19 minutes watching it struggle.
FastAPI: All 100 concurrent requests are processed in parallel by the event loop. 500 requests / 100 concurrency = 5 batches × ~5.7s (5s delay + network overhead) = ~28.5s + TLS handshake overhead = ~39.8s total. One failure out of 500 — likely a transient network blip.
That’s a ~40x improvement in throughput on identical infrastructure.
Cloud Run Configuration
The final production deployment:
gcloud run deploy static-ip-proxy
--concurrency 100
--min-instances 1
--max-instances 10
--concurrency 100: slightly above the default 80. The app can handle it now.--min-instances 1: no cold starts. Other services depend on this proxy — a cold start adds seconds of latency to their requests.--max-instances 10: safety cap. 10 instances × 100 concurrency = 1,000 max concurrent requests. More than enough headroom.
Migration Gotchas
A few things that bit us (or almost did) during the migration:
1. data= vs content= in httpx
Already covered above, but worth repeating: content=dict raises TypeError at runtime, not import time. Your tests might pass if they only test JSON payloads. Form-encoded data breaks in production.
2. FastAPI returns JSON by default
@app.get("/")
async def home():
return "Hello, World!" # Returns: "Hello, World!" (JSON string with quotes)
If you’re returning plain text, use PlainTextResponse("Hello, World!") explicitly. Otherwise FastAPI’s default JSONResponse wraps your string in JSON — quotes included.
3. TestClient needs a context manager
# Wrong — lifespan events don't fire
client = TestClient(app)
response = client.get("/ip") # AttributeError: app.state has no attribute 'http_client'
# Right
with TestClient(app) as client:
response = client.get("/ip") # Works
TestClient(app) without with doesn’t trigger FastAPI’s lifespan events. Any app.state attributes you set in your lifespan will be missing. This is easy to miss because the error doesn’t mention lifespan at all — it just looks like a missing attribute.
4. uvicorn binds to 127.0.0.1 by default
Unlike gunicorn (which defaults to 0.0.0.0), uvicorn binds to 127.0.0.1. In a container, this means it’s unreachable from outside. Your Cloud Run service will deploy successfully but health checks will fail and traffic won’t route. Add --host 0.0.0.0 to your CMD.
5. .dockerignore matters
If you have a local .venv/ directory and your Dockerfile uses ADD . /app, it gets copied into the container. Then uv venv fails with “virtual environment already exists.” Add .venv/ to your .dockerignore.
Conclusion
The proxy ran fine for two and a half years — through dependency bumps, new vendors, infrastructure changes. It was the kind of service you forget about because it just works. The problem was hiding in plain sight: Cloud Run thought it was distributing load across 80 concurrent slots. Gunicorn was processing them one at a time. As long as downstream APIs responded quickly, the queue drained fast enough that nobody noticed.
The fix wasn’t about Cloud Run configuration or scaling policies. It was about making the application actually honor the concurrency that Cloud Run expects. The entire migration was about 150 lines of code change:
- Flask → FastAPI
- requests → httpx
- gunicorn → uvicorn
- Sync handlers → async handlers
The result: a 40x throughput improvement on identical infrastructure, with a single process handling 100+ concurrent requests instead of serializing them.
If you’re running a sync Python service on Cloud Run (or any container platform with concurrent request routing), check whether your app can actually handle the concurrency it’s being given. The gap between “accepts connections” and “processes them concurrently” is where bottlenecks hide.