InfoQ: How GitHub Copilot Serves 400 Million Completion Requests a Day

David Cheney: open‑source contributor for Go, tech lead on “copilot‑proxy” at GitHub.
The talk focuses on the backend architecture of Copilot’s code‑completion service: handling millions of daily requests, achieving low latency (< 200 ms), globally distributed scale.

speaker: David Cheney

The product: GitHub Copilot provides code completions in IDEs (VSCode, Visual Studio, IntelliJ, Neovim, Xcode).
Scale: At the time of the talk, >400 million completion requests/day, peak ~8,000 requests/sec during Europe‑US time. Mean response time under ~200 ms.
Latency & user experience: Because IDE built‑in completions run locally (no network overhead), the service must minimize network latency, connection setup, etc.
Variability: Request sizes vary (because code completion context is variable), streaming responses help.

Proxy + Authentication
- They built a copilot‑proxy layer: IDEs authenticate via OAuth to GitHub, get a short‑lived token (~minutes) which the proxy validates and exchanges for an API key. This avoids embedding keys in clients.
- Because the token is short‑lived, abuse is mitigated and accounts can be shut down quickly.
Connection Management & HTTP/2
- Connection setup (TCP/TLS) has significant latency (5‑6 round trips ~50‑100 ms each) especially for long distances.
- Use HTTP/2 between client ↔ proxy and proxy ↔ model host: multiplexed streams on a single connection, ability to cancel individual streams while keeping connection open.
- Keeping long‑lived connections warms TCP and reduces latency.

GLB - GitHub Load Balancer

Global Deployment & Routing
- Models are hosted in multiple Azure regions (via OpenAI partnership). Proxy instances are colocated in those regions.
- They use a routing configuration system (octoDNS) to route users to the optimal region based on geography and health of region.
- They chose not to do a “point of presence” (PoP) model with many small caches, because the model invocation always needs to go back to a model host — the traffic “tromboned” and operational burden was too high.
Client Diversity & Long Tail of Versions
- Many IDEs, many client versions: fixing things in just the client is slow (20% of users may remain on old versions indefinitely). Proxy allows fixes/degradation logic server‑side for old clients.
  - also quick mitigation, debugging by attaching some parameter, or ask if request is cancelled before firing the request
- Example: a model version broke when certain parameter sent; proxy mutated request on the fly rather than waiting for full client rollout.

Use HTTP/2 (or a protocol better than HTTP/1) for low‑latency services.
Look for where your engineering budget should be spent: invest in the bespoke part (for Copilot it was the proxy + connection management) rather than only off‑the‑shelf.
To reduce latency globally, bring your application (or model) closer to the user (geo‑distribute) rather than assuming network backbone will do it for free.
Having an intermediary layer (proxy) gives you powerful flexibility: routing, cancelling, versioning, observing metrics, handling legacy clients — all without requiring client changes.

The engineering effort to build copilot‑proxy and the HTTP/2 long‑lived connection architecture paid off: allowed them to achieve competitive latency (approaching local IDE completions) at global scale.
Operational benefits: regional failure becomes degradation instead of total outage; client diversity managed; health checks, routing, load balancing handled centrally.