InfoQ: How GitHub Copilot Serves 400 Million Completion Requests a Day
2025.Mar https://www.infoq.com/presentations/github-copilot/
🎤 Speaker & Context
- David Cheney: open‑source contributor for Go, tech lead on “copilot‑proxy” at GitHub.
- The talk focuses on the backend architecture of Copilot’s code‑completion service: handling millions of daily requests, achieving low latency (< 200 ms), globally distributed scale.
speaker: David Cheney
- https://github.com/davecheney
- https://au.linkedin.com/in/davecheney
đź§± Key Requirements & Challenges
- The product: GitHub Copilot provides code completions in IDEs (VSCode, Visual Studio, IntelliJ, Neovim, Xcode).
- Scale: At the time of the talk, >400 million completion requests/day, peak ~8,000 requests/sec during Europe‑US time. Mean response time under ~200 ms.
- Latency & user experience: Because IDE built‑in completions run locally (no network overhead), the service must minimize network latency, connection setup, etc.
- Variability: Request sizes vary (because code completion context is variable), streaming responses help.
🚀 Architectural Solutions
-
Proxy + Authentication
- They built a copilot‑proxy layer: IDEs authenticate via OAuth to GitHub, get a short‑lived token (~minutes) which the proxy validates and exchanges for an API key. This avoids embedding keys in clients.
- Because the token is short‑lived, abuse is mitigated and accounts can be shut down quickly.
-
Connection Management & HTTP/2
- Connection setup (TCP/TLS) has significant latency (5‑6 round trips ~50‑100 ms each) especially for long distances.
- Use HTTP/2 between client ↔ proxy and proxy ↔ model host: multiplexed streams on a single connection, ability to cancel individual streams while keeping connection open.
- Keeping long‑lived connections warms TCP and reduces latency.
GLB - GitHub Load Balancer
-
Global Deployment & Routing
- Models are hosted in multiple Azure regions (via OpenAI partnership). Proxy instances are colocated in those regions.
- They use a routing configuration system (octoDNS) to route users to the optimal region based on geography and health of region.
- They chose not to do a “point of presence” (PoP) model with many small caches, because the model invocation always needs to go back to a model host — the traffic “tromboned” and operational burden was too high.
-
Client Diversity & Long Tail of Versions
- Many IDEs, many client versions: fixing things in just the client is slow (20% of users may remain on old versions indefinitely). Proxy allows fixes/degradation logic server‑side for old clients.
- also quick mitigation, debugging by attaching some parameter, or ask if request is cancelled before firing the request
- Example: a model version broke when certain parameter sent; proxy mutated request on the fly rather than waiting for full client rollout.
- Many IDEs, many client versions: fixing things in just the client is slow (20% of users may remain on old versions indefinitely). Proxy allows fixes/degradation logic server‑side for old clients.
📌 Key Takeaways & Lessons
- Use HTTP/2 (or a protocol better than HTTP/1) for low‑latency services.
- Look for where your engineering budget should be spent: invest in the bespoke part (for Copilot it was the proxy + connection management) rather than only off‑the‑shelf.
- To reduce latency globally, bring your application (or model) closer to the user (geo‑distribute) rather than assuming network backbone will do it for free.
- Having an intermediary layer (proxy) gives you powerful flexibility: routing, cancelling, versioning, observing metrics, handling legacy clients — all without requiring client changes.
âś… Why It Was Worth It
- The engineering effort to build copilot‑proxy and the HTTP/2 long‑lived connection architecture paid off: allowed them to achieve competitive latency (approaching local IDE completions) at global scale.
- Operational benefits: regional failure becomes degradation instead of total outage; client diversity managed; health checks, routing, load balancing handled centrally.