InfoQ: How GitHub Copilot Serves 400 Million Completion Requests a Day

2025.Mar https://www.infoq.com/presentations/github-copilot/

🎤 Speaker & Context

  • David Cheney: open‑source contributor for Go, tech lead on “copilot‑proxy” at GitHub.
  • The talk focuses on the backend architecture of Copilot’s code‑completion service: handling millions of daily requests, achieving low latency (< 200 ms), globally distributed scale.

speaker: David Cheney

  • https://github.com/davecheney
  • https://au.linkedin.com/in/davecheney

đź§± Key Requirements & Challenges

  • The product: GitHub Copilot provides code completions in IDEs (VSCode, Visual Studio, IntelliJ, Neovim, Xcode).
  • Scale: At the time of the talk, >400 million completion requests/day, peak ~8,000 requests/sec during Europe‑US time. Mean response time under ~200 ms.
  • Latency & user experience: Because IDE built‑in completions run locally (no network overhead), the service must minimize network latency, connection setup, etc.
  • Variability: Request sizes vary (because code completion context is variable), streaming responses help.

🚀 Architectural Solutions

  1. Proxy + Authentication

    • They built a copilot‑proxy layer: IDEs authenticate via OAuth to GitHub, get a short‑lived token (~minutes) which the proxy validates and exchanges for an API key. This avoids embedding keys in clients.
    • Because the token is short‑lived, abuse is mitigated and accounts can be shut down quickly.
  2. Connection Management & HTTP/2

    • Connection setup (TCP/TLS) has significant latency (5‑6 round trips ~50‑100 ms each) especially for long distances.
    • Use HTTP/2 between client ↔ proxy and proxy ↔ model host: multiplexed streams on a single connection, ability to cancel individual streams while keeping connection open.
    • Keeping long‑lived connections warms TCP and reduces latency.

GLB - GitHub Load Balancer

  1. Global Deployment & Routing

    • Models are hosted in multiple Azure regions (via OpenAI partnership). Proxy instances are colocated in those regions.
    • They use a routing configuration system (octoDNS) to route users to the optimal region based on geography and health of region.
    • They chose not to do a “point of presence” (PoP) model with many small caches, because the model invocation always needs to go back to a model host — the traffic “tromboned” and operational burden was too high.
  2. Client Diversity & Long Tail of Versions

    • Many IDEs, many client versions: fixing things in just the client is slow (20% of users may remain on old versions indefinitely). Proxy allows fixes/degradation logic server‑side for old clients.
      • also quick mitigation, debugging by attaching some parameter, or ask if request is cancelled before firing the request
    • Example: a model version broke when certain parameter sent; proxy mutated request on the fly rather than waiting for full client rollout.

📌 Key Takeaways & Lessons

  • Use HTTP/2 (or a protocol better than HTTP/1) for low‑latency services.
  • Look for where your engineering budget should be spent: invest in the bespoke part (for Copilot it was the proxy + connection management) rather than only off‑the‑shelf.
  • To reduce latency globally, bring your application (or model) closer to the user (geo‑distribute) rather than assuming network backbone will do it for free.
  • Having an intermediary layer (proxy) gives you powerful flexibility: routing, cancelling, versioning, observing metrics, handling legacy clients — all without requiring client changes.

âś… Why It Was Worth It

  • The engineering effort to build copilot‑proxy and the HTTP/2 long‑lived connection architecture paid off: allowed them to achieve competitive latency (approaching local IDE completions) at global scale.
  • Operational benefits: regional failure becomes degradation instead of total outage; client diversity managed; health checks, routing, load balancing handled centrally.