Scaling PostgreSQL to Power 800 Million ChatGPT Users

PostgreSQL has long served as a critical backend data system for core products such as ChatGPT and OpenAI’s API. With rapid user growth, database demands have escalated significantly. Over the last year, PostgreSQL load has increased by over 10x and continues its rapid ascent.

Efforts to advance the production infrastructure to sustain this growth revealed a new insight: PostgreSQL can be scaled to reliably support much larger read-heavy workloads than many previously thought possible. The system, initially created by scientists at the University of California, Berkeley, has enabled support for massive global traffic with a single primary Azure PostgreSQL flexible server instance and nearly 50 read replicas spread over multiple regions globally. This is the story of how PostgreSQL has been scaled to support millions of queries per second for 800 million users through rigorous optimizations and solid engineering, including key takeaways learned along the way.

Cracks in the initial design

After ChatGPT’s launch, traffic grew at an unprecedented rate. To support this, extensive optimizations were rapidly implemented at both the application and PostgreSQL database layers. Scaling involved increasing instance sizes and adding more read replicas. This architecture has performed well for a long time, and with ongoing improvements, it continues to provide ample runway for future growth.

It may seem surprising that a single-primary architecture can meet the demands of OpenAI’s scale; however, making this work in practice is not simple. Several service-impacting events (SEVs) have been observed due to Postgres overload, often following a consistent pattern: an upstream issue causes a sudden spike in database load. Examples include widespread cache misses from a caching-layer failure, a surge of expensive multi-way joins saturating CPU, or a write storm from a new feature launch. As resource utilization climbs, query latency rises, and requests begin to time out. Retries then further amplify the load, triggering a vicious cycle with the potential to degrade the entire ChatGPT and API services.

Although PostgreSQL scales well for read-heavy workloads, challenges still arise during periods of high write traffic. This is largely due to PostgreSQL’s multiversion concurrency control (MVCC) implementation, which makes it less efficient for write-heavy workloads. For instance, when a query updates a tuple or even a single field, the entire row is copied to create a new version. Under heavy write loads, this results in significant write amplification. It also increases read amplification, as queries must scan through multiple tuple versions (dead tuples) to retrieve the latest one. MVCC introduces additional challenges such as table and index bloat, increased index maintenance overhead, and complex autovacuum tuning. A deep-dive on these issues can be found in a blog co-written with Prof. Andy Pavlo at Carnegie Mellon University, titled The Part of PostgreSQL We Hate the Most, which is cited in the PostgreSQL Wikipedia page.

Scaling PostgreSQL to millions of QPS

To mitigate these limitations and reduce write pressure, shardable (i.e., horizontally partitionable), write-heavy workloads have been, and continue to be, migrated to sharded systems such as Azure Cosmos DB. Application logic is optimized to minimize unnecessary writes. Adding new tables to the current PostgreSQL deployment is no longer permitted, with new workloads defaulting to sharded systems.

Even as the infrastructure has evolved, PostgreSQL has remained unsharded, with a single primary instance serving all writes. The primary rationale is that sharding existing application workloads would be highly complex and time-consuming, requiring changes to hundreds of application endpoints and potentially taking months or even years. Since workloads are primarily read-heavy, and extensive optimizations have been implemented, the current architecture still provides ample headroom to support continued traffic growth. While sharding PostgreSQL in the future is not ruled out, it is not a near-term priority given the sufficient runway for current and future growth.

The following sections detail the challenges faced and the extensive optimizations implemented to address them and prevent future outages, pushing PostgreSQL to its limits and scaling it to millions of queries per second (QPS).

Reducing load on the primary

Challenge: With only one writer, a single-primary setup cannot scale writes. Heavy write spikes can quickly overload the primary and impact services like ChatGPT and the API.

Solution: Load on the primary is minimized as much as possible—both reads and writes—to ensure sufficient capacity for write spikes. Read traffic is offloaded to replicas wherever feasible. However, some read queries must remain on the primary as they are part of write transactions; for these, the focus is on efficiency and avoiding slow queries. For write traffic, shardable, write-heavy workloads have been migrated to sharded systems such as Azure CosmosDB. Workloads that are harder to shard but still generate high write volume take longer to migrate, and that process is ongoing. Applications are also aggressively optimized to reduce write load; for example, application bugs causing redundant writes have been fixed, and lazy writes introduced where appropriate to smooth traffic spikes. Additionally, when backfilling table fields, strict rate limits are enforced to prevent excessive write pressure.

Query optimization

Challenge: Several expensive queries were identified in PostgreSQL. In the past, sudden volume spikes in these queries would consume large amounts of CPU, slowing both ChatGPT and API requests.

Solution: A few expensive queries, such as those joining many tables together, can significantly degrade or even bring down the entire service. Continuous optimization of PostgreSQL queries is necessary to ensure efficiency and avoid common Online Transaction Processing (OLTP) anti-patterns. For example, an extremely costly query that joined 12 tables was once identified, and spikes in this query were responsible for past high-severity SEVs. Complex multi-table joins should be avoided whenever possible. If joins are necessary, it is advisable to consider breaking down the query and moving complex join logic to the application layer instead. Many problematic queries are generated by Object-Relational Mapping frameworks (ORMs), so careful review of the SQL they produce is important to ensure expected behavior. Long-running idle queries are also common in PostgreSQL; configuring timeouts like idle_in_transaction_session_timeout is essential to prevent them from blocking autovacuum.

Single point of failure mitigation

Challenge: If a read replica goes down, traffic can still be routed to other replicas. However, relying on a single writer means having a single point of failure—if it goes down, the entire service is affected.

Solution: Most critical requests only involve read queries. To mitigate the single point of failure in the primary, those reads were offloaded from the writer to replicas, ensuring that these requests can continue serving even if the primary goes down. While write operations would still fail, the impact is reduced; it is no longer a SEV0 since reads remain available.

To mitigate primary failures, the primary runs in High-Availability (HA) mode with a hot standby, a continuously synchronized replica that is always ready to take over serving traffic. If the primary goes down or needs to be taken offline for maintenance, the standby can be quickly promoted to minimize downtime. The Azure PostgreSQL team has done significant work to ensure these failovers remain safe and reliable even under very high load. To handle read replica failures, multiple replicas are deployed in each region with sufficient capacity headroom, ensuring that a single replica failure does not lead to a regional outage.

Workload isolation

Challenge: Situations often arise where certain requests consume a disproportionate amount of resources on PostgreSQL instances. This can lead to degraded performance for other workloads running on the same instances. For example, a new feature launch can introduce inefficient queries that heavily consume PostgreSQL CPU, slowing down requests for other critical features.

Solution: To mitigate the “noisy neighbor” problem, workloads are isolated onto dedicated instances to ensure that sudden spikes in resource-intensive requests do not impact other traffic. Specifically, requests are split into low-priority and high-priority tiers and routed to separate instances. This way, even if a low-priority workload becomes resource-intensive, it will not degrade the performance of high-priority requests. The same strategy is applied across different products and services, so that activity from one product does not affect the performance or reliability of another.

Connection pooling

Challenge: Each instance has a maximum connection limit (5,000 in Azure PostgreSQL). It is easy to run out of connections or accumulate too many idle ones. Incidents have previously occurred due to connection storms that exhausted all available connections.

Solution: PgBouncer was deployed as a proxy layer to pool database connections. Running it in statement or transaction pooling mode allows for efficient reuse of connections, greatly reducing the number of active client connections. This also cuts connection setup latency: in benchmarks, the average connection time dropped from 50 milliseconds (ms) to 5 ms. Inter-region connections and requests can be expensive, so the proxy, clients, and replicas are co-located in the same region to minimize network overhead and connection use time. Moreover, PgBouncer must be configured carefully. Settings like idle timeouts are critical to prevent connection exhaustion.

Each read replica has its own Kubernetes deployment running multiple PgBouncer pods. Multiple Kubernetes deployments operate behind the same Kubernetes Service, which load-balances traffic across pods.

Caching

Challenge: A sudden spike in cache misses can trigger a surge of reads on the PostgreSQL database, saturating CPU and slowing user requests.

Solution: To reduce read pressure on PostgreSQL, a caching layer is used to serve most of the read traffic. However, when cache hit rates drop unexpectedly, the burst of cache misses can push a large volume of requests directly to PostgreSQL. This sudden increase in database reads consumes significant resources, slowing down the service. To prevent overload during cache-miss storms, a cache locking (and leasing) mechanism is implemented so that only a single reader that misses on a particular key fetches the data from PostgreSQL. When multiple requests miss on the same cache key, only one request acquires the lock and proceeds to retrieve the data and repopulate the cache. All other requests wait for the cache to be updated rather than all hitting PostgreSQL at once. This significantly reduces redundant database reads and protects the system from cascading load spikes.

Scaling read replicas

Challenge: The primary streams Write Ahead Log (WAL) data to every read replica. As the number of replicas increases, the primary must ship WAL to more instances, increasing pressure on both network bandwidth and CPU. This causes higher and more unstable replica lag, which makes the system harder to scale reliably.

Solution: Nearly 50 read replicas operate across multiple geographic regions to minimize latency. However, with the current architecture, the primary must stream WAL to every replica. Although it currently scales well with very large instance types and high-network bandwidth, adding replicas indefinitely without eventually overloading the primary is not feasible. To address this, collaboration with the Azure PostgreSQL team is underway on cascading replication, where intermediate replicas relay WAL to downstream replicas. This approach allows scaling to potentially over a hundred replicas without overwhelming the primary. However, it also introduces additional operational complexity, particularly around failover management. The feature is still in testing; it will be ensured to be robust and capable of safe failover before rolling it out to production.

Rate limit

Challenge: A sudden traffic spike on specific endpoints, a surge of expensive queries, or a retry storm can quickly exhaust critical resources such as CPU, I/O, and connections, which causes widespread service degradation.

Solution: Rate-limiting is implemented across multiple layers—application, connection pooler, proxy, and query—to prevent sudden traffic spikes from overwhelming database instances and triggering cascading failures. It is also crucial to avoid overly short retry intervals, which can trigger retry storms. The ORM layer was enhanced to support rate limiting and, when necessary, fully block specific query digests. This targeted form of load shedding enables rapid recovery from sudden surges of expensive queries.

Schema Management

Challenge: Even a small schema change, such as altering a column type, can trigger a full table rewrite. Therefore, schema changes are applied cautiously—limiting them to lightweight operations and avoiding any that rewrite entire tables.

Solution: Only lightweight schema changes are permitted, such as adding or removing certain columns that do not trigger a full table rewrite. A strict 5-second timeout on schema changes is enforced. Creating and dropping indexes concurrently is allowed. Schema changes are restricted to existing tables. If a new feature requires additional tables, they must be in alternative sharded systems such as Azure CosmosDB rather than PostgreSQL. When backfilling a table field, strict rate limits are applied to prevent write spikes. Although this process can sometimes take over a week, it ensures stability and avoids any production impact.

Results and the road ahead

This effort demonstrates that with the right design and optimizations, Azure PostgreSQL can be scaled to handle the largest production workloads. PostgreSQL now handles millions of QPS for read-heavy workloads, powering critical products like ChatGPT and the API platform. Nearly 50 read replicas have been added, while keeping replication lag near zero, maintaining low-latency reads across geo-distributed regions, and building sufficient capacity headroom to support future growth.

This scaling works while still minimizing latency and improving reliability. Low double-digit millisecond p99 client-side latency and five-nines availability are consistently delivered in production. Over the past 12 months, only one SEV-0 PostgreSQL incident occurred (during the viral launch of ChatGPT ImageGen, when write traffic suddenly surged by more than 10x as over 100 million new users signed up within a week).

While the current PostgreSQL setup has proven highly effective, efforts continue to push its limits to ensure sufficient runway for future growth. Shardable write-heavy workloads have already been migrated to sharded systems like CosmosDB. The remaining write-heavy workloads are more challenging to shard, and active migration of those is underway to further offload writes from the PostgreSQL primary. Collaboration with Azure is also ongoing to enable cascading replication, allowing for safe scaling to significantly more read replicas.

Looking ahead, additional approaches to further scale will continue to be explored, including sharded PostgreSQL or alternative distributed systems, as infrastructure demands continue to grow.

Latest Post

What is Alpha, the AI-only school of the future?

Cloudflare outage on February 20, 2026

Scaling PostgreSQL to Power 800 Million ChatGPT Users

Demis Hassabis and John Jumper Receive Nobel Prize in Chemistry

ChatGPT’s Dominance Among Young Indians: Usage Insights from OpenAI

SIMA 2: An Agent that Plays, Reasons, and Learns With You in Virtual 3D Worlds

ChatGPT Mobile App Surpasses $3 Billion in Consumer Spending

Creator Tayla Cannon Lands $1.1M Investment for Rebuildr PT Software

Automate Your iPhone’s Always-On Display for Better Battery Life and Privacy

Latest Post

What is Alpha, the AI-only school of the future?

Cloudflare outage on February 20, 2026

Scaling PostgreSQL to Power 800 Million ChatGPT Users

Latest Post

Scaling PostgreSQL to Power 800 Million ChatGPT Users

Cracks in the initial design

Scaling PostgreSQL to millions of QPS

Reducing load on the primary

Query optimization

Single point of failure mitigation

Workload isolation

Connection pooling

Caching

Scaling read replicas

Rate limit

Schema Management

Results and the road ahead

Related Posts