← You Don't Need Redis

Chapter 16: Everything to Try Before You Shard

The Waiter of Gold Lapel · Updated Mar 30, 2026 Published Mar 11, 2026 · 13 min

The question is not whether PostgreSQL can handle your scale. The question is whether you have exhausted everything PostgreSQL offers before concluding that it cannot.

Sharding — distributing your data across multiple database servers — is the nuclear option of scaling. It touches every query. It complicates every migration. It transforms every backup and every disaster recovery procedure. It requires routing logic in your application that must be maintained, tested, and debugged for the lifetime of the system. Joins across shards become expensive or impossible. Schema changes must be applied to every shard individually. The complexity accrues interest, and the payments never stop.

OpenAI estimated that sharding their PostgreSQL primary would require modifying hundreds of application endpoints and take months to years to complete. They chose not to. Their single PostgreSQL primary — unsharded — serves 800 million ChatGPT users today.

Before you shard, try everything in this chapter. Most teams will find they never need to.

The Proof

In January 2026, OpenAI published a detailed account of their PostgreSQL architecture. The numbers deserve to be stated plainly.

One Azure Database for PostgreSQL Flexible Server handles all write operations. Approximately 50 read replicas, distributed across multiple regions, handle reads. The system processes millions of queries per second. Availability is five nines. P99 read latency is in the low double-digit milliseconds. The database load grew more than tenfold in the preceding year.

The architecture is unsharded. One primary. One source of truth for writes. Every write in the system — every ChatGPT message, every API call, every user preference change — funnels through a single PostgreSQL instance.

At PGConf.Dev 2025, OpenAI infrastructure engineer Bohan Zhang stated it directly: "At OpenAI, we've proven that PostgreSQL can scale to support massive read-heavy workloads — even without sharding — using a single primary writer."

How? The workload pattern makes it possible. ChatGPT is overwhelmingly read-heavy. For every message a user sends, dozens of reads fire — conversation history, user settings, model configurations, interface rendering. Reads scale horizontally by adding replicas. Writes are the hard part, and OpenAI attacked write pressure from every angle: eliminating redundant writes at the application level, using lazy writes to smooth traffic spikes, rate-limiting backfills even when the process took over a week, and offloading shardable write-heavy workloads to Azure CosmosDB. New tables are forbidden on the main PostgreSQL cluster — new features default to sharded systems.

The connection management is handled by PgBouncer running within their Kubernetes infrastructure, pooling connections from potentially thousands of application instances into a manageable number of database connections — the exact pattern from Chapter 15.

If your application serves fewer than 800 million users — and I suspect it does — the ceiling is rather higher than you feared.

Scale Up, Then Scale Out

Vertical scaling

Modern hardware is extraordinary, and I find it is often underestimated by teams reaching for distributed architectures. A single high-end cloud instance — an AWS db.r6g.16xlarge with 64 vCPUs and 512GB of RAM, for example — can handle workloads that would have required a cluster a decade ago. CYBERTEC benchmarks demonstrate 32,000 simple read transactions per second per core on well-tuned PostgreSQL with an in-memory dataset.

Vertical scaling is unfashionable. It is also effective, requires zero application code changes, and can be completed in an afternoon. Before distributing your data across multiple servers, ensure your current server is appropriately sized.

Read replicas

PostgreSQL's streaming replication creates real-time copies of your database. The primary streams its write-ahead log to each replica, which applies the changes continuously. Route reads to replicas, writes to the primary, and the read capacity of your system scales linearly with the number of replicas.

The read/write ratio for most web applications falls between 80% and 95% reads. Replicas handle the majority of traffic. Start with one replica — it also serves as your failover target. Add replicas as read traffic grows. OpenAI runs approximately 50. Most applications need one to five.

Replication lag under normal load is near-zero — milliseconds at most. Monitor it with pg_stat_replication. PgCat provides built-in read/write splitting with replica failover. For operations that require read-after-write consistency — reading data immediately after writing it — route those specific reads to the primary. Most application frameworks and PostgreSQL proxies support this pattern: Patroni can provide similar routing for PgBouncer-based deployments.

The Optimization Checklist

Before you shard, exhaust this list. Each item is ordered by the combination of impact and ease of implementation.

1. Materialized views

Pre-compute expensive aggregations. This is the thesis of Parts I through III. A dashboard query that aggregates millions of rows in 7 seconds becomes an index scan that returns pre-computed results in 7 milliseconds. The read load drops by three orders of magnitude. The connection hold time drops by three orders of magnitude. The database has capacity to spare.

2. Connection pooling

PgBouncer or PgCat in transaction mode. Convert 1,000 application connections into 20 database connections. Chapter 15's core recommendation. Without pooling, you hit max_connections before you hit any scaling limit.

3. Query optimization

Run EXPLAIN ANALYZE on every slow query. Look for sequential scans on large tables, nested loop joins on unindexed columns, and sorts that spill to disk. The N+1 problem from Chapter 3 reappears here with particular urgency — a single API endpoint generating 100 queries instead of 1 is a scaling problem masquerading as a traffic problem. Fix the query before scaling the infrastructure.

4. Indexes

B-tree for equality and range conditions. GIN for full-text search, JSONB, and array operations. GiST for geometric and range types. Covering indexes with INCLUDE to enable index-only scans — the query is answered entirely from the index without touching the table. Partial indexes to index only the rows that matter:

partial_index.sql
CREATE INDEX idx_orders_active ON orders (created_at)
    WHERE status = 'active';

If 90% of your orders are archived and you only ever query active ones, do not index the 90% you will never read. The partial index is smaller, faster to maintain, and faster to scan.

5. Table partitioning

Not sharding. Partitioning splits a table into child tables on the same server, organized by a partition key. Range partitioning by date for time-series data is the classic case:

table_partitioning.sql
CREATE TABLE events (
    id SERIAL,
    created_at TIMESTAMPTZ NOT NULL,
    data JSONB
) PARTITION BY RANGE (created_at);

CREATE TABLE events_2026_01 PARTITION OF events
    FOR VALUES FROM ('2026-01-01') TO ('2026-02-01');

CREATE TABLE events_2026_02 PARTITION OF events
    FOR VALUES FROM ('2026-02-01') TO ('2026-03-01');

A query for January 2026 events touches only the January partition. PostgreSQL's partition pruning makes this automatic — the planner examines the WHERE clause and excludes irrelevant partitions from the scan. For tables with tens or hundreds of millions of rows where queries consistently filter on a specific column, partitioning can improve query performance dramatically without any application code changes.

Materialized views on partitioned tables work naturally — the MV query reads across partitions transparently.

6. UNLOGGED tables

For ephemeral data — sessions, temporary caches, job queues — UNLOGGED tables provide approximately 2.9x faster writes than regular tables by skipping write-ahead log entries. Data survives clean shutdowns but is lost on crash. If the data is regenerable or disposable, UNLOGGED eliminates write overhead that the WAL was never needed to protect. Cross-reference to Chapter 6.

7. Write offloading

The OpenAI strategy. Move shardable, write-heavy workloads to specialized systems while keeping PostgreSQL as the read-optimized source of truth. Counters and rate limiters to Redis. Event streams to Kafka. Analytics ingestion to a separate PostgreSQL instance or a columnar store. The primary stays lean, handling only the relational workloads that require strong consistency.

8. Vacuum tuning

PostgreSQL's MVCC model creates dead tuples on every UPDATE and DELETE. The autovacuum daemon reclaims them. But the default settings are conservative for high-write tables. Autovacuum tuning is critical. For tables with heavy write traffic, tune aggressively:

vacuum_tuning.sql
ALTER TABLE hot_table SET (
    autovacuum_vacuum_scale_factor = 0.01,
    autovacuum_analyze_scale_factor = 0.005
);

This triggers vacuum when 1% of the table's rows are dead, rather than the default 20%. Prevents bloat from degrading read performance — a common cause of "the database got slow" that has nothing to do with traffic volume.

Partitioning Is Not Sharding

The distinction matters and is frequently confused.

Partitioning: One PostgreSQL server. One database. The table is split into child tables organized by a partition key. Queries are automatically routed to the correct partition by the planner. No application code changes. No routing logic. No distributed transactions.

Sharding: Multiple PostgreSQL servers. Data is distributed by a shard key. Queries must be routed to the correct server by the application or a middleware layer. Joins across shards are expensive. Transactions across shards require distributed coordination. Application code must be shard-aware.

Partitioning gives you the performance benefits of organized data — smaller indexes, faster scans, efficient pruning — without any of the operational complexity of distributed data. For tables that grow continuously (logs, events, time-series), partitioning is often the optimization that makes sharding unnecessary.

When You Actually Need to Shard

I am not opposed to sharding. I am opposed to premature sharding.

Shard when the write volume exceeds the capacity of a single well-optimized PostgreSQL server — after you have exhausted vertical scaling, write offloading, vacuum tuning, and every other optimization in this chapter — and the workload is partitionable by a natural shard key (tenant_id, user_id, region).

If you cannot identify a natural shard key, sharding will make your system worse, not better. Cross-shard queries will be slow. Cross-shard transactions will be fragile. The routing logic will be a permanent source of bugs.

Citus is the most mature PostgreSQL sharding solution. Now part of Azure Cosmos DB for PostgreSQL, it extends PostgreSQL with distributed tables, distributed queries, and reference tables. Queries are rewritten transparently for distributed execution. For teams that have genuinely reached the write ceiling, Citus is the least disruptive path forward — your application still speaks PostgreSQL, and many queries work without modification.

Application-level routing is the simpler alternative for clean multi-tenant architectures. Shard by tenant_id: route each tenant to a dedicated PostgreSQL instance. Each shard is a complete, independent database. Simpler to reason about than Citus, but more work for any operation that crosses tenant boundaries.

The cost of either approach is permanent. Every query must account for the shard topology. Every migration must be applied to every shard. Every monitoring dashboard must aggregate across shards. Every backup and recovery procedure must handle partial failures. This complexity does not decrease over time — it increases as the shard count grows.

Exhaust the checklist first. If you have done everything in this chapter and the write volume still exceeds your primary's capacity, then — and only then — is it time to distribute the data.

The pattern throughout this book has been the same: the solution was already inside PostgreSQL, hidden in plain sight. Materialized views for caching. Connection pooling for concurrency. Read replicas for read scaling. Table partitioning for data organization. Partial indexes for targeted optimization. UNLOGGED tables for ephemeral writes. The tools are not new. They are merely underemployed. For a structured decision framework covering this entire progression, see the PostgreSQL scaling decision framework.

In the next chapter, we examine the connection pooler landscape in detail — PgBouncer, PgCat, Supavisor, and Odyssey — and how to choose among them.