Chapter 2: The Infrastructure You Were Told You Needed

The Waiter of Gold Lapel · Updated Mar 30, 2026 Published Mar 11, 2026 · 16 min

I should like to conduct a brief inventory, if you don't mind.

Your application runs PostgreSQL for its primary data. Redis for caching and, most likely, background job processing. There may be an Elasticsearch instance for search — or its open-source successor, OpenSearch, which amounts to the same operational commitment under a different name. Perhaps a message queue: RabbitMQ or Kafka for event-driven workflows. A monitoring stack — Prometheus and Grafana, or Datadog if the budget permits. A log aggregator. Each service runs in its own container, with its own configuration file, its own health checks, its own failure modes, and its own entry in the on-call runbook that nobody has updated since the engineer who wrote it left the company.

I count seven services. Your application serves — and I say this with no judgment whatsoever — four thousand users.

Every service in your infrastructure is a commitment. It requires monitoring, alerting, backup strategies, security patches, capacity planning, upgrade scheduling, and at least one engineer who understands its failure modes intimately enough to diagnose them at 3 AM in a state of diminished patience and insufficient caffeine. The question is not whether each service is capable. Each one is, I have no doubt, an excellent piece of engineering built by talented people solving real problems.

The question is whether each one is necessary.

I should like to trace how this arrangement came to be considered normal, examine what it actually costs — in dollars, in engineering hours, and in the quiet erosion of your team's ability to ship features — and then suggest, with all due courtesy, that several of your services may be ready for retirement.

How We Arrived at This Arrangement

This is not a story about bad decisions. It is a story about decisions that were correct at the time, and an industry that forgot to revisit them.

Redis appeared in 2009. PostgreSQL was at version 8.4. I should like you to consider, for a moment, what PostgreSQL lacked in 2009. Materialized views — the feature around which this entire book is organized — would not arrive until version 9.3 in 2013. JSONB, the column type that made PostgreSQL a credible document store, came with 9.4 in 2014. Declarative table partitioning appeared in version 10, in 2017. The full-text search capabilities that exist today were a fraction of their current sophistication. Connection pooling remained entirely the province of external tools.

In 2009, if you needed to cache database query results, you needed an external cache. If you needed a background job queue, you needed an external queue. If you needed pub/sub messaging, you needed an external message broker. Redis filled genuine gaps — gaps that existed in the actual PostgreSQL of that era, not the PostgreSQL of today.

The advice to use "the right tool for the right job" was not wrong. It was responsive to the tools available. The difficulty is that the tools changed, and the advice did not.

Bootcamps and introductory courses calcified the 2009-era architecture into gospel: PostgreSQL for relational data, Redis for caching, Elasticsearch for search, MongoDB for documents. Each clause of this rule was reasonable in 2010. By 2026, each is partially or fully wrong for most applications. But curricula do not update at the speed of PostgreSQL releases. A developer completing a bootcamp this year learns substantially the same architecture a developer learned in 2015. The tutorials haven't changed. The starter templates haven't changed. The "production-ready" boilerplate repositories on GitHub still include Redis in the docker-compose.yml as though it were as essential as the database itself.

I call this the curriculum effect, and it is remarkably persistent. An architecture decision made in one era becomes an industry assumption in the next, not because anyone reaffirmed it, but because nobody questioned it.

The ORM layer made matters worse — not through malice, but through abstraction. Django's ORM, Rails' ActiveRecord, Spring's JPA, Laravel's Eloquent, Prisma — these tools were designed to make database interaction simpler for application developers. They succeeded admirably at that goal. They also inadvertently hid PostgreSQL's most powerful features behind a wall of convenience. When your ORM presents the database as a collection of objects to be loaded and saved, features like materialized views, LISTEN/NOTIFY, advisory locks, and UNLOGGED tables become invisible. They exist below the abstraction layer, in territory that the ORM's design suggests you should never need to visit.

The ORM didn't cause the problem. It prevented the discovery of the solution.

And then there is the matter of marketing. Every specialized database has a company behind it. Redis Ltd. (formerly Redis Labs). Elastic. MongoDB Inc. Confluent. Each employs developer advocates, sponsors conferences, publishes benchmark posts carefully designed to show their product in the most favorable light, and offers generous free tiers engineered to embed their service in your architecture before you've had the opportunity to consider whether you need it. Each produces content — blog posts, tutorials, conference talks, certification programs — that reinforces the narrative that your application needs their product.

PostgreSQL has no equivalent marketing operation. It has a mailing list, a wiki, and forty years of engineering excellence that speaks — rather quietly, I'm afraid — for itself.

One does not begrudge these companies their enterprise. One merely observes that the advice to add more services to your architecture arrives most enthusiastically from those who sell them.

The Cost of a Full Household

Allow me to speak in numbers, since numbers are rather more difficult to argue with than opinions.

The SLA arithmetic

Three services, each operating at 99.9% uptime — a figure most teams would consider excellent. The combined availability is not 99.9%. It is 99.7%. That is 2.6 additional hours of downtime per year, distributed unpredictably across your on-call rotation. Five services at the same individual reliability: the combined figure drops to 99.5%, or 4.4 hours of additional annual downtime. And 99.9% is optimistic for most self-managed services — it assumes zero operator error, zero configuration drift, and zero unfortunate interactions between version upgrades.

Each service failure triggers its own cascade of dependent failures, its own investigation, its own postmortem. The failure of your Redis cache may not bring down your application entirely — but it will degrade performance, trigger a wave of cache misses that overwhelm your database, and produce a PagerDuty alert at whatever hour of the night the universe finds most inconvenient. The complexity of your failure modes is not additive. It is multiplicative. Three services don't produce three categories of incidents. They produce the interactions between three services, which is a considerably larger number.

A well-run establishment does not employ three waiters where one suffices. The coordination overhead alone would be untenable.

The dollar cost

A production Redis instance on a managed hosting platform runs $135 per month at minimum — that figure comes from Render's production tier, and similar prices apply across Heroku, Railway, and comparable platforms. AWS ElastiCache for a modest workload begins at roughly the same figure and scales rapidly with memory requirements. Production Elasticsearch — or its equivalent, Amazon OpenSearch Service — starts at approximately $150 per month for the smallest viable production configuration and reaches $400 or more with moderate data volumes and multi-availability-zone redundancy. These are the service fees alone, before anyone on your team has spent a single minute configuring, monitoring, or maintaining them.

The time cost

A 2025 survey of 300 IT professionals by Port, published in their State of Internal Developer Portals report, found that three-quarters of developers lose 6 to 15 hours per week to tool sprawl — the aggregate overhead of navigating, configuring, debugging, and context-switching between the services in their stack. The average developer interacts with 7.4 distinct tools in the course of building an application. Nearly half of all teams juggle 10 or more DevOps tools with significant functional overlap.

Gloria Mark, Chancellor's Professor of Informatics at UC Irvine, has spent two decades studying how technology affects attention. Her research, published initially in 2008 and expanded in her 2023 book Attention Span, established a finding that I suspect will be familiar to anyone who has been interrupted during a debugging session: it takes an average of 23 minutes and 15 seconds to fully regain focus after an interruption. Not the duration of the interruption — the recovery time after it. Every time your attention shifts from PostgreSQL query analysis to Redis configuration to Elasticsearch index management, you pay that recovery cost.

One technology executive, who requested anonymity in the survey, calculated that 40% of their engineering team's time was devoted to infrastructure maintenance rather than product development. I shall let that figure settle for a moment.

Forty percent. Of a team hired and paid to build product. Maintaining infrastructure.

The cognitive cost

Your team needs SQL for PostgreSQL. Redis commands for caching. Possibly Elasticsearch Query DSL for search. MongoDB aggregation pipelines if a document store crept into the stack. Kafka consumer patterns if someone introduced event streaming. That is five query languages, five error models, five debugging paradigms, five sets of documentation, and five separate Stack Overflow ecosystems — each with its own conventions, its own gotchas, and its own version-specific quirks.

Tiger Data — the company formerly known as Timescale — captured this with admirable precision in a 2026 blog post: that is not specialization. That is fragmentation. Your team is not becoming expert in five technologies. It is becoming superficially competent in five technologies, which is a meaningfully different outcome.

I have found, in my experience, that a household runs most efficiently when its staff speak the same language.

The Complexity Tax — A Running Total

For a typical SaaS application serving 5,000–50,000 users, with a 5-person engineering team at $150K average salary:

Infrastructure costs (monthly):

PostgreSQL (managed, e.g., RDS db.r6g.large): ~$200

Redis (managed, e.g., ElastiCache): ~$135–$270

Elasticsearch/OpenSearch (managed): ~$150–$400

Engineering overhead (monthly):

Infrastructure maintenance at 15–40% of team time: $9,375–$25,000

On-call burden: 3 services × rotation coverage × incident investigation time

With Redis + Elasticsearch: ~$9,860–$25,670/month total

With PostgreSQL alone: ~$9,575–$25,200/month total

The infrastructure savings from eliminating two services are modest: $285–$470/month in hosting fees. The engineering time savings are not modest. When even 10% of your team's infrastructure overhead shifts to product development, that is 260 hours per year — more than six full engineering weeks — redirected from maintaining services to building features. The hours compound. The features compound. The complexity tax, once you stop paying it, stays stopped.

A Frank Assessment of Redis

I shall now speak directly about Redis, and I intend to be fair. A waiter who overstates the specials is not worth trusting with the wine.

Where Redis genuinely wins

Redis is an excellent piece of engineering. For specific categories of work, it is the correct choice, and I shall not pretend otherwise. Simple key-value operations at extraordinary throughput. Sub-millisecond latency on hot-path lookups serving tens of thousands of concurrent requests. Data structures — sorted sets, HyperLogLog, streams — that have no direct PostgreSQL equivalent and that solve real problems elegantly. Pub/sub at very high message rates where latency is measured in microseconds rather than milliseconds.

If your application performs that category of work, keep Redis. This book has no quarrel with Redis deployed for the problems Redis was designed to solve.

The difficulty — and I shall be candid — is that most applications using Redis are not performing that category of work. They are caching database query results. They are storing user sessions. They are managing background job queues. They are using Redis as a general-purpose supplement to a PostgreSQL database that could handle these tasks itself, if anyone had thought to introduce the two.

Where the advantage evaporates

CYBERTEC, one of Europe's foremost PostgreSQL consultancies, published latency benchmarks that I consider essential context for any Redis-versus-PostgreSQL discussion. The numbers are widely cited. They are also widely misunderstood. Allow me to present them properly.

A Redis GET operation completes in approximately 0.095 milliseconds. A PostgreSQL SELECT against cached data takes approximately 0.65 milliseconds. On the surface, Redis is roughly seven times faster. This is the number that appears in every "Redis vs PostgreSQL" comparison article, and it is accurate as far as it goes.

It does not go far enough.

The actual execution time for an indexed PostgreSQL read — the time the database engine spends doing computational work, not waiting — is 0.016 milliseconds. The remaining 97% of the response time is network overhead: the time it takes for the query to travel from your application to the database server and for the result to travel back.

When you add Redis to your architecture, you add another network hop. Your application talks to Redis over the network. If the cache misses, your application then talks to PostgreSQL over the network. The theoretical speed advantage of Redis — real in absolute terms — is largely consumed by the network overhead that both systems share equally. For the vast majority of application workloads, where individual query latency of 1 to 5 milliseconds is perfectly acceptable, the difference between Redis and PostgreSQL is a rounding error buried inside a network round trip.

If you'll forgive a rather pointed analogy: hiring a courier to carry a message across the room, when the recipient is already standing beside you.

The Latency Truth

Operation Latency Source
Redis GET 0.095 ms CYBERTEC benchmarks
PostgreSQL SELECT (cached data) 0.65 ms CYBERTEC benchmarks
PostgreSQL actual execution (indexed read) 0.016 ms CYBERTEC benchmarks
Network overhead (typical share of response time) ~97% CYBERTEC analysis

The database is not the bottleneck. The network is the bottleneck. Adding a second network hop to avoid a 0.55-millisecond difference does not improve matters.

Operation	Latency	Source
Redis GET	0.095 ms	CYBERTEC benchmarks
PostgreSQL SELECT (cached data)	0.65 ms	CYBERTEC benchmarks
PostgreSQL actual execution (indexed read)	0.016 ms	CYBERTEC benchmarks
Network overhead (typical share of response time)	~97%	CYBERTEC analysis

The cache invalidation trap

Phil Karlton of Netscape observed, circa 1996, that there are only two hard problems in computer science: cache invalidation and naming things. He was not joking. He was identifying a structural difficulty that has not diminished in the intervening thirty years.

When your cache and your database are separate systems, every write to the database potentially invalidates an unknown number of cached entries. Consider a modest example: a user profile page that displays the user's name, their order count, their loyalty tier, and their three most recent orders. The cache key is user:profile:{user_id}.

When must this cache entry be invalidated? When the user updates their name. When a new order is placed — because the order count changes and the recent orders list changes. When an existing order is canceled or refunded. When the user's loyalty tier is recalculated — which happens when their lifetime spend crosses a threshold, which happens when a payment is confirmed, which may happen asynchronously hours after the order was placed.

That is five invalidation triggers for one cache key, spanning three database tables and an asynchronous payment workflow. Miss any one of them and the user sees stale data. Implement all of them and you've written a distributed consistency mechanism that must be maintained every time your data model changes.

Now multiply this across every cached entity in your application. The complexity scales nonlinearly with the number of relationships in your data model.

When your "cache" is a materialized view inside the same database as your data, this problem does not arise. You call REFRESH MATERIALIZED VIEW. One command. Atomic. Transactional. Every row is consistent with every other row. No stale entries. No orphaned keys. No distributed coordination. Chapter 6 is devoted to this insight entirely.

Redis in production: the operational reality

I mention the following not to disparage Redis but because the decision to add any service to your infrastructure should be made with full knowledge of its operational characteristics — the unpleasant ones as well as the pleasant.

Redis is single-threaded by design. A server provisioned with 128 CPU cores will run Redis on exactly one of them. This is an architectural choice with genuine benefits — no lock contention, no race conditions, predictable latency — but it means that Redis cannot scale vertically in the way most developers assume.

During RDB snapshot persistence, Redis forks its process to create a point-in-time copy. The fork operation itself is fast, but the resulting child process shares memory pages with the parent via copy-on-write. Under heavy write load, modified pages must be duplicated, and an 8-gigabyte dataset can briefly require 15 gigabytes or more of physical memory. If the system lacks the headroom, the operating system's out-of-memory killer intervenes with its characteristic lack of subtlety.

One technology executive recounted discovering at 2 AM that three months of analytics data had been lost to a Redis persistence failure. The failure mode was not exotic. It was a foreseeable consequence of the fork-and-copy-on-write mechanism operating without sufficient memory headroom — a configuration error that Redis's documentation warns about, buried in a section that most operators never read because Redis's reputation for simplicity suggests such reading is unnecessary.

A January 2026 postmortem published on Medium documented a 20-minute production outage caused by an unexpected Redis Sentinel failover. The failover mechanism — designed to provide high availability — promoted a replica that was behind the primary, causing data loss. The infrastructure designed to prevent downtime became the cause of it.

These are not arguments against Redis. They are arguments for making the decision to run Redis with your eyes open, and for questioning whether the workloads you're assigning to Redis genuinely require its capabilities.

The Movement That's Already Underway

If what I have described sounds like a fringe position — one author's contrarian opinion, suitable for a Hacker News debate but not for production architecture — allow me to introduce you to the rather considerable company you would be keeping.

Rails 8, released by the framework that arguably did more than any other to popularize Redis-backed background jobs, removed Redis from its default application stack entirely. In its place: SolidQueue for job processing, SolidCache for caching, and SolidCable for WebSocket connections. All backed by PostgreSQL or SQLite. This was not a tentative experiment. 37signals, the company behind Rails, processes 20 million background jobs per day with SolidQueue. Twenty million jobs. Every day. Without Redis. When the framework that popularized Redis-backed jobs decides Redis is optional, the landscape has shifted.

The identity provider authentik — used by thousands of organizations for single sign-on and identity management — removed Redis as a required dependency across four releases in 2024 and 2025. Their engineering team's assessment was characteristically understated: moving off Redis "simplified our architecture; one fewer piece to manage." Not a performance decision. An operational simplicity decision, made by a team whose product depends on reliability above all else.

Simple Thread, a Rails consultancy with extensive production experience, published what amounts to both a love letter and a breakup notice in a single blog post titled "I Love You, Redis, But I'm Leaving You for SolidQueue." Their accounting of production Redis costs on platforms like Render — $135 per month minimum, plus the engineering time for monitoring, debugging eviction policies, configuring persistence, and maintaining failover — led them to a straightforward conclusion: for most applications, Redis solves a problem that PostgreSQL already handles.

The practitioners arrived at the same conclusion independently. Stephan Schmidt, a CTO coach with four decades of experience building production systems, published the crystallizing statement of this movement — updated as recently as December 2025: "If you have more systems than developers, just use Postgres." Matt Nunogawa, who adopted the approach as a solo technical founder, updated his account in January 2026 after scaling to a team of five. His assessment: "No regrets. It's been a very positive decision in hindsight."

The grassroots evidence is equally compelling. A dev.to post titled "I Replaced Redis with PostgreSQL (And It's Faster)" generated hundreds of comments and debate across multiple platforms. A companion website, postgresisenough.dev, catalogs the full ecosystem of PostgreSQL-native replacements for the services developers typically outsource to specialized databases. The ecosystem it documents is broad and growing: PostgreSQL for queues, for caching, for pub/sub, for search, for time-series, for document storage. Each replacement eliminates a service. Each eliminated service eliminates a category of operational overhead.

This is not a fringe position. It is an emerging consensus, arrived at independently by framework maintainers, identity providers, consultancies, solo founders, and CTO coaches across three continents. The evidence points in one direction, and it points there with increasing conviction.

I should, however, be honest about the boundary — because a responsible advisor does not recommend change for its own sake.

PostgreSQL should not replace Redis when sub-millisecond latency is genuinely required on hot paths serving tens of thousands of concurrent requests per second. It should not replace Redis when Redis-specific data structures — sorted sets, HyperLogLog, streams — are central to the application's functionality. And it should not replace Redis when existing Redis infrastructure is working well, the team has deep Redis expertise, and the operational overhead is genuinely manageable.

In my experience, these conditions describe perhaps one in ten of the applications currently running Redis. The other nine are paying a complexity tax for a service they do not need.

What This Book Will Attend To

I have been speaking in rather broad terms about simplification. Allow me to be specific.

The single feature of PostgreSQL that eliminates the most external infrastructure is the materialized view — a stored query result that serves at the speed of a table read and refreshes on your schedule. It is the feature that transforms "just use PostgreSQL" from an aspirational philosophy into a practical architecture. It pre-computes what your dashboard queries compute on every request. It pre-aggregates what your reporting endpoints aggregate under pressure. It pre-joins what your ORM joins inefficiently.

Chapter 4 is devoted to materialized views entirely — how they work, how to create and index them, what they can and cannot do. Chapter 5 addresses how to keep them fresh. Chapter 6 explains why they make cache invalidation a solved problem rather than an ongoing nightmare.

But before we arrive at the solution, there is one more aspect of the problem that warrants examination: the instrument through which most developers interact with their database — the ORM — and the rather unfortunate SQL it produces.

How do I convince my team to remove a service that's already working?

You don't begin by removing anything. You begin by adding a materialized view for your most expensive dashboard query. The improvement — typically three orders of magnitude — speaks for itself. Once your team has seen a 7-second query serve in 7 milliseconds without Redis, without a new service, without a cache invalidation protocol, the conversation about whether Redis is earning its keep becomes rather more productive. The evidence does the persuading. I merely provide the introduction.

What about the 75% of developers losing 6–15 hours per week to tool sprawl?

That figure, from Port's 2025 survey, measures what most teams feel but cannot articulate: the aggregate cost of context-switching between services, debugging their interactions, and maintaining their configurations. The number is large because the cost is distributed — no single service is obviously expensive, but the sum is devastating. The remedy is not to stop using tools. It is to stop using tools that duplicate capabilities your database already provides.

If you'll follow me to the next chapter, I should like to examine the instrument that sits between your application and your database — the ORM — and the queries it submits on your behalf when you aren't looking.

I should warn you: what your framework does to your queries is not always pleasant to discover. But then, the first step in resolving any domestic matter is seeing it clearly.

After you.