How Notion Solved Their Database Crisis and Scaled for Millions of Users

How Notion Solved Their Database Crisis and Scaled for Millions of Users

In 2021, Notion, the popular productivity and collaboration tool, faced a critical challenge. As their user base exploded, their service slowed to a crawl. The culprit? A massive Postgres database bursting at the seams. Let’s dive into how Notion tackled this crisis and emerged stronger than ever.

The Root of the Problem: Notion’s Unique Data Model

At the heart of Notion’s struggle was its innovative yet data-intensive model. Everything in Notion is a “block” - be it text, an image, or an entire page. Each block is a row in the database with a unique ID, and blocks can nest within each other, creating complex hierarchies.

While this structure offers incredible flexibility, it comes at a cost. A single document could spawn hundreds or thousands of database entries. Multiply that by millions of users, and you’ve got a recipe for database overload.

The Symptoms of a Database in Distress

As Notion’s user base grew, so did the signs of trouble:

  1. Increased latency when requesting page data
  2. A stalling VACUUM process, crucial for database maintenance
  3. The looming threat of transaction ID wraparound, which could force the database into read-only mode

Notion’s database had become a ticking time bomb, ready to explode at any moment.

The Journey to a Solution

Attempt 1: Vertical Scaling

The first instinct was to beef up the Postgres instance. However, there are physical limits to scaling a single machine, and costs increase exponentially. Plus, query performance often degrades before hitting hardware limits.

The Winning Strategy: Horizontal Scaling and Sharding

Notion decided to shard their database, splitting data across multiple instances. Here’s how they did it:

  1. They sharded all tables linked to the block table via foreign keys.
  2. Chose workspace ID as the partition key, keeping related data in the same shard.
  3. Created 32 physical database instances, each with 15 logical shards (schemas), totaling 480 shards.
  4. Implemented a routing mechanism at the application level to determine data location.
  5. Used PgBouncer as a connection pooler to manage database connections efficiently.

The Migration Process

Migrating to the new system was no small feat:

  1. They created an audit log for writes to the old database.
  2. Used a catch-up script to populate new databases with additional schema changes.
  3. Migrated existing data over three days using a powerful m5.24xlarge instance.
  4. Carefully orchestrated the switch to the new sharded system.

Growing Pains and Further Optimization

By late 2022, even the 32-shard setup was showing signs of strain. Notion observed:

  1. Some shards reaching 90% CPU utilization during peak traffic.
  2. Many shards approaching full disk bandwidth utilization.
  3. Connection limits in the PgBouncer setup being reached.

The Next Evolution: 96-Shard System

To address these issues, Notion further sharded their system:

  1. Expanded from 32 to 96 database instances.
  2. Reduced logical schemas per instance from 15 to 5.
  3. Used Postgres logical replication for data synchronization.
  4. Optimized the process by delaying index creation, reducing sync time from 3 days to 12 hours.

Refining the Connection Management

Notion also had to rethink their PgBouncer setup:

  1. Created four new PgBouncer clusters, each managing 24 databases.
  2. Increased connections per PgBouncer per shard to 8.
  3. Limited total connections per Postgres instance to 200.
  4. Enhanced system resilience by isolating potential PgBouncer issues.

Testing and Rollout

Before going live, Notion implemented a “Dark Read” testing strategy:

  1. Fetched data from both new and old databases on requests.
  2. Compared results for consistency, logging discrepancies.
  3. Limited comparisons to small queries and sampled a portion of requests.
  4. Introduced a 1-second delay for replication catch-up.

The final failover process involved:

  1. Pausing traffic and allowing ongoing queries to complete.
  2. Ensuring new databases were fully caught up.
  3. Updating configurations and reversing replication direction as a precaution.
  4. Resuming traffic to the new shards.

The Results

Notion’s resharding project was a resounding success:

  1. Dramatically increased capacity.
  2. Improved performance, with CPU and IOPS utilization dropping from 90% to around 20% during peak traffic.
  3. Future-proofed their architecture to handle continued growth.

Through careful planning, innovative problem-solving, and meticulous execution, Notion transformed a potential disaster into a triumph of database engineering. Their journey offers valuable lessons for any growing tech company facing the challenges of scale.