Multi-Tenant SaaS Architecture for 10M+ Daily Events

The Problem with Naive Multi-Tenancy

Most early-stage SaaS products handle multi-tenancy the same way: add a tenant_id column to every table and filter by it in every query. It works — until it doesn't.

We inherited a codebase that had been built this way. Three years and 800 tenants later, the engineers were terrified of every migration, one accidental query was serving tenant A's data to tenant B, and the analytics pipeline was running a 40-minute job every hour.

Here's what we rebuilt it into — and the architectural decisions that made it possible to handle 10M+ events per day without breaking a sweat.

Tenant Isolation Strategy: Choosing Your Model

There are three common approaches to database-level tenant isolation:

Model	Isolation	Cost	Migration Complexity
Shared schema + `tenant_id`	Low	Low	Low
Separate schemas per tenant	Medium	Medium	Medium
Separate database per tenant	High	High	High

We went with separate schemas per tenant in PostgreSQL. Here's why:

Compliance requirements (SOC2, GDPR) often require data isolation — separate schemas make this demonstrable
Migrations can be run per-tenant, enabling gradual rollouts
Schema-level isolation is comprehensible to auditors and engineers alike
Connection pooling via PgBouncer still works well with search_path switching

-- On login or API key authentication, set the tenant schema:
SET search_path TO tenant_acme, public;

-- All subsequent queries now resolve to tenant_acme.users,
-- tenant_acme.events, etc. without any WHERE clauses.
SELECT * FROM users WHERE id = $1;

The key insight: the application layer never needs to filter by tenant_id. The schema search path handles isolation entirely.

Row-Level Security as a Safety Net

Even with separate schemas, we layer on PostgreSQL's Row-Level Security (RLS) as a defense-in-depth measure. RLS is enforced at the database level — a misconfigured query at the application layer cannot leak cross-tenant data.

-- Enable RLS on the events table
ALTER TABLE events ENABLE ROW LEVEL SECURITY;

-- Policy: tenants can only see their own events
CREATE POLICY tenant_isolation ON events
  USING (tenant_id = current_setting('app.current_tenant')::uuid);

-- In the application, set this at session start:
SELECT set_config('app.current_tenant', $1, true);

This is particularly valuable during incident response. When an engineer runs an ad-hoc query in production (it happens), RLS ensures they can't accidentally expose tenant data, even with SELECT *.

The Events Pipeline: Why We Moved Off Direct Writes

The original architecture wrote every event synchronously to PostgreSQL. At low volume this was fine. At 10M events/day (roughly 115 events/second average, with spikes to 800/second), it created:

Write amplification on the primary
Replication lag causing stale reads
API response times tied to database write latency

Our solution was a write-through queue using BullMQ + Redis:

// API handler — writes to queue, returns immediately
export async function POST(req: Request) {
  const event = await req.json();

  await eventsQueue.add('ingest', {
    tenantId: req.headers.get('x-tenant-id'),
    ...event,
  }, {
    attempts: 3,
    backoff: { type: 'exponential', delay: 1000 },
  });

  return Response.json({ ok: true }); // ~2ms response
}

// Worker — batches events and writes in bulk
eventsQueue.process('ingest', 50, async (jobs) => {
  const events = jobs.map(j => j.data);

  // Bulk insert grouped by tenant schema
  const byTenant = groupBy(events, 'tenantId');

  for (const [tenantId, tenantEvents] of Object.entries(byTenant)) {
    await db.execute(sql`SET search_path TO ${tenantId}, public`);
    await db.insertInto('events').values(tenantEvents).execute();
  }
});

This single change reduced our API p99 latency from 340ms to 8ms and eliminated the replication lag problem entirely.

Zero-Downtime Schema Migrations

With separate schemas, migrations are more flexible — but also more complex. We run migrations with the expand-contract pattern:

Expand: Add new columns as nullable, deploy new code that writes to both old and new columns
Backfill: Background job populates the new column for existing rows
Contract: Remove old column once backfill is verified

// migration tool runs this across all tenant schemas
async function runMigrationAcrossTenants(sql: string) {
  const tenants = await getTenantSchemas();

  for (const tenant of tenants) {
    await db.execute(`SET search_path TO ${tenant}`);
    await db.execute(sql);

    console.log(`✓ Migrated tenant ${tenant}`);
  }
}

This approach has let us ship schema changes during business hours, to 1,000+ tenants, with zero downtime.

Observability: Per-Tenant Metrics

The final piece was making per-tenant performance visible. We emit structured logs with tenantId on every operation and route them to a time-series store:

// Middleware that adds tenant context to every log
logger.child({ tenantId, requestId }).info('event.ingested', {
  eventType: event.type,
  processingMs: Date.now() - startTime,
});

This lets us alert on per-tenant anomalies (a tenant's event volume suddenly drops to zero — probably an integration bug on their side), and helps us identify which tenants are generating disproportionate load before it becomes an incident.

Key Takeaways

Schema-per-tenant is the sweet spot between isolation and operational complexity for most SaaS products
RLS is your safety net — never rely on application-layer filtering alone
Decouple writes from your API response time — queues are not premature optimization at scale
Expand-contract migrations let you ship schema changes continuously without maintenance windows
Per-tenant observability catches problems before your users do

The architecture described here handles our client's current 10M events/day with room to 10x. More importantly, the engineering team can reason about it clearly — which is the most underrated property of any good architecture.

How We Architected a Multi-Tenant SaaS for 10M+ Daily Events