The Problem with Naive Multi-Tenancy
Most early-stage SaaS products handle multi-tenancy the same way: add a tenant_id column to every table and filter by it in every query. It works — until it doesn't.
We inherited a codebase that had been built this way. Three years and 800 tenants later, the engineers were terrified of every migration, one accidental query was serving tenant A's data to tenant B, and the analytics pipeline was running a 40-minute job every hour.
Here's what we rebuilt it into — and the architectural decisions that made it possible to handle 10M+ events per day without breaking a sweat.
Tenant Isolation Strategy: Choosing Your Model
There are three common approaches to database-level tenant isolation:
| Model | Isolation | Cost | Migration Complexity |
|---|---|---|---|
Shared schema + tenant_id | Low | Low | Low |
| Separate schemas per tenant | Medium | Medium | Medium |
| Separate database per tenant | High | High | High |
We went with separate schemas per tenant in PostgreSQL. Here's why:
- Compliance requirements (SOC2, GDPR) often require data isolation — separate schemas make this demonstrable
- Migrations can be run per-tenant, enabling gradual rollouts
- Schema-level isolation is comprehensible to auditors and engineers alike
- Connection pooling via PgBouncer still works well with
search_pathswitching
-- On login or API key authentication, set the tenant schema:
SET search_path TO tenant_acme, public;
-- All subsequent queries now resolve to tenant_acme.users,
-- tenant_acme.events, etc. without any WHERE clauses.
SELECT * FROM users WHERE id = $1;
The key insight: the application layer never needs to filter by tenant_id. The schema search path handles isolation entirely.
Row-Level Security as a Safety Net
Even with separate schemas, we layer on PostgreSQL's Row-Level Security (RLS) as a defense-in-depth measure. RLS is enforced at the database level — a misconfigured query at the application layer cannot leak cross-tenant data.
-- Enable RLS on the events table
ALTER TABLE events ENABLE ROW LEVEL SECURITY;
-- Policy: tenants can only see their own events
CREATE POLICY tenant_isolation ON events
USING (tenant_id = current_setting('app.current_tenant')::uuid);
-- In the application, set this at session start:
SELECT set_config('app.current_tenant', $1, true);
This is particularly valuable during incident response. When an engineer runs an ad-hoc query in production (it happens), RLS ensures they can't accidentally expose tenant data, even with SELECT *.
The Events Pipeline: Why We Moved Off Direct Writes
The original architecture wrote every event synchronously to PostgreSQL. At low volume this was fine. At 10M events/day (roughly 115 events/second average, with spikes to 800/second), it created:
- Write amplification on the primary
- Replication lag causing stale reads
- API response times tied to database write latency
Our solution was a write-through queue using BullMQ + Redis:
// API handler — writes to queue, returns immediately
export async function POST(req: Request) {
const event = await req.json();
await eventsQueue.add('ingest', {
tenantId: req.headers.get('x-tenant-id'),
...event,
}, {
attempts: 3,
backoff: { type: 'exponential', delay: 1000 },
});
return Response.json({ ok: true }); // ~2ms response
}
// Worker — batches events and writes in bulk
eventsQueue.process('ingest', 50, async (jobs) => {
const events = jobs.map(j => j.data);
// Bulk insert grouped by tenant schema
const byTenant = groupBy(events, 'tenantId');
for (const [tenantId, tenantEvents] of Object.entries(byTenant)) {
await db.execute(sql`SET search_path TO ${tenantId}, public`);
await db.insertInto('events').values(tenantEvents).execute();
}
});
This single change reduced our API p99 latency from 340ms to 8ms and eliminated the replication lag problem entirely.
Zero-Downtime Schema Migrations
With separate schemas, migrations are more flexible — but also more complex. We run migrations with the expand-contract pattern:
- Expand: Add new columns as nullable, deploy new code that writes to both old and new columns
- Backfill: Background job populates the new column for existing rows
- Contract: Remove old column once backfill is verified
// migration tool runs this across all tenant schemas
async function runMigrationAcrossTenants(sql: string) {
const tenants = await getTenantSchemas();
for (const tenant of tenants) {
await db.execute(`SET search_path TO ${tenant}`);
await db.execute(sql);
console.log(`✓ Migrated tenant ${tenant}`);
}
}
This approach has let us ship schema changes during business hours, to 1,000+ tenants, with zero downtime.
Observability: Per-Tenant Metrics
The final piece was making per-tenant performance visible. We emit structured logs with tenantId on every operation and route them to a time-series store:
// Middleware that adds tenant context to every log
logger.child({ tenantId, requestId }).info('event.ingested', {
eventType: event.type,
processingMs: Date.now() - startTime,
});
This lets us alert on per-tenant anomalies (a tenant's event volume suddenly drops to zero — probably an integration bug on their side), and helps us identify which tenants are generating disproportionate load before it becomes an incident.
Key Takeaways
- Schema-per-tenant is the sweet spot between isolation and operational complexity for most SaaS products
- RLS is your safety net — never rely on application-layer filtering alone
- Decouple writes from your API response time — queues are not premature optimization at scale
- Expand-contract migrations let you ship schema changes continuously without maintenance windows
- Per-tenant observability catches problems before your users do
The architecture described here handles our client's current 10M events/day with room to 10x. More importantly, the engineering team can reason about it clearly — which is the most underrated property of any good architecture.