The 200ms Promise: How to Rescue an API That's Bleeding Latency

latency performance profiling caching database-optimization

The Demo That Almost Killed the Company

It was a Thursday afternoon, and the conference room smelled like dry-erase markers and anxiety.

Your PM was live-demoing the app to a table full of Series A investors. Fourteen months of engineering effort, three all-nighters, one rewrite. This was the moment.

She clicked Load Dashboard.

The spinner appeared.

It kept spinning.

Two seconds passed. Then two more. The investor in the corner glanced at his phone.

“Sorry, our demo environment is a bit slow,” your PM said, smiling.

But it wasn’t the demo environment. It was production. You knew this because you were watching the metrics from the hallway, and the same API call that just took 4.2 seconds in that room was averaging 3.8 seconds for every real user.

You had one weekend. Not to rewrite the system — to find the actual problem and fix it.

This is that story.

Why Fast Systems Go Slow

Before you can fix latency, you need to understand what creates it. APIs don’t randomly become slow — they slow down for specific, diagnosable reasons.

Here is the typical failure chain:

User request arrives
  → No cache → hits application layer
  → Application fires 1 query per item (N+1)
  → Each query hits unindexed columns
  → DB does full table scans
  → Connection pool is saturated
  → Requests queue behind each other
  → User sees 4-second response

Every step multiplies the problem. One missing index is annoying. One missing index combined with an N+1 loop on a table with 2 million rows is a disaster.

The cruel part: the system looked fine in staging. You had 500 rows of test data. The N+1 problem fired 500 queries in 80ms. With 2 million real rows, those same 500 queries each took 400ms. You multiplied by 2,000 and got a 200-second equivalent problem, compressed into the 4 seconds your users actually waited.

Step 1: Measure Everything Before You Touch Anything

The worst thing you can do is guess. Engineers who “know” the bottleneck without measuring almost always fix the wrong thing.

The right approach is distributed tracing. Every request gets a trace ID that follows it through every layer:

Trace: GET /api/dashboard (total: 3,820ms)
├── Auth middleware        12ms
├── Handler setup          3ms
├── getUserProfile()      18ms
├── getDashboardItems()   3,640ms    ← HERE
│   ├── query 1           4ms
│   ├── query 2           4ms
│   ├── ... (repeated 482 more times)
│   └── query 484         4ms
└── serialize response     147ms

With this view, you stop guessing. The 3,640ms is almost entirely spent inside getDashboardItems(), firing 484 individual queries in a loop. This is textbook N+1.

Tools to reach for:

Datadog APM, New Relic, or Jaeger for traces
EXPLAIN ANALYZE in PostgreSQL to inspect individual query plans
Slow query log to catch anything over a threshold (100ms is a good start)

Step 2: Understand the Latency Budget

A 200ms response time is achievable for most API calls. But you can’t achieve it by accident. You have to allocate it deliberately.

Total budget: 200ms
├── Network/TLS:           10ms
├── Auth/middleware:       15ms
├── Business logic:        25ms
├── Primary DB query:      80ms
├── Cache lookups:         10ms
├── Serialization:         20ms
└── Buffer:                40ms

Once you write this down, you immediately see the constraints. Your single DB query gets 80ms. Not 400ms. Not 4,000ms. If any layer is burning its allocation and spilling into the buffer, you have found a problem worth solving.

Compare this budget against your trace. The delta between budget and reality is your prioritized work queue.

Step 3: The N+1 Kill

The N+1 query problem is responsible for more production latency incidents than any other single cause. It looks harmless in code:

// This looks innocent. It is not.
const users = await db.query('SELECT * FROM users WHERE active = true');

for (const user of users) {
  user.posts = await db.query(
    'SELECT * FROM posts WHERE user_id = ?', [user.id]
  );
}

If users returns 200 rows, you just fired 201 queries. Fix it with a JOIN or a batched load:

// One query. Always.
const usersWithPosts = await db.query(`
  SELECT u.*, p.id as post_id, p.title, p.created_at
  FROM users u
  LEFT JOIN posts p ON p.user_id = u.id
  WHERE u.active = true
`);

// Or: fetch users, then batch-fetch all posts in one IN() query
const users = await db.query('SELECT * FROM users WHERE active = true');
const userIds = users.map(u => u.id);
const posts = await db.query(
  'SELECT * FROM posts WHERE user_id = ANY(?)', [userIds]
);
const postsByUser = groupBy(posts, 'user_id');
users.forEach(u => u.posts = postsByUser[u.id] ?? []);

The second version fires exactly 2 queries regardless of whether you have 5 users or 500,000.

Step 4: Index Surgery

After N+1 elimination, the next stop is the query plan. Run EXPLAIN ANALYZE on your slowest queries and look for the word Seq Scan:

EXPLAIN ANALYZE
SELECT * FROM events
WHERE user_id = 12345
  AND created_at > NOW() - INTERVAL '30 days';

-- Bad output:
-- Seq Scan on events  (cost=0.00..48921.00 rows=2,446,000)
--   Filter: (user_id = 12345)
-- Rows Removed by Filter: 2,443,554
-- Execution Time: 892ms

-- After adding the index:
-- Index Scan using idx_events_user_created on events
-- Index Cond: (user_id = 12345 AND created_at > ...)
-- Execution Time: 4ms

A composite index on (user_id, created_at) turned a 892ms scan into a 4ms lookup. That is a 223x improvement on a single query.

Index design rules of thumb:

Situation	Index type
Filtering by one column	Single-column index
Filtering by two columns together	Composite index (put equality first)
Sorting + filtering	Include the ORDER BY column in the composite
Full-text search	Use GIN index or dedicated search service
Low cardinality (e.g., status field)	Consider partial index instead

Step 5: Add Caching Layers Strategically

Once your queries are fast, you protect them with cache. The key word is strategically — not everything should be cached, and not everything should have the same TTL.

Request
  │
  ▼
CDN Cache (Cloudflare/CloudFront)
  │ HIT → return in < 5ms
  │ MISS ↓
  ▼
Redis / Memcached (Application Cache)
  │ HIT → return in 1–3ms
  │ MISS ↓
  ▼
Application Layer
  │
  ▼
Database (one query, indexed)
  │
  └──→ Store result in Redis (TTL: 60s)
       └──→ Return response

What to cache:

User profiles (TTL: 5 minutes, invalidate on update)
Configuration and feature flags (TTL: 60 seconds)
Aggregated dashboard data (TTL: 30 seconds, stale-while-revalidate)
Product catalogs and static reference data (TTL: 10 minutes)

What not to cache:

Anything that must be real-time accurate (live order status, auth tokens)
Write operations
Per-user data with privacy implications unless namespaced correctly

The Architecture After the Weekend

Here is what the system looked like after 48 hours of targeted work:

┌─────────────────────────────────────────────────────────┐
│                     CDN (Edge Cache)                     │
│              Static: images, JS, CSS bundles             │
└─────────────────────────┬───────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────┐
│                  Load Balancer (ALB)                      │
└──────────┬──────────────────────────┬────────────────────┘
           │                          │
┌──────────▼──────────┐   ┌───────────▼─────────┐
│   API Server (1)    │   │   API Server (2)     │
│   Node.js / Python  │   │   Node.js / Python   │
└──────────┬──────────┘   └───────────┬──────────┘
           │                          │
           └──────────┬───────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│                   Redis Cluster                           │
│           User profiles, dashboards, config              │
│           HIT rate target: > 85%                         │
└─────────────────────┬───────────────────────────────────┘
                      │ MISS only
                      │
┌─────────────────────▼───────────────────────────────────┐
│               PostgreSQL Primary                          │
│   Indexed queries, connection pool (PgBouncer: 20 conns) │
│   Read replicas for analytics and reports                 │
└─────────────────────────────────────────────────────────┘

Results after the fix:

Metric	Before	After
p50 latency	3,200ms	48ms
p99 latency	8,400ms	190ms
DB query count per request	484	3
Cache hit rate	0%	91%
DB CPU utilization	94%	12%

The Monday demo went fine. The investors didn’t ask about performance.

The Systematic Approach (Don’t Guess)

Here is the process condensed into a repeatable playbook:

1. Instrument first
   Add tracing. Identify the slowest span.

2. Find the category
   N+1 query? Missing index? No cache? Serialization?

3. Fix the highest-impact problem only
   Resist fixing everything at once.

4. Measure again
   Confirm the improvement. Look for the next bottleneck.

5. Repeat
   Each fix reveals the next constraint.

This is how you avoid the trap of “we optimized for 2 days and p99 improved by 8%.” Targeted fixes on the actual bottleneck produce order-of-magnitude improvements, not percentage improvements.

Key Takeaways

Measure before you fix. Distributed tracing tells you exactly where time is spent. Never optimize based on intuition alone.
N+1 queries are the most common cause of API latency regressions. A single JOIN or batch query can replace hundreds of round trips.
Indexes on the right columns are the highest-leverage database fix. A missing index on a filtered column can mean the difference between 4ms and 900ms.
Cache the read path, not the write path. Caching protects your database from repetitive reads. It does not help with writes.
Latency budgeting makes hidden problems visible. Assign each layer a time allowance. Anything exceeding its budget is a prioritized fix.

FAQ

Q: How do I know if I have an N+1 problem without distributed tracing? Enable query logging at the database level and look for large numbers of nearly-identical queries fired in rapid succession. A pattern like 200 queries all selecting by different user IDs is the signature of N+1.

Q: Should I cache everything in Redis to make the API fast? No. Cache only data that is expensive to recompute, read frequently, and tolerates some staleness. Over-caching creates consistency problems that are harder to debug than the original latency.

Q: What is a realistic p99 latency target for a typical REST API? For user-facing reads: under 200ms. For write operations: under 500ms. For analytics or report endpoints: 1 to 2 seconds is often acceptable. Define your targets based on user expectations for each endpoint type, not a single global number.

Q: How do I prevent this from happening again after the fix? Add a latency regression alert to your CI pipeline or monitoring system. Alert when p95 exceeds a threshold. Review slow query logs weekly. Run EXPLAIN ANALYZE on new queries before they ship.

Interview Questions

“Your API is slow. Walk me through how you’d diagnose it.” Add distributed tracing to identify the slowest span. Check for N+1 queries. Run EXPLAIN ANALYZE on slow DB queries. Look for missing indexes. Check cache hit rates. Measure each layer against the latency budget.

“What is the N+1 problem and how do you fix it?” It occurs when you fetch N parent records and then fire one additional query per record to fetch children. Fix it with a JOIN (one query) or by fetching all children in a single batched IN() query and grouping them in application code.

“When would you not add a cache?” When the data must be strongly consistent (live inventory, auth state), when cache invalidation logic is complex enough to introduce bugs, or when the data changes so frequently that the cache hit rate would be negligible.

“How do you design a composite index?” Put equality predicates first, then range predicates, then sort columns. For a query WHERE user_id = ? AND created_at > ?, the index should be (user_id, created_at).

“What is the difference between p50, p95, and p99 latency?” p50 is the median — half of requests are faster, half are slower. p95 means 95% of requests complete within that time. p99 captures the experience of your slowest 1%, which is often your most impactful users (paying customers, high-traffic users). Optimizing p99 matters because the slowest requests often represent the most complex or data-heavy operations.

This article is premium

One-time payment · Lifetime access to all premium content

Get Premium Access

Already have access? Sign in