Day 21: Week 3 retrospective - stress testing our limits

Day 21 of 30. End of week 3. Time to see if we can scale.

#performance #load-testing

Day 21 of 30. End of week 3. Time to see if we can scale.

We’ve currently been running with a pretty low volume, but we’re getting a small bit of traction. However, we currently aren’t sure what our capacity is, and it’s time to change this. Just like we’ve done for most of our other tests, it’s time to collect some real data!

There’s several ways how we can do this, but an easy way is k6. We’ve used it in the past, it’s a pretty good tool, and even though it’s lacking a little bit in the visualisation of the test results, we can some insights out of this testdata.

Load testing setup

As said, we used k6 for load testing, and this is our initial test:

import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 10 },  // Ramp up
    { duration: '5m', target: 10 },  // Stay at 10 concurrent
    { duration: '2m', target: 20 },  // Push higher
    { duration: '5m', target: 20 },  // Hold
    { duration: '2m', target: 0 },   // Ramp down
  ],
};

export default function () {
  const payload = JSON.stringify({
    url: 'https://example.com',
    device: 'desktop'
  });

  http.post('https://api.ourservice.com/v1/screenshots', payload, {
    headers: {
      'Content-Type': 'application/json',
      'X-API-Key': __ENV.API_KEY,
    },
  });

  sleep(1);
}

Of course, we tested locally first, but to get a real number out of the test, it’s important to have a production like environment. And what better production-like system than… well, production of course!

Interactive load test results

Click through the different concurrency levels to see how our system performed:

load-test - k6 load test results
[>] Requests per second
5 req/s
1 10 20 30 40 50
Server
[OK]
[*] Success [X] Error [~] Timeout
[T]
1.2s
Avg response
[C]
15%
CPU usage
[M]
1.2GB
Memory
of 4GB
[X]
0%
Error rate
Processed: 0
Success: 0
Failed: 0
Queue: 0
[OK]
System healthy. All requests completing successfully.
[=] Recommended capacity
10 Safe req/s
~1K /hour
~24K /day
~720K /month

The results

After some tense minutes, our results came in.

At 10 concurrent users

  • Screenshots created: 600 over 7 minutes
  • Average job completion: 3.2 seconds
  • Error rate: 0%
  • CPU: 45%
  • Memory: 2.1GB / 4GB

This is pretty okay. This is about 5,000 screenshots/hour, plus we know we have room for simple improvements without having to rebuild our whole system.

At 20 concurrent users

  • Screenshots created: 800 over 7 minutes
  • Average job completion: 5.1 seconds
  • Error rate: 2%
  • CPU: 78%
  • Memory: 3.4GB / 4GB

We’re getting a bit warm. The completion time increased 60%, and some jobs started failing. This most likely wouldn’t be a real problem in production since we retry, and it wouldn’t be noticeable for batch processing, so it’s something we can live with at the moment.

At 30 concurrent users (pushed further)

  • Average job completion: 12+ seconds
  • Error rate: 15%
  • CPU: 95%
  • Memory: 3.9GB / 4GB

We’re reaching the boundary of our system. The system is struggling, jobs are timing out, and we’ve even seen an out of memory exception. Now we’re sweating a bit.

What’s the bottleneck?

Memory is the killer. As can be expected when dealing with browsers, memory is challenging. Each Playwright browser context uses ~500MB. At 30 concurrent screenshots, we need 15GB+ RAM. We have 4GB.

CPU is secondary. Chrome rendering can be CPU-intensive, but it looks like memory pressure causes the real failures.

Database is fine. Postgres barely notices. Queries take less than in 10ms even under load.

Storage is fine. R2 uploads complete quickly. Our network, nor the storage, is a bottleneck.

Our actual capacity

With 4GB RAM, we can safely run:

  • 3-4 concurrent screenshots
  • ~1,000 screenshots/hour sustained
  • ~24,000 screenshots/day
  • ~720,000 screenshots/month

Wait, that’s actually more than we need at this moment. However, the problem isn’t total capacity - it’s burst capacity. If a users send 1,000 requests in 5 minutes, we’ll queue up and slow down.

Improvements made

1. Better queue management

We’ve limited concurrent workers to 3:

@Configuration
class WorkerConfig {
    @Bean
    fun screenshotExecutor(): ThreadPoolTaskExecutor {
        return ThreadPoolTaskExecutor().apply {
            corePoolSize = 3
            maxPoolSize = 3
            queueCapacity = 500  // Buffer for bursts
            setThreadNamePrefix("screenshot-worker-")
        }
    }
}

This way, jobs queue instead of overwhelming the system.

2. Resource cleanup

We’ve also applied a more aggressive browser context cleanup:

fun processJob(job: ScreenshotJob) {
    val context = browser.newContext(contextOptions)
    try {
        // ... capture screenshot
    } finally {
        context.close()  // Always close
        System.gc()      // Hint to JVM (doesn't guarantee anything)
    }
}

3. Health checks

To monitor our system better, we’ve, added memory-based health check:

@GetMapping("/health")
fun health(): ResponseEntity<*> {
    val runtime = Runtime.getRuntime()
    val usedMemory = runtime.totalMemory() - runtime.freeMemory()
    val maxMemory = runtime.maxMemory()
    val usagePercent = usedMemory.toDouble() / maxMemory * 100

    return if (usagePercent < 90) {
        ResponseEntity.ok(mapOf("status" to "healthy"))
    } else {
        ResponseEntity.status(503).body(mapOf(
            "status" to "degraded",
            "reason" to "memory pressure"
        ))
    }
}

I think we can all see the challenges with a solution like this, so tomorrow we’ll look at a better solution. For now, this just gives a quick insight in how our application is performing, but when it’s completely out of memory, this endpoint most likely won’t return any info.

4. Graceful degradation

When the queue is full, return a proper error:

{
  "error": {
    "code": "service_busy",
    "message": "High demand. Please retry in a few seconds.",
    "retry_after": 5
  }
}

We try to be explicit in our failures to recover more quickly. As such, proper errors are better than silent failures.

Week 3 retrospective

Let’s look at the whole week:

DayWhat we did
15Pricing strategy
16Stripe integration
17Documentation
18Error handling
19Soft launch, outreach
20First customer call
21Stress testing

Progress:

  • Product is complete and sellable
  • Documentation exists
  • Payments work
  • First real customer conversation
  • One pilot in progress

Numbers:

  • Registered users: 31 → 38 (this week)
  • Revenue: Still $0
  • Paying customers: Still 0

Honest assessment:

We’re close but not there. Our pilot is our current best shot. If it goes well, we have our first customer. If not, we need to keep grinding.

We’ve proved that product works. The bottleneck is now distribution, not development.

What’s left

  • Days 22-30: Nine days remaining
  • Goal: First paying customer
  • Strategy: Nurture our pilot, keep outreach going, improve based on feedback

Tomorrow: monitoring and alerts

On day 22 we’ll set up proper monitoring. When things break (and they will, as we’ve seen above), we need to know before customers tell us.

Book of the day

The Art of Capacity Planning by John Allspaw

Allspaw (former Etsy CTO) wrote the book on capacity planning. How to measure, predict, and provision for load.

A key insight: capacity isn’t about handling averages, it’s about handling peaks. Our system handles average load fine. It’s the bursts that kill us.

The book also covers queueing theory - why adding capacity has diminishing returns and why backpressure matters. Directly applicable to today’s work.


Day 21 stats

Hours
█████████░░░░░░
62h
</> Code
████████████░░░
4,100
$ Revenue
░░░░░░░░░░░░░░░
$0
Customers
░░░░░░░░░░░░░░░
0
Hosting
████░░░░░░░░░░░
$5.5/mo
Achievements:
[✓] Load testing complete [✓] Capacity limits identified [✓] Queue management improved
╔════════════════════════════════════════════════════════════╗
E

Erik

Building Allscreenshots. Writes code, takes screenshots, goes diving.

Try allscreenshots

Screenshot API for the modern web. Capture any URL with a simple API call.

Get started