Day 9: Sync vs async - two ways to capture screenshots

Day 9 of 30. Today we make the API actually work - with both instant and queued responses.

#postgres #queues

Day 9 of 30. Today we make the API actually work - with both instant and queued responses.

Yesterday we designed the interface and today we will build the engine. While we designed the async interface yesterday, we found use cases for which we also need the sync interface. Our key question: should a screenshot request block until complete, or return immediately with a job ID?

Our answer? Both!

Two modes, one API

Our API supports two capture modes:

Synchronous - Request blocks until screenshot is ready. Response contains the image directly.

POST /v1/screenshots
→ Wait 2-4 seconds
→ Returns PNG/JPEG bytes

Asynchronous - Request returns immediately with a job ID. Poll or use webhook for result.

POST /v1/screenshots/async
→ Returns job ID in 50ms
→ Poll GET /v1/screenshots/{id} for result

While it’s slightly harder to implement, we can reuse a lot of code, and we think it’s an important feature to have.

When to use which

Use sync when:

  • You need one screenshot at a time
  • The client can wait 2-5 seconds
  • You want the simplest integration
  • You’re building interactive tools (dashboard previews, live captures)

Use async when:

  • You’re capturing many URLs in batch
  • Your HTTP client has short timeouts
  • You need guaranteed delivery with retries
  • You want webhook notifications
  • You’re building automation pipelines

In our mind, most users will start with sync, and perhaps switch to async when they hit scale or more complex use cases.

The sync path

Sync is straightforward. A request comes in, we capture the request, and we respond:

POST /v1/screenshots:

user = getCurrentUser()
validateRequest(request)
reserveCredit(user)

if async:
  return createAsyncJob(request, user)

try:
    screenshot = captureScreenshot(request)
    consumeCredit(user)
    return screenshot as image
catch error:
    releaseCredit(user)
    throw error

In the case of a synchronous call, the response is the image itself. There are no job IDs to manage, and no polling. This keeps the API quite simple.

Unfortunately, a sync API has limits. If the site takes 10 seconds to load, your request takes 10 seconds. While we will optimize the call and provide several ways to keep the waiting to a minimum, a slow site can still result in a slow screenshot, and you might be running into timeouts.

Why async needs a queue

For async, we need a queue, and ideally a persistent one. Without a queue, async would be fire-and-forget. We’d potentially lose jobs if the server restarts, and it would be harder to retry failures. Also, we couldn’t distribute work across multiple workers, which would limit our scaling options.

The queue gives us:

Durability. Our jobs survive server restarts, since they’re in the database.

Retries. Any transient failures get another chance, and network blips don’t mean permanent failure.

Backpressure. If the number of requests spike, the queue can absorb them, since workers process items at a sustainable pace.

Scalability. If we reach a certain level of scale, we can add more workers to process faster.

The queue in action

See how jobs flow through our Postgres-based queue:

queue-monitor - Screenshot job queue
0 Pending
0 Processing
0 Completed
0 Failed
Incoming
Workers
W1 idle
W2 idle
Completed
1x
00:00:00 Queue monitor initialized. Add jobs to see the flow.

Why Postgres as the queue?

As we mentioned in Day 2: Just use Postgres.

We could use Redis, RabbitMQ, or a dozen other queue systems. But we’re using Postgres because:

One less thing to run. We already have Postgres. Adding Redis means another container, which requires more memory and potentially more things to break.

Transactional safety. When we create a user request and queue a job, they’re in the same transaction. No distributed coordination needed.

Good enough performance. We’re not processing thousands of jobs per second (yet, though we have a few plans - stay tuned!). Either way, Postgres can easily handle our scale.

SKIP LOCKED is magic. Postgres has SELECT ... FOR UPDATE SKIP LOCKED which is perfect for job queues. Workers grab jobs without blocking each other.

Complexity. Having 1 thing for queues and storage reduces our complexity. We know where the data is, we don’t have to hop to different systems to diagnose issues.

So, what’s the trade-off? Postgres polling perhaps isn’t as efficient as Redis pub/sub. But we’re polling every 500ms, and the overhead should be negligible.

The job table

CREATE TABLE screenshot_jobs (
    id VARCHAR(32) PRIMARY KEY,  -- scr_abc123...
    user_id UUID REFERENCES users(id),

    -- Request parameters
    url TEXT NOT NULL,
    device VARCHAR(20) DEFAULT 'desktop',
    full_page BOOLEAN DEFAULT FALSE,
    width INTEGER DEFAULT 1920,
    height INTEGER DEFAULT 1080,
    format VARCHAR(10) DEFAULT 'png',
    wait_for VARCHAR(20) DEFAULT 'networkidle',

    -- Status tracking
    status VARCHAR(20) DEFAULT 'pending',
    attempts INTEGER DEFAULT 0,
    max_attempts INTEGER DEFAULT 3,

    -- Results
    image_url TEXT,
    file_size INTEGER,
    error_code VARCHAR(50),
    error_message TEXT,

    -- Timestamps
    created_at TIMESTAMP DEFAULT NOW(),
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    failed_at TIMESTAMP,

    -- For queue processing
    locked_until TIMESTAMP,
    locked_by VARCHAR(50)
);

CREATE INDEX idx_jobs_pending ON screenshot_jobs(status, created_at)
    WHERE status = 'pending';
CREATE INDEX idx_jobs_user ON screenshot_jobs(user_id, created_at);

The locked_until and locked_by fields prevent multiple workers from grabbing the same job.

Creating async jobs

When using /async, we create a job and return immediately:

createAsyncJob(request, user):
    job = new Job(
        id = generateJobId(),      // scr_abc123...
        userId = user.id,
        url, device, fullPage, width, height, format, waitFor,
        webhookUrl
    )

    save(job)

    return 202 Accepted {
        id: job.id,
        status: "pending",
        statusUrl: "/v1/screenshots/{id}"
}

The response comes back in ~50ms. The client gets a job ID to track progress.

The worker

A background worker polls for pending jobs:

ScreenshotWorker:
    workerId = random short id

    every 500ms:
        job = claimNextJob()
        if no job: return

        try:
            processJob(job)
        catch error:
            handleFailure(job, error)

    claimNextJob():
        return claimNextPending(workerId, lockFor: 2 minutes)

    processJob(job):
        updateStatus(job.id, "processing")
        screenshot = captureScreenshot(job.url, job.device, ...)
        imageUrl = uploadToStorage(job.id + "." + job.format, screenshot)
        markCompleted(job.id, imageUrl, screenshot.size)

        if job.webhookUrl:
            sendWebhook(job.webhookUrl, job.id, "completed", imageUrl)

    handleFailure(job, error):
        newAttempts = job.attempts + 1

        if newAttempts >= job.maxAttempts:
            markFailed(job.id, categorizeError(error), error.message)
            if job.webhookUrl:
                sendWebhook(job.webhookUrl, job.id, "failed")
        else:
            // exponential backoff: 20s, 40s, 80s
            backoff = 2^newAttempts * 10 seconds
            releaseForRetry(job.id, newAttempts, availableAfter: now + backoff)

        categorizeError(error):
            if "timeout" in error: return "timeout"
            if "ERR_NAME_NOT_RESOLVED" in error: return "dns_error"
            if "ERR_CONNECTION_REFUSED" in error: return "connection_refused"
            else: return "internal_error"

The claim query

The claimNextPending method uses SKIP LOCKED:

claimNextPending(workerId, lockDuration):
    // atomically find and lock the next available job

    find first job where:
        status = 'pending'
        and (not locked or lock expired)
    order by created_at

    lock it with SKIP LOCKED  // if already locked, try next one

    set locked_by = workerId
    set locked_until = now + lockDuration
    set started_at = now (if not already set)

    return job

FOR UPDATE SKIP LOCKED means: lock this row, but if it’s already locked, skip it and try the next one. Multiple workers can run this simultaneously without blocking.

Polling for results

Clients poll the job endpoint:

GET /v1/screenshots/{id}:
    job = findById(id)
    if not found: return 404

    if job.status == "completed":
        return { id, status, imageUrl, fileSize, completedAt }

    if job.status == "failed":
        return { id, status, error: { code, message }, failedAt }

    else:  // pending or processing
        return { id, status, createdAt, startedAt }

We recommend polling every 1-2 seconds, since most jobs complete in 3-5 seconds. We will also offer webhooks, which might be a more appropriate way for receiving callbacks.

What we discovered

While talking to users (we don’t have the data yet), we found that:

Sync is what most users want. We aimed to add async first, thinking everyone would want that. Then we realized most integrations are simpler with sync.

Playwright needs memory. Each browser context uses 50-100MB. With a 4GB VPS and Postgres also running, we can safely run 2-3 concurrent browser instances. Sync requests share a browser pool; async workers have their own.

Some sites fail consistently. Sites with aggressive bot protection (Cloudflare challenges, CAPTCHAs) fail every time. We need to handle these gracefully with a quick failure, clear error message, or a better way around this.

The 500ms poll is fine. We worried a little about database load from async job polling. In practice, we found that the query is fast (indexed, returns quickly when there are no jobs) and even a 500ms latency is imperceptible when doing a screenshot capture.

What we built today

  • Implemented sync capture for simple use cases
  • Built an async job queue with Postgres for batch/pipeline use cases
  • Added retry logic with exponential backoff
  • Connected both paths to the same capture service
  • Added proper error categorization

The core product now works. A user can POST a URL and get a screenshot - either instantly or eventually.

Tomorrow: device emulation

On day 10 we’ll focus on mobile and tablet screenshots. Different viewports, user agents, and device scale factors.

Book of the day

Today, we have 2 recommendations, inspired by some of the work we’re doing at the moment.

Designing Distributed Systems by Brendan Burns

Burns is a co-founder of Kubernetes, and this book covers patterns for building reliable distributed systems. Even though we’re running on a single VPS, the patterns apply.

The chapter on work queue systems directly informed our async design. He covers single-worker, multi-worker, and coordinated batch patterns. Our “claim with skip locked” approach is a variation of his mutex-based work queue.

What I appreciate: he acknowledges that not everything needs to be distributed. Sometimes sync is fine. Sometimes a single worker is fine. The patterns are tools, not mandates.

Another good (and quick) read could be:

Design patterns for container-based distributed systems also by Brendan Burns and David Oppenheimer

In this paper, Burns and Oppenheimer note that containers are becoming the “objects” of distributed systems, and just as OOP gave us reusable design patterns, containers are doing the same. They identify three categories: single-container patterns for management interfaces, single-node patterns (Sidecar, Ambassador, Adapter) for cooperating containers, and multi-node patterns (Leader Election, Work Queue, Scatter/Gather) for distributed algorithms. The core insight: build complex infrastructure once, reuse everywhere.


Day 9 stats

Hours
███░░░░░░░░░░░░
21h
</> Code
███░░░░░░░░░░░░
950
$ Revenue
░░░░░░░░░░░░░░░
$0
Customers
░░░░░░░░░░░░░░░
0
Hosting
████░░░░░░░░░░░
$5.5/mo
Achievements:
[✓] Sync capture working [✓] Async queue built [✓] Retry logic added
╔════════════════════════════════════════════════════════════╗
E

Erik

Building Allscreenshots. Writes code, takes screenshots, goes diving.

Try allscreenshots

Screenshot API for the modern web. Capture any URL with a simple API call.

Get started