Day 9: Sync vs async - two ways to capture screenshots
Day 9 of 30. Today we make the API actually work - with both instant and queued responses.
Day 9 of 30. Today we make the API actually work - with both instant and queued responses.
Yesterday we designed the interface and today we will build the engine. While we designed the async interface yesterday, we found use cases for which we also need the sync interface. Our key question: should a screenshot request block until complete, or return immediately with a job ID?
Our answer? Both!
Two modes, one API
Our API supports two capture modes:
Synchronous - Request blocks until screenshot is ready. Response contains the image directly.
POST /v1/screenshots
→ Wait 2-4 seconds
→ Returns PNG/JPEG bytes
Asynchronous - Request returns immediately with a job ID. Poll or use webhook for result.
POST /v1/screenshots/async
→ Returns job ID in 50ms
→ Poll GET /v1/screenshots/{id} for result
While it’s slightly harder to implement, we can reuse a lot of code, and we think it’s an important feature to have.
When to use which
Use sync when:
- You need one screenshot at a time
- The client can wait 2-5 seconds
- You want the simplest integration
- You’re building interactive tools (dashboard previews, live captures)
Use async when:
- You’re capturing many URLs in batch
- Your HTTP client has short timeouts
- You need guaranteed delivery with retries
- You want webhook notifications
- You’re building automation pipelines
In our mind, most users will start with sync, and perhaps switch to async when they hit scale or more complex use cases.
The sync path
Sync is straightforward. A request comes in, we capture the request, and we respond:
POST /v1/screenshots:
user = getCurrentUser()
validateRequest(request)
reserveCredit(user)
if async:
return createAsyncJob(request, user)
try:
screenshot = captureScreenshot(request)
consumeCredit(user)
return screenshot as image
catch error:
releaseCredit(user)
throw error
In the case of a synchronous call, the response is the image itself. There are no job IDs to manage, and no polling. This keeps the API quite simple.
Unfortunately, a sync API has limits. If the site takes 10 seconds to load, your request takes 10 seconds. While we will optimize the call and provide several ways to keep the waiting to a minimum, a slow site can still result in a slow screenshot, and you might be running into timeouts.
Why async needs a queue
For async, we need a queue, and ideally a persistent one. Without a queue, async would be fire-and-forget. We’d potentially lose jobs if the server restarts, and it would be harder to retry failures. Also, we couldn’t distribute work across multiple workers, which would limit our scaling options.
The queue gives us:
Durability. Our jobs survive server restarts, since they’re in the database.
Retries. Any transient failures get another chance, and network blips don’t mean permanent failure.
Backpressure. If the number of requests spike, the queue can absorb them, since workers process items at a sustainable pace.
Scalability. If we reach a certain level of scale, we can add more workers to process faster.
The queue in action
See how jobs flow through our Postgres-based queue:
Why Postgres as the queue?
As we mentioned in Day 2: Just use Postgres.
We could use Redis, RabbitMQ, or a dozen other queue systems. But we’re using Postgres because:
One less thing to run. We already have Postgres. Adding Redis means another container, which requires more memory and potentially more things to break.
Transactional safety. When we create a user request and queue a job, they’re in the same transaction. No distributed coordination needed.
Good enough performance. We’re not processing thousands of jobs per second (yet, though we have a few plans - stay tuned!). Either way, Postgres can easily handle our scale.
SKIP LOCKED is magic. Postgres has SELECT ... FOR UPDATE SKIP LOCKED which is perfect for job queues. Workers grab jobs without blocking each other.
Complexity. Having 1 thing for queues and storage reduces our complexity. We know where the data is, we don’t have to hop to different systems to diagnose issues.
So, what’s the trade-off? Postgres polling perhaps isn’t as efficient as Redis pub/sub. But we’re polling every 500ms, and the overhead should be negligible.
The job table
CREATE TABLE screenshot_jobs (
id VARCHAR(32) PRIMARY KEY, -- scr_abc123...
user_id UUID REFERENCES users(id),
-- Request parameters
url TEXT NOT NULL,
device VARCHAR(20) DEFAULT 'desktop',
full_page BOOLEAN DEFAULT FALSE,
width INTEGER DEFAULT 1920,
height INTEGER DEFAULT 1080,
format VARCHAR(10) DEFAULT 'png',
wait_for VARCHAR(20) DEFAULT 'networkidle',
-- Status tracking
status VARCHAR(20) DEFAULT 'pending',
attempts INTEGER DEFAULT 0,
max_attempts INTEGER DEFAULT 3,
-- Results
image_url TEXT,
file_size INTEGER,
error_code VARCHAR(50),
error_message TEXT,
-- Timestamps
created_at TIMESTAMP DEFAULT NOW(),
started_at TIMESTAMP,
completed_at TIMESTAMP,
failed_at TIMESTAMP,
-- For queue processing
locked_until TIMESTAMP,
locked_by VARCHAR(50)
);
CREATE INDEX idx_jobs_pending ON screenshot_jobs(status, created_at)
WHERE status = 'pending';
CREATE INDEX idx_jobs_user ON screenshot_jobs(user_id, created_at);
The locked_until and locked_by fields prevent multiple workers from grabbing the same job.
Creating async jobs
When using /async, we create a job and return immediately:
createAsyncJob(request, user):
job = new Job(
id = generateJobId(), // scr_abc123...
userId = user.id,
url, device, fullPage, width, height, format, waitFor,
webhookUrl
)
save(job)
return 202 Accepted {
id: job.id,
status: "pending",
statusUrl: "/v1/screenshots/{id}"
}
The response comes back in ~50ms. The client gets a job ID to track progress.
The worker
A background worker polls for pending jobs:
ScreenshotWorker:
workerId = random short id
every 500ms:
job = claimNextJob()
if no job: return
try:
processJob(job)
catch error:
handleFailure(job, error)
claimNextJob():
return claimNextPending(workerId, lockFor: 2 minutes)
processJob(job):
updateStatus(job.id, "processing")
screenshot = captureScreenshot(job.url, job.device, ...)
imageUrl = uploadToStorage(job.id + "." + job.format, screenshot)
markCompleted(job.id, imageUrl, screenshot.size)
if job.webhookUrl:
sendWebhook(job.webhookUrl, job.id, "completed", imageUrl)
handleFailure(job, error):
newAttempts = job.attempts + 1
if newAttempts >= job.maxAttempts:
markFailed(job.id, categorizeError(error), error.message)
if job.webhookUrl:
sendWebhook(job.webhookUrl, job.id, "failed")
else:
// exponential backoff: 20s, 40s, 80s
backoff = 2^newAttempts * 10 seconds
releaseForRetry(job.id, newAttempts, availableAfter: now + backoff)
categorizeError(error):
if "timeout" in error: return "timeout"
if "ERR_NAME_NOT_RESOLVED" in error: return "dns_error"
if "ERR_CONNECTION_REFUSED" in error: return "connection_refused"
else: return "internal_error"
The claim query
The claimNextPending method uses SKIP LOCKED:
claimNextPending(workerId, lockDuration):
// atomically find and lock the next available job
find first job where:
status = 'pending'
and (not locked or lock expired)
order by created_at
lock it with SKIP LOCKED // if already locked, try next one
set locked_by = workerId
set locked_until = now + lockDuration
set started_at = now (if not already set)
return job
FOR UPDATE SKIP LOCKED means: lock this row, but if it’s already locked, skip it and try the next one. Multiple workers can run this simultaneously without blocking.
Polling for results
Clients poll the job endpoint:
GET /v1/screenshots/{id}:
job = findById(id)
if not found: return 404
if job.status == "completed":
return { id, status, imageUrl, fileSize, completedAt }
if job.status == "failed":
return { id, status, error: { code, message }, failedAt }
else: // pending or processing
return { id, status, createdAt, startedAt }
We recommend polling every 1-2 seconds, since most jobs complete in 3-5 seconds. We will also offer webhooks, which might be a more appropriate way for receiving callbacks.
What we discovered
While talking to users (we don’t have the data yet), we found that:
Sync is what most users want. We aimed to add async first, thinking everyone would want that. Then we realized most integrations are simpler with sync.
Playwright needs memory. Each browser context uses 50-100MB. With a 4GB VPS and Postgres also running, we can safely run 2-3 concurrent browser instances. Sync requests share a browser pool; async workers have their own.
Some sites fail consistently. Sites with aggressive bot protection (Cloudflare challenges, CAPTCHAs) fail every time. We need to handle these gracefully with a quick failure, clear error message, or a better way around this.
The 500ms poll is fine. We worried a little about database load from async job polling. In practice, we found that the query is fast (indexed, returns quickly when there are no jobs) and even a 500ms latency is imperceptible when doing a screenshot capture.
What we built today
- Implemented sync capture for simple use cases
- Built an async job queue with Postgres for batch/pipeline use cases
- Added retry logic with exponential backoff
- Connected both paths to the same capture service
- Added proper error categorization
The core product now works. A user can POST a URL and get a screenshot - either instantly or eventually.
Tomorrow: device emulation
On day 10 we’ll focus on mobile and tablet screenshots. Different viewports, user agents, and device scale factors.
Book of the day
Today, we have 2 recommendations, inspired by some of the work we’re doing at the moment.
Designing Distributed Systems by Brendan Burns
Burns is a co-founder of Kubernetes, and this book covers patterns for building reliable distributed systems. Even though we’re running on a single VPS, the patterns apply.
The chapter on work queue systems directly informed our async design. He covers single-worker, multi-worker, and coordinated batch patterns. Our “claim with skip locked” approach is a variation of his mutex-based work queue.
What I appreciate: he acknowledges that not everything needs to be distributed. Sometimes sync is fine. Sometimes a single worker is fine. The patterns are tools, not mandates.
Another good (and quick) read could be:
Design patterns for container-based distributed systems also by Brendan Burns and David Oppenheimer
In this paper, Burns and Oppenheimer note that containers are becoming the “objects” of distributed systems, and just as OOP gave us reusable design patterns, containers are doing the same. They identify three categories: single-container patterns for management interfaces, single-node patterns (Sidecar, Ambassador, Adapter) for cooperating containers, and multi-node patterns (Leader Election, Work Queue, Scatter/Gather) for distributed algorithms. The core insight: build complex infrastructure once, reuse everywhere.