Day 22: Monitoring & alerting - sleeping without worry

Day 22 of 30. Today we set up monitoring so we know when things break.

#alerting #devops

Day 22 of 30. Today we set up monitoring so we know when things break.

We have a pilot customer running tests. If something breaks at 3 AM, we need to know. Currently, we’d find out when we get angry emails.

Let’s do better than that.

What we need to monitor

  1. Uptime - Is the API responding?
  2. Error rate - Are screenshots failing?
  3. Response time - Is the API getting slow?
  4. Resources - CPU, memory, disk

The stack

We’re keeping it simple and cheap:

  • UptimeRobot (free) - External uptime monitoring
  • Application metrics - Built into Spring Boot Actuator
  • Simple alerting - Email/Slack when things break
  • Structured logging - JSON logs we can search

We’re not in a position yet to be paying for Datadog or New Relic, so we need a slightly less ideal solution. Alternatively, we could set up an ELK (Elasticsearch, Kibana and Logstash) stack, but this would require more storage, more management overhead, and perhaps even another thing to monitor. Turtles all the way down.

Interactive monitoring dashboard

See our live monitoring dashboard in action:

monitor - System monitoring
Live
[*] Uptime
99.9%
Last 30 days
[>] Avg Response
2.3s
Last hour
[X] Error Rate
0.2%
Below threshold
[M] Memory
2.4GB
of 4GB
[!] Alert rules
Error rate > 10%
-> Slack + Email
Response > 10s
-> Slack
Memory > 90%
-> Slack + PagerDuty
Any 500 error
-> Slack
[L] Recent events
14:32:01 INFO Screenshot completed: scr_abc123 (2.1s)
14:31:58 INFO New request from user_xyz
14:31:45 WARN Slow response: 4.2s (threshold: 3s)
14:31:32 INFO Health check passed

UptimeRobot setup

Uptime Robot offers a generous free tier: 50 monitors, 5-minute checks.

We set up:

  • GET /health - Basic health check
  • POST /v1/screenshots with test payload - Full API check

If either fails 2 consecutive checks, we get an email and Slack notification.

Total setup time: 10 minutes. A good investment of our time.

Application metrics

We added custom metrics using Micrometer:

@Component
class ScreenshotMetrics(
    private val meterRegistry: MeterRegistry
) {
    private val screenshotsTotal = Counter.builder("screenshots.total")
        .description("Total screenshots attempted")
        .register(meterRegistry)

    private val screenshotsSuccessful = Counter.builder("screenshots.successful")
        .description("Successful screenshots")
        .register(meterRegistry)

    private val screenshotsFailed = Counter.builder("screenshots.failed")
        .description("Failed screenshots")
        .register(meterRegistry)

    private val captureTimer = Timer.builder("screenshots.capture.duration")
        .description("Screenshot capture duration")
        .register(meterRegistry)

    fun recordAttempt() = screenshotsTotal.increment()
    fun recordSuccess() = screenshotsSuccessful.increment()
    fun recordFailure() = screenshotsFailed.increment()

    fun <T> recordCapture(block: () -> T): T {
        return captureTimer.recordCallable(block)!!
    }
}

Exposed at /actuator/metrics:

{
  "name": "screenshots.total",
  "measurements": [{"statistic": "COUNT", "value": 1547}]
}

Health endpoint

Enhanced health check that reports system status:

@Component
class SystemHealthIndicator : HealthIndicator {
    override fun health(): Health {
        val runtime = Runtime.getRuntime()
        val usedMemory = runtime.totalMemory() - runtime.freeMemory()
        val maxMemory = runtime.maxMemory()
        val memoryUsage = usedMemory.toDouble() / maxMemory

        return if (memoryUsage < 0.9) {
            Health.up()
                .withDetail("memory_usage", "%.1f%%".format(memoryUsage * 100))
                .withDetail("uptime_seconds", ManagementFactory.getRuntimeMXBean().uptime / 1000)
                .build()
        } else {
            Health.down()
                .withDetail("reason", "Memory pressure")
                .withDetail("memory_usage", "%.1f%%".format(memoryUsage * 100))
                .build()
        }
    }
}

Structured logging

We switched to JSON logging for better searchability:

// logback-spring.xml
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
    <encoder class="net.logstash.logback.encoder.LogstashEncoder"/>
</appender>

Now logs look like:

{
  "@timestamp": "2024-01-20T14:30:00.000Z",
  "level": "INFO",
  "logger": "ScreenshotService",
  "message": "Screenshot completed",
  "screenshot_id": "scr_abc123",
  "url": "https://example.com",
  "duration_ms": 2340,
  "file_size": 245678
}

This makes the logs grep, and easier to aggregate into a logging system.

Alerting

For simple alerting, we do via a webhook to Slack, plus we send a message to our Telegram chat:

@Service
class AlertService(
    @Value("\${slack.webhook.url}") private val slackWebhookUrl: String,
    private val restTemplate: RestTemplate
) {
    fun sendAlert(title: String, message: String, severity: String = "warning") {
        val color = when (severity) {
            "critical" -> "#dc3545"
            "warning" -> "#ffc107"
            else -> "#17a2b8"
        }

        val payload = mapOf(
            "attachments" to listOf(mapOf(
                "color" to color,
                "title" to title,
                "text" to message,
                "ts" to System.currentTimeMillis() / 1000
            ))
        )

        restTemplate.postForEntity(slackWebhookUrl, payload, String::class.java)
    }
}

We trigger alerts for the following:

  • Any error rate above 10% (over 5 minutes)
  • Response time above 10 seconds (average)
  • Memory usage above 90%
  • Any 500 errors

Status page

We’ve also created a status page with actual information about the system. It’s public, and you can view it at https://status.allscreenshots.com/.

Dashboard for ourselves

We built a simple internal dashboard showing:

  • Screenshots per hour (last 24h)
  • Error rate (last 24h)
  • Average capture time
  • Top errors by type
  • Active users today
@GetMapping("/internal/stats")
fun getStats(): StatsResponse {
    val last24h = Instant.now().minus(24, ChronoUnit.HOURS)

    return StatsResponse(
        screenshotsLast24h = screenshotRepository.countSince(last24h),
        errorsLast24h = screenshotRepository.countErrorsSince(last24h),
        avgCaptureTimeMs = screenshotRepository.avgCaptureTimeSince(last24h),
        topErrors = screenshotRepository.topErrorsSince(last24h, limit = 5),
        activeUsersToday = screenshotRepository.distinctUsersSince(last24h)
    )
}

What we built today

  • UptimeRobot monitoring (external)
  • Application metrics (internal)
  • Enhanced health checks
  • Structured JSON logging
  • Slack alerting
  • Public status page
  • Internal stats dashboard

We can now sleep at night. If something breaks, we’ll know.

Tomorrow: feature requests from the pilot

On day 23 we will address feedback from our pilot. Our users have been testing, and have requests, and a few questions. Exciting developments!

Book of the day

Site Reliability Engineering by Betsy Beyer et al.

The Google SRE book. It’s free to read online, but the physical copy is worth having.

Key concepts we applied today: monitoring should answer “is the system working?” not “is the server running?” Focus on symptoms (failed screenshots) not causes (high CPU). Alert on user-impacting issues, not internal metrics.

The chapter on monitoring and alerting is particularly relevant. Google’s approach: if an alert doesn’t require immediate human action, it shouldn’t page you.

The book is a little dense, but essential reading for anyone running production systems.


Day 22 stats

Hours
██████████░░░░░
65h
</> Code
█████████████░░
4,300
$ Revenue
░░░░░░░░░░░░░░░
$0
Customers
░░░░░░░░░░░░░░░
0
Hosting
████░░░░░░░░░░░
$5.5/mo
Achievements:
[✓] Monitoring setup complete [✓] Status page live [✓] Alerting configured
╔════════════════════════════════════════════════════════════╗
E

Erik

Building Allscreenshots. Writes code, takes screenshots, goes diving.

Try allscreenshots

Screenshot API for the modern web. Capture any URL with a simple API call.

Get started