Day 22: Monitoring & alerting - sleeping without worry
Day 22 of 30. Today we set up monitoring so we know when things break.
Day 22 of 30. Today we set up monitoring so we know when things break.
We have a pilot customer running tests. If something breaks at 3 AM, we need to know. Currently, we’d find out when we get angry emails.
Let’s do better than that.
What we need to monitor
- Uptime - Is the API responding?
- Error rate - Are screenshots failing?
- Response time - Is the API getting slow?
- Resources - CPU, memory, disk
The stack
We’re keeping it simple and cheap:
- UptimeRobot (free) - External uptime monitoring
- Application metrics - Built into Spring Boot Actuator
- Simple alerting - Email/Slack when things break
- Structured logging - JSON logs we can search
We’re not in a position yet to be paying for Datadog or New Relic, so we need a slightly less ideal solution. Alternatively, we could set up an ELK (Elasticsearch, Kibana and Logstash) stack, but this would require more storage, more management overhead, and perhaps even another thing to monitor. Turtles all the way down.
Interactive monitoring dashboard
See our live monitoring dashboard in action:
UptimeRobot setup
Uptime Robot offers a generous free tier: 50 monitors, 5-minute checks.
We set up:
GET /health- Basic health checkPOST /v1/screenshotswith test payload - Full API check
If either fails 2 consecutive checks, we get an email and Slack notification.
Total setup time: 10 minutes. A good investment of our time.
Application metrics
We added custom metrics using Micrometer:
@Component
class ScreenshotMetrics(
private val meterRegistry: MeterRegistry
) {
private val screenshotsTotal = Counter.builder("screenshots.total")
.description("Total screenshots attempted")
.register(meterRegistry)
private val screenshotsSuccessful = Counter.builder("screenshots.successful")
.description("Successful screenshots")
.register(meterRegistry)
private val screenshotsFailed = Counter.builder("screenshots.failed")
.description("Failed screenshots")
.register(meterRegistry)
private val captureTimer = Timer.builder("screenshots.capture.duration")
.description("Screenshot capture duration")
.register(meterRegistry)
fun recordAttempt() = screenshotsTotal.increment()
fun recordSuccess() = screenshotsSuccessful.increment()
fun recordFailure() = screenshotsFailed.increment()
fun <T> recordCapture(block: () -> T): T {
return captureTimer.recordCallable(block)!!
}
}
Exposed at /actuator/metrics:
{
"name": "screenshots.total",
"measurements": [{"statistic": "COUNT", "value": 1547}]
}
Health endpoint
Enhanced health check that reports system status:
@Component
class SystemHealthIndicator : HealthIndicator {
override fun health(): Health {
val runtime = Runtime.getRuntime()
val usedMemory = runtime.totalMemory() - runtime.freeMemory()
val maxMemory = runtime.maxMemory()
val memoryUsage = usedMemory.toDouble() / maxMemory
return if (memoryUsage < 0.9) {
Health.up()
.withDetail("memory_usage", "%.1f%%".format(memoryUsage * 100))
.withDetail("uptime_seconds", ManagementFactory.getRuntimeMXBean().uptime / 1000)
.build()
} else {
Health.down()
.withDetail("reason", "Memory pressure")
.withDetail("memory_usage", "%.1f%%".format(memoryUsage * 100))
.build()
}
}
}
Structured logging
We switched to JSON logging for better searchability:
// logback-spring.xml
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder"/>
</appender>
Now logs look like:
{
"@timestamp": "2024-01-20T14:30:00.000Z",
"level": "INFO",
"logger": "ScreenshotService",
"message": "Screenshot completed",
"screenshot_id": "scr_abc123",
"url": "https://example.com",
"duration_ms": 2340,
"file_size": 245678
}
This makes the logs grep, and easier to aggregate into a logging system.
Alerting
For simple alerting, we do via a webhook to Slack, plus we send a message to our Telegram chat:
@Service
class AlertService(
@Value("\${slack.webhook.url}") private val slackWebhookUrl: String,
private val restTemplate: RestTemplate
) {
fun sendAlert(title: String, message: String, severity: String = "warning") {
val color = when (severity) {
"critical" -> "#dc3545"
"warning" -> "#ffc107"
else -> "#17a2b8"
}
val payload = mapOf(
"attachments" to listOf(mapOf(
"color" to color,
"title" to title,
"text" to message,
"ts" to System.currentTimeMillis() / 1000
))
)
restTemplate.postForEntity(slackWebhookUrl, payload, String::class.java)
}
}
We trigger alerts for the following:
- Any error rate above 10% (over 5 minutes)
- Response time above 10 seconds (average)
- Memory usage above 90%
- Any 500 errors
Status page
We’ve also created a status page with actual information about the system. It’s public, and you can view it at https://status.allscreenshots.com/.
Dashboard for ourselves
We built a simple internal dashboard showing:
- Screenshots per hour (last 24h)
- Error rate (last 24h)
- Average capture time
- Top errors by type
- Active users today
@GetMapping("/internal/stats")
fun getStats(): StatsResponse {
val last24h = Instant.now().minus(24, ChronoUnit.HOURS)
return StatsResponse(
screenshotsLast24h = screenshotRepository.countSince(last24h),
errorsLast24h = screenshotRepository.countErrorsSince(last24h),
avgCaptureTimeMs = screenshotRepository.avgCaptureTimeSince(last24h),
topErrors = screenshotRepository.topErrorsSince(last24h, limit = 5),
activeUsersToday = screenshotRepository.distinctUsersSince(last24h)
)
}
What we built today
- UptimeRobot monitoring (external)
- Application metrics (internal)
- Enhanced health checks
- Structured JSON logging
- Slack alerting
- Public status page
- Internal stats dashboard
We can now sleep at night. If something breaks, we’ll know.
Tomorrow: feature requests from the pilot
On day 23 we will address feedback from our pilot. Our users have been testing, and have requests, and a few questions. Exciting developments!
Book of the day
Site Reliability Engineering by Betsy Beyer et al.
The Google SRE book. It’s free to read online, but the physical copy is worth having.
Key concepts we applied today: monitoring should answer “is the system working?” not “is the server running?” Focus on symptoms (failed screenshots) not causes (high CPU). Alert on user-impacting issues, not internal metrics.
The chapter on monitoring and alerting is particularly relevant. Google’s approach: if an alert doesn’t require immediate human action, it shouldn’t page you.
The book is a little dense, but essential reading for anyone running production systems.