Monitoring
Dashboards
- Platform Health — request rate, p95 latency, error ratio, CPU/mem per service.
- Attendance Funnel — punches ingested, processed, errored, flagged by code (per 5 min).
- Queue Health — Horizon queue depth + throughput per queue.
- Business KPIs — check-in completion rate by unit, late-check-in ratio, week-over-week trends.
Alerts
| Alert | Threshold | Severity |
|---|---|---|
/attendance/punch error ratio | > 2% over 5 min | page |
Queue depth punches | > 5,000 for 10 min | page |
| DB CPU | > 80% sustained 10 min | notify |
| Backup job failure | any | page |
| TLS cert expiry | < 14 days | notify |
| Login failure rate | > 20% over 10 min | notify (possible attack) |
SLOs
- Punch ingestion — 99.9% of requests in 2026 are accepted (HTTP 2xx) within 500 ms p95.
- Processing lag — 99% of punches produce an attendance row within 60 seconds.
- Export — 95% of exports delivered under 2 minutes.
Log Taxonomy
Structured fields on every log line:
{ "ts": "...", "level": "...", "trace_id": "...", "user_id": 42,
"org_id": 1, "unit_id": 12, "action": "attendance.punch",
"outcome": "accepted", "flags": ["LATE_CHECK_IN"] }
Synthetic Monitoring
A canary job runs every 5 minutes:
- Logs in as a seeded synthetic employee.
- Performs a check-in with simulated GPS.
- Verifies the attendance row appears within 30 s.
- Alerts if the end-to-end pipeline breaks, independent of per-host probes.