Skip to main content

Monitoring

Dashboards

  1. Platform Health — request rate, p95 latency, error ratio, CPU/mem per service.
  2. Attendance Funnel — punches ingested, processed, errored, flagged by code (per 5 min).
  3. Queue Health — Horizon queue depth + throughput per queue.
  4. Business KPIs — check-in completion rate by unit, late-check-in ratio, week-over-week trends.

Alerts

AlertThresholdSeverity
/attendance/punch error ratio> 2% over 5 minpage
Queue depth punches> 5,000 for 10 minpage
DB CPU> 80% sustained 10 minnotify
Backup job failureanypage
TLS cert expiry< 14 daysnotify
Login failure rate> 20% over 10 minnotify (possible attack)

SLOs

  • Punch ingestion — 99.9% of requests in 2026 are accepted (HTTP 2xx) within 500 ms p95.
  • Processing lag — 99% of punches produce an attendance row within 60 seconds.
  • Export — 95% of exports delivered under 2 minutes.

Log Taxonomy

Structured fields on every log line:

{ "ts": "...", "level": "...", "trace_id": "...", "user_id": 42,
"org_id": 1, "unit_id": 12, "action": "attendance.punch",
"outcome": "accepted", "flags": ["LATE_CHECK_IN"] }

Synthetic Monitoring

A canary job runs every 5 minutes:

  • Logs in as a seeded synthetic employee.
  • Performs a check-in with simulated GPS.
  • Verifies the attendance row appears within 30 s.
  • Alerts if the end-to-end pipeline breaks, independent of per-host probes.