Chapter 12 Monitoring, Logging & Observability
I. Chapter Purpose & Scope
specifications: metrics & dimensions, logs & tracing, dashboards & alerts, SLA/SLO & error budgeting, runtime health & capacity trends, audit & exports; ensure alignment with data contracts, DQ gates, orchestration, and the Metrology chapter.observability, and logging, monitoringFix pipelineII. Terminology & Dependencies
- Terms: metrics, logs, traces, SLA/SLO, error_budget, alert_rules, blackbox/whitebox, p50/p95/p99, AIOps, RCA (root cause analysis).
- Dependencies: contracts/exports (Core.DataSpec v1.0); units/dimensions (Core.Metrology v1.0); DQ gates (DatasetCards v1.0); orchestration/scheduling/resources (this volume, Ch.10).
- Math & symbols: wrap inline symbols (e.g., QPS, T_inf, ρ, p99, ψ) in backticks; any division/integral/composite operator must use parentheses; if path quantities T_arr appear, register gamma(ell) and d ell; no Chinese in formulas/symbols/definitions.
III. Fields & Structure (Normative)
monitoring:
metrics:
perf:
- {name:"qps", unit:"1/s", agg:"sum", window:"1m"}
- {name:"latency_ms.p50", unit:"ms", agg:"quant", window:"1m"}
- {name:"latency_ms.p95", unit:"ms", agg:"quant", window:"1m"}
- {name:"latency_ms.p99", unit:"ms", agg:"quant", window:"1m"}
- {name:"utilization_rho",unit:"ratio", agg:"mean", window:"5m"}
quality:
- {name:"dq.pass_rate", unit:"ratio", agg:"mean", window:"5m"}
- {name:"drift.psi", unit:"—", agg:"mean", window:"15m"}
resources:
- {name:"cpu", unit:"cores", agg:"mean", window:"1m"}
- {name:"mem_gb", unit:"GiB", agg:"mean", window:"1m"}
- {name:"net_mbps", unit:"Mbps", agg:"mean", window:"1m"}
- {name:"disk_io_mbps", unit:"Mbps", agg:"mean", window:"1m"}
logs:
level: "info|warn|error"
format: "jsonl"
retention: "P30D"
sinks: ["s3://.../logs/", "kafka://.../topic"]
pii_redaction: true
traces:
enabled: true
sampler: "parent|probabilistic"
ratio: 0.05
propagator: "w3c|b3"
dashboards:
system: ["grafana:/boards/pipeline_overview"]
dq: ["grafana:/boards/dq_quality"]
cost: ["grafana:/boards/costs"]
alert_rules:
- {name:"p99_latency_breach", rule:"latency_ms.p99>20000 for 10m", severity:"high", channel:"pagerduty"}
- {name:"dq_drop", rule:"dq.pass_rate<0.98 for 15m", severity:"medium", channel:"slack"}
- {name:"drift_alert", rule:"drift.psi>0.2 for 30m", severity:"low", channel:"email"}
slo:
objectives:
- {name:"latency_p99", target_ms: 20000, window:"30d"}
- {name:"availability", target: 0.999, window:"30d"}
- {name:"dq_pass_rate", target: 0.99, window:"30d"}
error_budget_policy: "freeze_releases|throttle|page_on_call"
IV. Metric System & Postures
- Performance: QPS (1/s), latency_ms.{p50,p95,p99}, queueing delay, and throughput–latency curves.
- Quality: dq.pass_rate, drift.psi (or KL/KS), leakage counts & rates.
- Resources: CPU/memory/network/disk I/O, cache hit rate, and origin fetch ratio.
- Aggregation & windows: unified agg/window posture—sum|mean|quant|max|min.
- Units & dimensions: SI units everywhere; normalize units first before composition; must pass check_dim.
V. Logging & Tracing
- Logs: structured jsonl with ts, level, stage, run_id, trace_id, span_id, error_code, message, artifact_hash; enable PII redaction/masking.
- Tracing: distributed tracing with trace_id/span_id, stage name, I/O sizes, key parameter hashes; sampling strategy parent (inherit) or probabilistic with ratio.
VI. Dashboards & Alerts
- Dashboards: system/quality/cost boards; each includes core time series, Top-K hotspots, and RCA links (jump to traces/logs).
- Alerts: “metric threshold + duration” syntax; support suppression & aggregation (avoid alert storms); severity & channel are fixed.
- SLO/error budget: on violations of latency_p99, availability, or dq_pass_rate, enforce error budget policy (freeze releases/throttle/page on-call).
VII. Observability & Health
- Health score: weighted composite over perf/quality/resources into health_score∈[0,1].
- Capacity trends: 30/90-day capacity & cost trends for scaling/budgeting decisions.
- Blackbox/whitebox: blackbox probes cover end-to-end paths; whitebox exposes internal stage metrics and thread/queue depths.
VIII. Metrology & Units (SI)
- Mandatory: metrology:{units:"SI", check_dim:true}.
- Perf/resources: QPS (1/s), T_inf (ms {p50,p95,p99}), ρ (—), net_mbps, size_bytes.
- Path quantities: if monitoring/logging involves T_arr, register delta_form, path="gamma(ell)", measure="d ell", use one of the equivalents and pass check_dim:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ).
IX. Machine-Readable Fragment (Drop-in)
monitoring:
metrics:
perf:
- {name:"qps", unit:"1/s", agg:"sum", window:"1m"}
- {name:"latency_ms.p99", unit:"ms", agg:"quant", window:"1m"}
quality:
- {name:"dq.pass_rate", unit:"ratio", agg:"mean", window:"5m"}
resources:
- {name:"cpu", unit:"cores", agg:"mean", window:"1m"}
- {name:"mem_gb", unit:"GiB", agg:"mean", window:"1m"}
logs:
level: "info"
format: "jsonl"
retention: "P30D"
sinks: ["s3://eift/logs/", "kafka://obs/logs"]
pii_redaction: true
traces: {enabled:true, sampler:"probabilistic", ratio:0.1, propagator:"w3c"}
dashboards:
system: ["grafana:/boards/pipeline_overview"]
alert_rules:
- {name:"p99_latency_breach", rule:"latency_ms.p99>20000 for 10m", severity:"high", channel:"pagerduty"}
slo:
objectives:
- {name:"latency_p99", target_ms:20000, window:"30d"}
error_budget_policy: "freeze_releases"
X. Lint Rules (Excerpt, Normative)
lint_rules:
- id: MON.METRICS_UNIT_SI
when: "$.monitoring.metrics..unit"
assert: "all_units_in_SI(value)"
level: error
- id: LOG.STRUCTURED_JSONL
when: "$.monitoring.logs.format"
assert: "value == 'jsonl'"
level: error
- id: TRACE.SAMPLER_VALID
when: "$.monitoring.traces"
assert: "value.enabled == false or value.sampler in ['parent','probabilistic']"
level: error
- id: ALERT.SYNTAX_VALID
when: "$.monitoring.alert_rules[*].rule"
assert: "matches('^[a-z0-9_\\.]+[><=].+ for \\d+[smhd]$')"
level: error
- id: SLO.OBJECTIVES_DEFINED
when: "$.monitoring.slo.objectives"
assert: "len(value) >= 1"
level: error
- id: METROLOGY.SI_AND_CHECKDIM
when: "$.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
XI. Export Manifest & Audit
export_manifest:
version: "v1.0"
artifacts:
- {path:"monitoring/dashboards.json", sha256:"..."}
- {path:"monitoring/alert_rules.yaml", sha256:"..."}
- {path:"monitoring/slo_objectives.yaml", sha256:"..."}
- {path:"logs/index.manifest.json", sha256:"..."}
- {path:"traces/config.yaml", sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
XII. Chapter Compliance Checklist
- Metrics, logs, and tracing configurations complete; SI units with check_dim; consistent agg/window.
- Dashboards cover system/quality/cost core views; alert rules valid with severity & channels.
- SLO targets and error-budget policy defined; violations freeze releases/throttle as needed.
- Log redaction enabled; tracing sampler/propagator explicit; RCA links to traces/logs available.
- Export manifest lists dashboards/alerts/SLO/log index/trace config with sha256; citation anchors complete to meet release gates.