45-EFT.WP.Data.Pipeline v1.0 | Chapter 12 Monitoring, Logging & Observability

Chapter 12 Monitoring, Logging & Observability

I. Chapter Purpose & Scope

specifications: metrics & dimensions, logs & tracing, dashboards & alerts, SLA/SLO & error budgeting, runtime health & capacity trends, audit & exports; ensure alignment with data contracts, DQ gates, orchestration, and the Metrology chapter.observability, and logging, monitoringFix pipeline

II. Terminology & Dependencies

Terms: metrics, logs, traces, SLA/SLO, error_budget, alert_rules, blackbox/whitebox, p50/p95/p99, AIOps, RCA (root cause analysis).
Dependencies: contracts/exports (Core.DataSpec v1.0); units/dimensions (Core.Metrology v1.0); DQ gates (DatasetCards v1.0); orchestration/scheduling/resources (this volume, Ch.10).
Math & symbols: wrap inline symbols (e.g., QPS, T_inf, ρ, p99, ψ) in backticks; any division/integral/composite operator must use parentheses; if path quantities T_arr appear, register gamma(ell) and d ell; no Chinese in formulas/symbols/definitions.

III. Fields & Structure (Normative)

monitoring:

metrics:

perf:

- {name:"qps", unit:"1/s", agg:"sum", window:"1m"}

- {name:"latency_ms.p50", unit:"ms", agg:"quant", window:"1m"}

- {name:"latency_ms.p95", unit:"ms", agg:"quant", window:"1m"}

- {name:"latency_ms.p99", unit:"ms", agg:"quant", window:"1m"}

- {name:"utilization_rho",unit:"ratio", agg:"mean", window:"5m"}

quality:

- {name:"dq.pass_rate", unit:"ratio", agg:"mean", window:"5m"}

- {name:"drift.psi", unit:"—", agg:"mean", window:"15m"}

resources:

- {name:"cpu", unit:"cores", agg:"mean", window:"1m"}

- {name:"mem_gb", unit:"GiB", agg:"mean", window:"1m"}

- {name:"net_mbps", unit:"Mbps", agg:"mean", window:"1m"}

- {name:"disk_io_mbps", unit:"Mbps", agg:"mean", window:"1m"}

logs:

level: "info|warn|error"

format: "jsonl"

retention: "P30D"

sinks: ["s3://.../logs/", "kafka://.../topic"]

pii_redaction: true

traces:

enabled: true

sampler: "parent|probabilistic"

ratio: 0.05

propagator: "w3c|b3"

dashboards:

system: ["grafana:/boards/pipeline_overview"]

dq: ["grafana:/boards/dq_quality"]

cost: ["grafana:/boards/costs"]

alert_rules:

- {name:"p99_latency_breach", rule:"latency_ms.p99>20000 for 10m", severity:"high", channel:"pagerduty"}

- {name:"dq_drop", rule:"dq.pass_rate<0.98 for 15m", severity:"medium", channel:"slack"}

- {name:"drift_alert", rule:"drift.psi>0.2 for 30m", severity:"low", channel:"email"}

slo:

objectives:

- {name:"latency_p99", target_ms: 20000, window:"30d"}

- {name:"availability", target: 0.999, window:"30d"}

- {name:"dq_pass_rate", target: 0.99, window:"30d"}

error_budget_policy: "freeze_releases|throttle|page_on_call"

IV. Metric System & Postures

Performance: QPS (1/s), latency_ms.{p50,p95,p99}, queueing delay, and throughput–latency curves.
Quality: dq.pass_rate, drift.psi (or KL/KS), leakage counts & rates.
Resources: CPU/memory/network/disk I/O, cache hit rate, and origin fetch ratio.
Aggregation & windows: unified agg/window posture—sum|mean|quant|max|min.
Units & dimensions: SI units everywhere; normalize units first before composition; must pass check_dim.

V. Logging & Tracing

Logs: structured jsonl with ts, level, stage, run_id, trace_id, span_id, error_code, message, artifact_hash; enable PII redaction/masking.
Tracing: distributed tracing with trace_id/span_id, stage name, I/O sizes, key parameter hashes; sampling strategy parent (inherit) or probabilistic with ratio.

VI. Dashboards & Alerts

Dashboards: system/quality/cost boards; each includes core time series, Top-K hotspots, and RCA links (jump to traces/logs).
Alerts: “metric threshold + duration” syntax; support suppression & aggregation (avoid alert storms); severity & channel are fixed.
SLO/error budget: on violations of latency_p99, availability, or dq_pass_rate, enforce error budget policy (freeze releases/throttle/page on-call).

VII. Observability & Health

Health score: weighted composite over perf/quality/resources into health_score∈[0,1].
Capacity trends: 30/90-day capacity & cost trends for scaling/budgeting decisions.
Blackbox/whitebox: blackbox probes cover end-to-end paths; whitebox exposes internal stage metrics and thread/queue depths.

VIII. Metrology & Units (SI)

Mandatory: metrology:{units:"SI", check_dim:true}.
Perf/resources: QPS (1/s), T_inf (ms {p50,p95,p99}), ρ (—), net_mbps, size_bytes.
Path quantities: if monitoring/logging involves T_arr, register delta_form, path="gamma(ell)", measure="d ell", use one of the equivalents and pass check_dim:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ).

IX. Machine-Readable Fragment (Drop-in)

monitoring:

metrics:

perf:

- {name:"qps", unit:"1/s", agg:"sum", window:"1m"}

- {name:"latency_ms.p99", unit:"ms", agg:"quant", window:"1m"}

quality:

- {name:"dq.pass_rate", unit:"ratio", agg:"mean", window:"5m"}

resources:

- {name:"cpu", unit:"cores", agg:"mean", window:"1m"}

- {name:"mem_gb", unit:"GiB", agg:"mean", window:"1m"}

logs:

level: "info"

format: "jsonl"

retention: "P30D"

sinks: ["s3://eift/logs/", "kafka://obs/logs"]

pii_redaction: true

traces: {enabled:true, sampler:"probabilistic", ratio:0.1, propagator:"w3c"}

dashboards:

system: ["grafana:/boards/pipeline_overview"]

alert_rules:

- {name:"p99_latency_breach", rule:"latency_ms.p99>20000 for 10m", severity:"high", channel:"pagerduty"}

slo:

objectives:

- {name:"latency_p99", target_ms:20000, window:"30d"}

error_budget_policy: "freeze_releases"

X. Lint Rules (Excerpt, Normative)

lint_rules:

- id: MON.METRICS_UNIT_SI

when: "$.monitoring.metrics..unit"

assert: "all_units_in_SI(value)"

level: error

- id: LOG.STRUCTURED_JSONL

when: "$.monitoring.logs.format"

assert: "value == 'jsonl'"

level: error

- id: TRACE.SAMPLER_VALID

when: "$.monitoring.traces"

assert: "value.enabled == false or value.sampler in ['parent','probabilistic']"

level: error

- id: ALERT.SYNTAX_VALID

when: "$.monitoring.alert_rules[*].rule"

assert: "matches('^[a-z0-9_\\.]+[><=].+ for \\d+[smhd]$')"

level: error

- id: SLO.OBJECTIVES_DEFINED

when: "$.monitoring.slo.objectives"

assert: "len(value) >= 1"

level: error

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.metrology"

assert: "units == 'SI' and check_dim == true"

level: error

XI. Export Manifest & Audit

export_manifest:

version: "v1.0"

artifacts:

- {path:"monitoring/dashboards.json", sha256:"..."}

- {path:"monitoring/alert_rules.yaml", sha256:"..."}

- {path:"monitoring/slo_objectives.yaml", sha256:"..."}

- {path:"logs/index.manifest.json", sha256:"..."}

- {path:"traces/config.yaml", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

XII. Chapter Compliance Checklist

Metrics, logs, and tracing configurations complete; SI units with check_dim; consistent agg/window.
Dashboards cover system/quality/cost core views; alert rules valid with severity & channels.
SLO targets and error-budget policy defined; violations freeze releases/throttle as needed.
Log redaction enabled; tracing sampler/propagator explicit; RCA links to traces/logs available.
Export manifest lists dashboards/alerts/SLO/log index/trace config with sha256; citation anchors complete to meet release gates.