Chapter 12 Monitoring, Logging & Observability


I. Chapter Purpose & Scope

specifications: metrics & dimensions, logs & tracing, dashboards & alerts, SLA/SLO & error budgeting, runtime health & capacity trends, audit & exports; ensure alignment with data contracts, DQ gates, orchestration, and the Metrology chapter.observability, and logging, monitoringFix pipeline

II. Terminology & Dependencies


III. Fields & Structure (Normative)

monitoring:

metrics:

perf:

- {name:"qps", unit:"1/s", agg:"sum", window:"1m"}

- {name:"latency_ms.p50", unit:"ms", agg:"quant", window:"1m"}

- {name:"latency_ms.p95", unit:"ms", agg:"quant", window:"1m"}

- {name:"latency_ms.p99", unit:"ms", agg:"quant", window:"1m"}

- {name:"utilization_rho",unit:"ratio", agg:"mean", window:"5m"}

quality:

- {name:"dq.pass_rate", unit:"ratio", agg:"mean", window:"5m"}

- {name:"drift.psi", unit:"—", agg:"mean", window:"15m"}

resources:

- {name:"cpu", unit:"cores", agg:"mean", window:"1m"}

- {name:"mem_gb", unit:"GiB", agg:"mean", window:"1m"}

- {name:"net_mbps", unit:"Mbps", agg:"mean", window:"1m"}

- {name:"disk_io_mbps", unit:"Mbps", agg:"mean", window:"1m"}

logs:

level: "info|warn|error"

format: "jsonl"

retention: "P30D"

sinks: ["s3://.../logs/", "kafka://.../topic"]

pii_redaction: true

traces:

enabled: true

sampler: "parent|probabilistic"

ratio: 0.05

propagator: "w3c|b3"

dashboards:

system: ["grafana:/boards/pipeline_overview"]

dq: ["grafana:/boards/dq_quality"]

cost: ["grafana:/boards/costs"]

alert_rules:

- {name:"p99_latency_breach", rule:"latency_ms.p99>20000 for 10m", severity:"high", channel:"pagerduty"}

- {name:"dq_drop", rule:"dq.pass_rate<0.98 for 15m", severity:"medium", channel:"slack"}

- {name:"drift_alert", rule:"drift.psi>0.2 for 30m", severity:"low", channel:"email"}

slo:

objectives:

- {name:"latency_p99", target_ms: 20000, window:"30d"}

- {name:"availability", target: 0.999, window:"30d"}

- {name:"dq_pass_rate", target: 0.99, window:"30d"}

error_budget_policy: "freeze_releases|throttle|page_on_call"


IV. Metric System & Postures


V. Logging & Tracing


VI. Dashboards & Alerts


VII. Observability & Health


VIII. Metrology & Units (SI)

  1. Mandatory: metrology:{units:"SI", check_dim:true}.
  2. Perf/resources: QPS (1/s), T_inf (ms {p50,p95,p99}), ρ (—), net_mbps, size_bytes.
  3. Path quantities: if monitoring/logging involves T_arr, register delta_form, path="gamma(ell)", measure="d ell", use one of the equivalents and pass check_dim:
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell ).

IX. Machine-Readable Fragment (Drop-in)

monitoring:

metrics:

perf:

- {name:"qps", unit:"1/s", agg:"sum", window:"1m"}

- {name:"latency_ms.p99", unit:"ms", agg:"quant", window:"1m"}

quality:

- {name:"dq.pass_rate", unit:"ratio", agg:"mean", window:"5m"}

resources:

- {name:"cpu", unit:"cores", agg:"mean", window:"1m"}

- {name:"mem_gb", unit:"GiB", agg:"mean", window:"1m"}

logs:

level: "info"

format: "jsonl"

retention: "P30D"

sinks: ["s3://eift/logs/", "kafka://obs/logs"]

pii_redaction: true

traces: {enabled:true, sampler:"probabilistic", ratio:0.1, propagator:"w3c"}

dashboards:

system: ["grafana:/boards/pipeline_overview"]

alert_rules:

- {name:"p99_latency_breach", rule:"latency_ms.p99>20000 for 10m", severity:"high", channel:"pagerduty"}

slo:

objectives:

- {name:"latency_p99", target_ms:20000, window:"30d"}

error_budget_policy: "freeze_releases"


X. Lint Rules (Excerpt, Normative)

lint_rules:

- id: MON.METRICS_UNIT_SI

when: "$.monitoring.metrics..unit"

assert: "all_units_in_SI(value)"

level: error

- id: LOG.STRUCTURED_JSONL

when: "$.monitoring.logs.format"

assert: "value == 'jsonl'"

level: error

- id: TRACE.SAMPLER_VALID

when: "$.monitoring.traces"

assert: "value.enabled == false or value.sampler in ['parent','probabilistic']"

level: error

- id: ALERT.SYNTAX_VALID

when: "$.monitoring.alert_rules[*].rule"

assert: "matches('^[a-z0-9_\\.]+[><=].+ for \\d+[smhd]$')"

level: error

- id: SLO.OBJECTIVES_DEFINED

when: "$.monitoring.slo.objectives"

assert: "len(value) >= 1"

level: error

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.metrology"

assert: "units == 'SI' and check_dim == true"

level: error


XI. Export Manifest & Audit

export_manifest:

version: "v1.0"

artifacts:

- {path:"monitoring/dashboards.json", sha256:"..."}

- {path:"monitoring/alert_rules.yaml", sha256:"..."}

- {path:"monitoring/slo_objectives.yaml", sha256:"..."}

- {path:"logs/index.manifest.json", sha256:"..."}

- {path:"traces/config.yaml", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"


XII. Chapter Compliance Checklist