Chapter 6 Data Validation & Quality Gates
I. Chapter Purpose & Scope
specifications in pipelines: rule types, sampling & significance, blocking vs. warning levels, exception handling, auditing & exports; ensure alignment with Σ_in/Σ_out contracts, splits/coverage, metrology, and citation anchors.DQ gates and data validationFixII. Terminology & Dependencies
- Terms: dq_rules, pass_rate, shadow, quarantine, significance.alpha, blocking/warning, leakage_guard, drift.
- Dependencies: contracts/exports (Core.DataSpec v1.0); units/dimensions (Core.Metrology v1.0); splits/quality (DatasetCards v1.0); feature & I/O assumptions (ModelCards v1.0).
- Math & symbols: wrap inline symbols (α, QPS, T_inf, ρ, u_c) in backticks; any division/integral/composite operator must use parentheses; if path quantities T_arr appear, register gamma(ell) and d ell; no Chinese in formulas/symbols/definitions.
III. Fields & Structure (Normative)
stage:
name: "schema.check|dq.scan|leakage.audit"
type: "validate.schema|validate.dq|validate.leakage"
impl: "I16-2.schema_check|I16-7.dq_scan|I16-8.leakage_audit"
inputs: ["<upstream_artifact>"]
outputs: ["<clean_rows>|<dq_report>|<leakage_report>"]
schema_ref: "contracts/<name>@vX.Y"
dq:
sample: {rows: 50000, strategy: "head|random|stratified"}
significance: {alpha: 0.05}
gates:
- {id:"DQ_001", kind:"not_null", cols:["id","ts"], level:"block"}
- {id:"DQ_002", kind:"unique", cols:[["id","ts"]], level:"block"}
- {id:"DQ_003", kind:"range", col:"value", rule:"[0,1e6]", unit:"<SI>", level:"block"}
- {id:"DQ_004", kind:"enum", col:"status", values:["ok","warn","err"], level:"block"}
- {id:"DQ_005", kind:"distribution", col:"latency_ms", rule:"p99<=200", level:"warn"}
- {id:"DQ_006", kind:"freshness", col:"updated_at", max_lag:"PT30M", level:"warn"}
- {id:"DQ_007", kind:"drift", col:"feature_*", metric:"psi<=0.2", level:"warn"}
- {id:"DQ_008", kind:"leakage", policy:["per-object","per-timewindow"], level:"block"}
on_fail: "quarantine|skip|block"
retries: {max: 2, backoff: "expo"}
timeout_s: 1800
IV. Rule Types & Decision Posture
- Integrity: not_null, unique, and primary-key consistency.
- Value & units: range (explicit interval closure), unit (SI check aligned with constraints.units); normalize units first before composing derived metrics.
- Enums & semantics: stable enumerations with admission policy for unseen values (unknown|reject|map-to-other).
- Freshness & coverage: freshness.max_lag, sampling coverage and minimum sample counts.
- Distributional consistency: distribution (quantiles/p99/KS/AD); pair with significance level α and report p-values & intervals.
- Data drift: drift.psi/kl/ks; defaults psi<=0.2 (warn), psi<=0.3 (block) can be overridden.
- Leakage audit: leakage.policy (per-object|per-timewindow|per-scene); cross-splits overlap is blocking.
- Contract consistency: schema_ref fields/types/units/key constraints aligned with Σ_in/Σ_out.
V. Sampling, Significance & Severity
- Sampling: sample.rows and strategy:"head|random|stratified"; for stratified sampling, declare strata keys & quotas.
- Significance: statistical tests at default α=0.05; report p-values and effect sizes; blocking requires dual conditions (threshold violation and p<α).
- Severity levels: level:"block|warn"; block triggers on_fail and quarantine exports; warn logs and alerts only.
VI. Exception Handling & Audit Exports
- Handling: on_fail:"quarantine|skip|block"; quarantine artifacts record paths, hashes, and mismatch reasons.
- Audit: produce dq/report.jsonl (per-rule records), dq/summary.csv (rollup), dq/leakage_report.csv; register sha256 in export_manifest.artifacts[].
VII. Metrology & Units (SI)
- Perf/time metrics: QPS (1/s), T_inf (ms with {p50,p95,p99}), ρ (unitless); bandwidth net_mbps, volume size_bytes.
- metrology:{units:"SI", check_dim:true} is mandatory; range/unit/distribution rules must pass SI checks.
- For path quantities (e.g., T_arr), register in the rule or stage config: delta_form, path="gamma(ell)", measure="d ell", and validate via one of:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ).
VIII. Machine-Readable Fragment (Drop-in)
layers:
- name: "validate"
stages:
- name: "dq.scan"
type: "validate.dq"
impl: "I16-7.dq_scan"
inputs: ["clean_rows"]
outputs: ["dq_report"]
schema_ref: "contracts/clean_rows@v1.3"
dq:
sample: {rows: 100000, strategy: "stratified"}
significance: {alpha: 0.05}
gates:
- {id:"DQ_001", kind:"not_null", cols:["id","ts"], level:"block"}
- {id:"DQ_003", kind:"range", col:"power_w", rule:"[0,2e3]", unit:"W", level:"block"}
- {id:"DQ_005", kind:"distribution", col:"latency_ms", rule:"p99<=150", level:"warn"}
- {id:"DQ_007", kind:"drift", col:"feature_*", metric:"psi<=0.2", level:"warn"}
- {id:"DQ_008", kind:"leakage", policy:["per-object","per-timewindow"], level:"block"}
on_fail: "quarantine"
retries: {max: 2, backoff: "expo"}
timeout_s: 1800
IX. Lint Rules (Excerpt, Normative)
lint_rules:
- id: DQ.SCHEMA_REF_REQUIRED
when: "$.layers[*].stages[?(@.type=='validate.dq')]"
assert: "has_key('schema_ref')"
level: error
- id: DQ.SAMPLE_DEFINED
when: "$.layers[*].stages[?(@.type=='validate.dq')].dq.sample"
assert: "value.rows > 0 and value.strategy in ['head','random','stratified']"
level: error
- id: DQ.LEVEL_ALLOWED
when: "$.layers[*].stages[*].dq.gates[*].level"
assert: "value in ['block','warn']"
level: error
- id: DQ.RANGE_UNIT_SI
when: "$.layers[*].stages[*].dq.gates[?(@.kind=='range')]"
assert: "is_SI_unit($.unit)"
level: error
- id: DQ.DRIFT_THRESHOLDS
when: "$.layers[*].stages[*].dq.gates[?(@.kind=='drift')]"
assert: "psi_threshold_ok($.metric)"
level: warn
- id: DQ.LEAKAGE_POLICY
when: "$.layers[*].stages[*].dq.gates[?(@.kind=='leakage')]"
assert: "contains_any(['per-object','per-timewindow','per-scene'])"
level: error
X. Export Manifest & Reports
export_manifest:
version: "v1.0"
artifacts:
- {path:"dq/report.jsonl", sha256:"..."}
- {path:"dq/summary.csv", sha256:"..."}
- {path:"dq/leakage_report.csv",sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
- "EFT.WP.Data.DatasetCards v1.0:Ch.12"
XI. Chapter Compliance Checklist
- dq.sample/significance set; rules cover integrity/value/enum/freshness/distribution/drift/leakage.
- Severity & handling clear: block quarantines and stops; warn logs & alerts; audit artifacts with sha256 registered.
- schema_ref aligns with contracts; units in SI and check_dim=true; consistent units for range/distribution/perf metrics.
- Leakage guardrails effective; cross-splits overlap is blocking; path quantities (if any) registered & validated.
- export_manifest lists reports & citation anchors and meets release gates.