Chapter 5 Schema & Contract Management
I. Chapter Purpose & Scope
in pipelines; define reserved keys such as schema_ref/compat_mode/evolution_policy, standardize contract registration, shadow comparison, and release gates; ensure consistency with Dataset/Model Cards, the Metrology chapter, and citation anchors.schemas and data contractsFix the versioning, compatibility, evolution policy, and validation workflow ofII. Terminology & Dependencies
- Terms: schema_ref (schema reference), contract (data contract), compat_mode (compatibility mode), evolution_policy (evolution strategy), shadow (shadow comparison), breaking (breaking change).
- Dependencies: data contracts & exports (Core.DataSpec v1.0); units/dimensions (Core.Metrology v1.0); splits/coverage/quality (DatasetCards v1.0); feature & I/O assumptions (ModelCards v1.0).
- Math & symbols: wrap inline symbols with backticks; any expression with division/integral/composite operators must use parentheses and—if path quantities are involved—declare gamma(ell) and d ell; no Chinese in formulas/symbols/definitions.
III. Fields & Structure (Normative)
contract:
schema_ref: "contracts/<name>@vX.Y" # versioned schema reference (required)
compat_mode: "forward|backward|both|break"
evolution_policy:
add_field: "optional-by-default|feature-flag"
remove_field: "forbid|deprecate-then-remove"
change_type: "coercible|forbid"
change_sematic: "requires-shadow-and-signoff"
constraints:
primary_key: ["<col1>", "<col2?>"]
partition_by: ["<pcol?>"]
unique: [["<colA>","<colB>"]]
not_null: ["<colX>", "<colY>"]
range:
- {col:"<metric>", rule:"[lo,hi]"}
enum:
- {col:"<status>", values:["A","B","C"]}
units: { "<col>":"<SI-unit>" } # aligned with Metrology
validation:
mode: "strict|lenient"
sample: {rows: 10000, strategy:"head|random|stratified"}
significance: {alpha: 0.05}
shadow:
enabled: true
route: "percent:5" # shadow ratio or selector
compare_metrics: ["dq.pass_rate","error_rate","latency_ms.p95"]
lineage_bind:
produce: ["<artifact_path>"]
consume: ["<upstream_schema_ref>"]
IV. Contract Registration & Release Workflow
- Registration: record schema_ref in the schema registry with checksum and change summary; first release must include a minimal example and DQ baseline.
- Compatibility matrix:
- forward: downstream accepts upstream additive optional fields;
- backward: upstream can output a subset for older downstreams;
- both: bidirectional compatibility;
- break: breaking changes require shadow comparison and sign-off.
- Evolution policy: new fields default optional; removals use a deprecate → remove two-step; type changes only when coercible, with a declared conversion rule.
- Release gates: schema validation = pass, DQ = pass, shadow diffs within thresholds, metrology.check_dim=true, citation anchors complete.
V. Schema Design Constraints
- Explicit units & dimensions: numeric columns must declare SI units under constraints.units; normalize units first before any composition.
- Keys & indexing: primary_key must not include nullable columns; partition_by should match downstream bucketing; uniqueness must pair with dedupe keys.
- Time & timezone: timestamps in UTC (ISO 8601); windowing and lateness policy documented under validation.
- Enums & mapping: enums must include stable mapping and an admission policy for new values (unknown|reject|map-to-other).
VI. Shadow Comparison & Rollback
- Shadow: enable shadow.enabled=true, route via route, compare dq.pass_rate, error rate, and key perf; breaches trigger rollback.
- Rollback: when compat_mode!="break", prefer upstream rollback; for break, provide a compatibility layer or a dual-write window.
VII. Machine-Readable (Normative Excerpt)
layers:
- name: "validate"
stages:
- name: "schema.check"
type: "validate.schema"
impl: "I16-2.schema_check"
inputs: ["raw_rows"]
outputs: ["clean_rows"]
contract:
schema_ref: "contracts/raw_rows@v1.2"
compat_mode: "both"
evolution_policy:
add_field: "optional-by-default"
remove_field: "deprecate-then-remove"
change_type: "coercible"
change_sematic: "requires-shadow-and-signoff"
constraints:
primary_key: ["id"]
not_null: ["id","ts"]
enum: [{col:"status", values:["ok","warn","err"]}]
units: {"lat":"deg","lon":"deg","power_w":"W"}
validation:
mode: "strict"
sample: {rows: 50000, strategy:"stratified"}
significance: {alpha: 0.05}
shadow:
enabled: true
route: "percent:5"
compare_metrics: ["dq.pass_rate","error_rate","latency_ms.p95"]
lineage_bind:
produce: ["lake/clean/2025/09/"]
consume: ["contracts/raw_json@v1.2"]
VIII. Lint Rules (Excerpt, Normative)
lint_rules:
- id: SCHEMA.REF_FORMAT
when: "$..schema_ref"
assert: "matches('^contracts/[a-z0-9_\\-]+@v\\d+\\.\\d+$')"
level: error
- id: SCHEMA.COMPAT_ALLOWED
when: "$..compat_mode"
assert: "value in ['forward','backward','both','break']"
level: error
- id: SCHEMA.UNITS_DECLARED
when: "$..constraints.units"
assert: "all_units_in_SI(value)"
level: error
- id: SCHEMA.PK_NOT_NULL
when: "$..constraints"
assert: "primary_key != null and all_not_null(primary_key, not_null)"
level: error
- id: SCHEMA.SHADOW_REQUIRED_ON_BREAK
when: "$..compat_mode"
assert: "value != 'break' or $.shadow.enabled == true"
level: error
- id: SCHEMA.METROLOGY_CHECKDIM
when: "$.pipeline.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
IX. Contract Evolution & Notices
- Versioning: strict @vX.Y; minor (Y) for backward-compatible additions; major (X) for breaking changes.
- Notices: for break or semantic changes (change_sematic), add a notice anchor in export_manifest.references[] and update downstream subscriptions.
- Grace period: define grace_period and dual-write policy; during the period demote lint to warn, then restore to error after expiry.
X. Export Manifest & Audit Trail
export_manifest:
version: "v1.0"
artifacts:
- {path:"contracts/raw_rows.schema.json", sha256:"..."}
- {path:"contracts/changelog.md", sha256:"..."}
- {path:"validate/dq.report.jsonl", sha256:"..."}
- {path:"validate/shadow.diff.csv", sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
- "EFT.WP.Data.DatasetCards v1.0:Ch.12"
XI. Chapter Compliance Checklist
- schema_ref matches the regex and resolves; compat_mode and evolution_policy are explicit.
- Constraints (keys/unique/not-null/range/enum/units) complete; units in SI and check_dim=true.
- Shadow comparison enabled and within thresholds; breaking changes have a compat layer or dual-write window.
- Contract changes are noticed in export_manifest.references[]; change artifacts and DQ/shadow reports carry sha256.
- Downstream schema compatibility and split/coverage alignment confirmed via sampling and significance tests.