Chapter 11 Versioning, Provenance & Lineage
I. Chapter Purpose & Scope
specifications: version locking for objects and artifacts, hashing and traceability, lineage graphs and replay, change notices and compatibility policy, audit trail and export manifest; ensure consistency with data contracts, Dataset/Model Cards, the Metrology chapter, and citation anchors.lineage, and provenance, versioningFix pipelineII. Terminology & Dependencies
- Terms: semver, artifact, digest/sha256, lineage.graph, provenance, repro (reproducibility), replay, dual-write, compat_mode.
- Dependencies: contracts/exports (Core.DataSpec v1.0); units/dimensions (Core.Metrology v1.0); splits/coverage & quality (DatasetCards v1.0); evaluation/feature & I/O assumptions (ModelCards v1.0).
- Math & symbols: wrap inline symbols (e.g., QPS, T_inf, ρ) in backticks; any division/integral/composite operator must use parentheses and—if path quantities are involved—declare gamma(ell) and d ell; no Chinese in formulas/symbols/definitions.
III. Fields & Structure (Normative)
versioning:
scheme: "semver" # vMAJOR.MINOR.PATCH
stability_line: "v1.*"
compat_mode: "forward|backward|both|break"
notice:
type: "release|correction|withdrawal"
summary: "<text>"
date: "<YYYY-MM-DD>"
provenance:
sources: ["<uri-or-ref>", "..."] # upstream references (reference-only)
transforms: ["<stage-name>@vX.Y", "..."]
environment:
containers: ["<image@digest>", "..."]
deps_lock: "locks/deps.lock.yaml"
seeds: {global: 1701}
lineage:
graph:
nodes:
- {id:"src.s3.pull", kind:"stage", version:"v1.0"}
- {id:"schema.check", kind:"stage", version:"v1.2"}
- {id:"feat.map", kind:"stage", version:"v1.1"}
- {id:"train_pkg", kind:"artifact", digest:"sha256:..."}
edges:
- {from:"src.s3.pull", to:"schema.check"}
- {from:"schema.check", to:"feat.map"}
- {from:"feat.map", to:"train_pkg"}
replay:
enabled: true
inputs_lock: "locks/inputs.manifest.json" # source list + offsets/watermarks
policy: "strict|lenient"
artifacts:
- {path:"pipeline.yaml", sha256:"<hex>"}
- {path:"locks/inputs.manifest.json", sha256:"<hex>"}
- {path:"locks/deps.lock.yaml", sha256:"<hex>"}
- {path:"outputs/train_pkg.tgz", sha256:"<hex>"}
IV. Versioning Strategy & Stability Line
- Format: vMAJOR.MINOR.PATCH; MAJOR for breaking changes, MINOR for backward-compatible additions, PATCH for fixes and doc corrections.
- Stability line: public references should target v1.*; evaluation/release materials should pin to a minor.
- Compatibility mode: forward|backward|both|break aligned with Schema/contract compat_mode; break requires shadow comparison and notice.
- Notices: record notices under versioning.notice and in export_manifest.references[].
V. Provenance & Reproducibility
- Sources: provenance.sources[] records source identifiers (reference-only); watermarks/offsets and time ranges are locked in inputs.manifest.json.
- Environment: container images use immutable digests; deps.lock.yaml lists dependency versions and hashes.
- Randomness: fix seeds; for non-deterministic operators, declare library versions and deterministic backends.
VI. Lineage Graph & Replay
- Lineage graph: lineage.graph contains nodes (stages/artifacts) and directed edges; nodes carry versions/hashes; the graph must have no dangling nodes.
- Replay: when replay.enabled=true, constrain input sets and order via inputs_lock; policy:"strict" demands byte-identical results, "lenient" allows bounded non-deterministic drift with a stated tolerance.
- Dual-write window: for breaking migrations, use dual-write, compare diffs, and cut over within thresholds.
VII. Artifact Hashing & Integrity
- Mandatory hashing: all critical artifacts (configs/locks/packages/reports) require sha256; integrity failures are blocking.
- Manifest parity: artifacts[] and export_manifest.artifacts[] must match (paths + hashes); optionally record SIZE/LASTMOD.
VIII. Metrology & Units (SI)
- Performance & resources: QPS (1/s), T_inf (ms {p50,p95,p99}), ρ (—), net_mbps, size_bytes.
- Mandatory: metrology:{units:"SI", check_dim:true}; normalize units first before composition/conversion.
- Path quantities: if lineage covers arrival-time/correction chains, register delta_form, path="gamma(ell)", measure="d ell"; use:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ), or
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ),
and pass check_dim.
IX. Machine-Readable Fragment (Drop-in)
versioning:
scheme: "semver"
stability_line: "v1.*"
compat_mode: "both"
notice: {type:"release", summary:"initial stable", date:"2025-09-21"}
provenance:
sources: ["s3://eift-data/raw/2025/09/", "contracts/raw_rows@v1.2"]
transforms: ["schema.check@v1.2", "feat.map@v1.1"]
environment:
containers: ["ghcr.io/eift/pipeline@sha256:abcdef..."]
deps_lock: "locks/deps.lock.yaml"
seeds: {global:1701}
lineage:
graph:
nodes:
- {id:"src.s3.pull", kind:"stage", version:"v1.0"}
- {id:"schema.check", kind:"stage", version:"v1.2"}
- {id:"feat.map", kind:"stage", version:"v1.1"}
- {id:"train_pkg", kind:"artifact", digest:"sha256:1234..."}
edges:
- {from:"src.s3.pull", to:"schema.check"}
- {from:"schema.check", to:"feat.map"}
- {from:"feat.map", to:"train_pkg"}
replay: {enabled:true, inputs_lock:"locks/inputs.manifest.json", policy:"strict"}
artifacts:
- {path:"pipeline.yaml", sha256:"..."}
- {path:"locks/inputs.manifest.json", sha256:"..."}
- {path:"locks/deps.lock.yaml", sha256:"..."}
- {path:"outputs/train_pkg.tgz", sha256:"..."}
X. Lint Rules (Excerpt, Normative)
lint_rules:
- id: VER.SEMVER
when: "$.versioning.scheme"
assert: "value == 'semver' and matches($.pipeline.version, '^v\\d+\\.\\d+(\\.\\d+)?$')"
level: error
- id: VER.COMPAT_ALLOWED
when: "$.versioning.compat_mode"
assert: "value in ['forward','backward','both','break']"
level: error
- id: LIN.GRAPH_CONNECTED
when: "$.lineage.graph"
assert: "graph_is_connected(value) and no_dangling_nodes(value)"
level: error
- id: LIN.REPLAY_INPUTS_LOCK
when: "$.lineage.replay.enabled"
assert: "value == false or has_key($.lineage.replay.inputs_lock)"
level: error
- id: ART.SHA256_REQUIRED
when: "$.artifacts[*]"
assert: "has_key('sha256') and len(value.sha256) > 0"
level: error
- id: METROLOGY.SI_AND_CHECKDIM
when: "$.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
XI. Export Manifest & Audit Trail
export_manifest:
version: "v1.0"
artifacts:
- {path:"pipeline.yaml", sha256:"..."}
- {path:"locks/inputs.manifest.json", sha256:"..."}
- {path:"locks/deps.lock.yaml", sha256:"..."}
- {path:"lineage/graph.json", sha256:"..."}
- {path:"reports/replay.result.json", sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
- "EFT.WP.Data.DatasetCards v1.0:Ch.11"
- "EFT.WP.Data.ModelCards v1.0:Ch.11"
XII. Chapter Compliance Checklist
- Version follows semver; stability_line matches public references; compat_mode explicit; break changes shadow-compared and noticed.
- Provenance complete: sources, watermarks/offsets, environment & dependency locks, random seeds; no duplication of data facts.
- Lineage graph connected with no dangling nodes; when replay is enabled, provide inputs_lock and strict/lenient posture.
- All critical artifacts carry sha256 and are registered in export_manifest; performance/resource metrology uses SI with check_dim=true.
- For path quantities T_arr, delta_form/path/measure registered and validated; frozen splits and evaluation protocol aligned with Dataset/Model Cards.