Chapter 16 Machine-readable Schema & Lint
I. Chapter Purpose & Scope
.no Chinese for pipelines, covering structure/type/regex/dependencies/citation anchors/dimensional checks/idempotency & retries/frozen splits & leakage guardrails/minimal security & compliance checks; artifacts are used for pre-release blocking checks and portal auto-validation. Keys use snake_case; cross-volume citations use “Volume vX.Y:Anchor”; math uses backticks with parentheses and Lint ruleset and normative JSON SchemaProvide theII. Normative Artifacts (Release-Critical)
artifacts:
- path: "schema/pipeline.schema.json"
- path: "schema/lint_rules.yaml"
- path: "schema/examples/minimal.yaml"
- path: "schema/examples/full.yaml"
These artifacts must be listed in export_manifest.artifacts[] with sha256; citation anchors follow this volume’s posture.III. Normative JSON Schema (Core Excerpt)
The references[] regex enforces “Volume vX.Y:Anchor”; metrology.units="SI" and check_dim=true are mandatory.IV. Lint Rules (Normative)
version: "v1.0"
rules:
# Structure & versioning
- id: STRUCT.REQUIRED
when: "$"
assert: "has_keys(pipeline, metrology, export_manifest)"
level: error
- id: VERSION.SEMVER
when: "$.pipeline.version"
assert: "matches('^v\\d+\\.\\d+(\\.\\d+)?$')"
level: error
# Topology & contracts
- id: LAYERS.NOT_EMPTY
when: "$.pipeline.layers"
assert: "len(value) > 0"
level: error
- id: EDGES.COMPAT_SCHEMA
when: "$.pipeline.edges[*]"
assert: "schema_compat(edge.from.Σ_out, edge.to.Σ_in)"
level: error
# Sampling & splits
- id: SPLIT.RATIO_SUM
when: "$..stages[?(@.type=='export.splits')].splits"
assert: "abs(train.ratio + validation.ratio + test.ratio - 1) <= 1e-6"
level: error
- id: SPLIT.FREEZE_REQUIRED
when: "$..stages[?(@.type=='export.splits')].policy.freeze_indices"
assert: "value == true"
level: error
- id: LEAKAGE.GUARDS_PRESENT
when: "$..stages[?(@.type=='export.splits')].policy.leakage_guard"
assert: "contains_any(['per-object','per-timewindow','per-scene'])"
level: error
# Validation & DQ
- id: DQ.SCHEMA_REF_REQUIRED
when: "$..stages[?(@.type=='validate.dq')]"
assert: "has_key('schema_ref')"
level: error
- id: DQ.SAMPLE_DEFINED
when: "$..stages[?(@.type=='validate.dq')].dq.sample"
assert: "value.rows > 0 and value.strategy in ['head','random','stratified']"
level: error
# Transform & feature
- id: TF.IDEMPOTENT_REQUIRED
when: "$..stages[?(@.type^='transform.')]"
assert: "idempotent == true"
level: error
- id: FEAT.FS_REQUIRED
when: "$..stages[?(@.type^='feature.')]"
assert: "has_key('feature_space')"
level: error
# Security & compliance minimal checks
- id: SEC.CREDENTIALS_REF
when: "$..stages[?(@.type^='source.')].params"
assert: "has_key('credentials_ref') and not has_key('plain_secret')"
level: error
- id: PRIV.MINIMIZATION_ON
when: "$.privacy.data_minimization"
assert: "value == true"
level: error
# Metrology
- id: METROLOGY.SI_AND_CHECKDIM
when: "$.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
# Citation anchors
- id: REFERENCES.FORMAT
when: "$.export_manifest.references[*]"
assert: "matches('^[^:]+ v\\d+\\.\\d+:[A-Z].+$')"
level: error
Blocking rules include STRUCT.REQUIRED, VERSION.SEMVER, EDGES.COMPAT_SCHEMA, SPLIT.*, TF.IDEMPOTENT_REQUIRED, FEAT.FS_REQUIRED, SEC.CREDENTIALS_REF, METROLOGY.SI_AND_CHECKDIM, REFERENCES.FORMAT.V. Failure Examples & Diagnostics (Excerpt)
fail_examples:
- case: "bad reference format"
input: {export_manifest:{references:["Core.DataSpec:EXPORT"]}}
expect: {rule:"REFERENCES.FORMAT", level:"error",
fix:"Use 'EFT.WP.Core.DataSpec v1.0:EXPORT'"}
- case: "split ratios sum != 1"
input: {stages:[{type:"export.splits", splits:{train:{ratio:0.7}, validation:{ratio:0.2}, test:{ratio:0.2}}}]}
expect: {rule:"SPLIT.RATIO_SUM", level:"error",
fix:"Normalize ratios so they sum to 1±1e-6"}
- case: "no credentials_ref"
input: {stages:[{type:"source.s3", params:{endpoint:"...", plain_secret:"abc"}}]}
expect: {rule:"SEC.CREDENTIALS_REF", level:"error",
fix:"Remove plaintext secret; reference a secrets manager via credentials_ref"}
Lint outputs must include rule/path/message/fix.VI. Minimal Working Example (Validates under Schema & Lint)
pipeline:
id: "eift.ingest-validate-transform-export"
version: "v1.0"
layers:
- name: "ingest"
stages:
- name: "src.s3.pull"
type: "source.s3"
impl: "I16-1.s3_pull"
params: {endpoint:"https://s3.amazonaws.com", bucket_or_db:"eift-data",
prefix_or_table:"raw/2025/09/", query_or_pattern:"*.jsonl",
credentials_ref:"secrets://aws/ingest_ro", format:"json"}
outputs: ["raw_blob"]
idempotent: true
retries: {max:3, backoff:"expo", jitter_ms:200}
timeout_s: 1800
- name: "validate"
stages:
- name: "dq.scan"
type: "validate.dq"
impl: "I16-7.dq_scan"
inputs: ["raw_blob"]
outputs: ["dq_report"]
schema_ref: "contracts/raw_json@v1.2"
dq: {sample:{rows:100000, strategy:"stratified"}, significance:{alpha:0.05},
gates:[{id:"DQ_001", kind:"not_null", cols:["id","ts"], level:"block"}]}
edges:
- {from:"src.s3.pull:raw_blob", to:"dq.scan:raw_blob"}
metrology: {units:"SI", check_dim:true}
export_manifest:
version: "v1.0"
artifacts: [{path:"pipeline.yaml", sha256:"..."}]
references: ["EFT.WP.Core.DataSpec v1.0:EXPORT","EFT.WP.Core.Metrology v1.0:check_dim"]
VII. Coupling with Export Manifest (Normative)
export_manifest:
artifacts:
- {path:"schema/pipeline.schema.json", sha256:"..."}
- {path:"schema/lint_rules.yaml", sha256:"..."}
- {path:"schema/examples/minimal.yaml", sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
and must be listed and verifiable; references carry “Volume vX.Y:Anchor”.blockingSchema and Lint areVIII. Validation Interfaces (Implementation Binding Ixx-?; Unified Return)
def validate_pipeline(spec: dict) -> dict: ...
def lint_pipeline(spec: dict, rules: dict) -> dict: ...
def check_units(spec: dict) -> dict: ... # uses Core.Metrology v1.0:check_dim
def verify_references(spec: dict) -> dict: ...# regex + anchor reachability
Return shape: {"ok": bool, "errors":[...], "warnings":[...], "metrics":{...}} for portal/CI.IX. Chapter Compliance Checklist
- pipeline.schema.json and lint_rules.yaml produced and registered in export_manifest with sha256.
- Schema enforces metrology.units="SI" & check_dim=true and the anchor regex in references[]; Lint blocks topology incompatibility, unfrozen splits, missing leakage guardrails, missing idempotency, and plaintext secrets.
- Sampling/splits/distribution aligns with Dataset Cards; feature & I/O contracts and units align with metrology.
- Minimal example validates once under Schema & Lint; validation interfaces integrated and returning the unified structure.
- All citations use “Volume vX.Y:Anchor”; no shortcodes/aliases/missing-version refs.