Chapter 9 Sampling, Splits & Distribution
I. Chapter Purpose & Scope
: split definitions & ratios, frozen indices & consistency, stratification and re-sampling/re-weighting, leakage prevention & audits, package building and mirrors/rate/region compliance, integrity checks and the export manifest; ensure consistency with data contracts, Dataset Card frozen splits, Model Card evaluation protocol, the Metrology chapter, and citation anchors.distribution, and splits, samplingFix pipeline specifications forII. Terminology & Dependencies
- Terms: sampling.strategy/strata/weights, splits.train/validation/test, freeze_indices, leakage_guard, shard, mirror, checksum.
- Dependencies: contracts/exports (Core.DataSpec v1.0); units/dimensions (Core.Metrology v1.0); splits/coverage & quality (DatasetCards v1.0); evaluation protocol & frozen splits (ModelCards v1.0).
- Math & symbols: wrap inline symbols (e.g., QPS, T_inf, ρ) in backticks; any division/integral/composite operator must use parentheses; if path quantities T_arr appear, register gamma(ell) and d ell; no Chinese in formulas/symbols/definitions.
III. Fields & Structure (Normative)
stage:
name: "split.package|split.export"
type: "export.splits"
impl: "I16-5.split_package"
inputs: ["<Σ_in_feat>"]
outputs: ["train_pkg","val_pkg","test_pkg"]
splits:
train: {ratio: 0.8, count: null}
validation: {ratio: 0.1, count: null}
test: {ratio: 0.1, count: null}
policy:
sampling:
strategy: "random|stratified|time-based|spatial-tiles|systematic"
strata: [{by:"class|region|snr_bin", buckets: {"A":100,"B":200}}]
weights: {class:"inverse_freq|none"}
leakage_guard: ["per-object","per-timewindow","per-scene"]
freeze_indices: true
distribution:
packaging:
format: "tgz|zip|parquet|zarr"
shard_bytes: 134217728
layout: ["train","validation","test"]
mirrors: ["https://mirror-a.example/foo/","s3://bucket/foo/"]
rate_limit: {mbps: 50}
regional_compliance: ["EU-GDPR"]
checksums:
package: {sha256: "<hex>"}
shards:
- {path:"train-000.tgz", sha256:"<hex>"}
- {path:"train-001.tgz", sha256:"<hex>"}
on_fail: "block|quarantine|skip"
timeout_s: 1800
IV. Sampling Strategies & Stratification Posture
- Strategies: random (global uniform), stratified (by class/region/SNR, etc.), time-based (windows/periodic), spatial-tiles (spatial tiling), systematic (fixed step).
- Stratification: explicitly list buckets & quotas in strata[]; align with Dataset Card sampling.strata; report realized shares and deviations.
- Re-sampling/Re-weighting: if the training side uses weights, record the policy and explain its impact and significance in the Model Card evaluation section.
V. Split Definitions & Frozen Consistency
- Ratio sum: train/validation/test.ratio must sum to 1±1e-6.
- Frozen indices: set freeze_indices=true and emit index lists of files/rows/object IDs to ensure reproducible comparability.
- Cross-volume consistency: Model Card evaluation must use the frozen indices from this chapter and remain consistent with Dataset Card frozen splits.
VI. Leakage Prevention & Audit
- Guard granularity: leakage_guard:["per-object","per-timewindow","per-scene"]; the same object/adjacent time windows/the same scene must not appear across split sets.
- Blocking: any leakage hit is blocking.
- Audit exports: produce splits/leakage_report.csv and summary stats; register sha256.
VII. Distribution & Integrity
- Packaging: prefer stream/parallel-friendly formats; if tgz/zip archives, provide sharding and checksum tables.
- Mirrors & rate: provide at least two mirror endpoints; publish recommended concurrency and rate limits.
- Integrity: provide sha256 for top-level and each shard (optionally include SIZE/LASTMOD).
- Regional compliance: when privacy/geosensitivity applies, state allowable regions and legal bases.
VIII. Metrology & Units (SI)
- Performance: QPS (1/s), T_inf (ms with {p50,p95,p99}), utilization ρ (—); network net_mbps; package size size_bytes.
- metrology:{units:"SI", check_dim:true} is mandatory; normalize units first before composition/aggregation.
- If distribution/splits involve path quantities (e.g., T_arr), register delta_form, path="gamma(ell)", measure="d ell", use one of:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ),
and pass check_dim.
IX. Machine-Readable Fragment (Drop-in)
layers:
- name: "export"
stages:
- name: "split.package"
type: "export.splits"
impl: "I16-5.split_package"
inputs: ["feat_rows"]
outputs: ["train_pkg","val_pkg","test_pkg"]
splits:
train: {ratio: 0.8}
validation: {ratio: 0.1}
test: {ratio: 0.1}
policy:
sampling:
strategy: "stratified"
strata: [{by:"class", buckets: {"A":520,"B":2100,"C":12380}}]
leakage_guard: ["per-object","per-timewindow"]
freeze_indices: true
distribution:
packaging: {format:"tgz", shard_bytes:134217728, layout:["train","validation","test"]}
mirrors: ["https://mirror-a.example/datasets/foo/","s3://bucket/foo/"]
rate_limit: {mbps: 50}
checksums:
package: {sha256: "…"}
shards:
- {path:"train-000.tgz", sha256:"…"}
- {path:"train-001.tgz", sha256:"…"}
on_fail: "block"
timeout_s: 1800
X. Lint Rules (Excerpt, Normative)
lint_rules:
- id: SPLIT.RATIO_SUM
when: "$.layers[*].stages[?(@.type=='export.splits')].splits"
assert: "abs(train.ratio + validation.ratio + test.ratio - 1) <= 1e-6"
level: error
- id: SPLIT.FREEZE_REQUIRED
when: "$.layers[*].stages[?(@.type=='export.splits')].policy.freeze_indices"
assert: "value == true"
level: error
- id: SPLIT.LEAKAGE_GUARDS
when: "$.layers[*].stages[?(@.type=='export.splits')].policy.leakage_guard"
assert: "contains_any(['per-object','per-timewindow','per-scene'])"
level: error
- id: DIST.PACKAGING_ALLOWED
when: "$.layers[*].stages[?(@.type=='export.splits')].distribution.packaging.format"
assert: "value in ['tgz','zip','parquet','zarr']"
level: error
- id: DIST.CHECKSUMS_PRESENT
when: "$.layers[*].stages[?(@.type=='export.splits')].checksums"
assert: "has_key('package') and len(shards) >= 1"
level: error
- id: METROLOGY.SI_AND_CHECKDIM
when: "$.pipeline.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
XI. Export Manifest & Audit
export_manifest:
version: "v1.0"
artifacts:
- {path:"splits/train.index", sha256:"..."}
- {path:"splits/validation.index", sha256:"..."}
- {path:"splits/test.index", sha256:"..."}
- {path:"packages/train-000.tgz", sha256:"..."}
- {path:"packages/train-001.tgz", sha256:"..."}
- {path:"splits/leakage_report.csv", sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
- "EFT.WP.Data.DatasetCards v1.0:Ch.11"
- "EFT.WP.Data.ModelCards v1.0:Ch.11"
XII. Chapter Compliance Checklist
- splits ratios sum to 1±1e-6, freeze_indices=true; index lists exported and reproducible.
- sampling.strategy/strata/weights align with the Dataset Card; if training uses re-sampling/re-weighting, documented and explained in the Model Card evaluation.
- Leakage guardrails active; cross-splits overlap is blocking; leakage audit report produced with sha256.
- Distribution packages/shards have sha256; mirrors and rate limits are explicit; regional compliance appears in references[] where applicable.
- metrology.units="SI" and check_dim=true; consistent units for QPS/T_inf/ρ/net_mbps/size_bytes.
- export_manifest lists indices/packages/reports and citation anchors, satisfying release gates.