Chapter 7 Transform & Preprocessing
I. Chapter Purpose & Scope
specifications: operator types & parameters, input/output Σ_in/Σ_out contracts, idempotency & replayability, missing-value handling & standardization, time/frequency processing & resampling, feature construction alignment, exception handling & audit exports; ensure consistency with data contracts, Model Card feature space, the Metrology chapter, and citation anchors.preprocessing and transformFix pipelineII. Terminology & Dependencies
- Terms: transform.*, preprocess.*, idempotent, resample, window/stride, stats_from:"train-only", feature_space.
- Dependencies: contracts/exports (Core.DataSpec v1.0); units/dimensions (Core.Metrology v1.0); training data & splits (DatasetCards v1.0); feature & I/O assumptions (ModelCards v1.0, Ch.6 & Ch.9).
- Math & symbols: wrap inline symbols (e.g., x, z, μ, σ, f_samp, T_arr) in backticks; any division/integral/composite operator must use parentheses and—if path quantities are involved—declare gamma(ell) and d ell; no Chinese in formulas/symbols/definitions.
III. Fields & Structure (Normative)
stage:
name: "<normalize|standardize|resample|impute|encode|tokenize|stft|specaugment|feature_map|aggregate|pca|custom>"
type: "transform.<op>|preprocess.<op>"
impl: "I16-3.<impl_id>"
inputs: ["<Σ_in>"]
outputs: ["<Σ_out>"]
params:
method: "<zscore|minmax|robust|unit-norm|...>"
stats_from: "train-only|all"
window: "<samples|ms>?"
stride: "<samples|ms>?"
f_samp: "<Hz>?"
anti_alias: {enabled:true, cutoff:"<Hz>", order:5}?
encode: {vocab_ref:"<path>", unk:"<token>", pad:"<token>"}?
impute: {strategy:"mean|median|knn|model", value: null}?
pca: {n_components:"<int|ratio>", whiten:false}?
idempotent: true
retries: {max: 2, backoff: "expo"}
timeout_s: 1800
on_fail: "quarantine|skip|block"
schema_ref: "<contracts/after_transform@vX.Y>"
feature_space:
type: "<dense|sparse|sequence|image|audio_spec|tabular|embedding>"
shape: "<(…)>"
dtype: "<float32|int32|...>"
normalization: "<zscore|minmax|robust|unit-norm|none>"
IV. Canonical Operators & Postures
- Standardization/Normalization (normalize|standardize)
- zscore: ( x - μ ) / σ; compute μ/σ from the training set only (stats_from:"train-only") and lock them in a config file.
- minmax/robust/unit-norm: explicit intervals and norms; document outlier truncation or robust stats.
- Missing-Value Handling (impute)
Mean/median/KNN/model inference; record impact on uncertainty and compose in relevant chapters. - Time/Frequency Processing & Resampling (resample|stft|specaugment)
Declare f_samp, anti-aliasing (anti_alias), interpolation; for STFT, provide window/stride and window function (e.g., hann). - Encoding/Tokenization (encode|tokenize)
vocab_ref, unk/pad, max length, truncation/slide rules; version & hash vocab artifacts. - Feature Construction & Dimensionality Reduction (feature_map|aggregate|pca)
Feature functions/kernels & hyperparameters; for PCA, persist loadings and explained variance ratio; align aggregation windows and declare missing-policy. - Custom Operator (custom)
Provide container/script reference and parameter hash; make I/O contracts explicit; declare rollback policy for failures.
V. Idempotency, Replayability & Exception Handling
- Idempotency: identical inputs with the same parameter hash must produce byte-identical outputs; for non-deterministic ops, fix seed and record library versions.
- Replayability: produce lockfiles and execution logs (config.lock.yaml, logs/*.jsonl); support bypass replay where applicable.
- Handling: on_fail:"quarantine|skip|block"; export quarantined samples and record violated rules/thresholds with input-fragment hashes.
VI. Alignment with Feature Space / Task I-O
- feature_space.type/shape/dtype/normalization must match Model Cards (Ch.6/Ch.9).
- For multi-modal/multi-task settings, provide per-mode feature_space subtrees and export routed artifacts downstream.
VII. Metrology & Units (SI)
- Resampling frequency f_samp (Hz), window/stride (ms or samples), latency T_inf (ms), throughput QPS (1/s) must use SI with metrology:{units:"SI", check_dim:true}.
- If transforms involve path quantities (e.g., T_arr), register delta_form, path="gamma(ell)", measure="d ell", and use one of:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell );
then pass check_dim.
VIII. Machine-Readable Fragment (Drop-in)
layers:
- name: "transform"
stages:
- name: "standardize.rgb"
type: "transform.normalize"
impl: "I16-3.standardize"
inputs: ["raw_image"]
outputs: ["img_std"]
params: {method:"zscore", stats_from:"train-only"}
idempotent: true
schema_ref: "contracts/img_std@v1.0"
feature_space: {type:"image", shape:"(H,W,3)", dtype:"float32", normalization:"zscore"}
- name: "resample.audio"
type: "transform.resample"
impl: "I16-3.resample"
inputs: ["waveform_48k"]
outputs: ["waveform_16k"]
params:
f_samp: 16000
anti_alias: {enabled:true, cutoff: 7600, order: 5}
idempotent: true
schema_ref: "contracts/waveform_16k@v1.0"
- name: "stft.spec"
type: "transform.stft"
impl: "I16-3.stft"
inputs: ["waveform_16k"]
outputs: ["spec"]
params: {window:512, stride:160, window_fn:"hann"}
feature_space: {type:"audio_spec", shape:"(F,T)", dtype:"float32", normalization:"zscore"}
idempotent: true
IX. Lint Rules (Excerpt, Normative)
lint_rules:
- id: TF.IDEMPOTENT_REQUIRED
when: "$.layers[*].stages[?(@.type^='transform.')]"
assert: "idempotent == true"
level: error
- id: TF.STATS_FROM_TRAIN_ONLY
when: "$.layers[*].stages[?(@.params.stats_from)]"
assert: "value == 'train-only'"
level: error
- id: TF.FS_DECLARED
when: "$.layers[*].stages[*].feature_space"
assert: "has_keys(type,shape,dtype,normalization)"
level: error
- id: TF.RESAMPLE_SI
when: "$.layers[*].stages[?(@.type=='transform.resample')]"
assert: "is_number($.params.f_samp) and $.params.f_samp > 0"
level: error
- id: TF.UNITS_CHECKDIM
when: "$.pipeline.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
- id: TF.PATH_TARR_FIELDS
when: "$.layers[*].stages[*].params[?(@.delta_form)]"
assert: "has_keys(delta_form) and has_keys(path) and has_keys(measure)"
level: error
X. Export Manifest & Audit
export_manifest:
version: "v1.0"
artifacts:
- {path:"transform/config.lock.yaml", sha256:"..."}
- {path:"transform/logs/step-*.jsonl", sha256:"..."}
- {path:"features/spec.yaml", sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
- "EFT.WP.Data.ModelCards v1.0:Ch.9"
XI. Chapter Compliance Checklist
- Operator type/impl/params complete; idempotent=true; retry/timeout and failure handling explicit; logs and lockfiles present.
- feature_space matches the Model Card; statistics sourced from the training set; vocab/encoding/PCA loadings are stored and hashed.
- Resampling/time–frequency settings declare f_samp/window/stride/anti_alias; SI units with check_dim=true.
- For path quantities like T_arr, delta_form/path/measure registered and validated.
- export_manifest lists transform-related artifacts and citation anchors, satisfying release gates.