Chapter 7 Transform & Preprocessing


I. Chapter Purpose & Scope

specifications: operator types & parameters, input/output Σ_in/Σ_out contracts, idempotency & replayability, missing-value handling & standardization, time/frequency processing & resampling, feature construction alignment, exception handling & audit exports; ensure consistency with data contracts, Model Card feature space, the Metrology chapter, and citation anchors.preprocessing and transformFix pipeline

II. Terminology & Dependencies


III. Fields & Structure (Normative)

stage:

name: "<normalize|standardize|resample|impute|encode|tokenize|stft|specaugment|feature_map|aggregate|pca|custom>"

type: "transform.<op>|preprocess.<op>"

impl: "I16-3.<impl_id>"

inputs: ["<Σ_in>"]

outputs: ["<Σ_out>"]

params:

method: "<zscore|minmax|robust|unit-norm|...>"

stats_from: "train-only|all"

window: "<samples|ms>?"

stride: "<samples|ms>?"

f_samp: "<Hz>?"

anti_alias: {enabled:true, cutoff:"<Hz>", order:5}?

encode: {vocab_ref:"<path>", unk:"<token>", pad:"<token>"}?

impute: {strategy:"mean|median|knn|model", value: null}?

pca: {n_components:"<int|ratio>", whiten:false}?

idempotent: true

retries: {max: 2, backoff: "expo"}

timeout_s: 1800

on_fail: "quarantine|skip|block"

schema_ref: "<contracts/after_transform@vX.Y>"

feature_space:

type: "<dense|sparse|sequence|image|audio_spec|tabular|embedding>"

shape: "<(…)>"

dtype: "<float32|int32|...>"

normalization: "<zscore|minmax|robust|unit-norm|none>"


IV. Canonical Operators & Postures

  1. Standardization/Normalization (normalize|standardize)
    • zscore: ( x - μ ) / σ; compute μ/σ from the training set only (stats_from:"train-only") and lock them in a config file.
    • minmax/robust/unit-norm: explicit intervals and norms; document outlier truncation or robust stats.
  2. Missing-Value Handling (impute)
    Mean/median/KNN/model inference; record impact on uncertainty and compose in relevant chapters.
  3. Time/Frequency Processing & Resampling (resample|stft|specaugment)
    Declare f_samp, anti-aliasing (anti_alias), interpolation; for STFT, provide window/stride and window function (e.g., hann).
  4. Encoding/Tokenization (encode|tokenize)
    vocab_ref, unk/pad, max length, truncation/slide rules; version & hash vocab artifacts.
  5. Feature Construction & Dimensionality Reduction (feature_map|aggregate|pca)
    Feature functions/kernels & hyperparameters; for PCA, persist loadings and explained variance ratio; align aggregation windows and declare missing-policy.
  6. Custom Operator (custom)
    Provide container/script reference and parameter hash; make I/O contracts explicit; declare rollback policy for failures.

V. Idempotency, Replayability & Exception Handling


VI. Alignment with Feature Space / Task I-O


VII. Metrology & Units (SI)

  1. Resampling frequency f_samp (Hz), window/stride (ms or samples), latency T_inf (ms), throughput QPS (1/s) must use SI with metrology:{units:"SI", check_dim:true}.
  2. If transforms involve path quantities (e.g., T_arr), register delta_form, path="gamma(ell)", measure="d ell", and use one of:
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell );
      then pass check_dim.

VIII. Machine-Readable Fragment (Drop-in)

layers:

- name: "transform"

stages:

- name: "standardize.rgb"

type: "transform.normalize"

impl: "I16-3.standardize"

inputs: ["raw_image"]

outputs: ["img_std"]

params: {method:"zscore", stats_from:"train-only"}

idempotent: true

schema_ref: "contracts/img_std@v1.0"

feature_space: {type:"image", shape:"(H,W,3)", dtype:"float32", normalization:"zscore"}

- name: "resample.audio"

type: "transform.resample"

impl: "I16-3.resample"

inputs: ["waveform_48k"]

outputs: ["waveform_16k"]

params:

f_samp: 16000

anti_alias: {enabled:true, cutoff: 7600, order: 5}

idempotent: true

schema_ref: "contracts/waveform_16k@v1.0"

- name: "stft.spec"

type: "transform.stft"

impl: "I16-3.stft"

inputs: ["waveform_16k"]

outputs: ["spec"]

params: {window:512, stride:160, window_fn:"hann"}

feature_space: {type:"audio_spec", shape:"(F,T)", dtype:"float32", normalization:"zscore"}

idempotent: true


IX. Lint Rules (Excerpt, Normative)

lint_rules:

- id: TF.IDEMPOTENT_REQUIRED

when: "$.layers[*].stages[?(@.type^='transform.')]"

assert: "idempotent == true"

level: error

- id: TF.STATS_FROM_TRAIN_ONLY

when: "$.layers[*].stages[?(@.params.stats_from)]"

assert: "value == 'train-only'"

level: error

- id: TF.FS_DECLARED

when: "$.layers[*].stages[*].feature_space"

assert: "has_keys(type,shape,dtype,normalization)"

level: error

- id: TF.RESAMPLE_SI

when: "$.layers[*].stages[?(@.type=='transform.resample')]"

assert: "is_number($.params.f_samp) and $.params.f_samp > 0"

level: error

- id: TF.UNITS_CHECKDIM

when: "$.pipeline.metrology"

assert: "units == 'SI' and check_dim == true"

level: error

- id: TF.PATH_TARR_FIELDS

when: "$.layers[*].stages[*].params[?(@.delta_form)]"

assert: "has_keys(delta_form) and has_keys(path) and has_keys(measure)"

level: error


X. Export Manifest & Audit

export_manifest:

version: "v1.0"

artifacts:

- {path:"transform/config.lock.yaml", sha256:"..."}

- {path:"transform/logs/step-*.jsonl", sha256:"..."}

- {path:"features/spec.yaml", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.ModelCards v1.0:Ch.9"


XI. Chapter Compliance Checklist