Chapter 8 Feature Pipelines & Reuse
I. Chapter Purpose & Scope
specifications: feature extraction/aggregation/alignment, dictionary & embedding management, materialization & caching, cross-task/multi-modal reuse, versioning & dependency mapping; ensure consistency with data contracts, Model Card feature space & task I/O, the Metrology chapter, and citation anchors.feature pipelineFixII. Terminology & Dependencies
- Terms: feature_space, dict_ref, embedding_store, materialize, cache, ttl, point-in-time (PIT alignment), surrogate_key.
- Dependencies: contracts/exports (Core.DataSpec v1.0); units/dimensions (Core.Metrology v1.0); training data & splits (DatasetCards v1.0); feature & I/O assumptions (ModelCards v1.0, Ch.6 & Ch.9).
- Math & symbols: wrap inline symbols (e.g., x_t, μ, σ, Δt, T_arr) in backticks; any division/integral/composite operator must use parentheses and—if path quantities are involved—declare gamma(ell) and d ell; no Chinese in formulas/symbols/definitions.
III. Fields & Structure (Normative)
stage:
name: "<feat.map|feat.aggregate|feat.join|feat.encode|feat.embed|feat.materialize>"
type: "feature.<op>"
impl: "I16-4.<impl_id>"
inputs: ["<Σ_in>"]
outputs: ["<Σ_out>"]
params:
key: ["<entity_id>", "<ts?>"]
point_in_time:
enabled: true
lookback: "PT7D|P30D|N/A"
tolerance: "PT5M"
dict_ref: "dicts/<name>@vX.Y"
embed:
store: "faiss|annoy|milvus|custom"
dim: 768
metric: "cosine|l2"
index_ref: "embeddings/<name>@vX.Y"
aggregate:
window: "PT1H|P1D"
funcs: ["mean","max","count","std"]
fillna: {"method":"pad|zero|drop"}
join:
on: ["<entity_id>","<ts?>"]
how: "left|inner|asof"
materialize:
mode: "none|cache|persist"
cache: {ttl: "P7D", max_gb: 128}
idempotent: true
schema_ref: "contracts/feat_<name>@vX.Y"
feature_space:
type: "<tabular|sequence|image|audio_spec|embedding>"
shape: "<(…)>"
dtype: "<float32|int32|...>"
normalization: "<zscore|minmax|robust|unit-norm|none>"
IV. Feature Operators & Postures
- Mapping/Construction (feat.map): declare feature function/kernel & hyperparameters; emit feature_space with units/dimensions.
- Aggregation (feat.aggregate): sliding/tumbling windows, boundary inclusion, missing policy (fillna); list parameters for multi-window setups.
- Joins (feat.join|feat.asof): keys and temporal alignment; for asof, provide tolerance and direction (backward|forward).
- Encoding (feat.encode): dict_ref version hash, unk/pad policy, OOV handling, and sparse/dense representation.
- Embeddings (feat.embed): vector dimension/metric, index build params & index_ref; include latency/throughput in metrology.
- Materialization (feat.materialize): when mode:"cache|persist", declare store location, ttl, eviction & consistency (full/incremental/bypass).
V. Reuse & Dependency Mapping
- Cross-task reuse: define reusable sets via feature_view, record consumers[] and version constraints; set compat_mode:"forward|backward|both|break".
- Multi-modal reuse: maintain per-modality feature_view and feature_space; register artifacts separately in exports.
- Change impact: breaking changes require shadow comparison & dual-write window; threshold breaches trigger rollback or a backward-compat layer.
VI. Consistency & Point-in-Time (PIT) Alignment
- PIT: historical replay and online inference use the same alignment rules; lookback/tolerance are identical and locked in configs.
- Split consistency: features emitted for training/evaluation must use Dataset Card frozen splits; leakage audits (object/timewindow) are blocking.
VII. Dictionary & Embedding Management
- Dictionaries: dict_ref must carry version & hash; admission policy for new categories unknown|reject|map-to-other; provide stable alias keys for frequently changing vocabularies.
- Embeddings: version index_ref and record training-data time span; if vectors have physical meaning, declare units in feature_space and pass check_dim.
VIII. Metrology & Units (SI)
- Performance: QPS (1/s), T_inf (ms {p50,p95,p99}), ρ (—); bandwidth net_mbps; storage/index volume size_bytes.
- metrology:{units:"SI", check_dim:true} is mandatory; normalize units first before composition/aggregation.
- For path-quantity features (e.g., T_arr), register delta_form, path="gamma(ell)", measure="d ell", use one of the equivalences below, and pass check_dim:
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ).
IX. Machine-Readable Fragment (Drop-in)
layers:
- name: "feature"
stages:
- name: "feat.map.stats"
type: "feature.map"
impl: "I16-4.feature_map"
inputs: ["std_rows"]
outputs: ["feat_rows"]
params:
key: ["entity_id","ts"]
point_in_time: {enabled:true, lookback:"P30D", tolerance:"PT5M"}
aggregate: {window:"P1D", funcs:["mean","std","count"], fillna:{method:"pad"}}
idempotent: true
schema_ref: "contracts/feat_stats@v1.1"
feature_space: {type:"tabular", shape:"(N,D)", dtype:"float32", normalization:"zscore"}
- name: "feat.encode.cat"
type: "feature.encode"
impl: "I16-4.encode"
inputs: ["feat_rows"]
outputs: ["feat_enc"]
params:
dict_ref: "dicts/category_voc@v2.0"
encode: {vocab_ref:"dicts/category_voc@v2.0", unk:"<UNK>", pad:"<PAD>"}
idempotent: true
schema_ref: "contracts/feat_enc@v1.0"
- name: "feat.materialize"
type: "feature.materialize"
impl: "I16-4.materialize"
inputs: ["feat_enc"]
outputs: ["feat_pkg"]
params:
materialize: {mode:"cache", cache:{ttl:"P7D", max_gb:256}}
idempotent: true
schema_ref: "contracts/feat_pkg@v1.0"
X. Lint Rules (Excerpt, Normative)
lint_rules:
- id: FEAT.FS_REQUIRED
when: "$.layers[*].stages[?(@.type^='feature.')]"
assert: "has_key('feature_space')"
level: error
- id: FEAT.DICT_VERSIONED
when: "$.layers[*].stages[?(@.type=='feature.encode')].params.dict_ref"
assert: "matches('^dicts/[a-z0-9_\\-]+@v\\d+\\.\\d+$')"
level: error
- id: FEAT.PIT_PARAMS
when: "$.layers[*].stages[*].params.point_in_time"
assert: "value.enabled == true -> (has_key('lookback') and has_key('tolerance'))"
level: error
- id: FEAT.MATERIALIZE_POLICY
when: "$.layers[*].stages[?(@.type=='feature.materialize')].params.materialize"
assert: "value.mode in ['none','cache','persist']"
level: error
- id: FEAT.UNITS_CHECKDIM
when: "$.pipeline.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
- id: FEAT.LEAKAGE_GUARDS_FOR_TRAIN_EXPORT
when: "$.layers[*].stages[*].outputs"
assert: "produces_train_eval(outputs) -> has_leakage_guards()"
level: error
XI. Export Manifest & Audit
export_manifest:
version: "v1.0"
artifacts:
- {path:"features/feat_view.yaml", sha256:"..."}
- {path:"features/dict_category_v2.hash", sha256:"..."}
- {path:"features/feat_pkg.manifest.json", sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
- "EFT.WP.Data.ModelCards v1.0:Ch.6"
- "EFT.WP.Data.ModelCards v1.0:Ch.9"
XII. Chapter Compliance Checklist
- Feature operators type/impl/params complete; feature_space declared and aligned with Model Cards; PIT alignment reproducible.
- dict_ref/index_ref versioned & hashed; OOV/admission policies clear; materialization cache defines ttl and size cap.
- For training/evaluation outputs, splits match Dataset Card frozen splits; leakage guardrails active.
- Performance & units use SI with check_dim=true; if using path quantities T_arr, delta_form/path/measure registered & validated.
- export_manifest lists feature view/dictionaries/materialized package artifacts and anchors with sha256, satisfying release gates.