I. Chapter Purpose & Scope
- Establish the unified posture of benchmarks: how to construct suites, define tasks/subtasks, specify evaluation protocols and metrology, and publish reproducible experiments with leaderboard governance.
- Clarify applicability: offline/online/streaming/interactive evaluations; single-model and end-to-end systems; single-dataset and multi-dataset joint evaluations; cross‑modality and cross‑lingual.
- Align interfaces and anchors with companion volumes: DatasetCards, ModelCards, Pipeline, and SI units with dimensional checks.
II. Definitions & Terms
- Benchmark: a reproducible comparison of targets under given data and protocol.
- Suite: an organizational unit composed of tasks and subtasks, with shared protocol, aggregation, and governance rules.
- Task/Subtask: an evaluation unit specifying io_mode, input assumptions, constraints, and target metrics.
- Track: branches under a task with different resources/tools/openness (e.g., “closed-book/open-book”, “no-tools/tools-allowed”).
- Submission: an accepted evaluation run and its artifacts (with run_id, environment lock, and metric report).
- Artifact: a verifiable file/object (bound by sha256).
- Frozen splits: index‑level immutable sets S_train/S_val/S_test preventing leakage.
- Statistical significance: statistical decision on metric differences; report p, CI_95, and correction method.
- Path quantities (e.g., arrival time): if T_arr appears, use
- T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ), or
- T_arr = ( ∫ ( n_eff / c_ref ) d ell ),
and declare gamma(ell) and d ell, with dimensional consistency checks.
III. Background & Motivation
- Common pitfalls: uncontrolled leakage, ambiguous protocols, incomparable metrics, irreproducible environments, weak leaderboard governance.
- Objective: provide an end‑to‑end framework guided by protocol first, frozen data, unit unification, statistical rigor, and transparent governance.
IV. Design Principles (P01–P05)
- P01 Reproducible: inputs, environment, randomness, and implementation lockable; seed and deps_lock are mandatory.
- P02 Measurable: all metrics use SI units; composite metrics normalize first and then combine; check_dim=true.
- P03 Comparable: fixed protocols, frozen splits, unified aggregation (macro/micro/weighted) and confidence intervals.
- P04 Governable: submission workflow, review gates, retraction/correction, and versioning are open and transparent.
- P05 Extensible: tasks/data/metrics/protocols evolve via semantic versioning vMAJOR.MINOR.PATCH.
V. In Scope & Out of Scope
- In scope: classification/regression/ranking/retrieval/generation/multimodal; offline batch, online A/B, streaming, and interactive evaluation.
- Out of scope: optimization of training recipes per se; non‑public protocols bound to sensitive commercial data; datasets that cannot be frozen at index level.
- Cross‑volume dependencies:
- Data: see EFT.WP.Data.DatasetCards v1.0.
- Models: see EFT.WP.Data.ModelCards v1.0.
- Pipelines: see EFT.WP.Data.Pipeline v1.0.
VI. Deliverables & Release Gates
- Mandatory exports: benchmark.yaml/json, protocol.yaml, metrics.yaml, env.lock, splits/*.index, reports/*.jsonl, each with sha256.
- Gates:
- Frozen splits and leakage guardrails enabled;
- SI metrology and dimensional checks pass;
- Significance and uncertainty reports included;
- Privacy, residency, and third‑party processing registered.
- Leaderboard governance: stability line, shadow comparisons, submission cooldown, and arbitration process.
VII. Cross‑References & Dependencies
- Evaluation protocol & metrics: see EFT.WP.Data.ModelCards v1.0, Chapter 11.
- Performance, cost & scaling: see EFT.WP.Data.Pipeline v1.0, Chapter 13.
- Units & dimensions: see EFT.WP.Core.Metrology v1.0:check_dim.
- Fixed cross‑volume phrasing example: “See companion white paper Energy Threads, Chapter x, S/P/M/I…”.
VIII. Machine‑Readable Overview (Normative)
suite:
id: "eift.benchmarks.core"
title: "EIFT Core Benchmarks"
version: "v1.0.0"
modalities: ["text","image","audio"]
risks: ["leakage","bias","spurious_correlation"]
tasks:
- id: "cls.binary"
io_mode: "offline"
tracks: ["closed-book"]
dataset_ref: "datasets/core_cls@v1.0"
sampling: {strategy:"stratified", strata:[{by:"label"}]}
splits:
train: {frozen:true, index:"splits/train.index", sha256:"<hex>"}
val: {frozen:true, index:"splits/val.index", sha256:"<hex>"}
test: {frozen:true, index:"splits/test.index", sha256:"<hex>"}
leakage_guard: ["per-object","per-scene"]
protocol:
seed: 1701
repeats: 5
temperature: 0.0
tools_allowed: false
runtime_limits: {timeout_s: 3600}
metrics:
- {name:"Acc", unit:"—", higher_is_better:true, agg:"macro"}
- {name:"ECE", unit:"—", higher_is_better:false}
aggregation:
levels: ["task","suite"]
weights: {task:"uniform"}
normalize: {scheme:"zscore", anchors:["baseline.logreg","baseline.rf"]}
significance:
method: "bootstrap"
B: 10000
alpha: 0.05
correction: "Holm-Bonferroni"
env:
hardware: {cpu:"16c", mem_gb:64, gpu:0}
os: "ubuntu-22.04"
containers: ["ghcr.io/eift/runner@sha256:<hex>"]
deps_lock: "env.lock"
baselines:
- {id:"baseline.logreg", impl:"I15-1.logreg", params:{C:1.0}}
- {id:"baseline.rf", impl:"I15-2.rf", params:{n_trees:200}}
export_manifest:
version: "v1.0"
artifacts:
- {path:"benchmark.yaml", sha256:"<hex>"}
- {path:"splits/train.index", sha256:"<hex>"}
- {path:"reports/summary.json", sha256:"<hex>"}
references:
- "EFT.WP.Core.Metrology v1.0:check_dim"
- "EFT.WP.Data.DatasetCards v1.0:Ch.11"
- "EFT.WP.Data.ModelCards v1.0:Ch.11"
IX. Lint Rules (Excerpt, Normative)
lint_rules:
- id: SUITE.ID_FORMAT
when: "$.suite.id"
assert: "matches('^[a-z0-9_.\\-]+$')"
level: error
- id: SPLITS.FROZEN_REQUIRED
when: "$..splits"
assert: "train.frozen == true and val.frozen == true and test.frozen == true"
level: error
- id: LEAKAGE.GUARDS
when: "$..leakage_guard"
assert: "contains_any(['per-object','per-timewindow','per-scene'])"
level: error
- id: METRICS.UNITS_SI
when: "$..metrics[*].unit"
assert: "all_units_in_SI(value) or value == '—'"
level: error
- id: PROTOCOL.SEED_AND_REPEATS
when: "$..protocol"
assert: "has_keys(seed, repeats)"
level: error
- id: SIGNIFICANCE.PARAMS
when: "$..significance"
assert: "has_keys(method, B, alpha)"
level: error
- id: EXPORT.REFERENCES_FORMAT
when: "$.export_manifest.references[*]"
assert: "matches('^[^:]+ v\\d+\\.\\d+:[A-Z].+$')"
level: error
X. Chapter Compliance Checklist
- Conceptual posture and terminology unified; definitions of suite/tasks/subtasks/tracks are clear.
- Principles P01–P05 are actionable and aligned with anchors in DatasetCards/ModelCards/Pipeline.
- Applicability and boundaries are explicit; cross‑volume dependencies resolve.
- Deliverables and gates are verifiable (sha256, frozen splits, SI units, significance, compliance materials).
- Machine‑readable fragment is drop‑in; lint rules are enforceable as blocking checks in portal/CI.