Chapter 1 Overview & Scope


I. Chapter Purpose & Scope


II. Definitions & Terms

  1. Benchmark: a reproducible comparison of targets under given data and protocol.
  2. Suite: an organizational unit composed of tasks and subtasks, with shared protocol, aggregation, and governance rules.
  3. Task/Subtask: an evaluation unit specifying io_mode, input assumptions, constraints, and target metrics.
  4. Track: branches under a task with different resources/tools/openness (e.g., “closed-book/open-book”, “no-tools/tools-allowed”).
  5. Submission: an accepted evaluation run and its artifacts (with run_id, environment lock, and metric report).
  6. Artifact: a verifiable file/object (bound by sha256).
  7. Frozen splits: index‑level immutable sets S_train/S_val/S_test preventing leakage.
  8. Statistical significance: statistical decision on metric differences; report p, CI_95, and correction method.
  9. Path quantities (e.g., arrival time): if T_arr appears, use
    • T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell ), or
    • T_arr = ( ∫ ( n_eff / c_ref ) d ell ),
      and declare gamma(ell) and d ell, with dimensional consistency checks.

III. Background & Motivation


IV. Design Principles (P01–P05)


V. In Scope & Out of Scope

  1. In scope: classification/regression/ranking/retrieval/generation/multimodal; offline batch, online A/B, streaming, and interactive evaluation.
  2. Out of scope: optimization of training recipes per se; non‑public protocols bound to sensitive commercial data; datasets that cannot be frozen at index level.
  3. Cross‑volume dependencies:
    • Data: see EFT.WP.Data.DatasetCards v1.0.
    • Models: see EFT.WP.Data.ModelCards v1.0.
    • Pipelines: see EFT.WP.Data.Pipeline v1.0.

VI. Deliverables & Release Gates

  1. Mandatory exports: benchmark.yaml/json, protocol.yaml, metrics.yaml, env.lock, splits/*.index, reports/*.jsonl, each with sha256.
  2. Gates:
    • Frozen splits and leakage guardrails enabled;
    • SI metrology and dimensional checks pass;
    • Significance and uncertainty reports included;
    • Privacy, residency, and third‑party processing registered.
  3. Leaderboard governance: stability line, shadow comparisons, submission cooldown, and arbitration process.

VII. Cross‑References & Dependencies


VIII. Machine‑Readable Overview (Normative)

suite:

id: "eift.benchmarks.core"

title: "EIFT Core Benchmarks"

version: "v1.0.0"

modalities: ["text","image","audio"]

risks: ["leakage","bias","spurious_correlation"]

tasks:

- id: "cls.binary"

io_mode: "offline"

tracks: ["closed-book"]

dataset_ref: "datasets/core_cls@v1.0"

sampling: {strategy:"stratified", strata:[{by:"label"}]}

splits:

train: {frozen:true, index:"splits/train.index", sha256:"<hex>"}

val: {frozen:true, index:"splits/val.index", sha256:"<hex>"}

test: {frozen:true, index:"splits/test.index", sha256:"<hex>"}

leakage_guard: ["per-object","per-scene"]

protocol:

seed: 1701

repeats: 5

temperature: 0.0

tools_allowed: false

runtime_limits: {timeout_s: 3600}

metrics:

- {name:"Acc", unit:"—", higher_is_better:true, agg:"macro"}

- {name:"ECE", unit:"—", higher_is_better:false}

aggregation:

levels: ["task","suite"]

weights: {task:"uniform"}

normalize: {scheme:"zscore", anchors:["baseline.logreg","baseline.rf"]}

significance:

method: "bootstrap"

B: 10000

alpha: 0.05

correction: "Holm-Bonferroni"

env:

hardware: {cpu:"16c", mem_gb:64, gpu:0}

os: "ubuntu-22.04"

containers: ["ghcr.io/eift/runner@sha256:<hex>"]

deps_lock: "env.lock"

baselines:

- {id:"baseline.logreg", impl:"I15-1.logreg", params:{C:1.0}}

- {id:"baseline.rf", impl:"I15-2.rf", params:{n_trees:200}}

export_manifest:

version: "v1.0"

artifacts:

- {path:"benchmark.yaml", sha256:"<hex>"}

- {path:"splits/train.index", sha256:"<hex>"}

- {path:"reports/summary.json", sha256:"<hex>"}

references:

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.DatasetCards v1.0:Ch.11"

- "EFT.WP.Data.ModelCards v1.0:Ch.11"


IX. Lint Rules (Excerpt, Normative)

lint_rules:

- id: SUITE.ID_FORMAT

when: "$.suite.id"

assert: "matches('^[a-z0-9_.\\-]+$')"

level: error

- id: SPLITS.FROZEN_REQUIRED

when: "$..splits"

assert: "train.frozen == true and val.frozen == true and test.frozen == true"

level: error

- id: LEAKAGE.GUARDS

when: "$..leakage_guard"

assert: "contains_any(['per-object','per-timewindow','per-scene'])"

level: error

- id: METRICS.UNITS_SI

when: "$..metrics[*].unit"

assert: "all_units_in_SI(value) or value == '—'"

level: error

- id: PROTOCOL.SEED_AND_REPEATS

when: "$..protocol"

assert: "has_keys(seed, repeats)"

level: error

- id: SIGNIFICANCE.PARAMS

when: "$..significance"

assert: "has_keys(method, B, alpha)"

level: error

- id: EXPORT.REFERENCES_FORMAT

when: "$.export_manifest.references[*]"

assert: "matches('^[^:]+ v\\d+\\.\\d+:[A-Z].+$')"

level: error


X. Chapter Compliance Checklist