15-EFT.WP.Methods.Falsification v1.0 | Chapter 7: Statistical Testing & Error Control

Chapter 7: Statistical Testing & Error Control

I. Scope & Objectives

Establish the statistical testing and error-control framework for falsification, covering significance & power, sample-size planning, equivalence & non-inferiority tests, multiple-testing control (FDR/FWER), and sequential/adaptive tests with significance-budget spending. Unify offline batch evaluation and online gating so that GateDecision ∈ {pass, hold, block} is driven by statistical evidence and risk budgets. All procedures execute under a locked environment EnvLock and the shared time base ts = alpha + beta * tau_mono.
Conflict-name disambiguation
To avoid confusion with the time-base mapping parameters alpha, beta, this chapter denotes significance and Type-II error as alpha_sig and beta_err, respectively; power is power = 1 - beta_err.

II. Terms & Symbols

Hypotheses & statistics
H0, H1, T(x) (test statistic), C_alpha (rejection region), p_value.
Effect sizes: d = ( mu_1 - mu_0 ) / sigma_pooled, OR (odds ratio), ΔAUC, ΔECE, ΔNLL.
Equivalence / non-inferiority margins: delta_equiv, delta_noninf.
Errors & power
alpha_sig (Type-I error), beta_err (Type-II error), power = 1 - beta_err.
Multiple testing: m (number of tests), R (rejections), V (false rejections),
FDR = E[ V / max(R,1) ], FWER = P( V ≥ 1 ), q_star (target FDR).
Sequential testing & budgets
Likelihood-ratio sequence Lambda_n, thresholds A, B, spending function alpha_spend(t); family-wise budget alpha_family.
Sample size & quantiles
z_{p} (normal quantile), t_{p,df} (t quantile), n_per_group (per-group size), N_min (minimum total size).

III. Postulates & Minimal Equations

P51-10 (Familywise significance-budget postulate)
For a family {H0_i} with allocations {alpha_i} satisfying Σ alpha_i ≤ alpha_family, and using a conservative or one-step adjustment, FWER ≤ alpha_family.
P51-11 (Consistency of significance spending)
In sequential/adaptive testing, if Σ_{t=1..T} alpha_spend(t) ≤ alpha_family and the stopping rule is measurable with respect to sample paths under H0, then global Type-I error is controlled: P_H0( reject ) ≤ alpha_family.
S52-18 (p-value & rejection region)
One-sided: p_value = P( T ≥ T_obs | H0 ); two-sided:
p_value = 2 * min{ P( T ≥ T_obs | H0 ), P( T ≤ T_obs | H0 ) }; rule: p_value ≤ alpha_sig → reject H0.
S52-19 (Power definition)
power = P( T ∈ C_{alpha_sig} | H1 ) = 1 - beta_err.
S52-20 (Two independent means, z-test, known variance)
n_per_group = ( ( z_{1 - alpha_sig/2} + z_{1 - beta_err} )^2 * 2 * sigma^2 ) / delta_min^2,
where delta_min = | mu_1 - mu_0 | is the minimal detectable effect.
S52-21 (Two-proportions sample-size approximation)
With target proportions p1, p2, p_bar = ( p1 + p2 ) / 2:
n_per_group = ( z_{1 - alpha_sig/2} * sqrt( 2 * p_bar * ( 1 - p_bar ) ) +
z_{1 - beta_err} * sqrt( p1 * ( 1 - p1 ) + p2 * ( 1 - p2 ) ) )^2 / ( p1 - p2 )^2.
S52-22 (Benjamini–Hochberg, FDR control)
Sort p_(1) ≤ ... ≤ p_(m), take
k = max{ i : p_(i) ≤ ( i / m ) * q_star };
reject {H0_(1)..H0_(k)}, ensuring FDR ≤ q_star (independence or positive dependence).
S52-23 (Holm step-down, FWER control)
Sort p_(1) ≤ ... ≤ p_(m); sequentially test p_(i) ≤ alpha_sig / ( m - i + 1 ). On the first failure, stop and accept all remaining; FWER ≤ alpha_sig.
S52-24 (Hierarchical gatekeeping & priorities)
With tiers L1 → L2 → ... and budgets {alpha_l}, if Lk fails, do not release the budget of Lk+1; if Lk passes, roll unspent alpha_k into the next tier:
alpha_{k+1} ← alpha_{k+1} + unspent(alpha_k).
S52-25 (TOST equivalence test)
H0: | mu - mu0 | ≥ delta_equiv; H1: | mu - mu0 | < delta_equiv.
Two one-sided tests:
T1 = ( ( mu - mu0 ) - ( - delta_equiv ) ) / SE,
T2 = ( ( mu - mu0 ) - ( + delta_equiv ) ) / SE;
conclude equivalence iff p1 ≤ alpha_sig and p2 ≤ alpha_sig.
S52-26 (Non-inferiority test)
H0: mu_ref - mu_cand ≥ delta_noninf; if
P( mu_ref - mu_cand < delta_noninf ) ≥ 1 - alpha_sig
or an equivalent one-sided test is significant, declare non-inferiority.
S52-27 (SPRT boundaries)
Lambda_n = Π_{i=1..n} ( f_1( x_i ) / f_0( x_i ) );
if Lambda_n ≥ A = ( 1 - beta_err ) / alpha_sig → reject H0;
if Lambda_n ≤ B = beta_err / ( 1 - alpha_sig ) → accept H0; otherwise continue sampling.
S52-28 (Significance spending functions)
Given total budget alpha_family and a spending curve alpha_spend(t), ensure Σ_{i=1..t} alpha_spend(i) ≤ alpha_family. Example (O’Brien–Fleming type):
alpha_spend^{OF}(t) = 2 - 2 * Phi( z_{alpha_family/2} / sqrt(t) ).

IV. Data & Manifest Conventions

HypothesisRegistry (minimum fields)
{hid, H0, H1, effect_size_spec, delta_equiv?, delta_noninf?, metric, tail ∈ {one, two}, alpha_sig, beta_err, power_target, assumptions}.
TestPlan.card
{design ∈ {two-sample, paired, proportion, nonparam}, n_per_group|N_min, allocation_ratio, blocking, stratification, seeds, prereg_sig: alpha_sig, prereg_beta: beta_err}.
MultiTest.family
{scope, members[hid], control ∈ {BH, Holm, Bonferroni, gatekeeping}, q_star|alpha_family, dependency_assumption}.
SeqTest.rule
{type ∈ {SPRT, alpha-spending}, params{A,B|alpha_spend(•)}, stop ∈ {accept, reject, maxN}, monitoring_window}.
Provenance outputs
Each run emits {p_table.csv, adj_p.csv, decision.log, power_check.json, ci_table.csv, alpha_budget.yaml, hash(•), fingerprint}.

V. Algorithms & Implementation Bindings

Mapping to I50-*
Multiple testing: I50-6 sequential_test (when type = alpha-spending), I50-9 gate_release (consuming FDR/FWER reports & evidence bundle).
Statistical computation extensions
- I50-11 adjust_pvalues(p:list, method:str, q_or_alpha:float) -> {p_adj:list, reject:list}
- I50-12 plan_sample_size(spec:dict) -> {n_per_group:int, power:float}
- I50-13 tost_equivalence(x:any, y:any, delta_equiv:float, alpha_sig:float) -> Verdict
Reference flow (BH step-up)
- Input p[1..m], q_star; sort to p_(i).
- Compute thresholds tau_i = ( i / m ) * q_star.
- k = max{ i : p_(i) ≤ tau_i }; set reject[1..k] = true, others false.
- Produce adjusted p-values:
  p_adj_(i) = min_{j ≥ i} ( m / j ) * p_(j ), then map back to original indices.
Reference flow (Holm step-down)
- Sort p_(i); for i = 1..m, test
  p_(i) ≤ alpha_sig / ( m - i + 1 ).
- If the first failure occurs at i*, reject {1..i*-1} and accept {i*..m}; if none fail, reject {1..m}.
Reference flow (SPRT)
- Initialize A, B; update Lambda_n per observation.
- If Lambda_n ≥ A → reject; if Lambda_n ≤ B → accept; if n ≥ N_cap → stop = hold.
- Output {decision, n_used, alpha_spent ≈ P_H0( reject )}.

VI. Metrology Flows & Run Diagram

Mx-59 Sample-size planning & pre-registration
From effect_size_spec, alpha_sig, beta_err compute n_per_group; produce TestPlan.card and alpha_budget.yaml; freeze seeds and analysis script hashes.
Mx-60 Multiple testing & family-wise control
Define families and hierarchies; choose BH/Holm/gatekeeping; output adj_p.csv and decision.log. If FDR > q_star or FWER > alpha_family, set GateDecision = hold.
Mx-61 Sequential/online testing with gating
Configure SeqTest.rule and monitoring windows; run I50-6 sequential_test; integrate with TS.error / TS.latency, and upon block/hold record stopping evidence and cumulative Σ alpha_spend.

VII. Verification & Test Matrix

Type-I calibration (null simulations)
- Under H0, repeat B times (B ≥ 10^4) to estimate P( reject ); require | P( reject ) - alpha_sig | ≤ tau_calib.
- In multiple-testing settings, estimate FDR/FWER; verify they do not exceed budgets.
Power & sample-size backchecks
- Under H1, estimate power_hat; require power_hat ≥ power_target - tau_power.
- CI coverage: two-sided 1 - alpha_sig intervals cover at 1 - alpha_sig ± tau_cov.
Sequential robustness
Optional stopping / data-peeking simulations: under alpha_spend constraints, verify no Type-I inflation; compare expected sample size of SPRT against N_cap.
Assumption checks & robustness
When normality/homoscedasticity fail, use permutation or bootstrap for p_value and CIs; record deviations.

VIII. Cross-References & Dependencies

Depends on: Core.Metrology (metrics & confidence), Core.Errors (error types & thresholds), Core.DataSpec (data conventions).
Cross-links: Chapter 3 (postulates; power, FDR, sequential tests), Chapter 8 (uncertainty; tests for ECE/MCE/NLL), Chapter 9 (online gating; linkage to GateDecision and budget spending).

IX. Risks, Limitations & Open Questions

Risks & limitations
BH failure under dependent p_value; assumption breakage under distribution shift; implicit multiplicity from multi-metric scanning; uncontrolled optional stopping inflating Type-I error; unreliable asymptotics for extreme sparsity.
Open questions
Investment-style online FDR fused with gatekeeping; cross-domain/device calibration of a shared alpha_budget; precise power analysis for complex metrics such as ΔECE/ΔNLL.

X. Deliverables & Versioning

Deliverables
HypothesisRegistry.json, TestPlan.card, alpha_budget.yaml, p_table.csv, adj_p.csv, decision.log, power_check.json, ci_table.csv, SeqTest.rule, SeqTest.log, Evidence.bundle (with hash(•) and fingerprint).
Versioning policy
- Adjusting alpha_sig / beta_err / power_target or the family-control method → minor bump; changing the significance-budgeting or sequential rules → major bump.
- All changes require updated signatures and Appendix C history entries.