Appendix B — Data Specification and I/O
I. One-Sentence Goal
Anchor all data objects and I/O for Early Objects to Template v0.1 (EFT Technical Whitepaper & Engineering Memos — Complete Checklist v0.1). Define schemas, units, serialization, directory layout, I/O contracts, and error semantics so that Catalog/Seeds/Trajectory, Phi_T/grad_Phi_T, L_nu/LC, n_eff, { ell_i }, Delta_T_sigma, {R_env,T_trans,A_sigma}, and both arrival-time forms T_arr/Delta_T_arr are operational, reproducible, and auditable.
II. Scope & Non-Goals
- Covered: object model & primary keys, field+unit rules, serialization & directory layout, I/O contracts, DQ checks & consistency tests, Template-family alignment, JSONL examples, workflow mapping.
- Not covered: physics/numerics re-derivations; instrument/pipeline-specific formats; opaque or unverifiable formats.
III. Global Constraints & Conventions
- Coords/metric/units are mandatory: coords_spec, metric_spec, units_spec must be present; normalize ingress to SI. If inputs arrive in km/ms, map to m/s and log the mapping.
- Inline symbols: always use backticks for T_arr, Delta_T_arr, n_eff, c_ref, gamma(ell), Sigma_env, Delta_T_sigma, etc.
- Naming isolation: T_fil ≠ T_trans; n ≠ n_eff.
- Dimensionality & lower bound: ingress must pass check_dimension. Enforce dim(T_arr)=[T], dim(n_eff)=1, dim(c_ref)=[L][T^-1]. Outputs must satisfy the lower bound T_arr ≥ L_path / c_ref (the general form is equivalent).
- Energy consistency at interfaces: every event satisfies R_env + T_trans + A_sigma = 1, and must produce in-band curves with residuals.
- Two-form arrival time:
- Constant pull-out: T_arr = ( 1 / c_ref ) * ( ∫ n_eff d ell )
- General form: T_arr = ( ∫ ( n_eff / c_ref ) d ell )
Record mode ∈ {constant, general}.
IV. Data Objects & Primary Keys (minimal fields)
Contract (measurement contract)
- Required: id, spec_version, coords_spec, units_spec, metric_spec, mode, gauge:{x_ref,t_ref}, boundary_config, tolerances:{eps_T,eta_T,eta_w,tau_switch}
- Dependencies: n_eff_dependencies (e.g., F(Phi_T, grad_Phi_T, rho, f))
- Hashes: hash(Catalog), hash(Seeds), hash(Trajectory), hash(SeaProfile) (if coupled), hash(Phi_T), hash(grad_Phi_T), hash(n_eff), hash(gamma), hash(code)
Catalog (object directory)
{ id, type, z_form, z_obs, env_ref, seed_ref }, plus hash(Catalog)Required:Seeds/Triggers
priors, seed_samples (incl. seed_rng), triggers:[{event,type,time}], hash(Seeds)Required:Trajectory (state series)
state_series:[{t, M, R, J, a_bh, SFR, Z, …}], events:[…], hash(Trajectory)Required:Field (fields & refractive index)
- Required: name ∈ {Phi_T, grad_Phi_T, n_eff}, storage ∈ {grid, trajectory}, coords_spec, units_spec
- Grid: grid_axes:{x:[],y:[],z:[]}; Trajectory: samples:{path_id:[…]}
SeaProfile / Interfaces (optional)
- SeaProfile: layers:[{model, chi_k, Delta_k, sigma_k, …}], eta_w, hash(SeaProfile)
- Interfaces: sigma_id, type ∈ {continuous, jump_phi, jump_flux, anisotropic}, location (implicit function or grid)
- Optional events: C_sigma, J_sigma, R_env, T_trans, A_sigma
Path
- Required: path_id, gamma:[…] (coordinates), Δell:[…] (line elements, same length as gamma)
- Optional: t_hat:[…]
- Interfaces: interface_marks:[idx…] (discrete indices/interpolation locations for { ell_i })
Spectral/Obs
- L_nu(f) (intrinsic spectrum), F_nu(f) (observed spectrum), LC(t) (light curve)
- Observations:{ T_arr_obs_s, Delta_T_arr_obs_s, F_nu_obs, LC_obs } with uncertainties and ISO-8601 UTC timestamps
RTParams (energy triplet)
in-band curves & clamped intervals for R_env(f), T_trans(f), A_sigma(f)Required:CalibCref (reference speed calibration)
gamma_ref_id, T_arr_ref_s, n_eff_ref_hash, c_ref_est, u_stat, u_sys, env_blockRequired:Report/Log
run_id, contract_id, hashes, metrics:{eps_T,eta_T,eta_c,eta_w,tau_switch,GB,u_c}, notesRequired:V. Serialization & Directory Layout
- Formats: static data in JSONL/Parquet; large grid fields in Zarr/NetCDF (field names still follow this spec).
- Suggested layout:
- /contracts/ *.contract.json
- /catalog/ *.catalog.json
- /seeds/ *.seeds.json
- /traj/ *.trajectory.jsonl
- /fields/ phi_t.*, grad_phi_t.*, neff.*
- /seaprofile/ *.sea.json
- /interfaces/ sigma_env.*
- /paths/ *.path.jsonl
- /spectra/ Lnu.*, Fnu.*, LC.*
- /obs/ *.obs.jsonl
- /rtparams/ rt.*
- /calib/ c_ref.*
- /artifacts/ reports, logs, hash manifests, replay scripts
- Naming: <object>-<id>-<hash8>.<ext> where hash is content-hash (first 8 chars).
VI. Field & Unit Rules (key fields)
- f_hz: Hz = s^-1; T_arr_obs_s / Delta_T_arr_obs_s: s; Δell: m; c_ref: m•s^-1
- n_eff, R_env, T_trans, A_sigma: dimensionless
- Phi_T may be non-dimensionalized; otherwise Phi_ref must be declared in Contract; grad_Phi_T unit is dim(Phi_T)[L^-1]
- L_nu: W•Hz^-1 (or Contract photometric system); F_nu: W•m^-2•Hz^-1; LC: declared in Contract
- Delta_T_sigma, tau_switch: s
- All coordinates/metric/units must match the Contract; cross-system data must include explicit mapping and logs.
VII. I/O Contracts (aligned to Template family)
This section anchors Template APIs (not the volume’s implementation). Engineering mappings may be appended as “Template → I70-*”.
End-to-end (object → spectrum → propagation)
- Input: Catalog/Seeds/Trajectory, Phi_T/grad_Phi_T or T_fil+G(•), optional SeaProfile/Sigma_env, Path, f_grid, c_ref or CalibCref
- API family: I.Build.*, I.Path.Capture|Segment, I.Arrival.Constant|General|Delta, (optional) I.Interface.ApplyMatching, I.Report.*
- Output: L_nu/F_nu/LC, T_arr/Delta_T_arr, and audit logs for consistency/energy/switching
Causation & triggers
- Input: priors, environmental slices (Phi_T/SeaProfile)
- API family: I.Build.* (seed sampling, trigger process)
- Output: Seeds/Triggers (with seed_rng and hashes)
Energy consistency & interface audit
- Input: Sigma_env/SeaProfile, Path, observations or simulation outputs
- API family: I.Interface.ApplyMatching, I.RT.Estimate, I.Report.Log
- Output: RTParams and residual curves
VIII. Data-Quality Checks (DQC, automated)
- DQC-1 Dimension check: check_dimension covers both arrival-time forms, discrete segmentation, and layer/interface terms (see Appendix A).
- DQC-2 Unit coherence: Δell, gamma, c_ref share consistent units; if remapped at ingress, record mapping.
- DQC-3 Lower bound: T_arr_obs ≥ L_path / c_ref; near-margin samples within −k•u_c must be flagged.
- DQC-4 Two-form consistency: if both forms are available, eta_T ≤ threshold.
- DQC-5 Energy consistency: for every interface/band, ensure R_env + T_trans + A_sigma = 1.
- DQC-6 Thin/thick coherence: tau_switch = | T_arr^{thick} − (T_arr^{thin}+Delta_T_sigma) | ≤ limit.
- DQC-7 Differential coherence: same gamma[k], Δell[k] and Delta_T_sigma setup for all frequency pairs on the same path.
- DQC-8 Clamping statistics: record n_eff ∈ [1,n_max] clamping rate and impact.
- DQC-9 Reproducibility: SolverCfg, random seed, hash(*), and replay commands are present.
IX. Error Semantics (aligned to Template error family)
- E-DIM-001: dimensional inconsistency or missing units (reject)
- E-GAUGE-002: unspecified/ambiguous gauge (request gauge completion)
- E-NEFF-003: n_eff < 1 or assembly failure (reject and log falsification sample)
- E-PATH-004: illegal path discretization or measure mismatch (request {gamma, Δell} rebuild)
- E-INTF-005: interface matching failure or parameter out of bounds (reject; attach Sigma_env/SeaProfile tags)
- E-QAD-006: quadrature non-convergence or unmet eps_T (return local error breakdown)
- E-CREF-007: c_ref calibration unsolved/unstable (return environment block)
- E-CONSIST-008: two-form consistency failure
- E-EO-010: thin/thick inconsistency or Delta_T_sigma vs. volume integral gap beyond threshold
X. JSONL Examples (minimal viable)
Contract (/contracts/eo.contract.json)
{
"id": "ct-eo-001",
"spec_version": "EFT.WP.Cosmo.EarlyObjects v1.0",
"coords_spec": "Comoving-Spherical",
"units_spec": {"length":"m","time":"s","speed":"m•s^-1","frequency":"Hz"},
"metric_spec": {"type":"FLRW-like","S_k":"sin","a_ref":1.0},
"mode": "constant",
"gauge": {"x_ref":[0,0,0], "t_ref":"2025-01-01T00:00:00Z"},
"boundary_config": {"type":"Dirichlet","Phi_T_far":0},
"tolerances": {"eps_T":1e-9,"eta_T":5e-10,"eta_w":0.03,"tau_switch":5e-12},
"n_eff_dependencies": "F(Phi_T, grad_Phi_T, rho, f)",
"hashes": {
"hash(Catalog)":"aa22bb33",
"hash(SeaProfile)":"77cc11dd",
"hash(Phi_T)":"ab12cd34",
"hash(grad_Phi_T)":"de98fa76",
"hash(gamma)":"ef56ab78",
"hash(code)":"aa11bb22"
}
}
Catalog (/catalog/eo.catalog.json)
{"objects":[{"id":"obj001","type":"BHSeed","z_form":18.2,"z_obs":12.7,"env_ref":"sea_v1","seed_ref":"sd001"}]}
Seeds (/seeds/sd001.seeds.json)
{"id":"sd001","priors":{"M0":{"dist":"lognormal","mu":2e4,"sigma":0.3}},"seed_samples":[{"M0":2.3e4,"R0":1.5e15,"J0":1.0e50}],"seed_rng":20250905}
SeaProfile (/seaprofile/sea.v1.json)
{"layers":[{"model":"tanh","chi_k":1.2e3,"Delta_k":2.0e2,"sigma_k":1.0e2}],"eta_w":0.03,"hash(SeaProfile)":"77cc11dd"}
Path (/paths/p001.path.jsonl)
{"path_id":"p001","gamma":[[0,0,1.1e3],[0,0,1.3e3],[0,0,2.3e3]],"Δell":[2.0e2,1.0e3],"t_hat":[[0,0,1],[0,0,1]],"interface_marks":[1]}
Observations (/obs/p001.obs.jsonl)
{"obs_id":"o001","path_id":"p001","f_hz":1.0e9,"T_arr_obs_s":6.2001e-3,"Delta_T_arr_obs_s":-7.0e-7,"u_stat_s":2.0e-6,"u_sys_s":3.0e-6,"timestamp":"2025-01-01T00:00:00Z"}
{"obs_id":"o002","path_id":"p001","f_hz":1.05e9,"T_arr_obs_s":6.2008e-3,"Delta_T_arr_obs_s":0.0,"u_stat_s":2.0e-6,"u_sys_s":3.0e-6,"timestamp":"2025-01-01T00:00:01Z"}
RTParams (/rtparams/rt.p001.json)
{"R_env":[["9.5e8",0.18],["1.0e9",0.20],["1.05e9",0.19]],
"T_trans":[["9.5e8",0.77],["1.0e9",0.76],["1.05e9",0.78]],
"A_sigma":[["9.5e8",0.05],["1.0e9",0.04],["1.05e9",0.03]]}
CalibCref (/calib/c_ref.json)
{"gamma_ref_id":"p_ref","T_arr_ref_s":6.2000e-3,"n_eff_ref_hash":"99aa33bb",
"c_ref_est":2.99792458e8,"u_stat":5.0e3,"u_sys":1.0e3,
"env_block":{"temp_C":20.0,"clock":"UTC"}}
XI. Typical I/O Workflow Alignment (Template family)
The Template family is authoritative; engineering may add a “Template → I70-*” mapping.
A. Object → spectrum → propagation (E2E)
- I.Build.Catalog|Seeds|Trajectory → produce Catalog/Seeds/Trajectory
- I.Build.Phi|Neff → assemble Phi_T/grad_Phi_T/n_eff (optionally with SeaProfile)
- I.Path.Capture|Segment → { gamma[k], Δell[k] }, { ell_i }
- I.Arrival.Constant|General|Delta → T_arr/Delta_T_arr
- I.Report.Log|Emit → persist hashes, thresholds, falsification samples, replay entrypoints
B. Energy consistency & interface audit
- I.Interface.ApplyMatching (if coupled to SeaProfile/Sigma_env)
- I.RT.Estimate → { R_env, T_trans, A_sigma }
- I.Report.Log → residual curves & side-limit checks
C. Causation & triggers
- I.Build.Seeds|Triggers → sampling & registry
- I.Report.Log → priors, random seeds, parameter hashes
XII. Data Quality & Audit Checklist (pre-publish self-check)
- DimReport present; Δell / c_ref units consistent; metric_spec explicit.
- { ell_i } endpoints explicit in integrals; no cross-interface interpolation.
- eta_T, tau_switch, lower-bound and energy-consistency margins pass.
- Differential reuse of the same path discretization & corrections; out-of-band leakage recorded.
- Clamping rate logged; hash(*), SolverCfg, seed, and replay command present.
XIII. Security & Integrity
- Read-only mounts: recommend read-only for /contracts, /obs, /interfaces.
- Content hashing: content-hash (excluding filename/timestamp) for cross-environment invariance.
- Minimal metadata: logs retain only necessary indicators & hashes to avoid exposing sensitive path info.
- Integrity checks: write SHA-256 and file length for critical objects; re-verify on import.
XIV. Cross-Volume Alignment (data side)
- With Propagation.TensionPotential v1.0: two-form fields, Path/Field names & units.
- With Cosmo.LayeredSea v1.0: SeaProfile/Interfaces fields and tau_switch semantics.
- With Core.Metrology v1.0: units_spec/coords_spec/metric_spec/traceability.
- With Core.Errors v1.0: naming and reporting for u_stat/u_sys/u_c.
XV. Deliverables
- Data architecture compendium: schemas + exemplars for Contract/Catalog/Seeds/Trajectory/Field/SeaProfile/Interfaces/Path/Spectral/Observations/RTParams/CalibCref/Report.
- I/O contract boilerplates: I/O fields, units, requiredness, and error-semantics mapping (per Template family).
- Audit-bundle template: hash manifest, DimReport, SolverCfg, run logs, and falsification sample list.