Chapter 4 Sources & Ingest


I. Chapter Purpose & Scope

layer’s specifications and engineering practices: connector types, credentials & security, idempotency/retry/checkpointing, dedup & dedupe keys, throughput & latency metrology, contract alignment (Σ_in/Σ_out), exception handling and audit exports; ensure consistency with Dataset/Model Cards, the Metrology chapter, and citation anchors.Sources & IngestFix the

II. Terminology & Dependencies


III. Fields & Structure (Normative)

stage:

name: "<src.kind.name>"

type: "source.<s3|gcs|fs|db|kafka|http|custom>"

impl: "I16-1.<impl_id>"

params:

endpoint: "<url-or-bootstrap>"

bucket_or_db: "<bucket|db>"

prefix_or_table: "<prefix|schema.table>"

query_or_pattern: "<sql|glob>"

credentials_ref: "secrets://path/to/credential"

format: "<json|parquet|csv|avro|binary>"

watermark:

field: "<updated_at|offset|lsn>"

start: "<ISO8601|offset>"

step: "<PT5M|1000>"

checkpoint:

path: "s3://.../chk/<stage>"

mode: "exactly-once|at-least-once"

dedupe_key: ["<pk>", "<ts>"]

outputs: ["raw_blob|raw_rows|events"]

idempotent: true

retries: {max: 3, backoff: "expo", jitter_ms: 200}

timeout_s: 1800

on_fail: "quarantine|skip|block"

schema_ref: "<contracts/raw@vX.Y>"


IV. Connector Types & Specifications


V. Idempotency, Retry & Checkpointing


VI. Dedup & Ordering Guarantees


VII. Metrology & Units (SI)


VIII. Security, Credentials & Compliance


IX. Machine-Readable Fragment (Drop-in)

layers:

- name: "ingest"

stages:

- name: "src.s3.pull"

type: "source.s3"

impl: "I16-1.s3_pull"

params:

endpoint: "https://s3.amazonaws.com"

bucket_or_db: "eift-data"

prefix_or_table: "raw/2025/09/"

query_or_pattern: "*.jsonl"

credentials_ref: "secrets://aws/ingest_ro"

format: "json"

watermark: {field:"updated_at", start:"2025-09-01T00:00:00Z", step:"PT5M"}

checkpoint: {path:"s3://eift-meta/chk/src.s3.pull", mode:"at-least-once"}

dedupe_key: ["id","updated_at"]

outputs: ["raw_blob"]

idempotent: true

retries: {max:3, backoff:"expo", jitter_ms:200}

timeout_s: 1800

on_fail: "quarantine"

schema_ref: "contracts/raw_json@v1.2"


X. Lint Rules (Excerpt, Normative)

lint_rules:

- id: SRC.TYPE_ALLOWED

when: "$.layers[*].stages[*].type"

assert: "value in ['source.s3','source.gcs','source.fs','source.db','source.kafka','source.http','source.custom']"

level: error

- id: SRC.CREDENTIALS_REF

when: "$.layers[*].stages[?(@.type^='source.')].params"

assert: "has_key('credentials_ref') and not has_key('plain_secret')"

level: error

- id: SRC.CHECKPOINT_DEFINED

when: "$.layers[*].stages[?(@.type^='source.')].params"

assert: "has_key('checkpoint') and has_key('watermark')"

level: error

- id: SRC.DEDUPE_OR_EXACTLY_ONCE

when: "$.layers[*].stages[?(@.type^='source.')]"

assert: "has_key('params.dedupe_key') or $.params.checkpoint.mode == 'exactly-once'"

level: error

- id: METROLOGY.SI_AND_CHECKDIM

when: "$.metrology"

assert: "units=='SI' and check_dim==true"

level: error


XI. Export Manifest & Audit Trail

export_manifest:

version: "v1.0"

artifacts:

- {path:"ingest/pulled.manifest.json", sha256:"..."}

- {path:"ingest/checkpoint.meta.json", sha256:"..."}

- {path:"security/audit.log", sha256:"..."}

references:

- "EFT.WP.Core.DataSpec v1.0:EXPORT"

- "EFT.WP.Core.Metrology v1.0:check_dim"

- "EFT.WP.Data.DatasetCards v1.0:Ch.6"


XII. Chapter Compliance Checklist