Chapter 10 Orchestration, Scheduling & Resources
I. Chapter Purpose & Scope
specifications: orchestrator backends & DAG submission, priority & preemption, triggers & dependencies, retry & timeouts, SLA/SLO & alerts, resource profiling & quotas, autoscaling & cost metrology; ensure alignment with data contracts, DQ gates, monitoring, and the Metrology chapter.resource, and scheduling, orchestrationFix pipelineII. Terminology & Dependencies
- Terms: orchestrator, dag, queue, priority, preempt, retry, timeout_s, cron, event trigger, sla/slo, qos, requests/limits, autoscale, budget/cost.
- Dependencies: contracts/exports (Core.DataSpec v1.0); units/dimensions & performance metrology (Core.Metrology v1.0); quality gates (DatasetCards v1.0); evaluation protocol (ModelCards v1.0).
- Math & symbols: wrap inline symbols (e.g., QPS, T_inf, ρ, p99) in backticks; any division/integral/composite operator must use parentheses; no Chinese in formulas/symbols/definitions.
III. Fields & Structure (Normative)
orchestration:
orchestrator: "airflow|argo|ray|custom"
dag:
max_concurrency: 128
backfill: {enabled: true, window: "P7D"}
dependencies:
- {from:"validate.schema", to:"transform.normalize"}
- {from:"transform.normalize", to:"feature.map"}
triggers:
cron: "5 * * * *"
event: {source:"kafka", topic:"ds.ready", group:"pipeline-consumer"} # optional
scheduling:
queue: "high|default|low"
priority: 5
preempt: true
retries: {max: 3, backoff: "expo", jitter_ms: 200}
timeout_s: 3600
sla:
latency_ms: {p50: 5000, p95: 15000, p99: 30000}
availability: 0.999
error_rate: 0.01
alert_rules:
- {name:"sla_breach_p99", rule:"latency_ms.p99>30000 for 10m", severity:"high"}
resources:
requests: {cpu: 4, mem_gb: 16, gpu: 0}
limits: {cpu: 8, mem_gb: 32, gpu: 0}
disk_gb: 200
net_mbps: 800
qos: "burstable|guaranteed|best-effort"
autoscale:
enabled: true
policy:
metric: "qps|latency_ms.p95|cpu|custom"
target: 0.7
min_replicas: 2
max_replicas: 64
cooldown_s: 120
cost:
budget:
currency: "USD"
monthly_cap: 2000
pricing_refs:
compute: "pricing/compute@v1.0"
storage: "pricing/storage@v1.0"
egress: "pricing/egress@v1.0"
metrology:
units: "SI"
check_dim: true
IV. Orchestrator Backend & Submission
- Backend choice: airflow|argo|ray|custom; declare concurrency caps, queues, and worker topology; support backfill with idempotency for historical reruns.
- DAG dependencies: list order and fan-out/fan-in in dependencies[]; cycles require lint exceptions and explicit convergence conditions.
- Triggers: support scheduled cron and event triggers; for events, declare source, topic, and consumer group.
V. Scheduling Strategy & Failure Semantics
- Queues & priority: queue/priority govern resource allocation; preempt enables preemption of lower-priority work.
- Retry & timeouts: exponential backoff + jitter; explicit timeout_s; classify errors (retryable/non-retryable/escalate).
- SLA/SLO: fix {latency_ms:{p50,p95,p99}, availability, error_rate}; SLA breaches trigger alert_rules and degradation policies.
VI. Resource Profiling & Quotas
- Requests/limits: explicit CPU/Memory/GPU; disk and network bandwidth declared; qos indicates scheduler policy.
- Isolation & affinity: node selectors/affinity/taints can be specified at implementation level (Ixx-?); cross-AZ placement must include bandwidth and fault-domain posture.
- Metrics: SI-unified: QPS (1/s), T_inf (ms), ρ (—), net_mbps, size_bytes.
VII. Autoscaling & Elasticity
- Policy: HPA/custom scaling by metric (qps, latency_ms.p95, cpu); adjust toward target; cooldown_s for dampening.
- Bounds: min_replicas/max_replicas define limits; persistent SLA breaches trigger warnings or throttling.
- Cost coupling: scaling honors budget.monthly_cap to avoid overspend.
VIII. Cost Metrology & Budgeting
- Budget: currency and monthly_cap.
- Pricing refs: pricing_refs.* point to versioned price sheets.
- Reporting: itemized compute/storage/egress costs must be exported.
IX. Metrology & Units (SI)
- Mandatory: metrology:{units:"SI", check_dim:true}; normalize units first before composition/conversion.
- Perf/resources: QPS (1/s), T_inf (ms), ρ (—), net_mbps, size_bytes, power_w (if applicable).
X. Machine-Readable Fragment (Drop-in)
orchestration:
orchestrator: "argo"
dag: {max_concurrency: 256, backfill:{enabled:true, window:"P3D"}}
dependencies:
- {from:"validate.schema", to:"transform.normalize"}
- {from:"transform.normalize", to:"feature.map"}
triggers:
cron: "5 * * * *"
scheduling:
queue: "high"
priority: 8
preempt: true
retries: {max:3, backoff:"expo", jitter_ms:200}
timeout_s: 5400
sla:
latency_ms: {p50:3000, p95:10000, p99:20000}
availability: 0.999
error_rate: 0.005
alert_rules:
- {name:"p99_breach", rule:"latency_ms.p99>20000 for 10m", severity:"high"}
resources:
requests: {cpu: 8, mem_gb: 32, gpu: 0}
limits: {cpu: 16, mem_gb: 64, gpu: 0}
disk_gb: 500
net_mbps: 1200
qos: "guaranteed"
autoscale:
enabled: true
policy: {metric:"qps", target:0.7, min_replicas:4, max_replicas:64, cooldown_s:120}
cost:
budget: {currency:"USD", monthly_cap: 5000}
pricing_refs: {compute:"pricing/compute@v1.0", storage:"pricing/storage@v1.0", egress:"pricing/egress@v1.0"}
metrology: {units:"SI", check_dim:true}
XI. Lint Rules (Excerpt, Normative)
lint_rules:
- id: ORCH.ORCHESTRATOR_ALLOWED
when: "$.orchestration.orchestrator"
assert: "value in ['airflow','argo','ray','custom']"
level: error
- id: SCHED.TIMEOUT_DEFINED
when: "$.scheduling.timeout_s"
assert: "is_number(value) and value > 0"
level: error
- id: SCHED.RETRIES_VALID
when: "$.scheduling.retries"
assert: "value.max >= 0 and value.backoff in ['expo','linear']"
level: error
- id: SLA.METRICS_DEFINED
when: "$.scheduling.sla"
assert: "has_keys(latency_ms, availability, error_rate)"
level: error
- id: RES.REQUESTS_LIMITS
when: "$.resources"
assert: "has_keys(requests, limits) and requests.cpu <= limits.cpu and requests.mem_gb <= limits.mem_gb"
level: error
- id: AUTOSCALE.BOUNDS
when: "$.autoscale"
assert: "value.enabled == false or (value.policy.min_replicas >= 1 and value.policy.max_replicas >= value.policy.min_replicas)"
level: error
- id: METROLOGY.SI_AND_CHECKDIM
when: "$.metrology"
assert: "units == 'SI' and check_dim == true"
level: error
XII. Export Manifest & Audit
export_manifest:
version: "v1.0"
artifacts:
- {path:"orchestration/dag.yaml", sha256:"..."}
- {path:"scheduling/policies.yaml", sha256:"..."}
- {path:"resources/usage.report.csv", sha256:"..."}
- {path:"autoscale/history.csv", sha256:"..."}
- {path:"cost/monthly_report.csv", sha256:"..."}
references:
- "EFT.WP.Core.DataSpec v1.0:EXPORT"
- "EFT.WP.Core.Metrology v1.0:check_dim"
XIII. Chapter Compliance Checklist
- Orchestrator backend, DAG concurrency & dependencies, and triggers are complete; cyclic topologies whitelisted in lint with convergence conditions.
- Queue/priority/preemption, retry/timeout, SLA & alert rules are explicit and active.
- Resource requests/limits and qos are reasonable; autoscaling bounds & targets set and coordinated with budget.
- SI metrology with check_dim=true; consistent units for performance/resources.
- Export manifest lists DAG, scheduling policy, usage/autoscaling/cost reports with sha256; citation anchors complete, satisfying release gates.