Claim level: Exploring — demonstration environment; indicative findings only.

Onyx Kiln

V1

Every pod leaves the Kiln production-hardened: benchmarked on held-out tasks, governance gates server-enforced, kill-switch drilled.

← Back to Wizard

Model Guide — ACC-09 Swarm Benchmark Findings

Measured results and catalogue overview for the 20-model roster. Data sources: models/models-regions.json · models/benchmark-results.json · models/model-benchmark-report.md.

Headline Findings

Authoring Cost & Quality

On Microsoft Fabric notebook and pipeline authoring tasks, our benchmark found the mid-tier gpt-4.1-mini matched flagship-class alternatives at roughly one-sixth the cost — £0.0010 per accepted PR versus £0.0063 for gpt-4o (2024-11-20). Both models achieved 100% acceptance on the 10-task held-out suite (5 notebooks, 5 pipelines; simple, medium, and complex complexity). For format-constrained, structured-output generation, the cost premium of a larger model is not evidenced in this task shape.

Supervisor Catch-Rate & False Blocks

The gpt-4o supervisor achieved a 60% catch-rate on a harder 10-defect mix (secret strings, wrong workspace references, missing error handling, schema mismatches) with 0% false-block rate on 10 clean items. Security defects (secret strings) and structural errors (wrong workspace, missing dependencies) were all caught. Schema and error-handling defects were the gap: schema mismatches and missing error-handling patterns were missed in this run. Candidate improvements include richer task-spec grounding in the supervisor prompt.

Supervisor Defect-Type Gap Map (gpt-4o, n=10 defects)

secret_string Caught 3/3 All three injected secret strings identified and rejected.
wrong_workspace Caught 2/2 Both incorrect workspace/notebook GUID references flagged.
missing_error_handling Missed 3/3 Missing error-handling patterns not flagged; supervisor accepted all three.
schema_mismatch Missed 2/2 Schema-level mismatches accepted; structural validation needs strengthening in the supervisor prompt.

⚠ Demonstration Environment — Read Before Relying on These Figures

Full Catalogue — 20 Models

Model Catalogue & Benchmark Results

Model Family / Tier GA Status Indicative Price Band Benchmark Regions Notes
gpt-4.1 2025-04-14 gpt-4.1 / Flagship GA $0.002/$0.008 per 1k [VERIFY] Benchmark pending Global + Regional (not uksouth Regional) 1M context; Global Standard only from uksouth. Provisioned Managed Regional supports uksouth.
gpt-4.1-mini 2025-04-14 gpt-4.1 / Cost-optimised GA $0.0004/$0.0016 per 1k [VERIFY] Measured
Authoring: 100% acceptance
£0.0010/accepted PR
Mean latency 1.8s
UK-resident (uksouth Regional Std) Only gpt-4.1 family model with uksouth Regional Standard support. Best-value authoring on this task shape.
gpt-4.1-nano 2025-04-14 gpt-4.1 / Fastest/cheapest GA $0.0001/$0.0004 per 1k [VERIFY] Benchmark pending UK-resident (uksouth Regional Std) Highest-throughput, lowest-cost option. Quota blocked on Onyx02 at benchmark time.
gpt-4o 2024-11-20 gpt-4o / Flagship GA $0.0025/$0.010 per 1k [VERIFY] Measured
Authoring: 100% acceptance
£0.0063/accepted PR
Supervisor: 60% catch / 0% false-block
UK-resident (uksouth Regional Std) Current swarm default supervisor. Security defects fully caught; schema/error-handling gaps (see gap map above). Approaching end-of-stable — prefer gpt-4.1 family for new authoring.
gpt-4o 2024-08-06 gpt-4o / Flagship GA $0.0025/$0.010 per 1k [VERIFY] Benchmark pending UK-resident (uksouth Regional Std) Earlier GA version of gpt-4o; prefer 2024-11-20 or newer.
gpt-4o 2024-05-13 gpt-4o / Flagship GA [VERIFY] Benchmark pending UK-resident (uksouth Regional Std) Original GA release; prefer newer versions.
gpt-4o-mini 2024-07-18 gpt-4o / Lightweight Retiring 2025-09-15 [VERIFY] Benchmark pending UK-resident (uksouth Regional Std) Retiring — use gpt-4.1-nano for equivalent role in new deployments.
gpt-5 2025-08-07 gpt-5 / Flagship GA [VERIFY] Benchmark pending Global Standard only 400K context. Registration required. Global Standard only from uksouth resource.
gpt-5-mini 2025-08-07 gpt-5 / Cost-efficient GA $0.00025/$0.002 per 1k [VERIFY] Benchmark pending Global Standard only 400K context. Quota blocked on Onyx02 at benchmark time.
gpt-5-nano 2025-08-07 gpt-5 / Fastest GA $0.00005/$0.0004 per 1k [VERIFY] Benchmark pending Global Standard only 400K context; cheapest GPT-5 option.
gpt-5.1 2025-11-13 gpt-5 / Iterative GA $0.00125/$0.010 per 1k [VERIFY] Benchmark pending Global Standard only Iterative improvement on gpt-5. 400K context.
gpt-5.2 2025-12-11 gpt-5 / Reliability GA [VERIFY] Benchmark pending Global Standard only Focused on reliability and supervision tasks. 400K context.
gpt-5.3-codex 2026-02-24 gpt-5 / Code-specialised GA $0.00175/$0.014 per 1k [VERIFY] Benchmark pending Global Standard only Code-specialised variant. 400K context.
gpt-5.4 2026-03-05 gpt-5 / Extended-context flagship GA $0.0025/$0.015 per 1k [VERIFY] Benchmark pending Global Standard only 1M+ context; highest capability in gpt-5 family.
gpt-5.4-mini 2026-03-17 gpt-5 / Mini GA [VERIFY] Benchmark pending Global Standard only Mini variant of gpt-5.4. 400K context.
gpt-5.4-nano 2026-03-17 gpt-5 / Nano GA [VERIFY] Benchmark pending Global Standard only Nano variant of gpt-5.4. Fastest/cheapest in 5.4 sub-family.
o4-mini 2025-04-16 o-series / Reasoning GA $0.0011/$0.0044 per 1k [VERIFY] Benchmark pending UK-resident (uksouth Regional Std) Fast reasoning; strong on code and logic. Recommended o-series entry point. Quota blocked on Onyx02 at benchmark time.
o3 2025-04-16 o-series / Full reasoning GA [VERIFY] Benchmark pending Global Standard only Highest o-series capability; registration required. Best for complex review and supervisor roles.
o3-mini 2025-01-31 o-series / Lightweight reasoning GA [VERIFY] Benchmark pending UK-resident (uksouth Regional Std) Balanced speed/cost for reasoning tasks.
o1 2024-12-17 o-series / Original full reasoning GA [VERIFY] Benchmark pending Global Standard only Predecessor to o3; 200K context. Global Standard only.
model-router 2025-05-19 model-router / Routing GA [VERIFY] Benchmark pending Global Std only — not uksouth Routes prompts to underlying model. NOT available in uksouth region. Limited to eastus, eastus2, westus3, swedencentral.
claude-opus-4 claude-anthropic / Flagship Preview (partner) [VERIFY] Benchmark pending eastus2 / swedencentral only NOT deployable from uksouth. Requires Foundry hub in eastus2 or swedencentral; pay-as-you-go billing (Azure Sponsorship may not qualify).
claude-sonnet-4 claude-anthropic / Balanced Preview (partner) [VERIFY] Benchmark pending eastus2 / swedencentral only NOT deployable from uksouth. Same regional restriction as claude-opus-4.
DeepSeek-R1 deepseek / Reasoning GA (partner serverless) [VERIFY] Benchmark pending Global Standard (uksouth supported) Available via Azure Marketplace serverless; uksouth resource supported. Open-weights provenance.
Phi-4 phi-microsoft / SLM GA (serverless) [VERIFY] Benchmark pending US regions + swedencentral only NOT available in uksouth. Serverless only.

Benchmark run ID: 32caf9d8b04b · Run date: 12 June 2026 · n=10 tasks per sweep · Pricing source: Azure Retail Prices REST API (Accessed: 12 June 2026) · Region source: Microsoft Learn — Azure AI Foundry models sold directly by Azure — region availability (Accessed: 12 June 2026) · [VERIFY] = confirm current rates before deployment.

UK-Resident Models (Regional Standard — uksouth)

When you select UK-resident / Regional processing in the wizard, the model menu is constrained to models confirmed available in uksouth Regional Standard deployment. Regional Standard guarantees data processing stays within UK South data centres. Source: Microsoft Learn (2026) Azure AI Foundry models sold directly by Azure — region availability (Accessed: 12 June 2026) — link.

UK-Resident Model Set (6 models, uksouth Regional Standard)

Model Version Family Deployment Type Notes
gpt-4.1-mini 2025-04-14 gpt-4.1 Regional Standard — uksouth Benchmark run: 100% authoring acceptance, £0.0010/PR
gpt-4.1-nano 2025-04-14 gpt-4.1 Regional Standard — uksouth Highest throughput, lowest cost. Benchmark pending.
gpt-4o 2024-11-20 gpt-4o Regional Standard — uksouth Benchmark run: supervisor 60% catch / 0% false-block
gpt-4o 2024-08-06 gpt-4o Regional Standard — uksouth Earlier version; prefer 2024-11-20. Benchmark pending.
o3-mini 2025-01-31 o-series Regional Standard — uksouth Reasoning model; balanced speed/cost. Benchmark pending.
o4-mini 2025-04-16 o-series Regional Standard — uksouth Fast reasoning; recommended o-series entry point. Benchmark pending.

Note: gpt-4o-mini (2024-07-18) is also available in uksouth Regional Standard but is retiring 2025-09-15 — excluded from new deployment recommendations. Region data: Microsoft Learn (Accessed: 12 June 2026).

Commission a swarm →