Onyx Kiln — Model Guide

Model Guide — ACC-09 Swarm Benchmark Findings

Measured results and catalogue overview for the 20-model roster. Data sources: models/models-regions.json · models/benchmark-results.json · models/model-benchmark-report.md.

Headline Findings

Authoring Cost & Quality

On Microsoft Fabric notebook and pipeline authoring tasks, our benchmark found the mid-tier gpt-4.1-mini matched flagship-class alternatives at roughly one-sixth the cost — £0.0010 per accepted PR versus £0.0063 for gpt-4o (2024-11-20). Both models achieved 100% acceptance on the 10-task held-out suite (5 notebooks, 5 pipelines; simple, medium, and complex complexity). For format-constrained, structured-output generation, the cost premium of a larger model is not evidenced in this task shape.

Supervisor Catch-Rate & False Blocks

The gpt-4o supervisor achieved a 60% catch-rate on a harder 10-defect mix (secret strings, wrong workspace references, missing error handling, schema mismatches) with 0% false-block rate on 10 clean items. Security defects (secret strings) and structural errors (wrong workspace, missing dependencies) were all caught. Schema and error-handling defects were the gap: schema mismatches and missing error-handling patterns were missed in this run. Candidate improvements include richer task-spec grounding in the supervisor prompt.

Supervisor Defect-Type Gap Map (gpt-4o, n=10 defects)

secret_string Caught 3/3 All three injected secret strings identified and rejected.

wrong_workspace Caught 2/2 Both incorrect workspace/notebook GUID references flagged.

missing_error_handling Missed 3/3 Missing error-handling patterns not flagged; supervisor accepted all three.

schema_mismatch Missed 2/2 Schema-level mismatches accepted; structural validation needs strengthening in the supervisor prompt.

⚠ Demonstration Environment — Read Before Relying on These Figures

Demonstration environment only. This benchmark ran against synthetic, internal Fabric authoring tasks (Gate 5 — no client data). Results are directional only and should not be taken as production performance guarantees.
Small sample (n=10). Both authoring and supervisor sweeps used a 10-task held-out suite. Variance is high; confidence intervals are wide. Rankings are directional, not definitive.
Two models measured so far. gpt-4.1-mini (authoring) and gpt-4o 2024-11-20 (authoring + supervisor) were the only models with live quota on the Onyx02 Azure Sponsorship subscription at benchmark time. All other catalogue models show "Benchmark pending" due to InsufficientQuota on the Sponsorship subscription.
Remaining models pending quota access. Benchmarks for gpt-4.1, gpt-4.1-nano, gpt-5 family, o4-mini and others will be added as quota is obtained. The table below is updated as results run.
Pricing [VERIFY]. All indicative price bands are drawn from the Azure Retail Prices REST API (productName = 'Azure OpenAI' / 'Azure OpenAI GPT5' / 'Azure OpenAI Reasoning', Accessed: 12 June 2026). Rates are pay-as-you-go GlobalStandard tier; USD/GBP conversion at 0.79. Verify current rates at azure.microsoft.com/pricing/details/cognitive-services/openai-service/ before relying on them. Source: Microsoft Azure (2026) Azure OpenAI Service pricing. Microsoft Corporation. Available at: https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/ (Accessed: 12 June 2026).
Region availability [VERIFY]. Regional Standard model availability in uksouth drawn from Microsoft Learn (2026) Azure AI Foundry models sold directly by Azure — region availability. Microsoft Corporation. Available at: https://learn.microsoft.com/azure/foundry/foundry-models/concepts/models-sold-directly-by-azure-region-availability (Accessed: 12 June 2026). Region lists may change; verify before deployment.

Full Catalogue — 20 Models

Model Catalogue & Benchmark Results

Model	Family / Tier	GA Status	Indicative Price Band	Benchmark	Regions	Notes
gpt-4.1 2025-04-14	gpt-4.1 / Flagship	GA	$0.002/$0.008 per 1k [VERIFY]	Benchmark pending	Global + Regional (not uksouth Regional)	1M context; Global Standard only from uksouth. Provisioned Managed Regional supports uksouth.
gpt-4.1-mini 2025-04-14	gpt-4.1 / Cost-optimised	GA	$0.0004/$0.0016 per 1k [VERIFY]	Measured Authoring: 100% acceptance £0.0010/accepted PR Mean latency 1.8s	UK-resident (uksouth Regional Std)	Only gpt-4.1 family model with uksouth Regional Standard support. Best-value authoring on this task shape.
gpt-4.1-nano 2025-04-14	gpt-4.1 / Fastest/cheapest	GA	$0.0001/$0.0004 per 1k [VERIFY]	Benchmark pending	UK-resident (uksouth Regional Std)	Highest-throughput, lowest-cost option. Quota blocked on Onyx02 at benchmark time.
gpt-4o 2024-11-20	gpt-4o / Flagship	GA	$0.0025/$0.010 per 1k [VERIFY]	Measured Authoring: 100% acceptance £0.0063/accepted PR Supervisor: 60% catch / 0% false-block	UK-resident (uksouth Regional Std)	Current swarm default supervisor. Security defects fully caught; schema/error-handling gaps (see gap map above). Approaching end-of-stable — prefer gpt-4.1 family for new authoring.
gpt-4o 2024-08-06	gpt-4o / Flagship	GA	$0.0025/$0.010 per 1k [VERIFY]	Benchmark pending	UK-resident (uksouth Regional Std)	Earlier GA version of gpt-4o; prefer 2024-11-20 or newer.
gpt-4o 2024-05-13	gpt-4o / Flagship	GA	[VERIFY]	Benchmark pending	UK-resident (uksouth Regional Std)	Original GA release; prefer newer versions.
gpt-4o-mini 2024-07-18	gpt-4o / Lightweight	Retiring 2025-09-15	[VERIFY]	Benchmark pending	UK-resident (uksouth Regional Std)	Retiring — use gpt-4.1-nano for equivalent role in new deployments.
gpt-5 2025-08-07	gpt-5 / Flagship	GA	[VERIFY]	Benchmark pending	Global Standard only	400K context. Registration required. Global Standard only from uksouth resource.
gpt-5-mini 2025-08-07	gpt-5 / Cost-efficient	GA	$0.00025/$0.002 per 1k [VERIFY]	Benchmark pending	Global Standard only	400K context. Quota blocked on Onyx02 at benchmark time.
gpt-5-nano 2025-08-07	gpt-5 / Fastest	GA	$0.00005/$0.0004 per 1k [VERIFY]	Benchmark pending	Global Standard only	400K context; cheapest GPT-5 option.
gpt-5.1 2025-11-13	gpt-5 / Iterative	GA	$0.00125/$0.010 per 1k [VERIFY]	Benchmark pending	Global Standard only	Iterative improvement on gpt-5. 400K context.
gpt-5.2 2025-12-11	gpt-5 / Reliability	GA	[VERIFY]	Benchmark pending	Global Standard only	Focused on reliability and supervision tasks. 400K context.
gpt-5.3-codex 2026-02-24	gpt-5 / Code-specialised	GA	$0.00175/$0.014 per 1k [VERIFY]	Benchmark pending	Global Standard only	Code-specialised variant. 400K context.
gpt-5.4 2026-03-05	gpt-5 / Extended-context flagship	GA	$0.0025/$0.015 per 1k [VERIFY]	Benchmark pending	Global Standard only	1M+ context; highest capability in gpt-5 family.
gpt-5.4-mini 2026-03-17	gpt-5 / Mini	GA	[VERIFY]	Benchmark pending	Global Standard only	Mini variant of gpt-5.4. 400K context.
gpt-5.4-nano 2026-03-17	gpt-5 / Nano	GA	[VERIFY]	Benchmark pending	Global Standard only	Nano variant of gpt-5.4. Fastest/cheapest in 5.4 sub-family.
o4-mini 2025-04-16	o-series / Reasoning	GA	$0.0011/$0.0044 per 1k [VERIFY]	Benchmark pending	UK-resident (uksouth Regional Std)	Fast reasoning; strong on code and logic. Recommended o-series entry point. Quota blocked on Onyx02 at benchmark time.
o3 2025-04-16	o-series / Full reasoning	GA	[VERIFY]	Benchmark pending	Global Standard only	Highest o-series capability; registration required. Best for complex review and supervisor roles.
o3-mini 2025-01-31	o-series / Lightweight reasoning	GA	[VERIFY]	Benchmark pending	UK-resident (uksouth Regional Std)	Balanced speed/cost for reasoning tasks.
o1 2024-12-17	o-series / Original full reasoning	GA	[VERIFY]	Benchmark pending	Global Standard only	Predecessor to o3; 200K context. Global Standard only.
model-router 2025-05-19	model-router / Routing	GA	[VERIFY]	Benchmark pending	Global Std only — not uksouth	Routes prompts to underlying model. NOT available in uksouth region. Limited to eastus, eastus2, westus3, swedencentral.
claude-opus-4	claude-anthropic / Flagship	Preview (partner)	[VERIFY]	Benchmark pending	eastus2 / swedencentral only	NOT deployable from uksouth. Requires Foundry hub in eastus2 or swedencentral; pay-as-you-go billing (Azure Sponsorship may not qualify).
claude-sonnet-4	claude-anthropic / Balanced	Preview (partner)	[VERIFY]	Benchmark pending	eastus2 / swedencentral only	NOT deployable from uksouth. Same regional restriction as claude-opus-4.
DeepSeek-R1	deepseek / Reasoning	GA (partner serverless)	[VERIFY]	Benchmark pending	Global Standard (uksouth supported)	Available via Azure Marketplace serverless; uksouth resource supported. Open-weights provenance.
Phi-4	phi-microsoft / SLM	GA (serverless)	[VERIFY]	Benchmark pending	US regions + swedencentral only	NOT available in uksouth. Serverless only.

Benchmark run ID: 32caf9d8b04b · Run date: 12 June 2026 · n=10 tasks per sweep · Pricing source: Azure Retail Prices REST API (Accessed: 12 June 2026) · Region source: Microsoft Learn — Azure AI Foundry models sold directly by Azure — region availability (Accessed: 12 June 2026) · [VERIFY] = confirm current rates before deployment.

UK-Resident Models (Regional Standard — uksouth)

When you select UK-resident / Regional processing in the wizard, the model menu is constrained to models confirmed available in uksouth Regional Standard deployment. Regional Standard guarantees data processing stays within UK South data centres. Source: Microsoft Learn (2026) Azure AI Foundry models sold directly by Azure — region availability (Accessed: 12 June 2026) — link.

UK-Resident Model Set (6 models, uksouth Regional Standard)

Model	Version	Family	Deployment Type	Notes
gpt-4.1-mini	2025-04-14	gpt-4.1	Regional Standard — uksouth	Benchmark run: 100% authoring acceptance, £0.0010/PR
gpt-4.1-nano	2025-04-14	gpt-4.1	Regional Standard — uksouth	Highest throughput, lowest cost. Benchmark pending.
gpt-4o	2024-11-20	gpt-4o	Regional Standard — uksouth	Benchmark run: supervisor 60% catch / 0% false-block
gpt-4o	2024-08-06	gpt-4o	Regional Standard — uksouth	Earlier version; prefer 2024-11-20. Benchmark pending.
o3-mini	2025-01-31	o-series	Regional Standard — uksouth	Reasoning model; balanced speed/cost. Benchmark pending.
o4-mini	2025-04-16	o-series	Regional Standard — uksouth	Fast reasoning; recommended o-series entry point. Benchmark pending.

Note: gpt-4o-mini (2024-07-18) is also available in uksouth Regional Standard but is retiring 2025-09-15 — excluded from new deployment recommendations. Region data: Microsoft Learn (Accessed: 12 June 2026).

Commission a swarm →