Model Guide — ACC-09 Swarm Benchmark Findings
Measured results and catalogue overview for the 20-model roster.
Data sources: models/models-regions.json · models/benchmark-results.json · models/model-benchmark-report.md.
Headline Findings
Authoring Cost & Quality
On Microsoft Fabric notebook and pipeline authoring tasks, our benchmark found the mid-tier gpt-4.1-mini matched flagship-class alternatives at roughly one-sixth the cost — £0.0010 per accepted PR versus £0.0063 for gpt-4o (2024-11-20). Both models achieved 100% acceptance on the 10-task held-out suite (5 notebooks, 5 pipelines; simple, medium, and complex complexity). For format-constrained, structured-output generation, the cost premium of a larger model is not evidenced in this task shape.
Supervisor Catch-Rate & False Blocks
The gpt-4o supervisor achieved a 60% catch-rate on a harder 10-defect mix (secret strings, wrong workspace references, missing error handling, schema mismatches) with 0% false-block rate on 10 clean items. Security defects (secret strings) and structural errors (wrong workspace, missing dependencies) were all caught. Schema and error-handling defects were the gap: schema mismatches and missing error-handling patterns were missed in this run. Candidate improvements include richer task-spec grounding in the supervisor prompt.
Supervisor Defect-Type Gap Map (gpt-4o, n=10 defects)
⚠ Demonstration Environment — Read Before Relying on These Figures
- Demonstration environment only. This benchmark ran against synthetic, internal Fabric authoring tasks (Gate 5 — no client data). Results are directional only and should not be taken as production performance guarantees.
- Small sample (n=10). Both authoring and supervisor sweeps used a 10-task held-out suite. Variance is high; confidence intervals are wide. Rankings are directional, not definitive.
- Two models measured so far. gpt-4.1-mini (authoring) and gpt-4o 2024-11-20 (authoring + supervisor) were the only models with live quota on the Onyx02 Azure Sponsorship subscription at benchmark time. All other catalogue models show "Benchmark pending" due to InsufficientQuota on the Sponsorship subscription.
- Remaining models pending quota access. Benchmarks for gpt-4.1, gpt-4.1-nano, gpt-5 family, o4-mini and others will be added as quota is obtained. The table below is updated as results run.
- Pricing [VERIFY]. All indicative price bands are drawn from the Azure Retail Prices REST API (productName = 'Azure OpenAI' / 'Azure OpenAI GPT5' / 'Azure OpenAI Reasoning', Accessed: 12 June 2026). Rates are pay-as-you-go GlobalStandard tier; USD/GBP conversion at 0.79. Verify current rates at azure.microsoft.com/pricing/details/cognitive-services/openai-service/ before relying on them. Source: Microsoft Azure (2026) Azure OpenAI Service pricing. Microsoft Corporation. Available at: https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/ (Accessed: 12 June 2026).
- Region availability [VERIFY]. Regional Standard model availability in uksouth drawn from Microsoft Learn (2026) Azure AI Foundry models sold directly by Azure — region availability. Microsoft Corporation. Available at: https://learn.microsoft.com/azure/foundry/foundry-models/concepts/models-sold-directly-by-azure-region-availability (Accessed: 12 June 2026). Region lists may change; verify before deployment.
Full Catalogue — 20 Models
Model Catalogue & Benchmark Results
| Model | Family / Tier | GA Status | Indicative Price Band | Benchmark | Regions | Notes |
|---|---|---|---|---|---|---|
| gpt-4.1 2025-04-14 | gpt-4.1 / Flagship | GA | $0.002/$0.008 per 1k [VERIFY] | Benchmark pending | Global + Regional (not uksouth Regional) | 1M context; Global Standard only from uksouth. Provisioned Managed Regional supports uksouth. |
| gpt-4.1-mini 2025-04-14 | gpt-4.1 / Cost-optimised | GA | $0.0004/$0.0016 per 1k [VERIFY] |
Measured Authoring: 100% acceptance £0.0010/accepted PR Mean latency 1.8s |
UK-resident (uksouth Regional Std) | Only gpt-4.1 family model with uksouth Regional Standard support. Best-value authoring on this task shape. |
| gpt-4.1-nano 2025-04-14 | gpt-4.1 / Fastest/cheapest | GA | $0.0001/$0.0004 per 1k [VERIFY] | Benchmark pending | UK-resident (uksouth Regional Std) | Highest-throughput, lowest-cost option. Quota blocked on Onyx02 at benchmark time. |
| gpt-4o 2024-11-20 | gpt-4o / Flagship | GA | $0.0025/$0.010 per 1k [VERIFY] |
Measured Authoring: 100% acceptance £0.0063/accepted PR Supervisor: 60% catch / 0% false-block |
UK-resident (uksouth Regional Std) | Current swarm default supervisor. Security defects fully caught; schema/error-handling gaps (see gap map above). Approaching end-of-stable — prefer gpt-4.1 family for new authoring. |
| gpt-4o 2024-08-06 | gpt-4o / Flagship | GA | $0.0025/$0.010 per 1k [VERIFY] | Benchmark pending | UK-resident (uksouth Regional Std) | Earlier GA version of gpt-4o; prefer 2024-11-20 or newer. |
| gpt-4o 2024-05-13 | gpt-4o / Flagship | GA | [VERIFY] | Benchmark pending | UK-resident (uksouth Regional Std) | Original GA release; prefer newer versions. |
| gpt-4o-mini 2024-07-18 | gpt-4o / Lightweight | Retiring 2025-09-15 | [VERIFY] | Benchmark pending | UK-resident (uksouth Regional Std) | Retiring — use gpt-4.1-nano for equivalent role in new deployments. |
| gpt-5 2025-08-07 | gpt-5 / Flagship | GA | [VERIFY] | Benchmark pending | Global Standard only | 400K context. Registration required. Global Standard only from uksouth resource. |
| gpt-5-mini 2025-08-07 | gpt-5 / Cost-efficient | GA | $0.00025/$0.002 per 1k [VERIFY] | Benchmark pending | Global Standard only | 400K context. Quota blocked on Onyx02 at benchmark time. |
| gpt-5-nano 2025-08-07 | gpt-5 / Fastest | GA | $0.00005/$0.0004 per 1k [VERIFY] | Benchmark pending | Global Standard only | 400K context; cheapest GPT-5 option. |
| gpt-5.1 2025-11-13 | gpt-5 / Iterative | GA | $0.00125/$0.010 per 1k [VERIFY] | Benchmark pending | Global Standard only | Iterative improvement on gpt-5. 400K context. |
| gpt-5.2 2025-12-11 | gpt-5 / Reliability | GA | [VERIFY] | Benchmark pending | Global Standard only | Focused on reliability and supervision tasks. 400K context. |
| gpt-5.3-codex 2026-02-24 | gpt-5 / Code-specialised | GA | $0.00175/$0.014 per 1k [VERIFY] | Benchmark pending | Global Standard only | Code-specialised variant. 400K context. |
| gpt-5.4 2026-03-05 | gpt-5 / Extended-context flagship | GA | $0.0025/$0.015 per 1k [VERIFY] | Benchmark pending | Global Standard only | 1M+ context; highest capability in gpt-5 family. |
| gpt-5.4-mini 2026-03-17 | gpt-5 / Mini | GA | [VERIFY] | Benchmark pending | Global Standard only | Mini variant of gpt-5.4. 400K context. |
| gpt-5.4-nano 2026-03-17 | gpt-5 / Nano | GA | [VERIFY] | Benchmark pending | Global Standard only | Nano variant of gpt-5.4. Fastest/cheapest in 5.4 sub-family. |
| o4-mini 2025-04-16 | o-series / Reasoning | GA | $0.0011/$0.0044 per 1k [VERIFY] | Benchmark pending | UK-resident (uksouth Regional Std) | Fast reasoning; strong on code and logic. Recommended o-series entry point. Quota blocked on Onyx02 at benchmark time. |
| o3 2025-04-16 | o-series / Full reasoning | GA | [VERIFY] | Benchmark pending | Global Standard only | Highest o-series capability; registration required. Best for complex review and supervisor roles. |
| o3-mini 2025-01-31 | o-series / Lightweight reasoning | GA | [VERIFY] | Benchmark pending | UK-resident (uksouth Regional Std) | Balanced speed/cost for reasoning tasks. |
| o1 2024-12-17 | o-series / Original full reasoning | GA | [VERIFY] | Benchmark pending | Global Standard only | Predecessor to o3; 200K context. Global Standard only. |
| model-router 2025-05-19 | model-router / Routing | GA | [VERIFY] | Benchmark pending | Global Std only — not uksouth | Routes prompts to underlying model. NOT available in uksouth region. Limited to eastus, eastus2, westus3, swedencentral. |
| claude-opus-4 | claude-anthropic / Flagship | Preview (partner) | [VERIFY] | Benchmark pending | eastus2 / swedencentral only | NOT deployable from uksouth. Requires Foundry hub in eastus2 or swedencentral; pay-as-you-go billing (Azure Sponsorship may not qualify). |
| claude-sonnet-4 | claude-anthropic / Balanced | Preview (partner) | [VERIFY] | Benchmark pending | eastus2 / swedencentral only | NOT deployable from uksouth. Same regional restriction as claude-opus-4. |
| DeepSeek-R1 | deepseek / Reasoning | GA (partner serverless) | [VERIFY] | Benchmark pending | Global Standard (uksouth supported) | Available via Azure Marketplace serverless; uksouth resource supported. Open-weights provenance. |
| Phi-4 | phi-microsoft / SLM | GA (serverless) | [VERIFY] | Benchmark pending | US regions + swedencentral only | NOT available in uksouth. Serverless only. |
Benchmark run ID: 32caf9d8b04b · Run date: 12 June 2026 · n=10 tasks per sweep · Pricing source: Azure Retail Prices REST API (Accessed: 12 June 2026) · Region source: Microsoft Learn — Azure AI Foundry models sold directly by Azure — region availability (Accessed: 12 June 2026) · [VERIFY] = confirm current rates before deployment.
UK-Resident Models (Regional Standard — uksouth)
When you select UK-resident / Regional processing in the wizard, the model menu is constrained to models confirmed available in uksouth Regional Standard deployment. Regional Standard guarantees data processing stays within UK South data centres. Source: Microsoft Learn (2026) Azure AI Foundry models sold directly by Azure — region availability (Accessed: 12 June 2026) — link.
UK-Resident Model Set (6 models, uksouth Regional Standard)
| Model | Version | Family | Deployment Type | Notes |
|---|---|---|---|---|
| gpt-4.1-mini | 2025-04-14 | gpt-4.1 | Regional Standard — uksouth | Benchmark run: 100% authoring acceptance, £0.0010/PR |
| gpt-4.1-nano | 2025-04-14 | gpt-4.1 | Regional Standard — uksouth | Highest throughput, lowest cost. Benchmark pending. |
| gpt-4o | 2024-11-20 | gpt-4o | Regional Standard — uksouth | Benchmark run: supervisor 60% catch / 0% false-block |
| gpt-4o | 2024-08-06 | gpt-4o | Regional Standard — uksouth | Earlier version; prefer 2024-11-20. Benchmark pending. |
| o3-mini | 2025-01-31 | o-series | Regional Standard — uksouth | Reasoning model; balanced speed/cost. Benchmark pending. |
| o4-mini | 2025-04-16 | o-series | Regional Standard — uksouth | Fast reasoning; recommended o-series entry point. Benchmark pending. |
Note: gpt-4o-mini (2024-07-18) is also available in uksouth Regional Standard but is retiring 2025-09-15 — excluded from new deployment recommendations. Region data: Microsoft Learn (Accessed: 12 June 2026).