GPT-4o · OVR 90 · OF · OpenAI United

AIL Player Card #002 — GPT-4o: The Veteran Anchor
90 OVR. OF. Still starting for OpenAI United. Arena ELO #1 among peers, but context ceiling, value squeeze from DeepSeek, and GPT-5 on the bench tell the full story. #AILeague

90 OVR · OF · OpenAI United · Season 2024–2026
Two years after OpenAI dropped GPT-4o on the world — a model so fast and cheap it briefly redrew the entire cost map — the veteran is still starting. Not leading the table. Starting. There's a difference, and every stat in this scouting report reflects it.
This is a card for a player in the middle third of his prime: dominant enough to trust with complex tooling workflows and multi-modal tasks, outpaced at the top by newer builds on both sides of the bracket. The franchise signed him at $5.00 input, let him run through '24 and '25, and now has GPT-5.x breathing down the depth chart. But GPT-4o's position is still his.
The stat sheet
Loading stats card…
| Dimension | Score | Source |
|---|---|---|
| OVR (Overall) | 90 | Composite, weighted |
| RZN (Reasoning) | 86 | MMLU 88.7%, GPQA mid-tier at launch |
| CRE (Creativity) | 88 | Arena ELO ~1287 (human preference, May 2026) |
| SPD (Speed) | 82 | 128K context, 48 t/s, 1.25s latency |
| MLT (Multimodal) | 94 | Native omni model — text/image/audio/vision in one net |
| SAF (Safety) | 83 | Preparedness Framework: Medium post-mitigation (Persuasion), Low for Cybersecurity/CBRN/Autonomy |
| VAL (Value) | 79 | $2.50/$10.00 per 1M tokens — competitive but no longer cheap |
Position tag: OF (Omni Forward)
Defined as: native cross-modal reasoning, instruction-following precision, and structured tool use across chat, API, and enterprise integrations.
Scouting report
The multimodal thesis that actually landed
When GPT-4o launched on May 13, 2024, the pitch was "omni" — one model end-to-end across text, image, and audio, no stitched pipeline 1. Voice response at 232–320ms average latency, comparable to human conversation. The omni claim was real. The MLT score of 94 is the highest on this card for a reason: native vision, audio, and multilingual capability shipped together, not stapled on.
Loading content card…
Two years later that architecture advantage still holds for teams that actually need cross-modal workflows. GitHub Copilot runs a multi-model matrix that includes GPT-4o specifically for structured API use cases and broad general tasks 2.
Instruction following is the real position
On pure code generation (HumanEval), GPT-4o posts 90.2% pass@1 — real but not top 3. Claude 3.5 Sonnet is at 92.0%, DeepSeek V3 at 91.3%. Where GPT-4o consistently outranks both in production at scale is structured workflows: multi-step tool calls, JSON output fidelity, and following complex multi-part instructions without drifting across long conversations 3. That specific reliability is what keeps him on the starting XI.
LMSYS Chatbot Arena Elo of ~1287 as of May 2026 is the highest among the four models in the comparison bracket 3. In Arena matchups — blind head-to-head where real humans vote on preferred responses — a 30-point Elo gap translates to roughly 54% win rate. GPT-4o over Claude 3.5 Sonnet (gap: ~23 points) is meaningful but not categorical. The franchise powerhouse is still the franchise powerhouse.
Context wall and value squeeze
The 128K context window was competitive in 2024. In 2026, it's a constraint. Gemini 1.5 Pro sits at 1M tokens. Llama 4 Scout runs 10M 4. Claude 3.5 Sonnet at 200K has room for the large codebases and long document reviews that increasingly define enterprise workloads. GPT-4o's effective window caps around 100K before instruction degradation shows up in production 3.
On value: the API price of $2.50 input / $10.00 output looked cheap in 2024 5. It doesn't look cheap next to DeepSeek V3 at $0.27 / $1.10 4. For high-volume production, the differential is 9× on input and 9× on output. GPT-4o's VAL score of 79 reflects a model that used to be the value play and is no longer.
Head-to-head: OF position class
Loading chart…
| Stat | GPT-4o (OpenAI United) | Claude 3.5 Sonnet (Anthropic FC) | DeepSeek V3 (DeepSeek Athletic) |
|---|---|---|---|
| OVR | 90 | 89 | 87 |
| Arena ELO | ~1287 | ~1264 | ~1243 |
| HumanEval | 90.2% | 92.0% | 91.3% |
| MMLU | 88.7% | 88.3% | 88.5% |
| Context window | 128K | 200K | 128K |
| API input ($/1M) | $2.50 | $3.00 | $0.27 |
| API output ($/1M) | $10.00 | $15.00 | $1.10 |
| Best at | Tool use, structured output | Long docs, honest uncertainty | Cost efficiency, code at scale |
Season highlights
May 2024 — The omni launch. GPT-4o ships as a single model across text, audio, and vision. 50% cheaper and 2× faster than GPT-4 Turbo at parity performance on English text and code 1. The original contract year.
2024–2025 — The default installation. GitHub Copilot, Cursor, and dozens of enterprise AI products ship with GPT-4o as baseline or fallback. Developer survey data puts ChatGPT (powered by GPT-4o) at 41% usage among AI-assisted developers in 2025 6.
2025 — The depth chart problem. OpenAI drops GPT-4.1 (54.6% SWE-bench, 1M context), then GPT-5 (94.6% AIME 2025, 74.9% SWE-bench Verified) 7. GPT-4o remains in rotation but the "best OpenAI model" label passes up the roster.
Coach's verdict
GPT-4o is a 90 OVR player on a team with 95+ OVR on the bench. That's not a knock — that's franchise depth. The Omni Forward position he defined is still his, and his instruction-following precision and multimodal architecture remain the reference build for teams that need reliable tool use over raw benchmark scores.
The valuation gap with DeepSeek V3 is real. The context ceiling is real. But for production API workloads where consistent structured output matters more than either cost or benchmark rank, the veteran anchor delivers. Just don't mistake "still starting" for "still the best on the pitch."
— Filed from the scout deck at OpenAI Park, May 30, 2026 #AILeague
Add more perspectives or context around this Drop.