OpenAI vs Sonnet, Grok-Code, GLM, Qwen: Fresh GitHub Benchmark Results

Real GitHub issues are messy, time-sensitive, and full of sharp edges—exactly the kind of environment where code models either prove their value or quietly fail. That’s why a fresh benchmark built from live GitHub issues and pull requests is worth a close look. At AI Tech Inspire, we spotted a new snapshot from the SWE-rebench team that compares OpenAI’s latest models against strong contenders like Sonnet, Grok-Code, GLM, and Qwen. The setup promises fresher data and minimal training leakage—two things that make a leaderboard actually mean something for engineers.

Key facts from the benchmark

The benchmark, SWE-rebench, is constructed from real GitHub issues/PRs using fresh data to reduce training-set leakage.
OpenAI models are the focal point of the visualization, with additional competitive models included for comparison.
The full leaderboard spans 30+ models and reports per-task cost, pass@5, and an Inspect button to view each original issue/PR.
On the August task set, GPT-5 performs strongly; on the full board there’s no statistically significant gap versus Sonnet 4.
Open-source models are close to the top: GLM-4.5 and Qwen-480B look particularly strong.
gpt-oss-120b stands out as a solid baseline for its size and as a general-purpose model, though there were issues with its tool-calling.
The benchmark continues to update based on community feedback, with invitations for requests and questions.

Why this matters for developers

Most coding benchmarks are synthetic, repetitive, or easily overfit. Real-world GitHub issues, by contrast, are deeply contextual—spanning broken CI, tricky API migrations, flaky tests, and subtle regressions. Tasks can touch everything from build systems and CUDA kernels to documentation fixes and dependency pinning. A model that navigates this terrain has practical value beyond a single score.

Fresh data also helps reduce the risk of training exposure. If you’re evaluating a GPT-style model or an open-source alternative from Hugging Face, you want confidence that success isn’t just recall of known content. SWE-rebench puts that front and center.

Key takeaway: Evaluations should look like your backlog. If the dataset mirrors the issues you triage every day, the results move from “interesting” to actionable.

How to read the leaderboard like a pro

The highlight metrics here are pass@5, per-task cost, and per-task Inspect links. In practice:

pass@5: For code tasks, allowing up to five attempts captures realistic usage, especially in agentic loops where a model iterates. Just remember: better pass@k can hide rising cost and latency.
Per-task cost: If you’re deploying a code assistant across a large team or running nightly agents, cost curves matter as much as accuracy. Compare models at the same pass@k to see who’s efficient.
Inspect button: Clicking into the actual issue/PR is gold. You can confirm that tasks are nontrivial and judge whether the success definition aligns with your environment.

The authors note that on the full board there’s no statistically significant gap between GPT-5 and Sonnet 4. That kind of nuance is crucial. When two leaders are within error bars, deployment choices can hinge on latency, tool-calling stability, context window behavior, or ecosystem fit.

What the results suggest right now

The headline: Open-source is in the conversation. With GLM-4.5 and Qwen-480B performing near the top, teams with privacy constraints or specialized in-house data may find OS options viable—especially when combined with retrieval and rigorous guardrails. Meanwhile, gpt-oss-120b looks like a respectable baseline for its size, though the note about tool-calling hiccups is a timely reminder that function-calling reliability often lags core reasoning.

For many teams, the absence of a statistically significant gap between GPT-5 and Sonnet 4 across the broader board indicates a maturing top tier. If your stack already integrates with one ecosystem, switching may not yield clear ROI unless your workloads expose a specific edge case.

When contenders are this close, the tie-breakers aren’t scores—they’re integration details, stability under load, and operational cost.

Practical scenarios to test in your environment

CI/CD patch bots: Use pass@5 with a low temperature and strict unit-test gating. Evaluate how often the model proposes buildable patches versus noisy suggestions.
Migration assistants: Moving from TensorFlow to PyTorch? See which model handles API diffs and repository-wide refactors without breaking edge cases.
GPU code and kernels: If your repo includes CUDA or low-level performance code, check whether the model preserves memory semantics and avoids race conditions.
OSS maintenance: For libraries published on Hugging Face, test triage quality (labeling, reproduction, minimal repro PRs). This mirrors real-world maintainer work.
Research repos: Projects around Stable Diffusion often span Python, C++, and config glue. Multi-language reasoning is where leaders differentiate.

Tool-calling: the quiet make-or-break

The benchmark notes tool-calling issues with gpt-oss-120b. Engineers have seen this before across models: tools can drift from schema, arguments get hallucinated, or calls repeat in loops. To harden your stack:

Enforce strict JSON schemas and return 400 on malformed calls; retry with structured error messages.
Validate and sanitize before execution—especially for file, network, and shell tools.
Add max_call_depth and cooldown backoffs to prevent tool-call thrashing.
Log tool_name, args, and latency_ms per attempt; add tests for common schema variants.
Prefer declarative tool specs with examples; list negative examples (what not to call).

If your real workload is tool-heavy—repo search, linters, test runners—the “best” model for you is the one with the fewest integration surprises, not necessarily the highest raw pass@5.

Cost, latency, and the shape of your workload

Per-task cost on the leaderboard is a practical proxy for what a month of usage might look like. Consider:

Interactive coding: Prioritize low latency and high single-shot accuracy. You don’t always need pass@5 if your IDE loop is fast.
Batch triage: Lean into pass@5 with beam-like sampling to raise success on overnight jobs, but cap total spend.
Long-context repos: If the benchmark tasks resemble your monorepo, assess context handling and retrieval grounding, not just raw reasoning.

Even small differences in per-task cost compound quickly across thousands of jobs. When leaders are close in accuracy, cost per win is the metric that matters.

How to replicate value from this benchmark

Use SWE-rebench as a template for an internal evaluation harness:

Collect a rotating set of recent issues/PRs from your repos.
Define success criteria aligned to your CI (all tests pass, no lints, minimal diff size).
Run multiple models with the same temperature, max_tokens, and pass@k; log failures richly.
Publish an internal dashboard that mirrors cost, pass@k, and an Inspect link to the original task.
Re-run monthly to catch regressions and model drift.

If you’re evaluating an agent loop, add guardrails (timeouts, tool whitelists) and measure effective developer time saved, not just pass@k.

Open questions worth exploring

Does performance shift across programming languages and build systems? E.g., Gradle vs. CMake vs. Bazel.
How much does retrieval improve win rate on long-context issues?
What’s the best pass@k for cost-effective patch generation in CI?
Do models overfit to common repo patterns and stumble on niche tooling?
How stable are rankings month-to-month as GitHub issues evolve?

The bottom line

This snapshot underscores a few practical truths:

Top-tier models are tightly clustered; small differences in accuracy may not justify a platform switch.
Open-source contenders like GLM-4.5 and Qwen-480B are viable for serious workloads, especially with strong retrieval and guardrails.
Tool-calling reliability can overshadow raw reasoning for code agents—test it like any other dependency.

Use leaderboards to shortlist, but let your backlog pick the winner. The best model is the one that fixes your issues at the right price, predictably.

For developers, maintainers, and teams building agentic tooling, the details in this benchmark—fresh GitHub tasks, pass@5, per-task cost, and task-level inspection—offer a useful playbook. AI Tech Inspire will keep tracking how these evaluations evolve. In the meantime, if your roadmap includes code assistants or CI bots, this is a timely moment to run your own bake-off and let the data call the shots.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.

Fiverr Marketplace

Hire AI talent.

The Hundred-Page LLMs Book (PyTorch)

Hands-on LLMs.