
What if an AI could reliably discover rules it has never seen before, explain its reasoning, and do it deterministically? At AI Tech Inspire, we spotted a new claim that aims squarely at that vision: a system called WD-AGI
reporting strong results on ultra-hard reasoning benchmarks like ARC-AGI 2 and something dubbed the “True Detective” benchmark. The details are early, but the pitch is unusual enough to warrant a closer look—especially for developers and engineers hungry for more controllable, verifiable reasoning.
Key claims at a glance
- The team reports a system called
WD-AGI
that “discovers rules it’s never seen before,” spanning discrete and continuous domains, with hypothesis-driven, explainable, deterministic, goal-driven, and directive-driven behavior. - Privately reported scores on ARC-AGI 2 training sets: consistently above 40–50%.
- Claimed “True Detective” benchmark score: 99.5%, compared to GPT-4 at ~38%.
- Stated ARC-AGI 2 test-day target: 70%+ (with a comparison claim to Grok at 16%).
- AMA planned; papers “coming to arXiv,” with early access/trials available via email inquiry.
- Founder background noted as JD/MBA (University of Chicago) with prior experience at Skadden; the approach is said to draw on ideas from biology and philosophy.
Key takeaway: the pitch centers on deterministic, explainable autonomous reasoning that can discover novel rules and solve benchmark puzzles—if independently verified, that’s a meaningful shift from purely stochastic LLM workflows.
What stands out about the approach
Most modern AI tools that developers handle day-to-day—from TensorFlow and PyTorch stacks to LLM-centric pipelines—lean heavily on probabilistic inference. That’s powerful for language and perception, but it often struggles with brittle, puzzle-like tasks where a system must infer consistent rules, not just likely tokens. The WD-AGI
description puts emphasis on several properties many engineers have been asking for:
- Determinism: repeatable outputs make debugging sane. If a model hits or misses, you can instrument it.
- Explainability: hypothesis-driven steps create a trail: “Here’s the rule I think holds; here’s how I tested it.”
- Goal/directive-driven: sounds like a structured planner that can align actions to a task objective, potentially with explicit internal checks.
- Discrete + continuous domains: suggests the method is not confined to symbol manipulation or numeric optimization alone.
If true in practice, that combo would be useful anywhere developers need consistent, auditable reasoning rather than purely generative flair.
About those benchmarks
ARC-AGI 2: ARC (Abstraction and Reasoning Corpus) tasks are designed to test broad generalization through rule discovery. ARC-AGI 2 is a newer, reportedly tougher iteration. Training-set accuracy in the 40–50% band is noteworthy, but the real proof is in “test-day” performance—unseen tasks—where the team targets 70%+. That’s a high bar. The claim is juxtaposed with a cited Grok baseline at 16%, which should be interpreted cautiously until independent evaluations are public.
“True Detective” benchmark: Less standard, but the claim is eye-catching: 99.5% vs. ~38% for GPT-4. Without a public spec and replication, it’s hard to contextualize. Developers should ask about task format, leakage controls, time/computational budgets, and whether the benchmark captures generalizable reasoning or a narrower skill.
In short, the numbers are bold. The next step is verification—open tasks, reproducible runs, and apples-to-apples comparisons with fixed seeds, documented hardware, and time limits. Until then, treat the results as strong hypotheses rather than settled fact.
Why this matters for builders
Today’s LLM toolchains are versatile but can be fickle. Even with carefully engineered prompts, temperature settings, and evaluation harnesses, small perturbations can flip outcomes. If a system like WD-AGI
delivers deterministic reasoning with clear hypotheses, you get a few immediate benefits:
- Debuggability: easier to instrument with logs and assertions. Think
pytest
-like reproducibility for cognition. - Compliance and QA: explainable steps simplify audits for regulated use cases (finance, healthcare).
- Tool integration: a goal-driven planner can orchestrate calls to external tools, from Hugging Face models to custom microservices, while keeping a rational trace.
- Program synthesis and data wrangling: inferring consistent rules across examples is essentially what many ETL and code transformation tasks require.
Picture a system that examines half a dozen input-output examples and generates a precise transformation rule—then validates it against held-out examples. That’s the core spirit of ARC-style tasks but also a proxy for real workflows like spreadsheet refactoring, log normalization, and schema migration.
Possible design elements (reading between the lines)
The team mentions inspiration from biology and philosophy, plus hypothesis-driven reasoning. Translating that to engineering, one might expect components like:
- Symbolic search or program induction: enumerate candidate rules/programs, score them against examples.
- Guided heuristics: priors for visual or structural motifs to prune search (common in ARC solvers).
- Deterministic planners: explicit state transitions and goal trees, potentially akin to classical planning with constraints.
- LLM assistance (optional): using an LLM for proposal generation or explanation, while the core solver ensures correctness—especially if integrated with CUDA-accelerated search or domain-specific kernels.
None of these are confirmed; they’re plausible building blocks a team might combine to achieve deterministic, explainable rule inference at scale. The interesting bit is the claimed breadth: discrete and continuous domains. If the method generalizes beyond grid puzzles into, say, numerical control or robotics planning, the impact could be wider than benchmark wins.
How to kick the tires
Until the arXiv papers and public evals land, developers can prep a validation plan. A practical checklist:
- Data hygiene: ensure no overlap with training artifacts; use hidden test sets or freshly constructed tasks.
- Compute and time budgets: fix wall-clock limits and hardware specs for comparability.
- Ablations: measure the contribution of each subsystem (search, heuristics, verification).
- Determinism checks: run multiple times with the same seed; capture identical traces.
- Error taxonomy: request failure cases to see brittleness modes (off-by-one, symmetry misreads, overfitting to color/position).
- Cost profile: how does performance scale with problem size and available compute?
Hands-on experiments could mirror ARC mini-sets, small rule discovery tasks, or structured data transformations. Track both accuracy and time to first correct hypothesis. For day-to-day workflows, even a system that solves 40% of gnarly rule tasks quickly and reproducibly can be a win.
Where it could slot into your stack
Imagine a service that sits beside your LLM chain. The LLM handles free-form language tasks (summaries, specs), while the reasoning engine attempts deterministic rule discovery with proofs. In a tool like Airflow
or a microservice mesh, you might expose an endpoint like:
{
"task": "infer_rule",
"examples": [...],
"constraints": {...},
"verification": "holdout"
}
Use the result to gate downstream actions: only proceed if the solver provides a validated rule and an explanation trace. This hybrid pattern—combining stochastic language models with deterministic solvers—has been a rising theme in the community, whether you deploy via PyTorch extensions, TensorFlow Serving, or model hubs like Hugging Face. It’s also compatible with media-generation pipelines (think Stable Diffusion) where a planner can enforce constraints before image or video synthesis.
Small UX touch: bind a keyboard shortcut like Ctrl + R to rerun the same seed and confirm determinism—simple, but symbolic of a different era of AI tooling.
Risks, caveats, and what to watch
Deterministic solvers can be brittle: if the hypothesis library is narrow, they excel on certain patterns and stumble elsewhere. High benchmark scores can also mask dataset quirks. Keep an eye on:
- Generalization: do rules transfer to fresh, adversarial tasks?
- Scalability: how do compute needs grow with problem complexity?
- Spec gaming: does the system exploit idiosyncrasies of a benchmark rather than the underlying concept?
- Transparency: will the upcoming papers contain enough detail for independent reproduction?
The founder’s note about influences from biology and philosophy is intriguing; if those ideas translate into a novel search or inference paradigm, the community will want rigorous empirical comparisons and well-documented ablations.
Access, timeline, and next steps
The team says an AMA is coming, papers are headed to arXiv, and early trials are available. Inquiries and preorders are directed to the provided contact email. If you’re evaluating, line up your test harness now and prepare to publish your findings—good, bad, or mixed. The field benefits when claims meet public benchmarks with reproducible scripts and clear logs.
Whether the final numbers hold or settle lower, the emphasis on deterministic, explainable rule discovery is exactly the kind of direction many practitioners have been hoping to see. If systems like WD-AGI
can make reasoning more like software—inspectable, repeatable, and composable—then autonomous agents move from “cool demo” to “reliable teammate.”
AI Tech Inspire will track the AMA and papers, and we’ll test public releases as they appear. If you run your own experiments, send results and traces—this is one story where the most interesting plots will be in the logs.
Recommended Resources
As an Amazon Associate, I earn from qualifying purchases.