If you’ve ever wondered how far a single gaming GPU and a disciplined research loop can take a neural chess engine, this one deserves a close look. At AI Tech Inspire, we spotted a project that blends an AlphaZero-style approach with a Karpathy-inspired “autoresearch” workflow—and packages the result into a clean, browser-playable demo.
Fast facts (neutral and concise)
- Browser-playable neural chess engine targeting approximately ~2700 Elo on public tests.
- Built via an AI-assisted research loop: read papers, inspect ideas, prototype, ablate, optimize, repeat.
- Current public V3 model:
residual CNN + transformer + learned thought tokens, ~16M parameters. - Input encoding:
19-plane 8x8; output:4672-movepolicy head + value head. - Training: 100M+ positions via pipeline—supervised pretraining on 2200+ Lichess data, Syzygy endgame fine-tuning, then self-play RL with search distillation.
- Inference: CPU-only with
1-plylookahead and quiescence; reported <2 ms per evaluation. - Web app features: play vs AI, board editor, PGN import/replay, puzzles, and move analysis with probability shifts before/after a “thinking” step.
- Claimed to be unusually compute-efficient for its strength (positioned as a hypothesis seeking community scrutiny).
- Next versions: V4 explores
CNN + Transformer + Thought Tokens + DAB (Dynamic Attention Bias)at ~50M parameters; V5 proposesTemporal Look-Aheadas a speculative idea. - Demo: games.jesion.pl • Project details: games.jesion.pl/about • Free to try; nickname/email only for a public leaderboard.
- Actively seeking feedback on: ablations for thought tokens/DAB, Elo-vs-compute methodology, whether Temporal Look-Ahead is novel/useful, and stronger evaluation vs classical engines.
Why this caught our eye
Neural chess engines are often discussed in the context of supercomputers or massive cloud runs. This one leans into a home-hardware constraint and a research cadence that many independent developers can actually emulate. The result is a model that reportedly reaches around ~2700 Elo, runs with CPU inference and a shallow 1-ply lookahead, and still responds in under 2 ms per evaluation. That combination—responsiveness, accessibility, and transparency in a browser—offers a refreshing testbed for both chess and ML experimentation.
Key takeaway: the interesting experiment here isn’t just “Can a small net play strong chess?” It’s “How far can an AI-assisted research loop push model quality and efficiency on home hardware?”
Under the hood, in plain terms
At a high level, the engine follows an AlphaZero-style recipe: a shared trunk feeds a policy head (the move distribution) and a value head (position evaluation). What’s new-ish here is the integration of a transformer on top of a residual CNN and the use of thought tokens—auxiliary latent steps that let the model refine its internal representation before predicting moves. The public V3 has:
- ~16M params with a
19x8x8input encoding. - A wide
4672-movepolicy head (a common trick to cover all legal moves as indexed slots). - Training on 100M+ positions: supervised learning from high-rated Lichess data,
Syzygyendgame fine-tuning, then self-play RL with search distillation.
The web app exposes this in a developer-friendly way: load a position, see top-move probabilities, then watch how a brief “thinking” step (the extra latent refinement) shifts those probabilities. For anyone teaching ML or debugging model behavior, this level of visibility is gold.
Compute, timing, and the efficiency angle
The developer reports sub-2 ms CPU evaluations with a shallow 1-ply + quiescence lookahead. That hints at roughly ~500 NN evals/sec on modest hardware, which is appealing for browser play and for embedding into lightweight analysis workflows. The training was run on a consumer RTX 4090, presumably leveraging CUDA. While the implementation details aren’t specified, a common stack would be PyTorch on CUDA; the general point stands: this is within reach for many hobbyists.
The bigger claim is compute efficiency at a given strength. Instead of raw Elo alone, we’d love to see benchmarks like:
- Energy-to-strength: Watt-hours per +100 Elo at fixed time controls.
- Time-to-strength: wall-clock per game to reach a target SPRT confidence against baselines.
- Eval-cost-to-strength: evaluations per decision vs final Elo at matched latencies.
Measuring these across home hardware would make comparisons fairer and more reproducible than “how strong is it” in isolation.
Try it, break it, learn from it
Hands-on is easy: visit the live demo at games.jesion.pl. Beyond playing, the most educational features are:
- Move analysis UI that shows pre/post “thinking” probabilities—handy for intuiting what
thought tokensmight be doing. - PGN import and replay; step with ←/→ to scan moves and watch the distribution evolve.
- Puzzles and board editor to stress specific tactical and endgame patterns; compare decisions with and without the extra latent step.
For researchers, this is a lightweight arena to probe ablations, try unusual encodings, or instrument calibration curves.
Ablations we’d love to see
The project is actively seeking feedback on ablation setups, especially around thought tokens and DAB (Dynamic Attention Bias). A few concrete ideas:
- Thought tokens count sweep: 0/1/2/4-step variants; plot Elo, move-quality (e.g., engine loss vs a strong baseline), and latency.
- Teacher-forcing vs free-running: If internal tokens are trained with a target, compare policy calibration and KL to a high-depth teacher.
- Selective activation: Enable tokens only on high-uncertainty positions (high entropy or low value-margin); measure strength and average latency.
- DAB ablations: Bias only on tactical motifs (pins, forks), only on king-safety channels, or only in endgame; compare gains vs added compute.
- Explainability probes: Attention rollout visualizations during token steps; correlation with known chess concepts (passed pawns, outposts).
How to measure Elo-vs-compute on home rigs
To validate the efficiency hypothesis, consistent protocol matters. A practical recipe:
- cutechess-cli matches against a panel of classical and NN engines at fixed budgets: e.g.,
tc=0.2+0.2andtc=1+0.1. - Standard openings via a small balanced book; color-balanced and shuffled seeds.
- SPRT or BayesElo for significance; publish the full PGN set and logs for replication.
- Latency caps: Enforce per-move compute caps (e.g., eval count or ms) to decouple code optimizations from model quality.
- Power logging: If possible, record approximate system power to estimate energy per game.
Separately, run no-search vs 1-ply variants to attribute gains to the network versus the lookahead.
Is “Temporal Look-Ahead” new or just good framing?
The proposed V5 idea—Temporal Look-Ahead, where the network represents future moves internally and propagates that signal backward—echoes threads seen in model-based RL (e.g., implicit rollouts), search distillation, and trajectory modeling. Related art includes imagination-augmented agents, the Predictron, and transformer-based sequence models that condition on hypothetical futures.
To assess novelty and utility:
- Counterfactual conditioning: Inject candidate future moves as latent tokens; measure whether the present policy shifts in line with deep-search teachers.
- Distill-from-MCTS: Supervise temporal tokens to mimic shallow internal rollouts from a strong MCTS and compare to baseline distillation.
- Ablate future-token masking: Randomly drop or misorder the future tokens to test whether the claimed signal genuinely informs current choices.
If it boosts strength at fixed latency, that’s useful—novel or not. If it mainly reframes known mechanisms, the framing could still help practitioners reason about architecture design.
Stronger evaluation against classical engines (without overclaiming)
To pressure-test claims while keeping things fair:
- Run gauntlets versus Stockfish (classical), LCZero-lite (NN), and a few mid-strength engines at matched time and memory.
- Publish opening-book-free runs in addition to book-started runs; tactical and endgame suites (e.g., ECM, WAC) can supplement but not replace head‑to‑head.
- Report confidence intervals, not just point Elo; state hardware, threads, and hash size clearly.
- Include a “no-search” policy-only mode to quantify the net’s standalone quality.
What developers can reuse today
Two pieces stand out for engineers:
- The AI-assisted research loop. Treat an LLM (think GPT) as a literature scout and coding co-pilot: propose ablations, generate scripts, then run tight, repeatable experiments. It’s a pragmatic take on accelerating R&D.
- The browser UX. Exposing probabilities and pre/post-thinking shifts is a great pattern for any model where uncertainty and internal refinement matter—recommendation systems, game AIs, even small Hugging Face demos.
For anyone intrigued by efficient chess AI—or looking for a compact lab to test transformer variants, thought tokens, or new attention biases—the demo at games.jesion.pl is worth a spin. The thesis here isn’t that smaller always wins; it’s that careful iteration, good instrumentation, and a transparent evaluation story can get surprisingly far on everyday hardware.
Recommended Resources
As an Amazon Associate, I earn from qualifying purchases.