Why 3-Opt Cracked a 97-Block Neural Net When SA and Swaps Stalled

Imagine getting a 97-piece neural net jigsaw where every piece is shuffled and only one arrangement yields a MSE = 0 and a matching SHA-256 hash. That’s the setup behind a recent puzzle that lit up our feeds at AI Tech Inspire. One team’s path to the solution is worth a closer look not just for the win, but because it shows where common heuristics top out—and how a small twist in the move set (3-opt) makes all the difference.

Key facts at a glance

Puzzle: A 97-block sequential neural network was randomly permuted; the task was to recover the original block order.
Verification: Ground truth required a SHA-256 hash match and MSE = 0 on training data.
Compute: Track 2 solved in ~72 hours of wall clock using a 5-node home cluster; no cloud resources used.
Landscape: The search is over S_97 with correlated local structure but deep local minima.
Main techniques and progress:
- Distributed simulated annealing (SA) with PMX crossover via a PostgreSQL pool: MSE 0.45 → 0.00275; 172 elite solutions clustered within Kendall tau ≈ 3.
- Consensus disagreement analysis across top-50 permutations: ~90 positions unanimous; ~7 uncertain → targeted enumeration improved to 0.00173.
- Swap cascade: exhaustive C(97,2) pairwise swaps applied greedily in two rounds → 0.000253.
- 3-opt rotations: exhaustive C(97,3) cyclic rotations; three moves finished the job: 0.000253 → 0.000174 → 0.000111 → 0.000000.
Why SA and pairwise swaps fail at the end: Final errors required coordinated 3-cycles; any single swap was uphill. Only 3-opt crossed the barrier.
Practical notes: Consensus analysis delivered the highest ROI; difficulty appeared layered: ~90% (stochastic), ~8% (pairwise), ~2% (k=3).
Hardware: CPU-bound NumPy workload per-thread speeds — M4 Max: 318 steps/s; Threadripper 7960X: 122; M1 Max: 182; i9-13900K: 132.
Endgame runtime: 47 seconds on a single M4 Max once the right moves were known.
Stack: Python with PyTorch (inference only), NumPy, SciPy (linear_sum_assignment for Hungarian matching), PostgreSQL; ~4,500 lines across 15 scripts; no ML training.
What didn’t work: Jacobian chain conditioning (low SNR), un-targeted breakpoint surgery, running compute on DB nodes.
Attribution: Built by the Cherokee AI Federation, a community nonprofit, on consumer hardware across a home network.

The search space has gradients—but nasty traps

Recovering a permutation over S_97 looks hopeless at first blush—97! is astronomically large. The twist: the loss surface (MSE after reassembly) isn’t random. Neighboring permutations often produce correlated errors, which gives a gradient-like signal. That makes standard heuristics—simulated annealing, genetic recombination, and local search—viable starters.

But local structure cuts both ways. Deep local minima emerge where many single-step neighbors are worse. The endgame here included a few “lock-step” islands where fixing one part in isolation makes everything else worse, forcing coordinated moves of size k=3 to make progress.

“Only C(n,3) cyclic rotations reach the basin floor.”

If you’ve struggled with 2-opt plateaus in traveling salesman problems until a 3-opt pass magically unlocks a route, the analogy will feel familiar.

Distributed SA worked—until it didn’t

The team led with distributed SA augmented by PMX crossover, pooling elites in PostgreSQL. That combination is a workhorse on permutation problems: SA explores locally; PMX recombines structure from promising candidates without breaking permutation constraints.

It drove the error from 0.45 to 0.00275, which is a big jump. However, all five nodes ultimately converged into the same basin. A telling metric: 172 stored solutions, all within Kendall tau ≈ 3. In other words, there was diversity on paper, but not enough structural diversity to jump the final barriers.

Takeaway for practitioners: distributed search is great for exploration, but watch for hidden collapse into a narrow basin. If your elites are tightly clustered in a rank metric, it’s time to change neighborhoods—not just turn the temperature knob.

Consensus analysis: squeeze signal from your failures

Instead of hammering the same basin with more randomness, the solvers switched modes: measure agreement among the best runs and focus only on the disagreements. Across the top-50 permutations, ~90 positions were unanimous. Around seven positions showed real uncertainty. By exhaustively enumerating only those uncertain slots, they broke the stalemate and pushed loss to 0.00173.

That’s a powerful general idea for engineers: if you have a population of near-optimal solutions, treat it as a noisy posterior over variables. Then, branch only where the posterior is wide. It’s an inexpensive way to slash combinatorial blowup while keeping the right degrees of freedom.

From pairwise swaps to 3-opt: why the endgame needs bigger moves

After consensus, a greedy C(n,2) swap cascade (two passes) drove the MSE to 0.000253. Classic local search stuff—until it wasn’t. The final residue required rotating triples. Any single swap within those triples made the loss worse, so both SA (which proposes swaps) and pairwise exhaustive search were blind to the true descent direction.

Enter exhaustive C(n,3) three-element cyclic permutations. Just three 3-opt rotations landed the solution: 0.000253 → 0.000174 → 0.000111 → 0.000000. This is the textbook moment to expand your neighborhood. If your plateau persists and single-variable (or pairwise) moves are universally uphill, your problem likely hides a k > 2 lock-step structure.

Practical recipe for similar problems:

Use fast stochastic search to carve away the easy 80–90%.
Apply consensus disagreement to shrink the active set.
Run an ordered cascade: pairwise improvements, then 3-opt (or k-opt) moves.

For scheduling, routing, compiler pass ordering, layout synthesis, or even model-block repairs, this pattern generalizes surprisingly well.

Hardware notes: CPU-bound NumPy loves Apple Silicon

The numbers here will raise eyebrows for anyone running numeric kernels on desktops:

M4 Max: 318 steps/s
Threadripper 7960X: 122 steps/s
M1 Max: 182 steps/s
i9-13900K: 132 steps/s

On this CPU-bound NumPy-heavy workload, Apple Silicon delivered a ~2.5× per-thread advantage versus comparable x86 chips. It aligns with what AI Tech Inspire has seen on vectorized Python/BLAS routines when memory fits in cache and the code path doesn’t benefit from CUDA offload. The punchline: don’t assume a big x86 workstation is always faster. Benchmark your specific kernel.

Also notable: once the move vocabulary was correct, the endgame took just 47 seconds on an M4 Max. The cluster’s real value was in exploration, not raw solve time.

Tooling, stack, and the quiet power of simplicity

The stack was intentionally straightforward: Python; PyTorch for inference only; NumPy and SciPy for numerics (including linear_sum_assignment / Hungarian matching during seeding); and PostgreSQL as a coordination hub for elites. Around 4,500 lines across 15 scripts. No training, no specialized accelerators. Just careful orchestration and good heuristics.

That matters for teams working on tight budgets or with mixed hardware. You can do serious combinatorial work with approachable tools and consumer machines—as long as you’re smart about information reuse (consensus) and neighborhood design (k-opt).

What didn’t help—and why

A few discarded approaches are instructive:

Jacobian chain conditioning: The theory is appealing (propagating sensitivity through composed blocks), but the signal-to-noise was too low to beat simpler heuristics.
Breakpoint surgery without consensus targeting: Editing the wrong positions is still the wrong search. Precision matters more than brute force.
Running compute on DB nodes: A reminder that coordination services should stay snappy. Shared infra saturation kills search throughput.

Why it matters (and how to use it)

For developers and engineers, the bigger story isn’t this one puzzle—it’s a repeatable playbook for tough permutation and ordering problems:

Instrument diversity: Track rank distances (e.g., Kendall tau) among elites to detect convergence collapse.
Exploit consensus: Treat your solution pool as a distribution; branch only where it’s wide.
Escalate move size: If all 1- or 2-step neighbors are uphill, your landscape likely needs k=3 (or higher) moves.
Benchmark your hardware: For CPU-bound Python numerics, Apple Silicon may deliver outsized gains. Validate with your kernel, not assumptions.

At AI Tech Inspire, we look for patterns that travel. This one does. Whether you’re arranging compiler passes, optimizing a data pipeline, tuning a scheduler, or reassembling model components, consider adding a consensus step and a 3-opt (or k-opt) endgame to your toolbox. Sometimes the shortest path to zero isn’t a swap—it’s a rotation.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.

The Hundred-Page LLMs Book (PyTorch)

Hands-on LLMs.

Fiverr Image Editing

Get the perfect logo.