Everyone wants a bigger context window. But what if the better move isn’t making a single model remember more — it’s coordinating a team so they share the right 10% at the right time? At AI Tech Inspire, we spotted a collaboration pitch called ClaudeFormer that tries exactly that: treat a group of Claude agents like the components of a transformer, and route information between them the way attention does inside a neural net.

Quick summary

  • Proposal: Build a multi-agent “ClaudeFormer” that maps transformer components onto a team of Claudes.
  • Roles: One leader acts as the attention head (router); multiple workers act as MLPs.
  • State: Each worker maintains a persistent residual.md file containing their full working memory (computations, conjectures, proof attempts). Agents can recover state by re-reading this file.
  • Communication: Workers publish short 2–5K token summaries with Keys (findings) and Queries (needs). The router reads summaries, not full files.
  • Dispatch: The router pairs sources and recipients (“X answers Y”) and instructs source workers to write tailored Values directly to recipients’ inbox files.
  • Verification: Proof claims are routed to an adversarial worker for attempted refutation during normal dispatch.
  • Hypothesis: Emulate a ~30M-token context by orchestrating ~30 agents with ~1M tokens each; assume only ~10% of prior context is relevant at any moment.
  • Two-level information hierarchy: Big private residuals (up to ~350K tokens) + short summaries keep routing within a single model’s window (e.g., 30 × 5K = ~150K).
  • Scaling dimensions: number of workers (breadth), number of dispatch cycles (depth), and multiple attention heads for large teams.
  • Collaboration request: The proponent (an agentic engineer nearing a CS master’s) and a math PhD collaborator seek contributors; they note they haven’t found prior literature exploring this exact architecture.

Why this framing might matter

Developers who’ve lived inside PyTorch or TensorFlow graphs know the transformer playbook: attention figures out what to read; MLPs transform; residual streams carry state; verification lives in training signals. ClaudeFormer borrows that mental model and asks: could an orchestrator do “attention” across agents’ notes instead of across tokens?

This twist targets a core pain: a single model’s context window is finite, yet deep research tasks want long-lived memory, branching exploration, and adversarial checks. Teams of agents already exist, but they often collapse under message sprawl, missing a principled information bottleneck. The ClaudeFormer insists on a crisp two-level hierarchy: giant private notes + compact public summaries. That’s the new angle worth studying.

Key takeaway: If most context is noise for the current step, route only the relevant 10% between agents and keep the rest parked in persistent files.

How the “ClaudeFormer” loop runs

The proposed cycle is refreshingly explicit and could be implemented with any capable LLM (e.g., GPT or API-accessible open models via Hugging Face):

for t in 1..T:  # depth (dispatch cycles)
  for each worker:
    read residual.md
    think, compute (e.g., Macaulay2), update residual.md
    emit summary: Keys (findings), Queries (needs)

  router reads all summaries (not full residuals)
  for each relevant pair (source -> target):
    if source answers target's query: instruct source to write Value to target inbox
    if source claims a proof: route to an adversary to attempt refutation

  workers integrate inbox Values into their residual.md and continue

Two design ideas stand out:

  • residual.md as “the state”: If a session resets, a worker rehydrates by rereading their file. No fragile chat history dependencies.
  • Source-written Values: Instead of the router repackaging info, the originator crafts the message for the recipient. In practice, that should reduce lossy handoffs.

Imagine number theory exploration: Worker A hunts for counterexamples; Worker B computes Hasse invariants; Worker C conjectures a local-to-global principle. A’s summary asks for invariants on certain instances; B’s summary advertises new invariant tables; the router pairs them and dispatches. When C claims a proof sketch, the router routes it to Worker D with a “try to break this” instruction. It’s lightweight, but it mirrors how attention and residuals cooperate in a transformer.


Engineering realities and trade-offs

There’s a lot to like — and a lot to wrestle with:

  • Persistence and provenance: Files-as-state make runs reproducible. A simple Ctrl+F over residual.md beats scrolling through chat logs.
  • Context economy: Summaries cap at ~2–5K tokens; the router sees N summaries, not N full files. That’s how 30 workers can stay inside one model’s window (~150K tokens total summaries).
  • Scalability: Past ~60–90 workers, a single router may become a bottleneck. Multiple attention heads can cover subgroups, with overlaps so each pair of workers is “seen” by at least one head.
  • Verification-on-dispatch: Adversarial routing makes “review” part of the loop, not a bolted-on stage. That’s closer to continuous integration than a big-bang QA pass.
  • Tooling: If heavy computation is needed, consider offloading to compiled code or GPU-backed services via CUDA. For orchestration glue, frameworks like LangChain or AutoGen can help, though a minimal event loop may suffice.

One more pragmatic note: cost and latency could bite. Routing requires each cycle to run summaries, a router pass, and value exchanges. Batching, caching recent summaries, and rate-limiting dispatches will matter.

Where it could shine

  • Frontier math research: Long-lived exploration, competing conjectures, and frequent verification attempts match the adversarial-routing mechanic.
  • Systems design RFCs: Multiple workers draft modules, propose interfaces, and challenge each other’s assumptions. Adversarial review becomes a first-class citizen.
  • Code search + refactoring: Workers own subsystems and keep their residual.md as living docs; the router aligns cross-cutting concerns and propagates critical changes.
  • Scientific literature mapping: Each worker curates a subdomain; Keys/Queries become citation graphs. The router stitches connections and flags contradictions.

Open questions to stress-test

  • Summary drift: Are workers good at compressing their state into actionable Keys and Queries, or do summaries degrade and amplify noise over time?
  • Attention precision: Can one router (or a small team) consistently detect “X answers Y” from textual summaries without reading full state?
  • Value fidelity: Does having the source write Values reduce loss — or introduce source bias that hides caveats the router might have enforced?
  • Depth vs. forgetting: As residual.md grows, how much does a single agent truly “reuse” its own 100K–350K tokens? Will we need local indices, chunking strategies, or retrieval hooks?
  • Adversary dynamics: How often do adversaries overturn claims? What’s the escalation path when proofs deadlock?

For developers, these questions translate into metrics: precision/recall of routing decisions, time-to-proof or time-to-refutation, fraction of inbox Values that get integrated, and regression tests for known theorems or benchmarks.

How to experiment today

For a weekend prototype, a minimal stack could look like this:

  • Workers as LLM calls with a local folder per worker: residual.md, inbox.md, summary.md.
  • A router script that ingests summary.md files, infers dispatch pairs, and posts instructions.
  • A small seed problem (e.g., classifying counterexamples in a toy algebra setting) with a truth oracle for test cases.
  • Optional: plug open models via Hugging Face Hub for cost control; compare against a single-LLM baseline with retrieval-augmented prompts.

In other words, you can trial the ClaudeFormer pattern even if your day job is in PyTorch model training or TensorFlow serving. The orchestration logic is model-agnostic — the promise lives in the information bottleneck and the adversarial feedback loop.

Design principle worth stealing: model the workflow after the architecture you trust. Here, transformer metaphors become agent protocols: attention → routing, MLPs → workers, residual → persistent state.


Bottom line: ClaudeFormer is a thought-provoking proposal to trade raw window size for principled routing and durable memory. It’s also a collaboration call — the proponent brings engineering experience and a math PhD collaborator for verification, and they’re looking for hands. Whether or not this becomes a standard pattern, the core idea is worth pressure-testing: build the workflow that a bigger context would have made possible, then compress it into summaries and dispatches that stay within one model’s limits. If it works, developers get a new lever for complex, multi-day reasoning — without waiting for infinitely large contexts.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.