Big Idea #5: Decentralized Training

The moonshot is real. What’s working, what’s missing, and what comes next.

Let’s begin with our view:

Decentralized training is the most ambitious moonshot in AI x Crypto right now.

It challenges the assumption that only a few well-funded labs can build and control large models.

If it works, it weaves cryptography and blockchains directly into the foundation of the AI stack. At that point, the rest of the world has to pay attention.

We will explore 2 core ideas here:

  1. How large AI models can be trained across decentralized networks, and why that matters

  2. Tokenization of AI models

We’re moving into a phase where you don’t just use an AI model. You can help train it. You can own a piece of it. On-chain, with others.

The chart above shows several decentralized training runs, each using different datasets and with different goals. It’s not a direct comparison (different model types and setups), but the overall trend is clear.

Model size is scaling steadily, and the curve is moving up and to the right.

Small note: Most of them still rely on whitelisted contributors, so they aren’t fully open or permissionless yet.

Part I: Decentralized training

The core idea is simple: Building frontier-scale models without relying on centralized infrastructure.

Instead of routing everything through a single, trusted compute cluster, training is distributed across a permissionless network, where coordination, communication, and trust become first-class problems.

Sam Lehman from Symbolic Capital makes the distinction clearly in his article on decentralized training: 

Truly decentralized training is training that can be done by non-trusting parties.”

So… “non-trusting parties” is the really important and complicated bit.

In true decentralized training, any node can join the training run. It doesn’t matter whether it’s a rack in a data center or a single GPU in your home basement.

Recently, a surge of new research and builder energy is pushing the limits of what decentralized training can achieve.

What Has Happened in the past 3 months

  • Nous Research pre-trained a 15B parameter model in a distributed fashion and is now training a 40B model.

  • Prime Intellect fine-tuned a 32B Qwen base model over a distributed mesh, outperforming its Qwen baseline on math and code.

  • Templar trained a 1.2B model from scratch using token rewards. Early-stage loss was consistently lower than centralized baselines.

  • Pluralis showed that low-bandwidth, model-parallel training (once thought impossible) is actually quite feasible.

These wins remind us that decentralized training is no longer a thought experiment.

So far, progress has been around the 10 to 40 billion parameter range, and suggests that we are hitting the limits of what data parallelism can efficiently achieve over open, decentralized networks.

Scaling beyond this range, toward 100B or 1T+ parameter models trained from scratch, will likely depend on model parallelism, which comes with an order of magnitude harder challenges.

To understand what’s holding back larger runs, we need to unpack the three main constraints and how parallelism is used.

The Holy Trinity of Decentralized Training

To scale, decentralized networks need to solve what we call the “holy trinity” of design constraints:

Subscribe to keep reading

This content is free, but you must be subscribed to Chain of Thought to continue reading.

Already a subscriber?Sign in.Not now

Reply

or to participate.