What happens if reinforcement learning gets cheap enough for a much larger set of teams teams to use seriously?

I’ve been thinking about that question a lot. What first pulled me toward Gradient was the possibility that RL might not have to remain the privilege of a few giant labs. When I first met the team several months ago, I was struck by how seriously they had engaged with one of the hardest infrastructure problems in AI.

Their bet is that RL becomes much more affordable once you stop cramming incompatible workloads onto the same expensive hardware. If that bet is right, the payoff is much bigger than lower cost. It means a much broader set of builders can afford to create AI products that would be out of reach today.

Here’s the story of how that might happen.

The Post Training Era of AI
Section I: The Infrastructure
Section II: The Business
Section III: Our Thesis

Seoul. March 2016. Game two of what was about to become the most-watched Go match in history.

Lee Sedol, a nine-dan professional, South Korea's national hero, widely considered one of the best players of the world's oldest and most complex strategy game, is staring at the board. He can't make sense of what just happened.

Neither can anyone else. One of the announcers finally breaks the silence:

"I think we're seeing an original move here."

An original move. In a game that's been played for 4,000 years. A game where professionals spend decades studying joseki (established sequences refined across millennia, the way chess players study openings). And here's a commentator calmly suggesting that what they're watching is new.

The opponent was not human. It was AlphaGo, Google DeepMind's AI system. And the move (now known as Move 37) was a creative leap so alien that even the engineers who built AlphaGo thought it was a bug. The system estimated a human would play it roughly one in ten thousand times.

It wasn't a bug. It was a move that connected pieces across the board in a pattern invisible to every human who had ever played the game. A strategy so strange, so beautiful, that it changed how professionals think about Go permanently.

AlphaGo won the game. And the match, four to one.

I finally got around to watching the full DeepMind documentary last week. (Highly recommend it.) The whole film is good, but the Move 37 sequence is the part that rewired my brain. Because this was the first time, it felt like we were approaching real intelligence. Not mimicry. Not statistical pattern-matching. Something genuinely, irreducibly new.

The engine underneath it was reinforcement learning. AlphaGo went beyond its training data by playing thousands of games against itself, learning by trial and error and reward, until it developed intuitions no human had ever conceived.

That technique, reinforcement learning (RL), is now the single most important process in modern AI.

And it's still locked inside a handful of data centers.

This report is about Gradient Network, and the critical infrastructure they're building to break that pattern. When I first started looking into Gradient, I framed what they're doing as cheaper compute. But as I dived in deeper, I realized that's true but incomplete.

Instead, my core insight is that Gradient is enabling a category of AI products that don't exist yet, because the step that would make them possible (RL post-training) is too expensive for anyone outside a handful of frontier labs to seriously attempt. If that cost drops by 30-80%, a whole new market opens up.

We'll build toward that argument. But first, let’s understand the shift that RL the center of gravity.

The Post Training Era of AI

Here's a mental model for how modern AI works. Three stages, each doing something different:

Pre-training is where a model gets its foundation. You feed it the internet (books, code, Wikipedia, scientific papers, conversations) and it learns the statistical structure of human knowledge. This is the stage that costs hundreds of millions of dollars and requires thousands of GPUs running for months. It's what gives a model like GPT-5.4 or Claude Opus its raw capability, its general sense of the world.

Post-training closes the loop. You take the model's outputs, evaluate them against a reward signal (human preferences, mathematical correctness, task completion), and feed that evaluation back to refine behavior. Through this cyclical process, models don't just perform. They improve. They get genuinely better at the things that matter. This is reinforcement learning applied after pre-training, and it's why the technique has names like RLHF and GRPO that keep showing up in every frontier lab's release notes.

Inference is where the model does its job. You ask a question, it generates an answer. You give it code to debug, it fixes the bug. Inference is training put into action in real-time contexts.

If pre-training is a model's education and inference is its job, then post-training is its career development. It's what turns a generally capable system into one that's specifically excellent.

And RL has gone from a nice-to-have to the most consequential step in the post-training pipeline.

The evidence is hard to argue with. OpenAI's o1 and o3, the first models that can actually reason through multi-step math and working code, were shaped by massive RL runs. DeepSeek-R1 showed that clever RL alone could match frontier-level performance, prompting a real reckoning about whether raw scale matters as much as the industry assumed. Every major lab is now investing more in post-training relative to pre-training than they were two years ago.

This matters because the economics of AI are undergoing a structural inversion.

From 2020 to 2024, the dominant strategy was brute-force scaling. Bigger models, more data, predictable improvement along scaling laws. It was an arms race that favored deep pockets (OpenAI, Google, Meta, Anthropic) and left everyone else as spectators.

But the returns on raw pre-training scale are flattening while the costs keep climbing. The next order-of-magnitude improvement in pre-training may cost over a billion dollars.

Look at the compute numbers. From GPT-3 to GPT-4o, pre-training compute climbed steadily while post-training was an afterthought. By DeepSeek-R1, the two lines have nearly converged. Post-training compute now rivals pre-training. The investment is moving.

The new frontier is not going to be who can train the biggest model. It's who can take an already-capable base model and use RL to unlock reasoning and and task-specific excellence that raw scale alone can't provide.

If the pre-training era favored whoever had the biggest GPU cluster, the post-training era should favor whoever can run RL most efficiently.

And right now, almost nobody can run it efficiently. That's the deadlock.

The Deadlock

Think of reinforcement learning as training by trial and error at massive scale. You have an AI that tries something, gets feedback on whether it did well or poorly, then adjusts itself slightly. Then it tries again. This happens over and over, sometimes millions of times.

So RL is computationally brutal. And the infrastructure it requires is stuck in the worst possible configuration.

In a typical RL training loop, two phases alternate:

Trying things out (sampling): the model generates outputs, maybe solves problems, answers questions, plays out scenarios, etc and those outputs get scored by a reward function.
Learning from what happened (training): the scores flow back to update the model's weights, making it slightly better at the task.

It's how AlphaGo got to Move 37: by playing millions of games against itself and learning from every single one.

Sampling is latency-sensitive and irregular. You need fast responses from diverse prompts, and the workload arrives in bursts.

Training is batch-oriented and throughput-hungry. You're crunching gradients across large, structured datasets in long sustained runs. It's like running a fast-casual lunch counter and a fine-dining tasting menu out of the same kitchen simultaneously. Both involve cooking. But the equipment, the workflow, the rhythm.. everything is different.

And yet, almost every RL framework today cram both phases onto the same GPU cluster. Constant context switching between two workloads that are fighting each other for resources.

Sampling interrupts training. Training delays inference. GPUs sit idle waiting for the other phase to finish. Utilization drops. Costs rise. And the whole thing is chained to centralized clusters, which are the most expensive and supply-constrained resources in all of technology right now.

This is the deadlock. RL is the most important capability multiplier in modern AI, and the infrastructure to run it at scale is a bottleneck controlled by a handful of cloud providers charging premium rates for hardware that's being used inefficiently by design.

The Obvious Answer (That Doesn't Work)

There are roughly 1.5 billion PCs in the world. Hundreds of millions of gaming GPUs. An entire generation of Apple Silicon machines with unified memory architectures that are shockingly good at inference.

The compute isn't scarce but misallocated. Locked inside people's desktops, running screensavers, sitting idle twenty hours a day.

The idea of harnessing this distributed consumer hardware isn't new. Several teams have been building exactly this, incentivizing small-scale GPU suppliers with token rewards. The pitch writes itself.

But for the most part it hasn’t yet worked well enough to matter.

The internet is not a data center. Three walls stand between "distributed compute exists" and "distributed compute is actually useful for AI."

Wall 1: Latency. Data center GPUs communicate over InfiniBand at 400 Gb/s with microsecond latency. Your gaming rig in Seoul talks to a Mac Mini in Berlin through routers, firewalls, and NAT layers that introduce milliseconds of delay and constant variability. Most AI frameworks assume the network is fast and reliable. The real internet is neither.

Wall 2: Heterogeneity. A centralized cluster is uniform. Every GPU identical, every cable the same length, every driver the same version. A decentralized network is a zoo. An RTX 5090 in one node, an M4 Pro in another, a three-year-old A4000 in a third. Different memory capacities, different compute speeds, different everything. In a synchronous system, the entire pipeline moves at the speed of its slowest node.

Wall 3: Trust. In a Google data center, hardware is trusted by definition. On a permissionless network, nodes are anonymous. What stops someone from returning garbage results to save electricity? What stops a bad actor from poisoning training data? Without verification, the system collapses under the weight of its own openness.

These aren't minor engineering challenges. They're the reason "decentralized AI" has mostly been a better pitch than a product.

Okay, but here's where it gets really interesting.

Look at RL's computational structure one more time. The sampling phase and the training phase don't need to happen on the same hardware. They don't even need to happen at the same time.

Sampling is embarrassingly parallel. You can split a reasoning problem into thousands of independent attempts and scatter them across the globe. They don't need to talk to each other. They don't need to synchronize. Each one just takes the model's current best guess, runs it against a problem, and reports back whether it worked.

That's a fundamentally different workload profile than, say, distributed pre-training, where every GPU needs to stay in lockstep with every other GPU on every gradient update. Distributed pre-training over the open internet is genuinely hard (maybe impossible at frontier scale). Distributed sampling over the open internet is.. actually plausible.

This is the non-obvious part. RL doesn't need 10,000 H100s in a single data center. It needs 10,000 consumer GPUs handling the sampling side, plus one compact datacenter cluster grinding through gradient updates. The wide part runs on cheap, messy, distributed hardware. The dense part runs on premium iron. Each workload gets the hardware it actually needs.

The three walls still apply. You still need to solve latency, heterogeneity, and trust. All three, simultaneously. But RL's natural architecture turns a problem that's impossible in the general case into one that's merely very hard for this specific workload.

Gradient Network is building the infrastructure to exploit exactly this structural opening. The full stack spans peer-to-peer connectivity, distributed inference, distributed reinforcement learning, and verification. Each layer solves a specific wall from the problem above, each building on the one below it.

Section I: The Infrastructure

How Gradient is Building The Open Intelligence Stack

The stack has three core layers. Think of them as the nervous system, the engine, and the brain:

Lattica: the connectivity layer. The nervous system that lets devices find each other, establish connections and move data efficiently across the open internet.
Parallax: the distributed inference engine. The muscle that turns a swarm of heterogeneous consumer devices into a single, high-throughput machine for running large language models.
Echo: the distributed reinforcement learning framework. The brain that ties everything together, decoupling RL into separate inference and training swarms to enable training at datacenter quality on distributed, consumer-grade hardware.

Each layer solves a specific problem. And each one becomes more powerful because of the layers beneath it.

#1 Echo: Distributed Reinforcement Learning

We talked about the RL deadlock: RL's two phases (sampling and training) want opposite things from the hardware, and today they're crammed onto the same expensive cluster, wasting premium GPU time on context switching.

The structure of the problem suggests its own solution. Cheap distributed hardware for the wide, parallel sampling work, and premium datacenter GPUs for the dense training updates. In theory, this should work.

Echo is Gradient's attempt to make it work in practice.

The core idea is simple. Split the RL loop into two independent swarms that scale independently.

The inference swarm is a fleet of consumer devices connected through Parallax, Gradient's distributed inference engine. (We'll unpack how Parallax works in the next section.) Their only job is generating rollouts: thousands of practice attempts running in parallel. A GPU in Berlin doesn't need to coordinate with a Mac in Tokyo. Each one takes the model's current snapshot, runs it against different problems, and sends back results.

The training swarm is a smaller cluster of datacenter-grade GPUs (A100s, H100s) dedicated to one thing: taking scored rollouts and updating the model's weights. No rollout generation. No context switching. Just the sustained gradient math that makes the model smarter.

The economic logic makes sense. Rollout generation is the wide, embarrassingly parallel part of RL. It doesn't need premium hardware. Training is the dense part that needs fast interconnects and reliability. Echo offloads the wide part to cheap hardware and reserves the dense part for expensive hardware.

You stop paying H100 rates for work that doesn't need H100 reliability.

Simple idea. But it only works if you solve the catch.

The Catch: Staleness

The training swarm is constantly updating the model. Every batch of processed rollouts produces a slightly different version. Call them snapshots: Version 1, Version 2, Version 3.

But the rollout fleet is scattered across the internet. It can't receive every model update the moment it happens. Sending a fresh copy of a multi-billion-parameter model to hundreds of devices takes real time. So some devices are always working with a slightly older version.

That gap is called staleness. And past a certain point, it's destructive. Rollouts generated from Version 5 might be teaching the model things that Version 12 has already learned, or worse, things that are no longer true about how the model behaves. You end up learning from your own past in a way that's more confusing than helpful.

This is why decoupled RL is hard. It's a synchronization problem with no perfect answer, only a tradeoff between accuracy and speed.

Source: Gradient Blog

Echo offers two modes. Sequential mode waits for each batch of rollouts before updating, trading speed for perfectly clean data. Asynchronous mode never stops: the fleet continuously generates rollouts, tags each with the model version that produced it, and streams them back. The training swarm consumes them as fast as it can, using statistical corrections to adjust for slightly stale data. A lightweight coordinator watches the version gap and triggers a model refresh when it drifts too far.

Source: Gradient Blog

Asynchronous mode is the economic unlock. It keeps expensive training GPUs working continuously instead of sitting idle waiting for rollouts to arrive from across the internet. Both modes sit on top of existing RL tooling (the VERL family used by teams like DeepSeek) and support the algorithms practitioners already use: GRPO, with LoRA for shrinking model update sizes so new versions propagate faster.

Echo-2: What Happens at Scale

Echo proved the architecture. But the Gradient team is already pushing further.

Echo-2 is the latest research from the team, which tackles what happens when you try to run the dual-swarm architecture at production scale.

At small scale, asynchronous mode works beautifully. At large scale, a new enemy appears: the weight distribution bottleneck.

Every time the model improves, the training swarm needs to push that updated version out to every device in the rollout fleet. That's often gigabytes of data going to hundreds or thousands of devices scattered around the world, each with different download speeds. If distribution is slow, everything downstream breaks. Stale rollouts pile up. Training quality degrades. The whole loop stalls.

Echo-2 attacks this with three interlocking solutions.

First, staleness becomes an explicit control knob. Instead of vaguely hoping the version gap stays reasonable, Echo-2 defines a staleness budget: a maximum allowable gap between the model version generating rollouts and the version being trained. Set it tight (budget of 1-2), and you get very fresh data with more coordination overhead. Set it loose (budget of 5-6) and you get more throughput with slightly older data. The research shows training remains stable up to a budget of 6. Only at 11 does learning start to diverge. That's a wide, practical operating range, and it turns a vague problem into a dial that operators can tune based on their specific task.

Source: Gradient Blog

Second, relay-based model distribution. Instead of the training cluster pushing the full model update to every device simultaneously (which chokes on the trainer's upload bandwidth), Echo-2 breaks the update into chunks and organizes the fleet into a relay chain. The first devices to receive a chunk immediately start forwarding it to their neighbors, who forward it to theirs. The more devices in the network, the faster the update propagates, because you're using the fleet's collective bandwidth rather than bottlenecking on a single source. That's a rare scaling property. Most distribution problems get harder as you add nodes. This one gets easier.

Third, smart worker selection. Not all devices in the fleet are equal. Some are fast and expensive. Some are slow and cheap. Some are spotty. Given a target throughput and a staleness budget, Echo-2 automatically selects the cheapest combination of workers that can meet the requirement, rather than naively turning everything on and paying for stragglers.

The Results

The numbers come in two tiers, and the distinction matters.

In peer-reviewed research on 4B and 8B parameter models, Echo-2 achieves 33-36% cost reductions relative to centralized baselines while matching or slightly exceeding learning quality. These are real, verified results.

The team has also been running larger-scale internal experiments. In a benchmark to train the same 30B parameter model across 3 platforms (Fireworks, Tinker and Echo-2), Echo-2 reportedly achieved equivalent performance at roughly one-tenth the cost of comparable runs on commercial cloud providers, translating to 90% cost reductions.

Those numbers haven't been independently verified. But if they hold, the economic argument of the decentralized approach becomes very hard to dismiss.

Source: Gradient Docs

Then there are the results that change how you think about what RL can do.

A 7B model trained with Echo's distributed RL pipeline outperformed the much larger Qwen2.5-32B baseline across six math reasoning benchmarks, averaging a +12% gain (these were released earlier from Echo’s first iteration, pre–Echo-2). A model four times smaller, trained on distributed consumer hardware, beating a model that required datacenter-scale resources.

A 30B model hit 82.2% on the Sokoban planning task, surpassing both DeepSeek-R1 and GPT-OSS-120B. Models with dramatically more parameters, beaten by targeted RL post-training on a smaller base.

And maybe the most visceral result: the team took a tiny Qwen3-0.6B model, small enough to run on a phone, and put it in a No-Limit Texas Hold'em game against an LLM opponent. It was losing badly, hemorrhaging chips at -1.677 per hand. After training with Echo-2's distributed RL pipeline, that same model flipped to a +1.245 chip gain per hand. From incompetent to competitive through RL alone, on an adversarial, multi-step strategic task where you can't bluff your way through with pattern-matching.

(This next part: my worldview might turn out wrong, but I think it's the most important takeaway from these results.)

The AI world has been fixated on who can build the biggest model. These numbers suggest a different question matters more: who can make a small model excellent at a specific thing?

Targeted RL post-training on modest hardware is producing results that used to require orders of magnitude more compute. And the implications ripple outward. Right now, thousands of teams building vertical AI products (legal tech, financial agents, robotics controllers) are prompt-engineering their way around the limitations of general-purpose models. They know a model fine-tuned on their exact workflow would work better. They just can't afford the RL compute to do it.

In practical terms: a startup that currently spends $50K on a single RL post-training run could get equivalent results for $10-15K. At that price point, RL stops being something only frontier labs can afford and starts being something any well-funded vertical AI company can budget for quarterly.

But only if access is seamless. The gap between "this architecture works" and "anyone can use it" is the gap between a research result and a business. That's the gap Gradient is now trying to close.

What Makes the Inference Swarm Work

Echo's architecture depends on a fleet of consumer devices generating rollouts reliably and efficiently. That sounds straightforward until you remember what "consumer devices on the open internet" actually means: machines behind firewalls that can't accept incoming connections, a zoo of different hardware with different speeds and memory, and no guarantee that any given node stays online.

Two layers of infrastructure sit beneath Echo and handle this. Lattica solves connectivity. Parallax solves coordination.

#2 Lattica: The Connectivity Layer

The first problem is deceptively basic: can consumer devices scattered across the globe actually find each other and connect reliably enough to form a usable network?

Most consumer devices can't talk to each other directly.

Your laptop, your gaming rig, your Mac Mini… they sit behind firewalls and NATs (network address translators, basically digital bouncers that block unsolicited incoming traffic). Lattica is Gradient's peer-to-peer connectivity layer. It punches through those firewalls to establish direct device-to-device connections, falls back to relay nodes when it can't, and moves data between them efficiently. Everything encrypted, optimized for AI-specific payloads: model weights, intermediate states, training updates.

But in Echo’s world, “connectivity” isn’t the real point. Dissemination is. It’s one thing to get two devices to handshake; it’s another to keep hundreds (or thousands) of distributed rollout workers supplied with fresh policy snapshots without turning training into a bandwidth bottleneck.

This is where Lattica becomes essential infrastructure. Echo needs a way to push large, frequent updates (policy checkpoints, deltas, weights) across a messy consumer internet where most devices sit behind NATs, links are heterogeneous, and direct paths fail constantly. Lattica provides the underlying peer-to-peer transport + relay fabric that makes those devices reachable in the first place, and then treats model distribution as a first-class workload rather than an afterthought.

Without that fabric, a naive design collapses into centralization: every worker pulls from a single server or object store, and the learner’s uplink becomes the choke point. With Lattica, distribution can be peer-assisted and pipelined: once a node receives a snapshot (or chunk), it can forward it onward through available peers and relay routes, so the fleet’s aggregate bandwidth helps propagate updates and reduce tail latency.

In distributed RL, that dissemination latency directly determines how stale rollout policies become and whether the learner stays utilized.

#3 Parallax: Making the Swarm Useful for Inference

Lattica answers: Can these machines talk to each other?

Parallax answers the harder question: can a messy collection of consumer devices, different architectures, different memory sizes, different speeds, behave like a single, high-throughput inference engine?

Hardware is abundant. Coordination is scarce. And coordination is where most previous attempts at decentralized inference have broken down.

To run a large model (say, 72B parameters) on consumer hardware, you have to split it across multiple machines. No single consumer device can hold the whole thing. The standard datacenter approach splits each individual operation across identical GPUs and stitches results back together constantly. That requires ultra-fast connections between machines. Over the public internet, it falls apart.

Parallax takes a different approach. Instead of splitting operations, it splits the model into sequential chunks of layers. Device A runs early layers, Device B runs middle layers, Device C runs later layers. A request flows through the pipeline step by step. Only a small intermediate state gets passed between stages, which ordinary broadband can handle. The result: inference that actually works over real internet connections, not just between racks in the same building.

But splitting a model across devices creates its own problem. If you divide the work evenly and one machine is slower than the rest, the entire pipeline runs at the speed of its weakest node. Your best GPU finishes instantly and sits idle, waiting. That straggler effect is what kills throughput in heterogeneous networks.

Parallax treats this as the core problem to solve. When devices join the swarm, Parallax profiles each one: compute speed, memory, network quality. It assigns layer slices proportionally. Stronger devices take more layers. Weaker ones take fewer. The goal is to make each stage take roughly the same time so nothing bottlenecks. On every request, it routes through the best available path in real time. If a node slows down or drops offline, Parallax reroutes instead of stalling.

This is where Lattica's topology data compounds. Parallax doesn't guess at network quality. It knows, because Lattica has been continuously mapping which connections are reliable and which aren't.

One strategic detail worth flagging: Parallax isn't CUDA-only. Most distributed inference systems are built exclusively for NVIDIA hardware. Parallax runs a single inference pipeline that can include both NVIDIA GPUs and Apple Silicon devices as first-class participants. For a system whose value depends on utilizing the long tail of underused hardware, being architecture-agnostic isn't a feature. It's the whole point. Every Mac Mini sitting in a research lab or home office becomes part of the usable supply pool.

How well does it actually perform?

In benchmarks against Petals (the previous baseline for decentralized inference), Parallax is meaningfully better on the metrics that shape user experience:

In a larger heterogeneous deployment under real internet conditions, Parallax sustains roughly ~495 tokens/second at ~200 concurrent users. That tells you this isn't a two-node lab demo.

No, it doesn't match a dedicated H100 cluster. It's not supposed to. It's turning hardware people already own into usable inference capacity at a fraction of the cost.

In practical terms: if your team has a handful of consumer GPUs or Mac Minis, Parallax can turn them into a working inference cluster. It won't compete with a datacenter on raw speed. But for teams whose alternative is paying API rates for every inference call, the economics shift significantly.

And in the context of the full stack, this is the layer that makes Echo possible. Every rollout Echo generates during distributed RL training runs through Parallax. The inference throughput that Parallax extracts from consumer hardware directly determines how fast and how cheaply Echo can run its exploration phase.

The layers compound. Lattica provides connectivity and topology data. Parallax uses that data to coordinate heterogeneous devices into fast inference. Echo uses that inference capacity to run distributed RL at datacenter quality. Each layer is useful on its own. Together, they're more than the sum.

What Gradient Is Actually Shipping Today

Before the demand case, an honest accounting of where Gradient actually is.

Training and inference has been Gradient’s focus, and couple of things are live or in progress:

Parallax is now open-source. You can pull it from GitHub and self-host a distributed inference swarm across your own machines today. This is the most tangible product. Developers can use it, benchmark it, break it, and contribute back. It’s been recognized by leading open source model makers: Qwen, Kimi, Z.ai, MiniMax, Lmsys etc.

Echo is being productized. The architecture works in research settings. The next step is turning it into a platform with real service-level agreements and seamless onboarding. The team is also exploring RL-as-a-Service being productized into a distributed RL platform. Offering enterprise-grade GPUs for stable policy updates, and decentralized consumer GPUs (via Parallax) for the inference-heavy exploration phase.

That's the product surface. I also want to be direct about what stage this represents.

Gradient is pre-product-market-fit. The stack is being proven, not fully commoditized. The near-term audience won’t be enterprise procurement teams.

It'll be at the fringes: open-source model builders, RL researchers, and infrastructure engineers who know what "heterogeneity + trust" actually implies when you try to build on it. If Gradient can't win this crowd first, nothing else matters.

This is why the emphasis lands on papers, open-sourcing key pieces, and demonstrating that the architecture works under realistic conditions. And on that front, the output is hard to ignore.

The Research Portfolio

Beyond the core stack (Lattica, Parallax, Echo), Gradient has been publishing across the full surface area of what open, distributed AI requires to function. I’d be honest - the breadth is unusual for a seed-stage team.

Collective reasoning: A multi-LLM orchestration framework where diverse models debate, vote, and synthesize answers across multiple rounds. An ensemble of Gemini 2.5 Pro, GPT-5, Grok 4, and Claude Sonnet 4 hit 87.4% on GPQA-Diamond and 88.0% on IFEval, outperforming every individual model. The insight isn't "more models = better." It's that structured disagreement among models produces outputs that none of them contain alone. For a system built on coordinating many devices, this research extends the coordination thesis from hardware to intelligence itself.

Trust and verification: VeriLLM is a protocol that makes decentralized inference publicly verifiable with roughly 1% overhead. This directly addresses the third wall (trust) from the problem setup. Without verification, permissionless inference networks can't be trusted. VeriLLM makes the cost of adding trust negligible.

Memory and persistence: SEDM (Self-Evolving Distributed Memory) tackles the fact that most AI systems are stateless.. they forget everything between sessions. The research introduces a memory architecture with admission, scheduling, and diffusion mechanisms that let agents learn continuously, prune outdated knowledge, and transfer patterns across domains. If you want agents that get better over time rather than starting from zero every conversation, you need something like this.

Multi-agent coordination: Symphony provides the orchestration layer: a decentralized framework where lightweight LLMs on consumer-grade GPUs coordinate through dynamic task allocation and weighted voting.

Each paper targets a real gap. Collective reasoning makes distributed inference smarter. VeriLLM makes it trustworthy. SEDM makes it persistent. Symphony makes it coordinated. Together with the core stack, they outline a system where distributed AI is not only cheaper.. it's capable of things centralized systems can't easily replicate.

Six peer-reviewed papers (and counting) across the full intelligence stack, all on a $10M seed round. Whatever you think about the market timing, the research-output-per-dollar ratio is exceptional.

Section II: The Business

Where the Demand Shows Up

The demand case for distributed RL-as-a-service is straightforward: RL works, everyone knows it works, and very few can afford to use it as aggressively as they want to. If Echo drops the cost by 33-80%, new categories of users become viable.

Here's how I'd sequence the likely demand, from nearest to furthest:

First wedge: open-source teams doing post-training. The cleanest entry point because the pain is immediate.

You can download a strong base model today, but making it reason better through RL is still expensive. Post-training workflows are rollout-hungry. They don't need perfection on day one. They need volume. If Echo becomes "rollouts on tap," these teams are the first to pay because it compresses their iteration cycle from weeks to days.

And the teams with real budgets are easy to identify. Labs shipping competitive open-source models (Nous Research, MiniMax) are already spending serious money on post-training runs. Their entire value proposition depends on closing the gap between open-source base models and frontier closed models, and RL post-training is increasingly how that gap gets closed.

DeepSeek showed the playbook. Every serious open-source lab is now trying to run some version of it. If Echo cuts their rollout costs by 33-80%, the ROI is immediate and measurable against existing spend.

Then there's a tier that's currently priced out of RL entirely: fine-tuning shops and model customization companies that specialize in taking base models and post-training them for specific verticals (medical, legal, finance). These teams do supervised fine-tuning because it's cheaper, even though they know RL would produce better results. Echo doesn't just save them money on what they're already doing. It unlocks a capability tier they currently can't access. That's a different kind of demand.. not cost reduction, but capability expansion.

Both groups are cost-sensitive by nature. Open-source culture resists paying for infrastructure. But the ones who convert are the ones where RL compute is already a line item (or would be, if it were affordable). The wedge is making existing spend cheaper for the first group and making previously impossible spend viable for the second. Classic developer-infrastructure pattern!

Second: agent builders. Most training runs are episodic. You train, you ship, you move on. Agents don't work that way. Agents live in loops.

Consider a company like Cognition, whose coding agent Devin needs to keep getting better at the specific tasks where it fails. Every batch of production failures becomes a training set. The builder collects those cases, runs an RL campaign with thousands of rollouts against those failure modes, updates the weights, redeploys. Then does it again next week. That's a recurring compute bill that scales with the agent's usage.

Rollout generation is especially expensive for agents because each rollout is long: dozens of tool calls, web interactions, or multi-step reasoning chains, not a one-shot text completion. The cost per rollout is higher, which means the savings from distributing that generation across cheap hardware are proportionally larger. And the use cases stack: reward model training (teaching a model to score agent actions in context) is itself rollout-hungry, self-play and adversarial testing map directly onto distributed sampling. Each a recurring job, not a one-time expense.

To be honest, most agent builders today aren't running truly continuous RL. They're running periodic improvement cycles, weekly or biweekly. Echo makes each cycle cheaper, teams run them more often, agent improvement compounds faster.

This is the demand sector I'd watch most closely. Coding agents (Cognition, Factory AI), customer experience agents (Sierra), open-source agent projects (SWE-agent, OpenHands).. they're all burning compute on exactly the workload Echo is designed to make cheaper.

Third: scientific research and simulation-heavy RL. RL has already produced breakthroughs in protein folding, materials science, mathematical optimization, and fusion reactor control.. but only at labs with massive compute budgets.

A university team studying drug interactions or climate modeling faces the same RL loop (simulate, score, update, repeat) as DeepMind, just without the cluster. If Echo makes that loop affordable on distributed hardware, it expands who gets to do science. Robotics fits the same pattern: fundamentally trial-and-error, massive simulation requirements. Strong demand sector, but probably a later one since these users need more mature tooling and support.

One longer-term angle worth noting separately: organizations that can't send data to third-party clouds (healthcare, finance, defense) could deploy Parallax across their own on-premise hardware for sovereign inference and RL. That's a self-hosted enterprise play rather than a distributed consumer hardware play, and it requires a different go-to-market motion entirely. But the underlying technology transfers, and it's a market where willingness to pay is high.

The through-line: RL is too expensive for most of the people who would benefit from it. Drop the cost, and demand doesn't need to be created. It already exists. It's just priced out.

The Founders Behind The Vision

If you'd asked me to design a founding team for "distributed AI infrastructure that actually has to work," I'm not sure I could improve on what Gradient assembled.

Yuan Gao came up through the blockchain infrastructure world, first at Neo Blockchain, then at Helium starting in 2019, right as Helium was attempting something that sounded ridiculous: build a global wireless network by convincing regular people to plug hotspots into their homes and paying them in tokens for the coverage.

Yuan spent three years as Head of Growth running Asia, covering manufacturer relationships, the open-source builder community, exchange integrations, and the full operational playbook for turning a crypto experiment into legitimate telecommunications infrastructure.

You know how that story ends. Helium now operates 120K+ active mobile spots in the US and Mexico and offloads data for AT&T. It proved to an entire industry that token incentives could bootstrap physical infrastructure at a scale that would've taken a traditional telco decades and billions of dollars. Yuan helped make that happen.

Gradient Co-Founder, Yuan Gao

Eric Yang took a different path to the same destination. He studied computer science at UC Berkeley, the university behind many of the most important distributed compute projects in AI (SkyPilot, vLLM, Ray, among others).

He then became founding engineer at DLive, a blockchain-based streaming platform that signed an exclusive streaming deal with PewDiePie and grew to seven million active users. BitTorrent eventually acquired it. Eric built the decentralized infrastructure that handled real traffic at scale, the kind of system where uptime isn't academic.

After DLive, he crossed to the investment side at HSG (formerly Sequoia Capital China), spending time funding frontier tech startups and deploying capital across seed and venture stages.

Most founders have either the experience of building or the pattern-matching from investing. Eric has both.

The team around them reflects the same bias: heavy on research engineering, light on marketing. They describe themselves as "pretty techie" and "pretty shy," the kind of team that ships papers before press releases. Six peer-reviewed publications across the full intelligence stack on a seed budget tells you something about where the headcount goes.

Fundraising

In June 2025, Gradient closed a $10 million seed round led by Pantera Capital and Multicoin Capital, with participation from HSG (formerly Sequoia Capital China) and a group of angels spanning AI and crypto.

The investor list is worth reading carefully. Multicoin was an early backer of both Solana and Helium, they've been investing in the DePIN thesis since before it had a name, and they backed the specific project where Yuan learned the playbook.

$10 million isn't a lot by AI infrastructure standards. But the fact that Gradient is publishing peer-reviewed research, maintaining active collaborations with major model labs, and shipping working products across inference, RL, etc, all on a seed round, says something about the ratio of engineering output to capital consumed.

Section III: Our Thesis

The AI market has spent the last four years asking one question: who can build the biggest model? The assumption underneath it is that intelligence scales with size. More parameters, more data, more compute, more capability.

Echo's results suggest a different question matters more: who can make a small model excellent at a specific thing?

We covered the benchmarks in the Echo section, but step back and think about what they imply. A 7B model beating a 32B model on math reasoning. A 0.6B model learning strategic play from scratch. They're evidence that targeted RL post-training on a modest base can produce results that used to require orders of magnitude more parameters.

And the timing matters because the AI industry is shifting from general-purpose chatbots to vertical products that need to be very good at one thing. A legal tech company doesn't need GPT-5. It needs a 7B model that's exceptional at contract analysis. A robotics startup doesn't need a trillion-parameter foundation model. It needs a lightweight policy model shaped by millions of RL episodes in its specific environment.

Right now, most of these teams are prompt-engineering around the limitations of general models. They know that a model fine-tuned on their exact workflow, trained with RL on their specific reward signal, would perform significantly better. They just can't afford it. The compute cost of running serious RL post-training is still gated by datacenter cluster pricing, which puts it out of reach for anyone except the largest labs and best-funded startups.

If Echo reduces that cost by 30-80% (the range between published and unpublished results), it enables a category of AI products that currently don't exist because the post-training step is too expensive to attempt.

That's a much larger market than cheaper inference. It's every vertical AI company that would build differently if specialized training were accessible.

For investors, the mental model we'd use is this: the market is pricing Gradient (and projects like it) as a ‘cheaper compute’ play. But the real option value is in enabling a new class of AI products. If small-model-plus-targeted-RL becomes the dominant pattern for vertical AI (and the evidence increasingly points that direction), then whoever controls the cheapest RL pipeline captures a toll on an enormous long tail of products. That's a very different valuation framework than "discount cloud."

For builders, the implication is more immediate: if you're building a vertical AI product and your current approach is prompt engineering plus API calls to a frontier model, you should be tracking when RL-as-a-service becomes cheap and reliable enough to switch. The teams that move first to owning a specialized model (rather than renting a general one) will have a structural advantage that's hard for competitors to replicate, because the model gets better with use and the improvement compounds.

A counter-argument worth taking seriously

The biggest pushback on our thesis is that frontier model providers won't sit still. OpenAI, Google, and Anthropic are all investing heavily in making their own models more customizable through fine-tuning APIs, system prompts, and tool-use frameworks. If frontier APIs become sufficiently customizable that prompt engineering plus a good system prompt gets you 90% of what targeted RL would deliver, the incentive to run your own RL post-training drops significantly. The "build vs. rent" calculus only favors building if the gap between general and specialized is wide enough to justify the effort.

Our read: the gap is wide and getting wider, especially for tasks that involve tool use, long-horizon planning, and domain-specific reasoning where generic models plateau. But this is the key assumption to monitor.

What I'm Watching

Research credibility and product-market fit are different animals. Here's what would move our conviction in either direction, with rough time horizons.

Over the next 90 days: the first external RL workflow on Echo. An actual external team that runs a production RL workflow on Echo, gets results they couldn't afford on centralized infrastructure, and comes back for more.
Over the next 180 days: scale beyond the published sweet spot. If the team publishes peer-reviewed results at 70B+ scale with cost savings above 50%, the architecture's credibility extends to the model sizes that enterprise customers actually care about.
Over the next 12 months: developer adoption velocity. Infrastructure companies don't win on papers. They win on adoption. Gradient needs GitHub stars and paying API customers growing at a rate that suggests the developer community treats this as real infrastructure.

The lock was never the law

Every self-play game AlphaGo ran, every reward signal, every tiny improvement that eventually produced Move 37, all of it lived inside one Google data center. One company. One cluster. One permission set.

That was 2016. Since then, the method behind it has turned into maybe the defining training loop in modern AI. And the access pattern has barely moved. To run RL at real scale, you still need a cluster. To get a cluster, you still need absurd amounts of capital and access that only a small circle of labs can get. Different names hold the keys now. Same lock.

Gradient's bet is that this lock is chosen, not inevitable. RL has a two-phase shape, which means the most expensive part and the most parallel part do not have to live in the same place. That sounds obvious once you say it out loud, but the system implication is huge. The private results suggest it may work better than people assume. I’m not treating that as settled. But I also don’t think it can be waved away anymore.

The risks are still real. Above 30B, this is unproven. Benchmark wins are not product wins. And Gradient still has to survive the ugliest transition in startups, which is going from technically credible to commercially unavoidable.

But I think the center of gravity has shifted. Until recently, distributed RL matching centralized baselines was an open question. Now it’s a demonstrated one. A 7B model trained on consumer hardware beating models four times larger is not a vibes-based argument. It’s a published result.

So that’s where I land. Whether Gradient wins from here is execution. Whether this architecture matters is basically empirical now. And the empirics have moved.

Thanks for reading,

Teng Yan and 0xAce

Useful Links:

Disclosure: This essay was supported by Gradient, which funded the research and writing.

Chain of Thought kept full editorial control. The sponsor was permitted to review the draft only for factual accuracy and confidential information. All insights and analysis reflect Chain of Thought’s independent views. Where tradeoffs or limitations exist, they are stated clearly.

How Gradient Could Make Reinforcement Learning Much Cheaper

Table of Contents

The Post Training Era of AI

The Deadlock

The Obvious Answer (That Doesn't Work)

Section I: The Infrastructure

How Gradient is Building The Open Intelligence Stack

#1 Echo: Distributed Reinforcement Learning

The Catch: Staleness

Echo-2: What Happens at Scale

The Results

What Makes the Inference Swarm Work

#2 Lattica: The Connectivity Layer

#3 Parallax: Making the Swarm Useful for Inference

How well does it actually perform?

What Gradient Is Actually Shipping Today

The Research Portfolio

Section II: The Business

Where the Demand Shows Up

The Founders Behind The Vision

Fundraising

Section III: Our Thesis

A counter-argument worth taking seriously

What I'm Watching

The lock was never the law

Useful Links:

Reply

Keep Reading

Chain Of Thought

Discover

Connect