I first started writing this thesis in June-July 2025. This piece now reflects the state of play in November 2025.
Robotics is the next new multi-trillion-dollar market hiding in plain sight.
It’s the highest-upside, most underpriced leg of the AI trade. And it’s coming faster than anyone expects. I expect the inflection point to begin in 2026. Most people will miss it until it's too late.
Founders working inside the field keep telling me the same thing: the tech is moving way faster than it looks from the outside. The technical hurdles are real, but will be solved in a matter of time, rather than being an impossibility.
And when the constraints fall, the surface area of the market explodes.
How is this going to play out?
Earlier in the month, I saw the 1X home robot making the rounds on X. It ships in 2026. The teaser looks very cool, until you notice that even the simple tasks still require a company employee to teleoperate the robot.
Well, I don’t think I can ever get comfortable with a stranger peering into my home. It feels eerie.
If we want to arrive at humanoid robots that actually matter, we need to get past tele-op and into self-improving machines. After some thinking, I believe the roadmap will unfold in three main phases, closely aligned with robotics’ data needs.
Phase 1: Narrow Robots (2025-2027)
The current state of the art in robotics relies on well-defined state-space and inverse kinematics/dynamics models. In plain English, robots today require precise models of the environment with pre-mapped surfaces and fixed lighting to operate.
Even slight changes, like a cup placed two inches off or a shadow across a sensor, can throw the whole robot off.
And so over the next 12-24 months, we'll see humanoid robots deployed at an increasing pace for narrow, well-defined tasks: carrying boxes in an Amazon warehouse, making coffee, doing household chores at a 60% level (and often failing, but the novelty of having a robot at home will appeal to early adopters)

It's hard to justify the economics for non-industrial use cases. A humanoid costs upwards of $13K (Unitree G1) and can only lift 2kg, which makes it basically a very expensive cosplayer. A real human is still cheaper and far more capable.
But a price that high will not stop the early adopters. Researchers and tinkerers will buy them anyway, poke at the limits, and try to figure out what these machines are actually good for.
Specialised robots will continue to scale and succeed, as the environments they operate in are heavily constrained.
Phase 2: Data-Flywheel hits full velocity (Late 2026 - 2028+)
This is the moment the system starts learning in earnest. The flywheel forms:
More real-world data → more edge cases → more simulation data → fewer failures and better capabilities → bigger rollout → even more data
The real unlock is deployment. We need many robots in the wild to generate the volume and variety of interactions that move us toward general intelligence. The 1st 100,000+ humanoid robots do not need to be very good at anything yet. They just need to exist, bump into the world, make mistakes, and try again. That is when the data starts compounding.

We are starting from almost zero. Only a tiny number of humanoids have ever been deployed today, which means the upside is huge. Every failure becomes training data. Over time, the system starts to teach itself. Reinforcement learning from human feedback, sim-to-real training, and other adaptive tools become routine.
Once the loop tightens (deploy, observe, update, repeat), the flywheel accelerates. Each new deployment improves the model. Each improvement unlocks new work. Learning becomes continuous.
At the same time, we'll be supplementing this with 2 data streams:
Imitation data, especially video data. People will get paid to wear cameras on the job, lifting boxes, folding laundry, harvesting crops. Those recordings turn into training data.
Simulation data that multiplies the value of all the real-world data collected. (more on this later)
Robotics datasets are still tiny compared to the mountain of text used to train language models. I hear some founders throw around “a billion hours of robot video” as the magic number for us to reach the ChatGPT moment in robotics. It sounds neat, but it misses the point. The real goal is to collect enough high-quality data for the flywheel to catch.
Phase 3: Rapid task expansion (2028 onwards)
Here we start to see the payoffs as the data flywheel spins faster and faster.
By this point, the robot's foundation model becomes robust enough that a new task like "make a sandwich" is simply a composition of pre-learned skills ("locate bread," "pick up knife," "unscrew jar") and high-level reasoning.
This is where the robot actually starts being a co-worker.
The key metric here is Zero-Shot Generalization + Few-Shot Adaptation.
Instead of needing 1000 hours of training for "make a sandwich," it only needs 5 minutes of human demonstration or a few text prompts. That’s the moment the robot turns into a true software-defined machine. That’s when I (and many others) will buy one in a jiffy, because it finally makes sense.

The Physical Turing Test will be the measure of our progress on this. And the test will break, the same way language models blew past the original Turing Test.
As capability jumps, cost drops. Wright’s Law kicks in: every doubling of production drops unit cost on a predictable curve (~20%). It happened with solar, it happened with semis, and it will absolutely happen here.
Unlike autonomous driving (which took a decade of incremental progress), robotic manipulation will scale faster because mistakes are often recoverable and environments are more structured.
Hardware costs are already sliding down the curve. For instance, arms that cost ~$400K in 2014 (PR2) dropped to ~$30K by 2018 (UR arms) and are now around ~$3k per arm. Within a few years, they may fall below $1k.
One caveat: the timelines here are rough estimates based on my best guess. The direction is clear, but the exact timestamps are not. These 3 phases bleed into each other and will overlap. At any given moment, it will be clear which phase is dominant.
Economic Pressure Is Driving Adoption
The forward path of progress I outlined here is inevitable, because the macroeconomic case for humanoid robots is very strong. Human labor is by far the largest economic sector globally, at over $30 trillion annually. Labor shortages are rising due to lower birth rates, reduced immigration, and earlier retirements across major economies.
Labor is becoming more expensive and less available. This is a structural, not cyclical trend.
In this context, even partial automation of physical work becomes economically significant. Humanoids that can take over repetitive or hazardous tasks could address a broad range of gaps without requiring new physical infrastructure. The potential market is vast, even if only a small fraction of tasks are eventually automated.
Traditional market research pegs the global robotics sector at around $82 billion in 2024, with some forecasts reaching $448 billion by 2034. However, I believe these figures likely understate the true potential. As is common with emerging technologies, early forecasts often underestimate market size and overlook the breadth of applications that human innovation can unlock over time.
I would place the 2034 market closer to $1 trillion, suggesting over $900 billion in new opportunity within this decade alone.

#3: China Is Scaling Faster Than Anywhere Else
China is already installing more industrial robots than the rest of the world combined. In 2023–2024, it deployed over 276,000 new industrial robots, making up 51% of all new global installations. And they’re shifting their muscle into humanoids.
China is the world's manufacturing and engineering hub. It owns the supply chain and the factories, even if it still trails the US in cutting-edge hardware design and advanced software. Companies like Unitree Robotics (expected to IPO in 2026) are already scaling up production.
What matters most is pace: Shenzhen has become the global "Silicon Valley of Robotics"
Beijing is also backing the sector with record-scale investment. In March 2025, the National Development and Reform Commission (NDRC) launched a state-backed venture initiative targeting up to RMB 1 trillion (roughly US$138 billion) in funding over the next two decades. The focus is robotics, artificial intelligence, and advanced manufacturing.
This is an order of magnitude larger than any previous state fund dedicated to robotics and signals a clear national strategy to dominate the next era of industrial automation.
Physical Turing Test
When ChatGPT arrived, we all felt it. It could write, explain, joke, and help in ways that felt startlingly close to human. That moment was the turning point for AI. The Turing test was passed. ChatGPT became the fastest app ever to reach 100 million users (2 months!)
Robotics is still waiting for its own version of that shift. Dr Jim Fan at NVIDIA gave it a name: the Physical Turing Test.
Imagine the morning after a house party you threw for a swarm of college friends. The night was loud and loose. Music, dancing, drinks flowing. Then everyone left. Now it’s just you and the mess: cups scattered across the floor, bottles knocked over, unwashed plates
Everything is a mess (damn)

Can a robot enter the post-party house, clean the clutter, load the dishwasher, wipe the counters, and reset the furniture, so convincingly that you can’t tell whether a human or a machine did the work?
So… basically like this:

Even on the best real-world robotic task benchmark, Stanford’s BEHAVIOR-1K, a scripted “optimal” policy completes only 40% of runs in simulation and 22% on real hardware. Nearly half of the real-world failures come from grasping problems, which require a lot of fine motor dexterity.
Just moving through clutter is a challenge. Robots cruise at roughly 0.5 m/s in cluttered environments, a third of the 1.4 m/s pace at which humans stroll without thinking.
We’re still at least an order of magnitude of improvement away.
The Rise of “Generalist” Robot Policies
Okay…. then what gets us closer to passing the physical Turing test?
The frontier of the field is pushing toward generalist policies: models trained across diverse tasks, settings, and hardware. We're taking a path similar to what we’ve seen with LLMs: let's pre-train broadly, then fine-tune lightly for each new context.
In robotics, a policy is the brain. It is the model that maps perception to action. It takes the robot’s current understanding of the world (its state) and decides what to do next. A good policy defines behavior. A great one handles surprise.
The true huge market unlock will be generalized humanoid robots. This is much, much bigger than any specialized robot TAM. That’s why everyone is eyeing this prize, including Elon / Tesla.
The struggle to achieve this is that the real world is unstable. Objects shift. Lighting changes. Task-specific systems collapse under that kind of variability.
The goal is not to hardcode every edge case but to teach transferable physical intuition. If a robot can clean one apartment and handle the next without rewriting its logic, the economics shift.
Intelligence comes from diversity. For robots to be truly intelligent, they must live and learn among us. We need robots that can fail gracefully, so they can learn from mistakes.
Two capabilities make this possible: adaptability and autonomy.
Adaptability is the ability to learn from experience and generalize. If a robot can clean one sink, can it figure out another without starting from scratch?
Autonomy is about execution without supervision. Once the robot is in a new environment, can it operate end-to-end without human help?
Foundation models for robotics aim to encode physical common sense: how objects behave, how to manipulate them, and how to move through space. On top of that, they support higher-level reasoning by deciding what to do, not just how.
Companies like Physical Intelligence and Skild AI are chasing this vision with serious funding. Their approach centers on a simple idea: scale the data, and the model will generalize.
The Data Bottleneck

Source: Coatue (@coatuemgmt)
The catch is that physical AI has to climb a much steeper data hill than the language models we are used to.
Text-based models had an early advantage because text has already captured human-relevant knowledge in condensed form. Humanity spent centuries compressing knowledge into books, articles, and posts.
But for robots, the data is all new and has to be collected from scratch. A robot learns from vision, audio, touch, force, proprioception, and the messy physics of a 3D world. That is a far harder distribution.
And robots do not scale like software. The feedback loop is slow and expensive. You cannot run a thousand iterations an hour. Every trial burns hardware. Parts wear out. I don’t want a robot nanny practicing trial and error in my living room.
By one estimate, the largest robotics datasets today contain about 10⁶ to 10⁷ motion samples. Compare that to 10¹² examples common in language or vision training, and the asymmetry becomes stark. That is 6 orders of magnitude (1,000,000x) less.
A quick look at some of the available open datasets reveals just how little data we have, and how wide the gap still is:
Dataset | Size/Details |
|---|---|
1M+ trajectories, 22 robot types, 527 skills | |
76K trajectories, 350 hours of data, | |
3,700+ hours of perception video data | |
1 billion synthetic demos for dexterous-hand tasks | |
2000H tele-operated sidewalk-robot driving data from 10+ cities |
The largest open egocentric (first-person video) dataset I’ve seen was released just two weeks ago, with 10,000 video hours from 2,153 factory workers. Suffice to say that it is definitely still a drop in the ocean compared to what we need.
The talent pool is just as thin. Because the field is so early and access to capable robots is scarce, I estimate that probably only a few thousand people on the planet really know how to collect, clean, and use complex robotics datasets well.
Two Approaches to Robotics Data
It’s a very interesting period for robotics, because for the first time, there is a real consensus on the recipe for a general-purpose robotics model.
Most roboticists now agree that the path ahead will rely on (1) large and varied observation–action data, (2) diffusion-style or transformer-based action models, and (3) long sequence prediction instead of twitchy micro-step control.
This is the robotics equivalent of the moment when NLP researchers aligned around transformers in 2018-2020. Now every team is racing to build the first truly large scale dataset that spans enough environments, objects, lighting conditions, and human styles to matter.
What’s still missing for the breakout?
10–100 times more diverse manipulation data
cheaper hardware
training setups optimized for multi-hour, multi-room sequences
A. Simulations

Source: Dr Jim Fan’s presentation at Sequoia AI Ascent
The core idea is this:
If a robot has handled 1,000,000 different environments, the odds are that it'll do just fine in the 1,000,001st environment too.
Simulation is how robots learn without breaking. It’s the only place a machine can fall a thousand times and still get up. In simulations, robots can train faster than real-time, encounter rare or dangerous edge cases, and explore movements that are too slow, risky, or expensive to test on physical hardware.
Simulation gives us a way to multiply scarce real-world data. A single demonstration can be replayed across N environments and M motion variations, generating N × M new examples.
As neural world models and simulators improve, a new kind of scaling law is emerging, one where physical IQ rises with compute used.
More compute = more capable policies and smarter robots.
That’s how we scale.

Source: Dr Jim Fan’s presentation at Sequoia AI Ascent
But simulation has limits. It works for drones and basic locomotion because the physics are simple. Manipulation is messier. Friction, contact, deformation, and fine-grained sensing are hard to model accurately.
And the biggest challenge is translating success in simulation to actual performance in the real world: what we call the “sim-to-real gap”. A robot can ace the simulator and then fall apart in a real kitchen because the floor is slick or the glare blinds a camera.
The gap lies at the intersection of the two Ps: Physics and Perception.
Even the best simulators smooth out contact and friction. They do not capture the complexity of real light, texture, or sensor noise.
To bridge this gap, researchers rely on techniques like domain randomization (training across varied, slightly distorted conditions to encourage robustness) and domain adaptation (making simulated inputs look more like real ones).
This matters. If the sim-to-real gap is fundamentally unsolvable for certain dynamic tasks, then collecting a billion hours of simulated data might be useless or even harmful.
Simulation will never replace reality. But it’s how we might get there faster.
B. Real World Data for Imitation Learning
For physical AI to operate reliably in robots, it needs diverse real-world data that captures edge cases and unpredictability. None of which simulators model well.
Imitation learning is the most direct way to get that data. Robots learn by watching people. Instead of wandering toward the rules through trial and error, they start with an example.
Diffusion policies (conceptually similar to diffusion-based image-generation models like Stable Diffusion) are becoming increasingly popular and have pushed this field forward because they thrive on diversity. When you train on videos of humans performing tasks paired with action data, you capture a huge spread of motions, objects, lighting conditions, and failure modes. Older imitation systems tended to average everything into a bland, unusable action.
Diffusion policies avoid that collapse. They predict a whole action sequence, then refine it step by step, which keeps movements smooth and stable instead of jittery.
Robotics is scaling real-world training data along two main paths.
Collect massive human demonstration video datasets with action labels. This gives broad coverage and teaches models general manipulation patterns.
Collect large robot-only datasets in controlled setups. These provide clean data but often lack variety, leading to poor transfer outside the original environment.
Diffusion policies shine on the first path because they can absorb complexity without falling apart. That is why many of the newest generalist systems lean heavily on human demonstrations.

1. Teleoperation
The fastest way to teach a robot a new skill is still the oldest: show it.
Recent research shows that well-constructed demonstration datasets, together with supervised or hybrid algorithms, allow robots to be competent with just 10s to 100s of real-world episodes.
Teleoperation is the most common method of demonstration. A human controls the robot remotely, creating high-quality motion data. Interfaces vary (VR headsets, motion-capture suits, etc), but the idea is the same: people perform the task, robots watch and learn. Teleoperation is the easiest way to bootstrap a robot to a non-zero chance of success at a task, before you start refining
Kinesthetic teaching goes one level deeper. Instead of remote control, you physically guide the robot’s limbs through the motion.

Tesla’s teleoperation team. Source: Electrek
Some teams are scaling this with crowdsourcing.
NRN Agents uses a web-based simulator. Players guide robots through tasks using simple controls, creating useful trajectories with no special gear.
Tesla is hiring operators to wear capture suits and act out specific behaviors for its Optimus robot. Human motion is streamed straight into training pipelines.
The hardware still bites. VR and motion capture kits are drifting below $1K, but high precision kinesthetic systems can easily run to $10K+.
Traditional teleoperation rigs are slow to set up and hard on operators, especially when tasks require fine control or specific force profiles. And the correspondence problem, mapping human motion onto bodies with different shapes and joint limits, is still an active research area.
So... teleoperation data is invaluable for bootstrapping specialized skills. But it hits a ceiling.
2. Annotated Video
Could a robot learn from YouTube, where people chop vegetables, fold laundry, fix bikes, and do almost every task imaginable?
Not really. Most internet video is useless for robotics because video alone never tells a robot how the action was executed. It does not reveal joint positions, tool trajectories, or the 3D geometry of the scene. A robot needs kinematics, not just pixels.
This is why paired video–action datasets have been so important. When you combine diverse human videos with aligned action data, imitation learning becomes far more stable and general.
Researchers are now working to convert raw footage into structured training data. Some teams add motion trackers or lightweight AR markers to capture kinematic labels as video is recorded.
At the University of Washington, the Unified World Models Project learns representations from both labeled robot actions and unlabeled video clips, and infers likely actions from videos. In simulations, UWM outperforms standard imitation learning models.
Another tool, URDFormer, takes a single image and reconstructs an entire simulation-ready scene.

Source: https://urdformer.github.io/
I can see a world soon where workers in every profession wear lightweight cameras, generating continuous footage of real tasks, from making coffee to harvesting crops. Once enough annotated footage exists, the path to automating that job becomes much clearer.
Ultimately, the strongest data strategies blend simulation with the real world. Simulation gives you scale in a way nothing else can, billions of interactions, fast iteration, zero risk. Real-world data keeps the model honest. It surfaces the weird edge cases that never show up in sim.
Robotics will likely mirror the pattern Waymo followed in self-driving. Billions of simulated miles paired with millions of real ones. The system begins to address its own weaknesses. After it crashes in a simulation, the world model produces related scenes, the driving model trains on them, and the failure disappears.
Now I can picture a distributed swarm of robots, each gathering its own experience and streaming it back into a shared model. Multiply that across thousands of machines and the learning flywheel becomes unstoppable.
Exciting times!
In Part II, I will dig into the teams doing the most interesting work today, and where crypto actually matters for robotics
Thanks for reading,
Teng Yan
PS. Did you enjoy this? Forward this email to your friends so they can keep up. You’ll probably like the rest of what we do:
Chainofthought.xyz: Decentralized AI + Robotics newsletter & deep dives
Our Decentralized AI canon 2025: Our open library of industry reports
Prefer watching? Tune in on YouTube. You can also find me on X.


