Big Idea #1: The Great AI Bottleneck is Data

How decentralized networks are rebuilding AI’s most valuable resource.

GM 👋

I was testing a new AI agent and asked it a simple question about the Trump tariffs. It confidently replied with an answer from 2024. Because its training data stopped there.

That’s a failure of data, not intelligence.

Welcome to the first essay in our 30 Days of Chain of Thought.

We’re starting with the backbone: data networks. What makes them work, why many will fail, and what’s actually needed to build a living, evolving data economy.

Data networks are the 1st of our “Big Ideas for 2025,” specifically in the category of what we call a slow burn: a foundational shift with quiet but steady traction. The tech may still be maturing, but the direction is clear. These are long plays that compound over time.

If this clicks with you, share the essay on X or forward this on to a friend. If you’ve got a different take, post it and tag @cot_research (or tag me) and we’ll repost it.

Let’s make this a conversation worth having.

TL;DR

  • Public text is almost tapped out while the highest-signal private streams stay locked away behind paywalls, APIs, and privacy barriers.

  • Crypto data networks address this using three primary approaches: decentralized web scraping (e.g., Grass), user-consented private data aggregation (e.g., Vana), and on-demand synthetic data generation (e.g., Dria).

  • These networks leverage crypto primitives like tokens for incentives, blockchains for verifiable provenance, and DAOs for community governance.

  • Sustainable business models require focusing on utility-driven applications and moving up the value chain beyond raw data sales.

  • Key challenges include bootstrapping early networks, ensuring data quality, and overcoming enterprise skepticism.

  • The ultimate vision is a living data economy. Networks that secure credible, high-fidelity data now will dictate training speed, model performance, and capture the largest share of future AI value.

There’s a growing tension in AI: we’re racing toward ever more powerful models, yet running low on what matters most: high-signal training data. Not in quantity, but in quality.

According to Epoch AI, the largest training runs could deplete the world’s supply of public human-generated text—around 300 trillion tokens—by 2028. Some forecasts suggest we hit that wall as early as 2026, especially with overtraining.

And as Ethan Mollick notes, even vast amounts of niche text (like terabytes of amateur fiction) barely move the needle. The easy data is gone. We’ve scraped Wikipedia, drained Reddit, and mined Common Crawl. What’s left offers diminishing returns.

So we hit a paradox: while model capabilities continue to leap ahead, the availability of the right kind of data narrows. And as it does, the price and importance of quality, high-fidelity data skyrockets.

This is where things get interesting.

The Data We Need Now (And Why It's Locked Away)

People often say, “Data is the new oil”.

But this oversimplifies the challenge.

Oil is static and interchangeable. Data is dynamic, contextual, and deeply tied to how it’s sourced and used.

Here’s the new hierarchy of AI-critical data:

Category

Sources

Typical Barrier

Private & niche

Hospital imaging archives, manufacturing telemetry

Institutional silos, privacy law

Net-new domains

Robot teleoperation, agent interaction logs

Needs bespoke collection pipelines

Real-time streams

Market order books, social firehoses, supply-chain IoT

Latency and licensing costs

Expert-annotated

Radiology scans with specialist labels

Expensive, slow, hard to scale

  1. Private and Niche Datasets: The highest-signal data is locked behind institutional walls: health records, genomics, financial histories, factory telemetry, proprietary R&D. They’re fragmented and often siloed.

  2. Net New Data for Emerging Domains: You can’t train a household robot on Reddit. Robotics needs teleoperation, sensor data, and real-world context. All of this doesn’t exist in volume yet and must be actively generated through purpose-built pipelines.

    Another key area for advancing agentic AI is capturing real action sequences: user clicks, navigation paths, and interaction logs. One example is the Wikipedia clickstream, an anonymized dataset that traces how users move from one article to the next.

  3. Fresh, Real-Time Data: Intelligence needs a feed, not a snapshot. For them to adapt to live markets, we need real-time crawling and streaming.

  4. High-Quality, Expert-Annotated Data: In fields like radiology, law, and advanced science, accuracy depends on expert labeling. Crowd-sourced annotation won’t cut it. This kind of data is expensive + slow + hard to scale, but critical for domain competence.

The era of just scraping the internet is ending.

Web2 Knows This

As AI valuations soared, platforms realized their most valuable asset was user data.

Reddit signed a $60M training deal with Google. X charges enterprises steep fees for API access. OpenAI is striking licensing agreements with publishers like The Atlantic and Vox Media, offering $1M–$5M per archive.

And the people, like you and me, who created that data? We get nothing.

Users generate the content. Platforms monetize it. The rewards accrue to a few centralized players, while the real contributors are left out. It’s a deeply extractive dynamic.

What if this changed?

Crypto x Data = Rebuilding Data Ownership From First Principles

We see three major aggregation strategies take shape around data:

  1. Scraping and labeling public web data

  2. Aggregating user-owned, private data

  3. Generating synthetic data on demand

1. Scrape Public Data, Repackage at Scale

This focuses on harvesting the open web (forums, social platforms, public websites) and turning that raw stream into structured, machine-readable data for AI developers.

The indexed internet holds roughly 10 petabytes of usable data (10,000 TB). When broader public databases are factored in, that figure swells to around 3 exabytes (3,000,000 TB). Add platforms like YouTube videos, and the total exceeds 10 exabytes.

So there’s a lot of data out there.

Source

Estimated Size

Notes

Indexed Web Pages

~10 petabytes

Estimated based on 4.57 billion pages at 2.2 MB each

Deep Web Pages

~100 petabytes

Estimated as 10 times larger than the indexed web

Public Databases and APIs

~1-10 exabytes

Genomics, astronomy, climate data, open government portals

Public File Sharing and Storage

~1 exabyte

Data from platforms like GitHub, Dropbox, and public repositories.

Public Multimedia Platforms

~10 exabytes+

YouTube. Requires significant processing for AI use beyond transcripts.

Data is sourced through distributed scraping infrastructure: often, networks of user-run nodes. Once collected, the data is cleaned, lightly annotated, and formatted into structured datasets. These are then sold to model developers looking for affordable data at a fraction of what centralized providers like Scale AI charge.

The competitive edge comes from decentralization, which reduces scraping costs. Projects like Grass and Masa are turning public web data into a permissionless, commoditized resource.

Grass launched in 2024 as a decentralized scraping network built on Solana. Within a year, it grew to over 2 million active nodes. Users install a lightweight desktop app that transforms their device into a Grass node, contributing idle bandwidth to crawl the web.

Each node handles a small chunk of the scraping workload, and together they pull in over 1,300 TB of data daily and growing (see chart above). That data is bundled and sold as a continuous feed to AI companies.

By late 2024, Grass was reportedly generating ~$33 million in annualized revenue from AI clients, which we hear includes some of the big AI research labs we’re all familiar with (speculation, not confirmed).

Over time, it plans to distribute revenue back to node operators and token stakers, essentially treating data monetization as a shared revenue stream.

The vision is bigger than scraping: Grass is aiming to become a decentralized API for real-time data. In the future, it will be launching Live Context Retrieval, allowing clients to query real-time web data from across the network. It will require many more nodes to get to this stage.

Masa is taking a different route through the Bittensor ecosystem, running a dedicated data-scraping subnet (Subnet 42). Its “data miners” collect and annotate real-time web content, delivering data feeds to AI agents. Developers tap Masa to retrieve X/Twitter content to feed directly into LLM pipelines, bypassing costly APIs.

To scale, both Grass and Masa depend on a steady base of reliable node operators and contributors. That makes incentive design a core challenge. Other key challenges:

  • Very noisy data, prone to bias

  • Regulatory grey area

  • Lack of a real competitive moat since data is non-exclusive

2. Private Data, User-Controlled and Monetized

This focuses on unlocking high-value data that lives behind walls: personal, proprietary, and unavailable through public scraping. Think DMs, health records, financial transactions, codebases, app usage, smart device logs.

The core hypothesis: Private data contains deep, high-signal context that can dramatically improve AI performance, if it can be accessed securely with user consent

Subscribe to keep reading

This content is free, but you must be subscribed to Chain of Thought to continue reading.

Already a subscriber?Sign in.Not now

Reply

or to participate.