TL;DR

Public text is almost tapped out while the highest-signal private streams stay locked away behind paywalls, APIs, and privacy barriers.
Crypto data networks address this using three primary approaches: decentralized web scraping (e.g., Grass), user-consented private data aggregation (e.g., Vana), and on-demand synthetic data generation (e.g., Dria).
These networks leverage crypto primitives like tokens for incentives, blockchains for verifiable provenance, and DAOs for community governance.
Sustainable business models require focusing on utility-driven applications and moving up the value chain beyond raw data sales.
Key challenges include bootstrapping early networks, ensuring data quality, and overcoming enterprise skepticism.
The ultimate vision is a living data economy. Networks that secure credible, high-fidelity data now will dictate training speed, model performance, and capture the largest share of future AI value.

There’s a growing tension in AI: we’re racing toward ever more powerful models, yet running low on what matters most: high-signal training data. Not in quantity, but in quality.

According to Epoch AI, the largest training runs could deplete the world’s supply of public human-generated text—around 300 trillion tokens—by 2028. Some forecasts suggest we hit that wall as early as 2026, especially with overtraining.

And as Ethan Mollick notes, even vast amounts of niche text (like terabytes of amateur fiction) barely move the needle. The easy data is gone. We’ve scraped Wikipedia, drained Reddit, and mined Common Crawl. What’s left offers diminishing returns.

So we hit a paradox: while model capabilities continue to leap ahead, the availability of the right kind of data narrows. And as it does, the price and importance of quality, high-fidelity data skyrockets.

This is where things get interesting.

Subscribe Now

The Data We Need Now (And Why It's Locked Away)

People often say, “Data is the new oil”.

But this oversimplifies the challenge.

Oil is static and interchangeable. Data is dynamic, contextual, and deeply tied to how it’s sourced and used.

Here’s the new hierarchy of AI-critical data:

Category	Sources	Typical Barrier
Private & niche	Hospital imaging archives, manufacturing telemetry	Institutional silos, privacy law
Net-new domains	Robot teleoperation, agent interaction logs	Needs bespoke collection pipelines
Real-time streams	Market order books, social firehoses, supply-chain IoT	Latency and licensing costs
Expert-annotated	Radiology scans with specialist labels	Expensive, slow, hard to scale

Private and Niche Datasets: The highest-signal data is locked behind institutional walls: health records, genomics, financial histories, factory telemetry, proprietary R&D. They’re fragmented and often siloed.
Net New Data for Emerging Domains: You can’t train a household robot on Reddit. Robotics needs teleoperation, sensor data, and real-world context. All of this doesn’t exist in volume yet and must be actively generated through purpose-built pipelines.
Another key area for advancing agentic AI is capturing real action sequences: user clicks, navigation paths, and interaction logs. One example is the Wikipedia clickstream, an anonymized dataset that traces how users move from one article to the next.
Fresh, Real-Time Data: Intelligence needs a feed, not a snapshot. For them to adapt to live markets, we need real-time crawling and streaming.
High-Quality, Expert-Annotated Data: In fields like radiology, law, and advanced science, accuracy depends on expert labeling. Crowd-sourced annotation won’t cut it. This kind of data is expensive + slow + hard to scale, but critical for domain competence.

The era of just scraping the internet is ending.

Web2 Knows This

As AI valuations soared, platforms realized their most valuable asset was user data.

Reddit signed a $60M training deal with Google. X charges enterprises steep fees for API access. OpenAI is striking licensing agreements with publishers like The Atlantic and Vox Media, offering $1M–$5M per archive.

And the people, like you and me, who created that data? We get nothing.

Users generate the content. Platforms monetize it. The rewards accrue to a few centralized players, while the real contributors are left out. It’s a deeply extractive dynamic.

What if this changed?

Crypto x Data = Rebuilding Data Ownership From First Principles

We see three major aggregation strategies take shape around data:

Scraping and labeling public web data
Aggregating user-owned, private data
Generating synthetic data on demand

1. Scrape Public Data, Repackage at Scale

This focuses on harvesting the open web (forums, social platforms, public websites) and turning that raw stream into structured, machine-readable data for AI developers.

The indexed internet holds roughly 10 petabytes of usable data (10,000 TB). When broader public databases are factored in, that figure swells to around 3 exabytes (3,000,000 TB). Add platforms like YouTube videos, and the total exceeds 10 exabytes.

So there’s a lot of data out there.

Source	Estimated Size	Notes
Indexed Web Pages	~10 petabytes	Estimated based on 4.57 billion pages at 2.2 MB each
Deep Web Pages	~100 petabytes	Estimated as 10 times larger than the indexed web
Public Databases and APIs	~1-10 exabytes	Genomics, astronomy, climate data, open government portals
Public File Sharing and Storage	~1 exabyte	Data from platforms like GitHub, Dropbox, and public repositories.
Public Multimedia Platforms	~10 exabytes+	YouTube. Requires significant processing for AI use beyond transcripts.

Data is sourced through distributed scraping infrastructure: often, networks of user-run nodes. Once collected, the data is cleaned, lightly annotated, and formatted into structured datasets. These are then sold to model developers looking for affordable data at a fraction of what centralized providers like Scale AI charge.

The competitive edge comes from decentralization, which reduces scraping costs. Projects like Grass and Masa are turning public web data into a permissionless, commoditized resource.

Source: Grass

Grass launched in 2024 as a decentralized scraping network built on Solana. Within a year, it grew to over 2 million active nodes. Users install a lightweight desktop app that transforms their device into a Grass node, contributing idle bandwidth to crawl the web.

Each node handles a small chunk of the scraping workload, and together they pull in over 1,300 TB of data daily and growing (see chart above). That data is bundled and sold as a continuous feed to AI companies.

By late 2024, Grass was reportedly generating ~$33 million in annualized revenue from AI clients, which we hear includes some of the big AI research labs we’re all familiar with (speculation, not confirmed).

Over time, it plans to distribute revenue back to node operators and token stakers, essentially treating data monetization as a shared revenue stream.

The vision is bigger than scraping: Grass is aiming to become a decentralized API for real-time data. In the future, it will be launching Live Context Retrieval, allowing clients to query real-time web data from across the network. It will require many more nodes to get to this stage.

Masa is taking a different route through the Bittensor ecosystem, running a dedicated data-scraping subnet (Subnet 42). Its “data miners” collect and annotate real-time web content, delivering data feeds to AI agents. Developers tap Masa to retrieve X/Twitter content to feed directly into LLM pipelines, bypassing costly APIs.

→ We wrote an essay on Masa (March 2025)

To scale, both Grass and Masa depend on a steady base of reliable node operators and contributors. That makes incentive design a core challenge. Other key challenges:

Very noisy data, prone to bias
Regulatory grey area
Lack of a real competitive moat since data is non-exclusive

2. Private Data, User-Controlled and Monetized

This focuses on unlocking high-value data that lives behind walls: personal, proprietary, and unavailable through public scraping. Think DMs, health records, financial transactions, codebases, app usage, smart device logs.

The core hypothesis: Private data contains deep, high-signal context that can dramatically improve AI performance, if it can be accessed securely with user consent

Data Networks: Solving The Great AI Bottleneck

TL;DR

The Data We Need Now (And Why It's Locked Away)

Web2 Knows This

Crypto x Data = Rebuilding Data Ownership From First Principles

1. Scrape Public Data, Repackage at Scale

2. Private Data, User-Controlled and Monetized

Reply

Keep Reading

Chain Of Thought

Discover

Connect

Data Networks: Solving The Great AI Bottleneck

TL;DR

The Data We Need Now (And Why It's Locked Away)

Web2 Knows This

Crypto x Data = Rebuilding Data Ownership From First Principles

1. Scrape Public Data, Repackage at Scale

2. Private Data, User-Controlled and Monetized

Subscribe to keep reading

Reply

Keep Reading

Chain Of Thought

Discover

Connect