AI Training Data Has a Provenance Problem

I. The Trust Crisis Built on Invisible Assumptions

The Internet Forgot to Sign Its Work

Every major top of the line model — GPT-4, Claude, Llama, and Gemini — was trained on data scraped from the web at a scale that makes individual creator consent not merely inconvenient but structurally impossible. The entire infrastructure of the modern AI industry was built on an unspoken assumption: that public availability equals permission. That if you published something on the internet, you consented to whatever anyone might do with it downstream.

That assumption is now being litigated out of existence. And the companies that bet on it are scrambling.

The provenance problem is not a PR or legal problem. It is an architectural one that no one could have predicted. The internet systems that predated AI had no native way to track what was used, when it was used, who owned it, or under what terms.

II. The Mechanics of Mass Data Acquisition

How AI Companies Actually Source Training Data

To understand why the provenance problem is architectural rather than incidental, we need to understand how foundation model training actually works at the data layer.

The Common Crawl Dependency

The majority of training corpora for GPT-class models, Llama, Claude, and others are built substantially on Common Crawl. Datasets like RefinedWeb, C4 and Dolma which are used to train flagship AI models all get their data from Common Crawl — a nonprofit that has been archiving the web since 2008. Common Crawl is, effectively, the shared data utility of the AI industry. It is also where the consent problem begins.

While Common Crawl's mission to catalog and archive the world wide web may seem purely altruistic, its archives contain copyrighted content, pirated material, and adult websites — much of it obtained without explicit consent from the underlying creators by bypassing paywalls supposed to protect premium content. In 2023, both OpenAI and Anthropic donated $250,000 and Andreessen Horowitz donated $100,000 to Common Crawl (since they're technically a charity organization), creating a structural incentive alignment that dirties the waters of provenance. While this does not inherently change Common Crawl's mission of cataloging the web, these donations indicate an obvious alignment of interest that makes Common Crawl's claim to be purely altruistic and neutral harder to believe blindly. Creating a legal gray zone: if the data came from a nonprofit, it is easier to claim distance from the question of whether it should have been collected in the first place.

The robots.txt Fiction

Robots.txt was designed in 1995 with the aim of creating a symbiotic relationship between websites and search engines. Its application to AI training is a category error that the industry has treated as adequate consent.

The internet has always had a degree of voluntary compliance baked into it. Search engines would help drive users and traffic to a site in exchange for the data the site contained. Which worked with search because it was a mutual exchange that gave websites an incentive not to block crawlers. AI training, however, is a one-way extraction with no traffic return. Where data is ingested and morphed into mathematical and statistical values that govern how the model operates. Inverting the incentive structure and deteriorating the compliance record accordingly.

TollBit reported that as of Q1 2025, 12.9% of bots now ignore robots.txt files entirely — up from 3.3% the year prior. Multiple AI developers have been accused of bypassing robots.txt opt-outs to scrape publisher websites. The pattern appears to involve AI systems distinguishing between crawling for training and crawling for "inference-time retrieval" — meaning even a company that registered one crawler to respect opt-outs may deploy a separate one that does not.

In December 2025, OpenAI quietly removed language indicating its ChatGPT-User crawler would comply with robots.txt rules. A documentation change that largely went unnoticed but represented a significant policy shift. Common Crawl's CEO noted the implication: user-initiated AI browsing is no longer constrained by the consent signals or paywalls websites have set.

The Consent Collapse Hidden in Plain Sight

The Data Provenance Initiative's landmark 2024 study, "Consent in Crisis: The Rapid Decline of the AI Data Commons," quantified what many creators already sensed and experienced. In a single year (2023–2024), websites rapidly escalated their data restrictions, rendering approximately 5% or more of all tokens in C4 (one of the most widely used training datasets) fully restricted from use under robots.txt. For Terms of Service crawling restrictions, a full 45% of C4 is now restricted.

These numbers are more alarming than they first appear. That 5% robots.txt restriction concentrates almost entirely in the highest-quality, most actively maintained data that exists on the web — premium journalism, academic content, well-curated professional writing. The long tail of low-quality scraped content remains available. The signal is being locked down; the noise remains open. Which in the short-term may be considered a win, but these models are only the beginning and AI is not going anywhere.

AI companies are mechanically forced toward three degraded substitutes: synthetic data, social media slop, and jurisdictional arbitrage. Each carries a distinct failure mode. Synthetic data collapse is now mathematically proven — even a fraction of 1 in 1,000 synthetic samples can trigger model degradation that more compute cannot fix. Social media training produces what Grok demonstrates at scale: a model partly trained on X posts rampant with misinformation and conspiracy theories that incorrectly blamed a trans pilot for a helicopter crash, claimed the Trump assassination attempt was staged, and conjured criminal histories for innocent people. Jurisdictional arbitrage produces fragmented, regionally degraded products, as GDPR pressure forced xAI to serve EU users a Grok trained on an entirely different, lower-quality corpus.

Grok's antisemitic outputs, its election misinformation, its epistemic drift toward the ideological character of its training platform — all arrived upstream of any guardrail that could have caught them. If this were to happen on an AGI-level model with agentic capabilities whose worldview was quietly shaped by poisoned training data, it could confidently reason its way toward conclusions that reflect the biases of a 2024 X feed rather than reality — embedded in critical infrastructure, government systems, or financial decision-making before anyone thought to check what it actually learned. Could very easily result in an apocalypse.

The 'Public = Permitted' Logic and Its Legal Status

The foundational legal argument AI companies have relied upon — that publicly available data is implicitly licensed for any use — is being directly challenged in courts. The new generation of AI-scraping cases uses robots.txt affirmatively: not as a privacy mechanism but as evidence that AI companies knew they lacked authorization, were told they lacked authorization, and proceeded anyway.

The threshold question for 2026 is no longer whether consent matters. Courts and regulators have settled that. The question is who bears the burden of obtaining it and what infrastructure must exist to make consent legible at scale.

III. The Anatomy of Consent Failure

What Creators Actually Consented To

Creator consent operates at a fundamentally different level of granularity than AI company data practices. These are not disputes dancing about the edges of well-defined law. They are disputes about whether the law's underlying consent framework can be applied to a system that was never designed to honor it.

A writer who published an essay in 2018 consented to readers reading it. They did not consent to it being weight-updated into a model that now competes with their livelihood, generates content in their style, and returns no economic value to them. These are not the same act. The legal system is only beginning to catch up to that distinction.

The Systematic Absence of Attribution

One of the most technically under-appreciated dimensions of this provenance problem is how attribution evaporates as data moves through the training pipeline. Creators are rarely credited as having contributed training data, despite the occasional artist signature or watermark that slips through into generated content. The absence of attribution is a consequence of how these systems were designed. The point was to extract statistical structure, not to preserve provenance chains.

The Unlearning Problem — Why Retroactive Consent Fails

Perhaps the most technically important dimension of the provenance crisis is that retroactive consent is nearly impossible to implement at the model level. Methods to retract, or "unlearn," data from a model after training is complete are currently of limited reliability. Machine unlearning methods often fail to fully remove the intended information, or can harm other aspects of a production model — which disincentivizes their use even when creators demand it.

The Common Crawl architecture compounds this: in November 2024, The Atlantic reported that despite publishers submitting takedown requests — including The New York Times in July 2023 — Common Crawl's archives still contained the supposedly removed content. Common Crawl's executive director revealed the organization's file format (WARC) is "immutable." Once data is archived, nothing can be truly removed. Ever. Deletion of WARC files at the scale at which Common Crawl operates is unbelievably costly and complex. Each crawl snapshot is thousands of WARC files, each with its own web of overlapping domains. It would involve surgically decompressing, parsing, filtering and recompressing every affected file, then rewriting it to the database.

Common Crawl maintains a separate URL index (the CDX index) that maps URLs to byte offsets within specific WARC files. If you remove a record from the CDX index, that content becomes effectively invisible to anyone using standard tooling to access the archive. The WARC bytes still exist on S3, but nothing points to them — rendering the files invisible to researchers. De-indexing is not the same as deletion, and anyone with direct database access and knowledge of the file structure can still reach that content. The AI labs downloading bulk data for training runs aren't using the public CDX index the way a casual researcher would. Instead they're ingesting raw WARC files directly. So de-indexing a publisher's content from the public search interface does nothing to prevent it from flowing into a training pipeline that operates at the file level.

This is an architectural failure. And it points directly to the only viable solution: provenance must be established at creation, before data enters any pipeline. The window for consent must be at the origin, not at the destination.

Retroactive permissioning is an unsolvable problem. Pre-registration is not. The only infrastructure that can honor creator consent at the scale AI requires is provenance embedded cryptographically at the moment a work is created.

The Transparency Gap as Structural Harm

The unsettled legal status of AI training data has led to compensation proposals over which creators have little control. Greater data transparency would allow creators to know which models trained on their work, give them the opportunity to provide (or deny) consent, verify proper credit, and seek compensation in applicable cases. None of this infrastructure currently exists in any coherent form. The ASCAP/BMI analogy is instructive: before those rights management organizations existed, the legal rights of songwriters were also acknowledged but unenforceable at scale because the infrastructure to track, register, and monetize those rights didn't exist. The rights and the infrastructure are different things. AI is in the rights-without-infrastructure phase.

IV. Courts, Congress, and Regulators

The Legal Architecture Is Catching Up — and Faster Than Anyone Expected

The courtroom landscape shifted dramatically in 2025 in ways that directly define the future. The direction of legal travel is unambiguous: documented provenance is becoming the decisive variable in AI training litigation, and the companies and creators without it are most exposed.

2025: The Year Courts Drew Real Lines

Three federal decisions in 2025 collectively formed the first coherent framework for how copyright interacts with machine learning. In Thomson Reuters v. Ross Intelligence (D. Del. 2025), the court found that using Westlaw headnotes to train a competing legal-research AI failed fair use because the training directly substituted for the original product's market function. In Bartz v. Anthropic (N.D. Cal. 2025), the court found Claude's training transformative because it extracted statistical patterns rather than storing expressive content. In Kadrey v. Meta (N.D. Cal. 2025), a similar analytical-use argument succeeded.

The emerging trend: transformation protects learning; substitution invites liability. But buried in both favorable opinions is that the defense depended on the ability to document what data was used and how. Training data transparency went from an ethical aspiration to a litigation necessity in a single year.

The Copyright Office Weighs In

On May 9, 2025, the U.S. Copyright Office released its comprehensive guidance on generative AI training — the third and final report in its Copyright and Artificial Intelligence series. The 108-page report presented a cautious but consequential interpretation: while analytical use may qualify as fair use, the wholesale ingestion of entire works "ordinarily weighs against fair use." The office explicitly flagged the importance of provenance documentation, metadata requirements, and transparency obligations as the framework within which future licensing standards will develop.

Legislative Pressure Toward Explicit Consent

Senator Hawley's proposed AI Consumer Protection Act represents the furthest legislative reach yet: a private right of action that would allow creators to sue AI developers for using their work without express, informed consent. The bill specifies that consent must be willfully given, cannot be obtained as a condition of using a product, and places the burden of obtaining consent on whoever exploits the data. If enacted in any form resembling the current proposal, it would make consent infrastructure — the ability to document, verify, and honor creator preferences at scale — legally mandatory rather than ethically aspirational.

California Already Moved

California's A.B. 2013, enacted in September 2024, requires developers of generative AI systems to disclose details about training datasets for any system released to California residents on or after January 1, 2022. It is the first U.S. mandate for training data transparency. The disclosure requirement is limited to a summary, not a full audit, but it establishes the principle that opacity about training data is a legally actionable position. More states are watching.

V. The Technical Architecture of a Solution

Why C2PA Is Necessary But Not Sufficient — and Where Stelais Fits

The provenance problem has attracted a serious technical response from the industry. The Coalition for Content Provenance and Authenticity (C2PA) represents the most significant attempt to build provenance infrastructure at scale. Understanding what C2PA does well and where it stops is essential to understanding the infrastructure gaps.

What C2PA Gets Right

At the heart of the C2PA specification is the Content Credential: a cryptographically bound structure that records an asset's provenance. Content Credentials contain assertions about origin, modifications, use of AI, and — critically — whether a creator wishes to allow their content to be used for AI training. The specification is tamper-evident: if someone modifies the associated content or metadata after signing, the modification is detectable.

The adoption trajectory is meaningful. Google joined C2PA's steering committee and integrated Content Credentials into Google Search and its Ads systems. OpenAI committed to attaching Content Credentials to Sora-generated video (which they've recently dissolved). Adobe has implemented credentials across Photoshop, Lightroom, and Firefly. Leica released the first camera with native Content Credential support in hardware. The C2PA specification is expected to be adopted as an ISO international standard and is being examined by the W3C for browser-level implementation.

This is the signal layer working as intended. Major platforms and tools are beginning to mark content with verifiable provenance at the moment of creation.

What C2PA Cannot Do

C2PA is a signal layer. It verifies that provenance metadata is well-formed, cryptographically signed, and untampered at a point in time. It does not provide permanent, decentralized, immutable record of that provenance.

The specification itself is explicit on this point: Content Credentials do not provide value judgments about whether a given set of provenance data is "true" — only whether it is well-formed and free from tampering, valid, and trusted relative to a specific trust list. The trust list is the critical dependency. Trust lists are maintained by organizations. Organizations can change their policies, lose funding, be acquired, or be shut down. The C2PA architecture ultimately rests on centralized certificate authorities and trust anchors — which means the permanence of any given provenance record is contingent on the continued operation and good behavior of those institutions.

There are also more fundamental stripping risks. C2PA metadata is attached to files. Files can be re-saved, re-encoded, or processed through pipelines that strip metadata. The credential may not survive the journey from creation to wherever the data ends up in a training corpus.

C2PA tells you that a credential exists and is well-formed. It cannot tell you that the underlying creator claim is permanently anchored and will remain verifiable regardless of what any company does next.

The Stelais Architecture: Proof, Not Signal

Stelais is built on Arweave's permanent storage protocol, which provides the missing foundation the C2PA signal layer requires: an immutable, decentralized ledger of creator certification that exists independent of any trust list, certificate authority, or single company's continued operation. Which is imperative given Common Crawl's poor handling of its data.

When a creator registers a work on Stelais, three things happen that C2PA alone cannot guarantee. First, a cryptographic hash of the work is written to Arweave's permanent ledger — creating a record that cannot be modified, deleted, or revised by any party, including Stelais itself. Second, the creator's identity assertion is anchored to that hash at a specific timestamp, establishing an unambiguous chain of custody from creation to registration. Third, the creator's explicit consent preferences — whether the work may be used for AI training, under what license terms, and with what compensation requirements — are encoded into that permanent record.

The result is not just a signal that can be verified against a trust list. It is a proof: a timestamped, permanent, publicly verifiable record of who created what, when, and under what terms. For AI companies navigating the legal landscape described above, that distinction matters enormously. A C2PA credential that relies on a certificate authority that no longer exists is not defensible in discovery. A hash anchored permanently to a decentralized ledger is.

The Positioning: Infrastructure for Both Sides of the Market

The non-obvious strategic insight is that Stelais's primary market is not just creators seeking protection. It is AI companies seeking defensible training pipelines. As the data commons closes — 45% of high-quality web data now restricted — the value of provenance-verified, opt-in data pipelines increases dramatically. An AI company that can demonstrate it trained on Stelais-registered works, with provenance chains that are permanently verifiable and consent records that are cryptographically enforced, has a qualitatively different litigation posture than one relying on Common Crawl and a fair use argument. Additionally, Stelais's architecture offers a win-win solution for creators and AI companies. Creators win because they get either reasonably compensated for their contributions to the training data or can sleep soundly knowing their work is protected and AI companies are respecting their decision to opt-out. AI companies win because they will have the opportunity to train their models on high quality data — avoiding the ouroboros of continuous training on low quality information or already pre-existing AI-generated "slop."

Stelais is not just a rights protection tool for creators. It is the supply-chain infrastructure that allows AI companies to build legally defensible, regulatory-compliant models as the era of ambient consent ends.

VI. Conclusion

The Internet Forgot to Sign Its Work

The provenance problem is ultimately a design flaw in how the internet was built: open by default, attribution optional, consent implicit. AI training exploited that design at a scale the original architects never imagined. The legal system is now forcing a renegotiation of that implicit contract — through courts, through regulators, through legislation — and the companies on both sides of the creator-AI divide are discovering that the absence of provenance infrastructure is not a gap they can negotiate around. It is a gap they have to fill.

The creators who built the training data that made AI possible deserve a system that can honor their consent, track their contributions, and compensate them for their value. The AI companies that need defensible training pipelines deserve infrastructure that makes consent legible and legally durable. Both problems have the same solution: provenance at creation, permanent and verifiable, independent of any single institution's continued cooperation.

That is what Stelais builds.