Technical specification

Audience: Technical evaluator at an AI lab: staff engineer or compliance architect.

Scope: This document specifies the registry's regulatory grounding, architecture, integration model, audit-defensibility properties, and governance.

The on-the-wire format is specified separately. This document is one operator's deployment and its regulatory/audit framing. The open, vendor-independent declaration format it implements — subject scopes, reservation vocabulary, declarant signatures, anchor construction, and the normative verification procedure — is ADP-1, the Akaeon Declaration Protocol. Read ADP-1 for the format and the "verifiable without trusting Akaeon" core; read this document for how the registry deploys it and what regulatory obligations it helps discharge.

Companion documents: ADP-1 declaration format · Lab integration runbook · Concrete publisher opt-out example · Standalone verifier.

1. Executive summary

The Akaeon Registry is a domain-level content opt-out registry that produces cryptographic, publicly-anchored records of publisher preferences regarding machine learning training. It exists so that a lab's compliance process can answer one question: "did this domain opt out before our training cutoff?" With an artifact that survives later legal or auditor challenge.

The registry is not built greenfield. It is the second consumer of a production substrate that already operates at stelais.com. A creator product that has been anchoring per-record content provenance proofs to Arweave with explicit licensing terms. The substrate provides four primitives, packaged as workspace libraries (@akaeon/core-arweave, @akaeon/core-verification, @akaeon/core-fingerprinting, @akaeon/core-watermarking), each brand-neutral and parameterized at the API boundary so the same code paths serve both consumers without duplication. The registry adds the publisher- and lab-facing API surface, a DNS-based publisher verification flow, and a Merkle batching layer for higher-throughput opt-out submissions. Underneath, the record-level signing and anchoring path is identical to Stelais's.

The integration story for a lab is: one HTTPS GET per domain at training-data ingestion time, returning a JSON bundle whose every cryptographic claim is checkable in standard-library code against the public Arweave network — three independent checks, no proprietary cryptography, no required SDK, no required ongoing relationship with the registry beyond initial credentialing.

This specification covers what is built today (the cryptographic substrate, including the Ed25519 signing and Arweave canonical-payload anchoring), what is in active development (the publisher submission API, the Merkle batcher, the lab lookup endpoint), and what the audit-defensibility properties are for both halves of the system.

2. Regulatory context

what specific obligations does the registry help me discharge?, and would a record from this registry survive a regulator's or auditor's scrutiny in the jurisdictions I operate in? This section maps the live regulatory landscape onto the registry's mechanisms and where the registry's defensible design and where the obligation lies elsewhere.

The summary up-front: across the EU, UK, US, and Asia-Pacific, the trend in 2024–2026 has converged on a single question regardless of statutory framing. Did the lab know the source had reserved its content, and what proof does the lab have of what it knew and when? The registry is built specifically to be the enduring artifact that answers that question. The legal frame around the question varies by jurisdiction; the evidentiary primitive the lab needs does not.

2.1 European Union: AI Act + CDSM Directive

The EU is the most prescriptive jurisdiction and is where the registry's design has the cleanest fit.

EU AI Act (Regulation 2024/1689), Article 53. General-purpose AI model providers must, under Article 53(1)(c), "put in place a policy to comply with Union law on copyright and related rights, and in particular to identify and comply with, including through state-of-the-art technologies, a reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790." Article 53(1)(d) separately requires "a sufficiently detailed summary about the content used for training of the general-purpose AI model" in line with the AI Office template. GPAI provider obligations became applicable from 2 August 2025; the registry is designed for a world in which these obligations are already operative and the supervisory authority is asking for documented compliance, not for a promise of future compliance.

The critical phrase is "identify and comply with." A lab cannot demonstrate it identified a reservation without a record of the identification ie. what was checked, against what registry, at what time, with what result. The General-Purpose AI Code of Practice (the Commission-endorsed soft-law instrument signatories adhere to) treats engagement with machine-readable reservation mechanisms as a state-of-the-art compliance signal; the registry is one of those mechanisms, with the cryptographic properties to make engagement provable rather than self-asserted.

Directive (EU) 2019/790 (CDSM), Article 4. Article 4(1)–(2) creates a text and data mining exception. Article 4(3) carves out an opt-out: the exception applies unless the use of works "has been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online." The CDSM is the underlying legal mechanism the AI Act Article 53(1)(c) refers back to.

The CDSM's "appropriate manner" and "machine-readable means" are deliberately technology-neutral. robots.txt, ai.txt, IPTC photo metadata, and a domain-level registry like Akaeon are all candidate channels. The registry's distinguishing properties for an Article 4(3) defense are: (i) the reservation is associated with the publisher by DNS challenge, not by assertion; (ii) the timestamp is provable against a public substrate, not against the publisher's or registry's own clock; (iii) the reservation persists in a form the rightholder can point to in court even if the registry, the publisher's website, or any intervening party becomes uncooperative.

2.2 United Kingdom

The UK government's Copyright and AI consultation ran from 17 December 2024 to 25 February 2025. The government's originally preferred option was a broad TDM exception coupled with a rightsholder opt-out, mirroring the shape of EU CDSM Article 4(3) and drew support from only ~3% of the ~11,500 respondents; creative-industry submissions strongly favoured mandatory licensing and transparency. The government's response, the Report on Copyright and Artificial Intelligence required under the Data (Use and Access) Act 2025, was published on 18 March 2026 and notably did not proceed with the opt-out model; the current UK posture is evidence-gathering and commissioned research, with no near-term statutory reform.

This leaves UK-facing labs in a litigation-driven posture closer to the US frame than to the EU's. The registry's value in the UK is therefore primarily evidentiary: a contemporaneous, third-party-timestamped record of the lab's sourcing decisions that supports a defence to direct or secondary infringement claims. The registry's design is forward-compatible with whatever reservation channel a future UK regime may adopt. The cryptographic properties (DNS-anchored authority, public-substrate timestamping, post-hoc verifiability without the registry's cooperation) are agnostic to the statutory phrasing.

2.3 United States

The US has no federal TDM exception, no federal mandatory opt-out channel, and a litigation-led rather than statute-led trajectory. The registry's value in the US frame is less about discharging a specific statutory obligation and more about providing the evidence that a fair use defense, a DMCA Section 1202 claim, or a state-law action can rely on.

Case law as of 2026-06. Thomson Reuters Enterprise Centre GmbH v. ROSS Intelligence (D. Del., Judge Bibas, partial summary judgment 11 February 2025) was the first US merits ruling rejecting a fair use defense for AI training in a commercial-substitute context, though the court expressly limited its reasoning to non-generative AI. Bartz et al. v. Anthropic (N.D. Cal., Judge Alsup, summary-judgment order 23 June 2025) split the question: training on legally-acquired books was "exceedingly transformative" and fair use, but Anthropic's downloading of ~7 million books from shadow libraries (LibGen, PiLiMi) was not. The case was certified as a class action, and on 25 August 2025 the parties announced a $1.5 billion settlement (preliminarily approved 25 September 2025; final approval hearing 14 May 2026) covering ~500,000 pirated works it is the largest US copyright settlement on record. Kadrey v. Meta, NYT v. OpenAI, and a growing publisher docket are running in parallel.

The pattern across these cases is that the lab's sourcing record — provenance, licensing, opt-out signals at acquisition time — is critical in the fair use analysis and in damages calculations. A lab that can produce, per source, a cryptographically anchored record of opt-out status at the moment of acquisition is in a materially different evidentiary position than a lab whose claim is "we ran a robots.txt check" without a contemporaneous artifact.

17 U.S.C. § 1202 (DMCA copyright management information). Section 1202(b) prohibits the knowing removal or alteration of copyright management information. Several pending matters allege that training-data ingestion stripped CMI from in-corpus works. The registry does not handle CMI directly, but it produces a CMI-adjacent record: a publisher-authoritative, machine-readable assertion about training use of the publisher's content and labs that ignore the registry while ingesting a publisher's content invite a parallel argument under § 1202(b)(3) (distributing works knowing CMI was removed or altered) that the registry's record helps the rightsholder construct.

Federal proposals. The NO FAKES Act (federal, advanced through Senate Judiciary in 2024–2025) creates a digital-replication-right framework for voice and likeness. The COPIED Act (federal, reintroduced 2025) proposes provenance and tamper-evidence requirements for AI-generated content. Neither is a TDM-reservation statute, but both are a push in the direction of "the lab must have a contemporaneous record of what it did with each source." The registry's design contemplates extending its canonical record types to cover non-domain scopes (per §10.5) as these frameworks formalize.

State action. Tennessee's ELVIS Act (2024) and California's analogous frameworks address replication of voice and likeness. New York, Illinois, and a growing set of states have introduced or passed disclosure-and-watermark statutes touching AI training and output. State-law obligations are jurisdictionally fragmented; the registry's value in this frame is the same as in the federal one — a single record format that a state-court litigator can interpret without the registry's cooperation.

2.4 Asia-Pacific

Japan. Article 30-4 of the Copyright Act permits use of works for "non-enjoyment" purposes including information analysis, but the exception does not apply where the act "would unreasonably prejudice the interests of the copyright owner." The Agency for Cultural Affairs (Bunka-cho) released its draft Approach to AI and Copyright in January 2024 and published the General Understanding on AI and Copyright in Japan in May 2024, both clarifying that the Article 30-4 carve-out has real limits — commercial-substitute uses, style-targeted fine-tuning (e.g. LoRA on a specific artist), and ingestion that prejudices the rightholder's existing markets fall outside the exception. A publisher's machine-readable opt-out, recorded contemporaneously, is direct evidence of a reserved interest the rightholder can later cite.

Singapore. Section 244 of the Copyright Act 2021 creates a computational data analysis (CDA) exception that permits training on copyrighted works for any purpose (commercial or non-commercial), subject to a lawful access condition and a prohibition on redistribution. Unlike the EU's CDSM Article 4(3), Singapore's CDA exception cannot be overridden by contract and contains no statutory opt-out mechanism, which makes the registry's value in Singapore less about discharging an Article-4-style obligation and more about (i) evidentiary record-keeping for any "lawful access" dispute, and (ii) cross-border compliance for Singapore-based labs whose models are deployed into EU or US markets where the registry's record does carry direct statutory or evidentiary weight.

Australia. Australia has no statutory TDM exception. On 5 August 2025 the Productivity Commission, in its interim report Harnessing data and digital technology, proposed introducing one. On 26 October 2025 Attorney-General Michelle Rowland publicly rejected that proposal, stating the government "will not entertain a text and data mining exception"; the government's stated direction is instead to develop a regime under which Australian creators are "fairly remunerated" via a Copyright and AI Reference Group consultation. Like the UK, this leaves Australian-facing labs in a litigation-and-licensing posture rather than an opt-out-statute posture. The registry's value is the same evidentiary primitive: a contemporaneous record of what was checked and what was found, which any future Australian remuneration or licensing regime is likely to require labs to produce.

2.5 Industry self-regulatory frameworks

The registry is designed to complement, not replace, the existing self-regulatory stack. The stack as of mid-2026:

robots.txt — the long-standing crawler-control convention. Honors a Disallow directive per user-agent; not cryptographically signed; mutable by anyone with write access to the site; no timestamp; no audit trail. Useful as the first signal in an ingestion pipeline; insufficient as the sole evidentiary record.
ai.txt (Spawning) — a more expressive AI-specific extension proposed by Spawning AI. Same evidentiary limitations as robots.txt.
C2PA (Coalition for Content Provenance and Authenticity) — a content-credential standard producing cryptographically signed manifests embedded in or sidecar to media files. C2PA addresses per-asset provenance and authorship; the registry addresses domain-level training preferences. The two are orthogonal and complementary — a lab can honor a C2PA manifest's do-not-train assertion at the file level and check the registry at the domain level, and the two records reinforce each other.
IPTC Photo Metadata — per-image embedded metadata. The 2023.1 standard added a Data Mining field (PLUS-Coalition vocabulary, with values including Prohibited for AI/ML training) for opt-out signalling, alongside the separate DigitalSourceType field (for tagging AI-generated content). Same per-asset vs. per-domain distinction as C2PA; complementary, not substitutive.
TDMRep (W3C Community Group Final Report) — a TDM reservation protocol with two properties (tdm-reservation, tdm-policy) conveyed via HTTP headers, a .well-known file, or HTML metadata. Same kind of channel as ai.txt, with a more formalized vocabulary. The registry's lookup API is straightforward to wire alongside a TDMRep crawler in a lab's ingestion pipeline.
The Spawning "Do Not Train" registry — the closest existing analog to Akaeon. Spawning maintains an opt-out list and a Have I Been Trained API. The registry's distinguishing properties relative to Spawning are the cryptographic anchoring, the public verifiability without the registry's cooperation, and the audit-defensibility properties detailed in §7.

The registry's design philosophy is belt-and-suspenders: a lab should run as many signal channels as it reasonably can (robots.txt, ai.txt, C2PA, IPTC, TDMRep, and the registry) and record the result of each at ingestion time. The registry's role is to be the channel with the strongest evidentiary properties when later challenged.

2.6 Why the registry satisfies an auditor or regulator

The above frameworks differ in statutory framing but converge on a shared evidentiary requirement: the lab must be able to produce, at audit time, a record of what reservation signals it observed at training time and how it acted on them. An auditor or regulator looking at such a record will ask three questions of it. The registry is designed so the answer to all three is yes.

1. Was the record contemporaneous with the lab's claimed action? The registry's records are anchored to Arweave block timestamps that the registry cannot manipulate (§7.1). The lab's audit log records the Arweave transaction id at training time; an auditor can resolve that id against the public network and confirm the block timestamp predates the training run. This addresses the auditor's worst-case suspicion — that a non-compliant lab manufactured a paper trail after the fact — with cryptographic evidence rather than process attestation.

2. Does the record reflect what the publisher actually said? The registry's DNS-challenge flow ties the opt-out to control of the publisher's domain, not to the publisher's claim of control (§5, §7.3). The canonical record's hash is committed to a Merkle tree whose root is on the public chain; a publisher disputing the registry's record can be confronted with cryptographic evidence of their own DNS state at the time of submission. A regulator can verify the chain without trusting the registry, the lab, or the publisher individually.

3. Can the record be re-verified by a party who has no relationship with the registry? The verification path uses only standard-library cryptography against a public substrate (§7.2). An auditor's technical team can re-verify any record using ~30 lines of code in any modern language, against any Arweave gateway. The registry is not on the critical path of verification — it can be offline, hostile, or extinct, and the records still verify. This is the property that converts a lab's record from a process attestation (which an auditor must trust the lab and the registry to honor) into an evidentiary artifact (which the auditor can independently confirm).

The registry does not, and cannot, discharge the lab's substantive obligation — that the lab respects opt-outs in training. It produces the artifact the lab needs to prove it did. Section 8.2 is explicit about the attacks the registry does not defend against, including the case of a lab that fabricates a check it never performed; the registry's job is to make the honest lab's record defensible, not to police the dishonest one.

The remainder of this specification details the cryptographic and architectural mechanisms by which the registry delivers the three properties above. A reader who wants the regulatory mapping with full detail by jurisdiction is invited to engage Akaeon's compliance counsel directly; this section is the technical evaluator's grounding, not legal advice.

3. What problem the registry solves

A lab training on web-sourced content faces a structurally complex compliance problem: at the moment of training, the lab needs to be able to prove what it knew, and when it knew it, about whether each source had opted out. Three properties matter for that proof to survive later challenge:

1. Independent timestamping. The lab cannot rely on a registry's self-reported timestamps. If the registry says "this opt-out was effective 2026-04-01," and the lab's training cutoff was 2026-04-15, the registry could be lying — either backdating opt-outs to make the lab look non-compliant, or forward-dating them to give the lab cover. The registry's records must be timestamped by a third party the registry does not control.

2. Independent verifiability. The lab cannot rely on the registry remaining available, honest, or even existing at the moment of audit. If the registry vanishes, gets compromised, or is replaced by a successor entity, every record it ever produced must still be cryptographically verifiable against the same public substrate that timestamped it.

3. Tamper-evident chain. The link between a publisher's stated intent, the registry's record of that intent, and the public timestamp must be cryptographically unbreakable. A challenger should not be able to manufacture a plausible alternative record after the fact.

The registry's design centers on these three properties. Section 7 maps each property to the specific mechanism that delivers it.

The status quo without a registry is that labs either:

Don't track opt-outs at all (and accept the compliance and litigation risk), or
Build their own per-source opt-out crawler (which produces records the lab itself generated, with no third-party timestamp and no defense against backdating by the lab), or
Rely on robots.txt / ai.txt (which have no cryptographic timestamp, no DNS challenge for authority, no audit trail, and are silently mutable by anyone with site write access).

The registry provides the missing primitive: a cryptographically-signed, publicly-timestamped, third-party-anchored record of publisher opt-out intent, accessible by a single API call.

4. System architecture

The architecture is three layers stacked vertically.

4.1 Layer 1 — Consumers

Two services live at the top layer. Both are application-shaped Express/Node services in the same monorepo. Neither knows about the other; they share only the core packages underneath.

Stelais (services/stelais-api + services/stelais-web) — the creator-facing product, in production today. Issues per-record content-provenance proofs for creators. Out of scope for the registry.
Akaeon Registry (services/akaeon-registry) — the subject of this document. The workspace scaffold for it exists today; the publisher/lab API surface is in active development.

The two services are deliberately peers, not parent/child. This shape was established specifically to make the registry a separable consumer.

4.2 Layer 2 — Core packages

Four workspace packages publish the brand-neutral primitives both consumers depend on:

Package	What it provides	Production status
`@akaeon/core-arweave`	Canonical payload builders; cost estimator; safety-limit evaluator	Production today
`@akaeon/core-verification`	Ed25519 keygen, sign, verify (RFC 8032); AES-256-GCM key encryption; canonical-message builder factory	Production today
`@akaeon/core-fingerprinting`	Text SimHash, image pHash (DCT-based), audio STFT + constellation hashing, normalized text hashing	Production today
`@akaeon/core-watermarking`	DCT and LSB watermarking primitives for image content	Production today, used by Stelais

The brand-coupling rule across all four packages: every brand-specific parameter — app identifier, network identifier, canonical message prefix — is a required argument with no default. The package will throw if called without one. Stelais passes app: 'stelais', network: 'arweave', prefix: 'stelais:proof:v1'; the registry passes app: 'akaeon-registry', network: 'arweave', prefix: 'akaeon-registry:optout:v1'. The packages contain no string literal of either consumer's brand. This is verifiable by grep in the core source tree.

4.3 Layer 3 — Arweave anchoring substrate

The Arweave network is the public, permanent, third-party-operated trust root. Records anchored through @akaeon/core-arweave end up as transactions on Arweave, retrievable by transaction id at https://arweave.net/<txid>, queryable by GraphQL against the network's gateway nodes, and timestamped by the network's own block production.

Stelais's existing anchoring model is one Arweave transaction per record. The registry's anchoring model is one Arweave transaction per batch of opt-out records (with each record in a Merkle tree whose root is on-chain and whose inclusion proof is served by the registry). The substrate is the same; the registry adds a thin batching layer on top.

The registry has the option to switch to any different content-addressable substrate (ar.io, Irys, an L1 commitment) by swapping the network parameter and the underlying upload adapter. Existing records remain verifiable against the original network indefinitely; the choice is per-batch, not retroactive.

4.4 Safety controls inherited from the substrate

The substrate ships with production-grade safety controls that the registry inherits, with two default-value overrides where registry batch payloads differ from Stelais's per-proof records:

Kill switches (two independent paths) — ANCHORING_ENABLED=false halts all anchoring on the next operation, no redeployment required (the incident-response lever). ANCHORING_MODE=OFF produces the same effect and is the long-term "intentionally off" signal. Both paths return a SKIPPED preflight decision; batches accumulate in pending_anchor status pending re-enablement.
Per-batch cost cap — ANCHORING_MAX_PER_PROOF_USD Blocks any single batch whose pre-anchor cost estimate exceeds the cap.
Daily and monthly budget caps — ANCHORING_DAILY_BUDGET_USD and ANCHORING_MONTHLY_BUDGET_USD enforced against the rolling anchoring_spend_log table.
Maximum payload size — ANCHORING_MAX_PAYLOAD_BYTES hard-blocks any payload that would exceed Arweave's per-tx cost-efficient range.
Dry-run mode — TURBO_DRY_RUN=true swaps the Turbo SDK for a mock client that produces DRY_RUN_<random> Arweave tx IDs without spending real credits.
Fail-closed semantics — if cost estimation fails, the default is to block; if the spend log can't be queried, the default is to assume budget is exhausted; if the balance probe fails, the default is to treat the balance as insufficient.

5. The cryptographic substrate

This section specifies the cryptographic primitives the registry depends on. The reader should leave this section confident they could re-implement verification of any registry record in their language of choice, using only the standard library.

5.1 Hashing

SHA-256 is the hash function throughout — content hashes, canonical payload hashes, Merkle tree leaves, DNS challenge digests.
Implementations: any FIPS 180-4 SHA-256. Node's crypto.createHash('sha256'), Python's hashlib.sha256, Go's crypto/sha256, Rust's sha2 crate, the Web Crypto API's SubtleCrypto.digest('SHA-256', ...).
Hex encoding is lowercase, no leading 0x, no whitespace.

5.2 Signature scheme

Ed25519 per RFC 8032 — Curve25519 in Edwards form, deterministic signing, no per-signature randomness, 32-byte public keys, 64-byte signatures.
The registry's signing keypair is generated via Node's crypto.generateKeyPairSync('ed25519'). Public key is the raw 32 bytes (transported base64); private key is DER PKCS8 (encrypted-at-rest with AES-256-GCM and the service master key, never transmitted).
Signature production calls Node's crypto.sign(null, message, privateKey).
Signature verification by any third party calls crypto.verify(null, message, publicKey, signature). To reconstruct a verifier-side public key from the registry's 32-byte raw public key, prepend the 12-byte Ed25519 SPKI DER header 30 2a 30 05 06 03 2b 65 70 03 21 00. Other Ed25519 libraries accept raw 32-byte keys directly without the SPKI wrap.

5.3 Canonical message format

For each signed record, the registry signs a canonical message: a deterministic pipe-delimited UTF-8 string the verifier can reconstruct from the record's stored fields without trusting the registry's timestamps.

For opt-out records:

akaeon-registry:optout:v1|<submission_id>|<domain>|<policy>

For a future extension where the registry attests to its own observation of a verified DNS challenge, an alternate format is:

akaeon-registry:dns-verify:v1|<submission_id>|<domain>|<dns_challenge_record_sha256>

The format is intentionally timestampless. The signature attests to what the registry said, not when it said it — the timestamp is delegated to the Arweave block in which the batch root is anchored, which is the trust root for time. This is the same property as Stelais's existing canonical message format stelais:proof:v1|<userId>|<fileHash>.

The prefix is the brand-coupling. It's a required argument; the core package has no default. Stelais's prefix and the registry's prefix cannot collide because they're literally different strings, and the prefix is part of the signed bytes.

5.4 Canonical record schema

An opt-out record's canonical form is a deterministic JSON document:

{
  "version": 1,
  "type": "domain_optout",
  "submission_id": "01J9XW...",
  "domain": "example-publisher.com",
  "policy": "no-training",
  "scope": "domain",
  "effective_from": "2026-05-11T00:00:00Z",
  "submitted_at": "2026-05-11T14:23:00Z",
  "dns_verified_at": "2026-05-11T14:31:00Z",
  "dns_challenge_record_sha256": "<hex>",
  "publisher_account_id": "01J9XW...",
  "app": "akaeon-registry",
  "network": "arweave"
}

The field set is deliberately minimal. Optional fields are included only when present; the canonical-serialization rule is:

UTF-8 encoding.
Sort keys lexicographically.
No insignificant whitespace (JSON.stringify with no space argument).
Reject any field not in the schema (no extension fields in v1).

The SHA-256 of this canonical document is the leaf hash that goes into the Merkle tree.

5.5 Merkle tree construction

The registry batches opt-out leaves into a Merkle tree once per batch. The construction follows the structure of RFC 6962 (Certificate Transparency) to inherit its well-studied second-preimage resistance:

Leaf node hash: SHA-256(0x00 || canonical_record_bytes).
Internal node hash: SHA-256(0x01 || left_child || right_child).
Odd-count handling: when a level has an odd number of nodes, the last node is promoted unchanged to the next level (CT's approach), not duplicated.
Empty tree: not permitted; a batch with zero leaves is not anchored.
Inclusion proof: the standard sibling-hash path from leaf to root. The verifier requires three inputs alongside the leaf hash and claimed root: the leaf index (the leaf's 0-based position in the tree), the tree size (total leaf count of the batch the leaf belongs to), and the sibling-hash array (lowest-level first). The tree size is load-bearing — RFC 6962's odd-count promotion means the "is the current node on the right edge of its level" decision depends on tree size, not just on the index's low bit. Naive Bitcoin-style verifiers that omit tree size produce wrong intermediate hashes for any path that passes through a promoted node, and silently fail to reconstruct the root for non-power-of-2 batches. The Lab Integration Runbook publishes the RFC 6962 §2.1.2 verifier algorithm that labs use.

The 0x00 / 0x01 domain-separation prefixes prevent the second-preimage attack where a leaf could be reinterpreted as an internal node. This is the same rationale the Certificate Transparency log uses; reference verifiers exist in every major language.

5.6 Batch canonical payload

The on-chain payload for an opt-out batch is:

{
  "version": 1,
  "type": "optout_batch",
  "batch_id": "01J9YA...",
  "started_at": "2026-05-11T14:00:00Z",
  "closed_at": "2026-05-11T15:00:00Z",
  "merkle_root_sha256_hex": "f3a9...",
  "leaf_count": 2814,
  "tree_construction": "rfc6962-style",
  "registry_signature": {
    "canonical_message": "akaeon-registry:batch:v1|01J9YA...|f3a9...|2814",
    "signature": "<base64>",
    "public_key": "<base64-32-byte-raw>",
    "signature_scheme": "ed25519",
    "version": "v1"
  },
  "app": "akaeon-registry",
  "network": "arweave"
}

This is the only payload anchored on Arweave per batch. The leaves themselves are stored in the registry's database and served on-demand to verifiers who request an inclusion proof. (Section 10 discusses the alternative — anchoring the full list of leaf hashes — and its trade-offs.)

The batch payload is small and one batch produces one Arweave transaction. At an hourly cadence with batches of up to ~10,000 leaves, the registry produces 24 transactions per day, but can be scaled according to demand.

5.7 Daily state commitment (sparse Merkle tree)

In addition to the per-batch trees, the registry anchors one state commitment per UTC day: a depth-256 sparse Merkle tree over the registry's entire per-domain opt-out state, keyed by SHA-256(domain), with SHA-256 throughout and the same 0x00/0x01 leaf/internal domain-separation prefixes as the batch trees. Each non-empty leaf commits to a canonical domain_state record listing the batch-tree leaf hashes of the domain's active anchored opt-outs — so a disclosed state chains down to the same Arweave transactions the per-record proofs use. The root is signed (akaeon-registry:state-root:v1|<epoch>|<root>|<domain_count>) and anchored as a type: "state_commitment" payload, chained day-to-day via prev_state_arweave_tx_id.

Against an anchored root the registry serves per-domain membership and non-membership proofs. The non-membership proof is the strong form of the negative attestation: it proves, against a root that was public before the question was asked, that a domain had no opt-out state as of the epoch. The full construction — empty-subtree constants, path convention, compressed proof wire format, verification algorithm, API surface, and the enumeration-resistance analysis — is specified in the state commitment addendum.

5.8 Evidence commitments and selective disclosure

The canonical record commits to verification evidence by hash rather than by content: dns_challenge_record_sha256 (§5.4) is the pattern. The registry generalizes it as policy: the evidence a verification produces — multi-resolver DNS transcripts, DNSSEC chain captures, resolver-disagreement records, verification metadata, content fingerprints — is retained in the registry's private evidence vault, never published, and committed to by SHA-256 inside the anchored canonical record. (Planned for canonical record schema v2: an optional evidence_bundle_sha256 field committing to the full evidence bundle for the verification; v1 records commit to the DNS challenge record only.)

Disclosure is selective: in a dispute, to a court, or to a credentialed lab under agreement, the registry produces the evidence bundle, and the recipient verifies it against the hash that was anchored — typically years — before the disclosure. Selective disclosure therefore carries the same trust properties as publication (the evidence provably predates the dispute and provably hasn't been altered) without the evidence ever becoming public. The registry's audit-defensibility story does not depend on publishing its enrichment data, only its commitments.

6. The registry's API surface

The registry exposes three public surfaces: a publisher API (write side), a lab API (read side), and a public verification surface (anyone side). All three are HTTPS; all three return JSON; all three use bearer tokens or HMAC signatures for authentication where authentication is required.

The Lab Integration Runbook contains the full request/response shapes. This section gives the high-level shape; the runbook is the implementation reference.

6.1 Publisher API (write side)

POST /v1/optouts — submit a new opt-out for a domain or subdomain. Returns a DNS challenge the publisher must publish before verification proceeds. Response is 202 Accepted with a submission_id and the challenge record details.
GET /v1/optouts/:submission_id — poll the status of a submission. States: pending_dns_verification, dns_verified, pending_anchor, anchored, failed, expired.
DELETE /v1/optouts/:submission_id — withdraw an opt-out. The withdrawal itself is anchored as a new record (type: "domain_optout_withdrawal") rather than erasing the original. The on-chain record cannot be erased; this preserves the audit trail.

Authentication: publisher account credentials issued at publisher registration time. Rate-limited per account.

6.2 Lab API (read side)

GET /v1/lookup?domain=<domain> — return all currently-effective opt-outs for the given domain, with full Merkle inclusion proofs and Arweave transaction references. This is the only endpoint the lab needs to call at training-data ingestion time. Response includes everything the lab needs for an independently-verifiable audit log entry.
POST /v1/lookup/bulk — POST a list of domains, return the bundle for each. For ingestion pipelines processing millions of URLs, this avoids one-domain-at-a-time round trips.
GET /v1/state/proof?domain=<domain>&epoch=<epoch> — return a membership or non-membership proof for the domain against the daily anchored state root (§5.7). Paid-tier entitlement; see the state commitment addendum §6.

Authentication: lab account bearer token. The token does not affect the returned facts — anchored records and proofs are the same for any authenticated lab, and every bundle is verifiable without trust in the registry's authentication. What is per-lab is the attestation envelope: negative attestations, bulk attestations, and served state proofs embed the credential's public lab_key_id inside the signed canonical message (runbook §2.1), making any redistributed response corpus attributable to the credential that obtained it.

6.3 Public verification surface

These endpoints require no authentication. They exist so a court, auditor, or competing service can verify any registry record without a relationship with the registry:

GET /v1/public/optouts/:submission_id/verify — return the canonical record, the registry's signature, the Merkle inclusion proof, and the Arweave transaction id for a single opt-out.
GET /v1/public/batches/:batch_id — return the batch metadata, the Merkle root, the leaf count, and the Arweave transaction id. Used by verifiers to confirm a claimed root against the registry's view of the batch.
GET /v1/public/registry-key — return the registry's current Ed25519 signing public key, and the historical keys with their validity windows. Verifiers use this to validate signatures issued at any time in the registry's lifetime.
GET /v1/public/state-roots/latest and GET /v1/public/state-roots/:epoch — return the signed daily state-commitment payload (§5.7) with its Arweave transaction reference. Roots are public so that any held proof verifies forever without credentials; obtaining a proof is the credentialed operation (§6.2).

The public verification surface is the load-bearing trust property. As long as any one of these endpoints is reachable somewhere — at the registry's domain, at a mirror, in a court's exhibit folder — every opt-out the registry ever issued can be independently verified against the public Arweave network.

6.4 Publication policy — commitments public, contents private

The API surface encodes a deliberate boundary, specified in full in the state commitment addendum §2:

Published, permanently: every cryptographic commitment — batch roots, daily state roots, the key catalog — and any individual record a holder chooses to disclose. Verification never requires the registry's cooperation or credentials.
Not published, at any tier: the registry's contents in enumerable form. There is no bulk export or snapshot manifest. Per-domain facts are served through credentialed, rate-limited, watermarked endpoints; verification evidence is held privately and disclosed selectively against its anchored hash (§5.8).

This boundary is why the registry can be maximally verifiable without being copyable. temporally, by the registry's own thesis, nobody can anchor in the past — so even a party holding the complete domain list tomorrow cannot recreate the anchored attestation history, the daily root chain, or the accumulated verification timestamps. The commitments are public because publishing them costs the registry nothing that time hasn't already made irreproducible.

7. Audit-defensibility properties

This section maps the three properties stated in §3 onto specific mechanisms.

7.1 Independent timestamping — Arweave block inclusion

The registry's claim about when an opt-out became effective rests on the Arweave transaction's block timestamp, not on any field in the registry's database:

The opt-out's effective_from field is a publisher-declared intent. It carries no cryptographic weight on its own.
The opt-out's submitted_at and dns_verified_at fields are registry-declared timestamps. They carry weight only insofar as they're signed by the registry and included in the batch canonical record.
The opt-out's effective timestamp for audit purposes is the Arweave block time of the transaction that anchored the batch containing the opt-out. This is the only timestamp the registry cannot manipulate.

The lab's audit log records the Arweave transaction id. A challenger arguing the opt-out was backdated must argue against Arweave's block timestamp — which requires either compromising the Arweave network or producing a forged transaction id that resolves to the same canonical record, both of which the network's design makes infeasible.

7.2 Independent verifiability — no registry code in the verify path

Every cryptographic claim the registry makes is verifiable using:

Any RFC 8032 Ed25519 implementation (Node crypto, Python cryptography, Go crypto/ed25519, Rust ed25519-dalek, Web Crypto API, OpenSSL).
Any SHA-256 implementation.
Any Arweave gateway (arweave.net, g8way.io, any node operator's endpoint, or a self-run Arweave node).

The lab's verifier code is ~30 lines of Node.js using only the standard library. It runs in the lab's environment, not the registry's. It does not depend on the registry being online, honest, or extant at the moment of verification.

A lab can, and per defensible practice should, cache the registry's signing public key locally at training time, so that even if the registry is later compromised and the verify endpoint serves a substituted key, the lab can verify against the key as it was at training time. Section 9.2 discusses key rotation policy.

7.3 Tamper-evident chain — cryptographic linkage at every step

Each link in the chain from publisher intent to lab acknowledgment is cryptographically committed:

Publisher intent → DNS challenge: the DNS challenge nonce is committed in the registry's database; the publisher's TXT record value is hashed (SHA-256) and the hash goes into the canonical opt-out record. Tampering with the TXT record after-the-fact would invalidate the canonical record's hash.
DNS verification → canonical record: the canonical record is a deterministic UTF-8 JSON encoding (§5.4). Any byte change anywhere in the record produces a different SHA-256, which would not match the leaf hash in the Merkle tree.
Canonical record → Merkle leaf: the leaf hash is SHA-256(0x00 || canonical_record_bytes). The 0x00 prefix prevents reinterpretation as an internal node.
Merkle leaf → Merkle root: the inclusion proof is a sequence of sibling hashes. Recomputing the root from the leaf and the proof is deterministic; any tampering with the leaf or any sibling invalidates the reconstruction.
Merkle root → Arweave transaction: the root is embedded in the batch canonical payload, which is the body of the Arweave transaction. The transaction id is SHA-256(transaction_fields); tampering with the body produces a different transaction id, which won't match the id the lab recorded at training time.
Arweave transaction → block inclusion: Arweave's consensus includes the transaction in a block with a public timestamp. Block reorgs are bounded by the network's design; the registry recommends labs treat transactions with at least N block confirmations as final, where N is configurable (default: 10).

Each link is independently checkable. A challenger has to break every link to produce a plausible alternative record — not just one.

7.4 The key-rotation failure mode and how the registry handles it

A subtle but real audit-defensibility concern: what if the registry's signing key is rotated, and an old record is later verified against the new public key? The signature won't match, and a careless verifier might conclude the record is invalid when in fact it was correctly signed by an earlier key.

The registry handles this by:

Versioned public keys. Every record's registry_signature block includes a version field. The version maps to a specific public key with a documented validity window, returned by the GET /v1/public/registry-key endpoint.
Snapshot-at-signing. The signing public key is embedded directly in the registry signature block on every record. A verifier never has to ask the registry "what was the right key for this record"; the answer is in the record itself.
Mirroring discipline. The validity windows and historical public keys are published in three places — the registry's own endpoint, a read-only mirror at a separate domain, and an Arweave-anchored declaration. The third is the most important: even if the registry and its mirror both vanish, the historical key catalog survives on the public chain.

8. Adversarial considerations

A technical evaluator at a lab will reasonably ask "what attacks does this not defend against?" This section is the honest answer.

8.1 Attacks the registry defends against

Attack	Defense
Registry backdating an opt-out	Arweave block timestamp is the trusted clock; registry can't move blocks
Registry forward-dating to manufacture liability	Same — block timestamp is fixed at inclusion
Tampering with a record after it's anchored	Canonical hash mismatch detectable by Merkle proof recomputation
Substituting a fake record into a real Merkle batch	Leaf isn't in the published leaves; inclusion proof would fail
Registry's signing key compromise after-the-fact	Historical key catalog is mirrored and Arweave-anchored; old records verify against the key that was current at signing time
Registry going offline	Records remain verifiable against Arweave; public verification endpoint can be mirrored or run by a successor entity
Registry falsely attesting "no opt-out existed" at lookup time	Daily anchored state commitment (§5.7): non-inclusion proofs verify against a root anchored before the lookup; a contradiction between an anchored opt-out and a signed non-inclusion proof is publicly provable equivocation
Bulk scraping / resale of registry responses	Per-lab `lab_key_id` embedded in every signed attestation and proof envelope (unstrippable without destroying the signature), plus rate limits and the acceptable-use contract
Network-level DNS poisoning during publisher verification	Multi-resolver DNS verification (Cloudflare, Google, Quad9, etc.) and DNSSEC chain recording where available
Replay attack on the signature	Canonical message includes `submission_id` which is unique per submission; replaying it produces a duplicate the registry rejects

8.2 Attacks the registry does not defend against

Publisher key compromise. If the publisher loses control of their domain (e.g. domain hijack, expired registration repurchased), an attacker can pass a DNS challenge and submit opt-outs the original publisher didn't intend. Mitigation: opt-outs have an effective_from field, and the registry exposes a withdraw flow that creates a new anchored record; the lab's audit log will reflect both. The registry does not detect that a domain has changed ownership.

Arweave network failure. If Arweave is partitioned, halted, or reorgs deeply, recently-anchored transactions could be lost. Mitigation: the registry's pre-anchor record state is retained in its database; on Arweave recovery, batches can be re-anchored (with a new, later timestamp). Mitigation also: the registry can dual-anchor to a second substrate (e.g. Irys, ar.io gateway) for cross-substrate redundancy.

Sybil submission flooding. A bad actor could attempt to register thousands of bogus opt-outs to inflate batch sizes or degrade service. Mitigation: per-account rate limits, DNS challenge requirement (each domain requires control proof), per-account cost-based throttling, and the substrate's existing daily/monthly anchoring budget caps. The DNS challenge is the strongest mitigation — it can't be passed without control of an actual domain.

Lab fabricating a verification record after the fact. A non-compliant lab could claim "we checked the registry at time T and saw no opt-out," when in fact they didn't check. The registry cannot prevent this — it controls its own records, not the lab's audit logs. The registry's job is to provide records that would defend the lab's good-faith audit log, not to police the lab's logging discipline. (A future extension: lab-side attestations signed by a publishable lab key, with the registry recording the attestation — out of scope for v1.)

Court refusing to credit Arweave timestamps. In a jurisdiction where public blockchain timestamps haven't been precedent-tested, a court may require additional evidence. The registry's design provides the cryptographic primitives; the registry cannot guarantee a specific court's admission ruling. Mitigation: dual-anchor to an additional substrate whose timestamps are precedent-credited in the relevant jurisdiction (e.g. a notarized hash, an IPFS+filecoin commitment, a public CA timestamping authority signature).

Compelled disclosure of the registry's signing key. In a jurisdiction that compels production of the signing private key, the registry could be forced to produce forged historical records. Mitigation: hardware security module storage for the key (HSM-backed; key never extractable as bytes), plus the structural property that every record ever signed is already anchored on Arweave. A compelled key disclosure would let an attacker forge a future record; it could not retroactively change records already anchored. The mirroring and Arweave-anchoring of the historical key catalog protects against post-compulsion narrative reconstruction.

8.3 Security boundaries of the substrate

All private signing keys at rest are encrypted with AES-256-GCM. The master key is environment-variable-only, never written to the database, rotated on a documented cadence.
TLS termination at the edge, with HSTS + certificate transparency monitoring.
HTTP API uses bearer tokens (JWT-signed) for authenticated endpoints. Webhook callbacks to publisher endpoints are HMAC-signed; the publisher verifies the HMAC before treating the callback as authoritative.
Anchoring spend log (anchoring_spend_log) is append-only; any anomalous spending triggers alerts.
Fail-closed semantics throughout: when in doubt, the system refuses to anchor rather than anchor incorrectly.

9. Governance, jurisdiction, and operational integrity

9.1 Who operates the registry, and what happens if they stop?

The registry is operated by Akaeon. The operational responsibility includes:

Maintaining the publisher and lab API surfaces.
Maintaining the Arweave-anchored Merkle batching cadence.
Maintaining the public verification endpoints.
Custody of the registry's signing private key.

If Akaeon ceases operations, every record the registry ever produced remains independently verifiable against Arweave. The records do not depend on Akaeon's continued operation. A successor entity can take over the publisher and lab API surfaces; the cryptographic chain remains intact.

The registry commits to:

Source availability for the verifier reference implementations under a permissive open-source license (Apache 2.0 or MIT). A successor can re-host verification without negotiating a license.
Wind-down protocol: if Akaeon enters wind-down, a 90-day notice window precedes shutdown, during which the registry continues anchoring and the public verification endpoints continue operating. The signing key is published at the end of the window so no further valid records can be produced under the existing key (preventing post-wind-down forgery), and the historical-records archive is mirrored to a long-term storage commitment (the registry's preference: an Arweave-anchored declaration listing every batch ever produced and its transaction id).

9.2 Key rotation policy

The registry's signing key is rotated:

Routinely at a documented cadence (initial proposal: every 12 months).
On compromise immediately, with a public disclosure of the compromise and a re-issuance flow for records signed by the compromised key (the records themselves don't change; an additional attestation under the new key is anchored to confirm the records' continued validity).
On algorithm change if RFC 8032 Ed25519 is deprecated by NIST or cryptographic consensus shifts to a successor scheme.

Key rotation does not invalidate prior records (§7.4). Each record carries the public key snapshot it was signed with; verifiers verify against the embedded snapshot, not against the current key.

The historical key catalog is published on Arweave on each rotation. The catalog is itself a canonical record (type: "registry_key_catalog") signed by the new current key, with chained references to all prior keys' Arweave records. This produces a tamper-evident key history that survives the registry's own continued existence.

9.3 Jurisdiction and data residency

The registry operates from a single jurisdiction; publisher account data and the registry's database are stored in a single region. Anchored data is on the public Arweave network, which is globally distributed and does not have a single jurisdiction.

The registry's database stores:

Publisher account credentials (email, API key hashes, organization name).
Submission state (pending DNS verification, anchored batch references).
Merkle leaf hashes and their batch membership.
Spend log entries.

The database does not store: publisher private keys, lab API keys' plaintext (only Argon2id hashes), Arweave private keys.

GDPR considerations: opt-out records contain a domain and a policy. A publisher's publisher_account_id is a UUID, not a person's identifier; the account-to-person linkage is in the registry's database under the publisher's voluntary registration. A right-to-erasure request can erase the account-to-person linkage from the database; it cannot erase Arweave records (which is a property of the substrate; this is disclosed at publisher registration). The opt-out records themselves contain no personal data beyond what the publisher voluntarily anchors.

9.4 Audit and inspection rights

The registry's operational integrity is verifiable by external parties:

Read-only mirror. A second domain (mirror.akaeon-registry.com, initial proposal) serves the public verification surface in read-only mode, with a different operational stack and ideally a different hosting provider. Discrepancy between the primary and mirror signals compromise.
Reproducibility audits. Quarterly, the registry publishes a reproducibility report: take the Merkle root of every batch in the quarter, recompute it from the leaf records the registry serves, and publish the result. A third party can spot-check by recomputing themselves; the report is itself anchored as a type: "audit_report" canonical record.
Open verifier reference implementation. The reference verifier code is published under an open license. Any party can run it independently; cross-implementation in another language (Python, Go, Rust) is a welcomed contribution and a real strengthening of the trust story.

10. Open questions and known limitations

The following are unresolved as of this draft.

10.1 Leaf publication policy. The registry currently anchors only the Merkle root, not the full leaf list. This is cost-efficient but creates a dependency on the registry to produce inclusion proofs. An alternative — anchor the full leaf-hash list in the batch canonical payload — costs more but removes the registry as a necessary mediator. Current Status: The registry anchors roots only for v1, add full-leaf-list anchoring as an optional batch mode in v2 driven by labs' assurance requirements. Update (2026-07-10): the daily state commitment (§5.7) now covers most of what motivated full-leaf-list anchoring — a lab can obtain anchored-grade proofs of a domain's state or absence without the registry publishing its leaf population, and the publication-policy analysis in the addendum argues an enumerable leaf list should not be anchored at all. Full-leaf-list mode remains open only for labs whose requirements demand mediator-free positive proofs after registry wind-down; see addendum §9.

10.2 Cross-substrate dual-anchoring. Section 8.2 mentions dual-anchoring to a second substrate (Irys, an L1, a notarized CA timestamp). v1 anchors only to Arweave. Adding a second substrate is straightforward at the @akaeon/core-arweave network parameter level but requires operational infrastructure (a second budget envelope, a second adapter, mirrored cost controls). Initial proposal: implement dual-anchoring as a v1.1 feature, once labs articulate which second substrate has the most precedent value in their jurisdiction.

10.3 Publisher-side signature. §5.3 mentions an extension where publishers sign their own submissions in addition to the registry signing its observation. Publisher signatures require the publisher to manage a keypair the DNS-based flow deliberately avoids. Initial proposal: defer to v2; collect signal from publishers on whether they want this property, and from labs on whether it would change their compliance posture.

10.4 Withdrawal semantics. A withdrawal is anchored as a new record (§6.1), not an erasure. This is correct for the audit trail (a lab that trained at time T1 and saw the opt-out should be able to verify it later, even if the publisher withdrew at T2 > T1). But it raises a UX question: does the lab's GET /v1/lookup return active opt-outs only, or all opt-outs with their effective windows? Initial proposal: active only by default, with a ?include_withdrawn=true flag for compliance teams that want the full history.

10.5 Granularity of opt-out. v1 supports scope: "domain" and scope: "subdomain". v2 candidates: path, url_pattern, content_type, content_hash (per-asset opt-out). Each adds complexity to the lab's lookup flow. Initial proposal: ship v1 with domain/subdomain only; collect publisher feedback before expanding.

10.6 The Merkle batching primitive's home. Currently designed to live in the registry workspace (not in @akaeon/core-arweave). If a second batching consumer emerges, the primitive can be hoisted to core.

10.7 Rate limits and pricing. Per-publisher submission rate, per-lab lookup rate, and pricing tiers are operational policy decisions not in scope for this technical specification. The runbook will document the rate limits when they're set.

Appendix A — Glossary

Term	Definition
Anchoring	Writing a record's canonical hash and metadata to a public, permanent, third-party-operated network (Arweave) such that the record's existence at a specific time becomes publicly verifiable.
Canonical record	A deterministic JSON serialization of an opt-out's stated facts (domain, policy, scope, timestamps). The SHA-256 of this serialization is what gets committed to the Merkle tree.
Canonical message	The pipe-delimited UTF-8 string the registry signs with Ed25519. Format: `<prefix>\|<components...>`. Timestampless by design.
Inclusion proof	The sequence of Merkle-tree sibling hashes a verifier uses to reconstruct the batch root from a single leaf.
Leaf hash	`SHA-256(0x00 \|\| canonical_record_bytes)`. The bottom-of-tree value committed to the Merkle root.
Merkle root	The top-of-tree hash that summarizes every leaf in the batch. Anchored on Arweave.
Opt-out	A publisher-declared preference against ML training on content under a specified scope (domain, subdomain, etc.). The unit of record in the registry.
Substrate	The shared, brand-neutral primitives (`@akaeon/core-*` packages) that both Stelais and the registry consume.
TXT challenge	A DNS TXT record the publisher publishes at `_akaeon-registry-challenge.<domain>` to prove control of the domain.

This document is versioned. Changes appear in the changelog.