Technical specification

Cryptographic primitives, message schemas, audit-defensibility properties, and governance for the Akaeon Registry.

Audience: Technical evaluator at an AI lab — staff engineer or compliance architect reading to decide whether the Akaeon Registry is real infrastructure they can integrate against, or vaporware.

Scope: This document specifies the registry's architecture, integration model, audit-defensibility properties, and governance. It does not cover regulatory context — that section is left as a placeholder for current jurisdictional analysis.

Companion documents: Lab integration runbook · Concrete publisher opt-out example · Standalone verifier.


1. Executive summary

The Akaeon Registry is a domain-level content opt-out registry that produces cryptographic, publicly-anchored records of publisher preferences regarding machine learning training. It exists so that a lab's compliance process can answer one question — "did this domain opt out before our training cutoff?" — with an artifact that survives later legal or auditor challenge.

The registry is not built greenfield. It is the second consumer of a production substrate that already operates at stelais.com under the name Stelais — a creator product that has been anchoring per-record content provenance proofs to Arweave. The substrate provides four primitives, packaged as workspace libraries (@akaeon/core-arweave, @akaeon/core-verification, @akaeon/core-fingerprinting, @akaeon/core-watermarking), each brand-neutral and parameterized at the API boundary so the same code paths serve both consumers without duplication. The registry adds the publisher- and lab-facing API surface, a DNS-based publisher verification flow, and a Merkle batching layer for higher-throughput opt-out submissions. Underneath, the record-level signing and anchoring path is identical to Stelais's.

The integration story for a lab is: one HTTPS GET per domain at training-data ingestion time, returning a JSON bundle whose every cryptographic claim is checkable in standard-library code against the public Arweave network — three independent checks, no proprietary cryptography, no required SDK, no required ongoing relationship with the registry beyond initial credentialing.

This specification covers what is built today (the cryptographic substrate, including the Ed25519 signing and Arweave canonical-payload anchoring), what is in active development (the publisher submission API, the Merkle batcher, the lab lookup endpoint), and what the audit-defensibility properties are for both halves of the system.


2. Regulatory context

Placeholder. This section will cover EU AI Act Article 53(1)(c)/(d) TDM opt-out requirements, EU Directive 2019/790 Article 4 reservation mechanisms, U.S. state and federal context (NO FAKES Act, ELVIS Act, post-Thomson Reuters v. ROSS, post-Bartz v. Anthropic, DMCA Section 1202), UK AI-and-Copyright consultation outputs, and other jurisdictions with active TDM-reservation, content-provenance, or rightsholder-opt-out requirements (Japan Article 30-4, Singapore CDA, Australia consultation outputs). It will also map onto industry-self-regulatory frameworks (C2PA, IPTC, ai.txt, spawning.ai).

The technical specification below does not depend on the regulatory section being filled in — the cryptographic and architectural claims stand on their own merits. The regulatory section exists to map the registry's capabilities onto specific legal obligations the lab is trying to meet.


3. What problem the registry solves

A lab training on web-sourced content faces a structurally hard compliance problem: at the moment of training, the lab needs to be able to prove what it knew, and when it knew it, about whether each source had opted out. Three properties matter for that proof to survive later challenge:

1. Independent timestamping. The lab cannot rely on a registry's self-reported timestamps. If the registry says "this opt-out was effective 2026-04-01," and the lab's training cutoff was 2026-04-15, the registry could be lying — either backdating opt-outs to make the lab look non-compliant, or forward-dating them to give the lab cover. The registry's records must be timestamped by a third party the registry does not control.

2. Independent verifiability. The lab cannot rely on the registry remaining available, honest, or even existing at the moment of audit. If the registry vanishes, gets compromised, or is replaced by a successor entity, every record it ever produced must still be cryptographically verifiable against the same public substrate that timestamped it.

3. Tamper-evident chain. The link between a publisher's stated intent, the registry's record of that intent, and the public timestamp must be cryptographically unbreakable. A challenger should not be able to manufacture a plausible alternative record after the fact.

The registry's design centers on these three properties. Section 7 maps each property to the specific mechanism that delivers it.

The status quo without a registry is that labs either:

  • Don't track opt-outs at all (and accept the compliance and litigation risk), or
  • Build their own per-source opt-out crawler (which produces records the lab itself generated, with no third-party timestamp and no defense against backdating by the lab), or
  • Rely on robots.txt / ai.txt (which have no cryptographic timestamp, no DNS challenge for authority, no audit trail, and are silently mutable by anyone with site write access).

The registry provides the missing primitive: a cryptographically-signed, publicly-timestamped, third-party-anchored record of publisher opt-out intent, accessible by a single API call.


4. System architecture

The architecture is three layers stacked vertically.

4.1 Layer 1 — Consumers

Two services live at the top layer. Both are application-shaped Express/Node services in the same monorepo. Neither knows about the other; they share only the core packages underneath.

  • Stelais (services/stelais-api + services/stelais-web) — the creator-facing product, in production today. Issues per-record content-provenance proofs for creators. Out of scope for the registry.
  • Akaeon Registry (services/akaeon-registry) — the subject of this document. The workspace scaffold for it exists today; the publisher/lab API surface is in active development.

The two services are deliberately peers, not parent/child. This shape was established specifically to make the registry a separable consumer.

4.2 Layer 2 — Core packages

Four workspace packages publish the brand-neutral primitives both consumers depend on:

| Package | What it provides | Production status | |---|---|---| | @akaeon/core-arweave | Canonical payload builders; cost estimator; safety-limit evaluator | Production today | | @akaeon/core-verification | Ed25519 keygen, sign, verify (RFC 8032); AES-256-GCM key encryption; canonical-message builder factory | Production today | | @akaeon/core-fingerprinting | Text SimHash, image pHash (DCT-based), audio STFT + constellation hashing, normalized text hashing | Production today | | @akaeon/core-watermarking | DCT and LSB watermarking primitives for image content | Production today, used by Stelais |

The brand-coupling rule across all four packages: every brand-specific parameter — app identifier, network identifier, canonical message prefix — is a required argument with no default. The package will throw if called without one. Stelais passes app: 'stelais', network: 'arweave', prefix: 'stelais:proof:v1'; the registry passes app: 'akaeon-registry', network: 'arweave', prefix: 'akaeon-registry:optout:v1'. The packages contain no string literal of either consumer's brand. This is verifiable by grep in the core source tree.

4.3 Layer 3 — Arweave anchoring substrate

The Arweave network is the public, permanent, third-party-operated trust root. Records anchored through @akaeon/core-arweave end up as transactions on Arweave, retrievable by transaction id at https://arweave.net/<txid>, queryable by GraphQL against the network's gateway nodes, and timestamped by the network's own block production.

Stelais's existing anchoring model is one Arweave transaction per record. The registry's anchoring model is one Arweave transaction per batch of opt-out records (with each record in a Merkle tree whose root is on-chain and whose inclusion proof is served by the registry). The substrate is the same; the registry adds a thin batching layer on top.

The registry has the option to switch to a different content-addressable substrate (ar.io, Irys, an L1 commitment) by swapping the network parameter and the underlying upload adapter. Existing records remain verifiable against the original network indefinitely; the choice is per-batch, not retroactive.

4.4 Safety controls inherited from the substrate

The substrate ships with production-grade safety controls that the registry inherits, with two default-value overrides where registry batch payloads differ from Stelais's per-proof records:

  • Kill switches (two independent paths)ANCHORING_ENABLED=false halts all anchoring on the next operation, no redeployment required (the incident-response lever). ANCHORING_MODE=OFF produces the same effect and is the long-term "intentionally off" signal. Both paths return a SKIPPED preflight decision; batches accumulate in pending_anchor status pending re-enablement.
  • Per-batch cost capANCHORING_MAX_PER_PROOF_USD (registry default $0.50). Blocks any single batch whose pre-anchor cost estimate exceeds the cap.
  • Daily and monthly budget capsANCHORING_DAILY_BUDGET_USD (default $10) and ANCHORING_MONTHLY_BUDGET_USD (default $100) enforced against the rolling anchoring_spend_log table.
  • Maximum payload sizeANCHORING_MAX_PAYLOAD_BYTES (registry default 8192) hard-blocks any payload that would exceed Arweave's per-tx cost-efficient range. Doubled from Stelais's 4096 because the registry's batch payload includes the full registry signature + Merkle metadata.
  • Turbo balance probe — when ANCHORING_MODE=TURBO, the preflight blocks if the ArDrive Turbo credit balance is zero or unreachable.
  • Dry-run modeTURBO_DRY_RUN=true swaps the Turbo SDK for a mock client that produces DRY_RUN_<random> Arweave tx IDs without spending real credits.
  • Fail-closed semantics — if cost estimation fails, the default is to block; if the spend log can't be queried, the default is to assume budget is exhausted; if the Turbo balance probe fails, the default is to treat the balance as insufficient.

5. The cryptographic substrate

This section specifies the cryptographic primitives the registry depends on. The reader should leave this section confident they could re-implement verification of any registry record in their language of choice, using only the standard library.

5.1 Hashing

  • SHA-256 is the hash function throughout — content hashes, canonical payload hashes, Merkle tree leaves, DNS challenge digests.
  • Implementations: any FIPS 180-4 SHA-256. Node's crypto.createHash('sha256'), Python's hashlib.sha256, Go's crypto/sha256, Rust's sha2 crate, the Web Crypto API's SubtleCrypto.digest('SHA-256', ...).
  • Hex encoding is lowercase, no leading 0x, no whitespace.

5.2 Signature scheme

  • Ed25519 per RFC 8032 — Curve25519 in Edwards form, deterministic signing, no per-signature randomness, 32-byte public keys, 64-byte signatures.
  • The registry's signing keypair is generated via Node's crypto.generateKeyPairSync('ed25519'). Public key is the raw 32 bytes (transported base64); private key is DER PKCS8 (encrypted-at-rest with AES-256-GCM and the service master key, never transmitted).
  • Signature production calls Node's crypto.sign(null, message, privateKey).
  • Signature verification by any third party calls crypto.verify(null, message, publicKey, signature). To reconstruct a verifier-side public key from the registry's 32-byte raw public key, prepend the 12-byte Ed25519 SPKI DER header 30 2a 30 05 06 03 2b 65 70 03 21 00. Other Ed25519 libraries accept raw 32-byte keys directly without the SPKI wrap.

5.3 Canonical message format

For each signed record, the registry signs a canonical message: a deterministic pipe-delimited UTF-8 string the verifier can reconstruct from the record's stored fields without trusting the registry's timestamps.

For opt-out records:

akaeon-registry:optout:v1|<submission_id>|<domain>|<policy>

For a future extension where the registry attests to its own observation of a verified DNS challenge, an alternate format is:

akaeon-registry:dns-verify:v1|<submission_id>|<domain>|<dns_challenge_record_sha256>

The format is intentionally timestampless. The signature attests to what the registry said, not when it said it — the timestamp is delegated to the Arweave block in which the batch root is anchored, which is the trust root for time. This is the same property as Stelais's existing canonical message format stelais:proof:v1|<userId>|<fileHash>.

The prefix is the brand-coupling. It's a required argument; the core package has no default. Stelais's prefix and the registry's prefix cannot collide because they're literally different strings, and the prefix is part of the signed bytes.

5.4 Canonical record schema

An opt-out record's canonical form is a deterministic JSON document:

{
  "version": 1,
  "type": "domain_optout",
  "submission_id": "01J9XW...",
  "domain": "example-publisher.com",
  "policy": "no-training",
  "scope": "domain",
  "effective_from": "2026-05-11T00:00:00Z",
  "submitted_at": "2026-05-11T14:23:00Z",
  "dns_verified_at": "2026-05-11T14:31:00Z",
  "dns_challenge_record_sha256": "<hex>",
  "publisher_account_id": "01J9XW...",
  "app": "akaeon-registry",
  "network": "arweave"
}

The field set is deliberately minimal. Optional fields are included only when present; the canonical-serialization rule is:

  • UTF-8 encoding.
  • Sort keys lexicographically.
  • No insignificant whitespace (JSON.stringify with no space argument).
  • Reject any field not in the schema (no extension fields in v1).

The SHA-256 of this canonical document is the leaf hash that goes into the Merkle tree.

5.5 Merkle tree construction

The registry batches opt-out leaves into a Merkle tree once per batch. The construction follows the structure of RFC 6962 (Certificate Transparency) to inherit its well-studied second-preimage resistance:

  • Leaf node hash: SHA-256(0x00 || canonical_record_bytes).
  • Internal node hash: SHA-256(0x01 || left_child || right_child).
  • Odd-count handling: when a level has an odd number of nodes, the last node is promoted unchanged to the next level (CT's approach), not duplicated.
  • Empty tree: not permitted; a batch with zero leaves is not anchored.
  • Inclusion proof: the standard sibling-hash path from leaf to root. The verifier requires three inputs alongside the leaf hash and claimed root: the leaf index (the leaf's 0-based position in the tree), the tree size (total leaf count of the batch the leaf belongs to), and the sibling-hash array (lowest-level first). The tree size is load-bearing — RFC 6962's odd-count promotion means the "is the current node on the right edge of its level" decision depends on tree size, not just on the index's low bit. Naive Bitcoin-style verifiers that omit tree size produce wrong intermediate hashes for any path that passes through a promoted node, and silently fail to reconstruct the root for non-power-of-2 batches. The Lab Integration Runbook publishes the RFC 6962 §2.1.2 verifier algorithm that labs use.

The 0x00 / 0x01 domain-separation prefixes prevent the second-preimage attack where a leaf could be reinterpreted as an internal node. This is the same rationale the Certificate Transparency log uses; reference verifiers exist in every major language.

5.6 Batch canonical payload

The on-chain payload for an opt-out batch is:

{
  "version": 1,
  "type": "optout_batch",
  "batch_id": "01J9YA...",
  "started_at": "2026-05-11T14:00:00Z",
  "closed_at": "2026-05-11T15:00:00Z",
  "merkle_root_sha256_hex": "f3a9...",
  "leaf_count": 2814,
  "tree_construction": "rfc6962-style",
  "registry_signature": {
    "canonical_message": "akaeon-registry:batch:v1|01J9YA...|f3a9...|2814",
    "signature": "<base64>",
    "public_key": "<base64-32-byte-raw>",
    "signature_scheme": "ed25519",
    "version": "v1"
  },
  "app": "akaeon-registry",
  "network": "arweave"
}

This is the only payload anchored on Arweave per batch. The leaves themselves are stored in the registry's database and served on-demand to verifiers who request an inclusion proof. (Section 10 discusses the alternative — anchoring the full list of leaf hashes — and its trade-offs.)

The batch payload is small (~500 bytes, well below the 4096-byte payload cap inherited from the substrate's safety controls) and one batch produces one Arweave transaction. At an hourly cadence with batches of up to ~10,000 leaves, the registry produces 24 transactions per day, well within the substrate's $10 default daily budget.


6. The registry's API surface

The registry exposes three public surfaces: a publisher API (write side), a lab API (read side), and a public verification surface (anyone side). All three are HTTPS; all three return JSON; all three use bearer tokens or HMAC signatures for authentication where authentication is required.

The Lab Integration Runbook contains the full request/response shapes. This section gives the high-level shape; the runbook is the implementation reference.

6.1 Publisher API (write side)

  • POST /v1/optouts — submit a new opt-out for a domain or subdomain. Returns a DNS challenge the publisher must publish before verification proceeds. Response is 202 Accepted with a submission_id and the challenge record details.
  • GET /v1/optouts/:submission_id — poll the status of a submission. States: pending_dns_verification, dns_verified, pending_anchor, anchored, failed, expired.
  • DELETE /v1/optouts/:submission_id — withdraw an opt-out. The withdrawal itself is anchored as a new record (type: "domain_optout_withdrawal") rather than erasing the original. The on-chain record cannot be erased; this preserves the audit trail.

Authentication: publisher account credentials issued at publisher registration time. Rate-limited per account.

6.2 Lab API (read side)

  • GET /v1/lookup?domain=<domain> — return all currently-effective opt-outs for the given domain, with full Merkle inclusion proofs and Arweave transaction references. This is the only endpoint the lab needs to call at training-data ingestion time. Response includes everything the lab needs for an independently-verifiable audit log entry.
  • POST /v1/lookup/bulk — POST a list of domains, return the bundle for each. For ingestion pipelines processing millions of URLs, this avoids one-domain-at-a-time round trips.

Authentication: lab account bearer token. The token does not affect the returned data — the registry returns the same bundle for any authenticated lab, and the bundle is verifiable without trust in the registry's authentication.

6.3 Public verification surface

These endpoints require no authentication. They exist so a court, auditor, or competing service can verify any registry record without a relationship with the registry:

  • GET /v1/public/optouts/:submission_id/verify — return the canonical record, the registry's signature, the Merkle inclusion proof, and the Arweave transaction id for a single opt-out.
  • GET /v1/public/batches/:batch_id — return the batch metadata, the Merkle root, the leaf count, and the Arweave transaction id. Used by verifiers to confirm a claimed root against the registry's view of the batch.
  • GET /v1/public/registry-key — return the registry's current Ed25519 signing public key, and the historical keys with their validity windows. Verifiers use this to validate signatures issued at any time in the registry's lifetime.

The public verification surface is the load-bearing trust property. As long as any one of these three exist somewhere — at the registry's domain, at a mirror, in a court's exhibit folder — every opt-out the registry ever issued can be independently verified against the public Arweave network.


7. Audit-defensibility properties

This section maps the three properties stated in §3 onto specific mechanisms.

7.1 Independent timestamping — Arweave block inclusion

The registry's claim about when an opt-out became effective rests on the Arweave transaction's block timestamp, not on any field in the registry's database:

  • The opt-out's effective_from field is a publisher-declared intent. It carries no cryptographic weight on its own.
  • The opt-out's submitted_at and dns_verified_at fields are registry-declared timestamps. They carry weight only insofar as they're signed by the registry and included in the batch canonical record.
  • The opt-out's effective timestamp for audit purposes is the Arweave block time of the transaction that anchored the batch containing the opt-out. This is the only timestamp the registry cannot manipulate.

The lab's audit log records the Arweave transaction id. A challenger arguing the opt-out was backdated must argue against Arweave's block timestamp — which requires either compromising the Arweave network or producing a forged transaction id that resolves to the same canonical record, both of which the network's design makes infeasible.

7.2 Independent verifiability — no registry code in the verify path

Every cryptographic claim the registry makes is verifiable using:

  • Any RFC 8032 Ed25519 implementation (Node crypto, Python cryptography, Go crypto/ed25519, Rust ed25519-dalek, Web Crypto API, OpenSSL).
  • Any SHA-256 implementation.
  • Any Arweave gateway (arweave.net, g8way.io, any node operator's endpoint, or a self-run Arweave node).

The lab's verifier code is ~30 lines of Node.js using only the standard library. It runs in the lab's environment, not the registry's. It does not depend on the registry being online, honest, or extant at the moment of verification.

A lab can, and per defensible practice should, cache the registry's signing public key locally at training time, so that even if the registry is later compromised and the verify endpoint serves a substituted key, the lab can verify against the key as it was at training time. Section 9.2 discusses key rotation policy.

7.3 Tamper-evident chain — cryptographic linkage at every step

Each link in the chain from publisher intent to lab acknowledgment is cryptographically committed:

  1. Publisher intent → DNS challenge: the DNS challenge nonce is committed in the registry's database; the publisher's TXT record value is hashed (SHA-256) and the hash goes into the canonical opt-out record. Tampering with the TXT record after-the-fact would invalidate the canonical record's hash.
  2. DNS verification → canonical record: the canonical record is a deterministic UTF-8 JSON encoding (§5.4). Any byte change anywhere in the record produces a different SHA-256, which would not match the leaf hash in the Merkle tree.
  3. Canonical record → Merkle leaf: the leaf hash is SHA-256(0x00 || canonical_record_bytes). The 0x00 prefix prevents reinterpretation as an internal node.
  4. Merkle leaf → Merkle root: the inclusion proof is a sequence of sibling hashes. Recomputing the root from the leaf and the proof is deterministic; any tampering with the leaf or any sibling invalidates the reconstruction.
  5. Merkle root → Arweave transaction: the root is embedded in the batch canonical payload, which is the body of the Arweave transaction. The transaction id is SHA-256(transaction_fields); tampering with the body produces a different transaction id, which won't match the id the lab recorded at training time.
  6. Arweave transaction → block inclusion: Arweave's consensus includes the transaction in a block with a public timestamp. Block reorgs are bounded by the network's design; the registry recommends labs treat transactions with at least N block confirmations as final, where N is configurable (default: 10).

Each link is independently checkable. A challenger has to break every link to produce a plausible alternative record — not just one.

7.4 The key-rotation failure mode and how the registry handles it

A subtle but real audit-defensibility concern: what if the registry's signing key is rotated, and an old record is later verified against the new public key? The signature won't match, and a careless verifier might conclude the record is invalid when in fact it was correctly signed by an earlier key.

The registry handles this by:

  1. Versioned public keys. Every record's registry_signature block includes a version field. The version maps to a specific public key with a documented validity window, returned by the GET /v1/public/registry-key endpoint.
  2. Snapshot-at-signing. The signing public key is embedded directly in the registry signature block on every record. A verifier never has to ask the registry "what was the right key for this record"; the answer is in the record itself.
  3. Mirroring discipline. The validity windows and historical public keys are published in three places — the registry's own endpoint, a read-only mirror at a separate domain, and an Arweave-anchored declaration. The third is the most important: even if the registry and its mirror both vanish, the historical key catalog survives on the public chain.

8. Adversarial considerations

A technical evaluator at a lab will reasonably ask "what attacks does this not defend against?" This section is the honest answer.

8.1 Attacks the registry defends against

| Attack | Defense | |---|---| | Registry backdating an opt-out | Arweave block timestamp is the trusted clock; registry can't move blocks | | Registry forward-dating to manufacture liability | Same — block timestamp is fixed at inclusion | | Tampering with a record after it's anchored | Canonical hash mismatch detectable by Merkle proof recomputation | | Substituting a fake record into a real Merkle batch | Leaf isn't in the published leaves; inclusion proof would fail | | Registry's signing key compromise after-the-fact | Historical key catalog is mirrored and Arweave-anchored; old records verify against the key that was current at signing time | | Registry going offline | Records remain verifiable against Arweave; public verification endpoint can be mirrored or run by a successor entity | | Network-level DNS poisoning during publisher verification | Multi-resolver DNS verification (Cloudflare, Google, Quad9, etc.) and DNSSEC chain recording where available | | Replay attack on the signature | Canonical message includes submission_id which is unique per submission; replaying it produces a duplicate the registry rejects |

8.2 Attacks the registry does not defend against

These are not silently ignored — they're real properties of the system that a sophisticated evaluator will want to understand.

Publisher key compromise. If the publisher loses control of their domain (e.g. domain hijack, expired registration repurchased), an attacker can pass a DNS challenge and submit opt-outs the original publisher didn't intend. Mitigation: opt-outs have an effective_from field, and the registry exposes a withdraw flow that creates a new anchored record; the lab's audit log will reflect both. The registry does not detect that a domain has changed ownership.

Arweave network failure. If Arweave is partitioned, halted, or reorgs deeply, recently-anchored transactions could be lost. Mitigation: the registry's pre-anchor record state is retained in its database; on Arweave recovery, batches can be re-anchored (with a new, later timestamp). Mitigation also: the registry can dual-anchor to a second substrate (e.g. Irys, ar.io gateway) for cross-substrate redundancy.

Sybil submission flooding. A bad actor could attempt to register thousands of bogus opt-outs to inflate batch sizes or degrade service. Mitigation: per-account rate limits, DNS challenge requirement (each domain requires control proof), per-account cost-based throttling, and the substrate's existing daily/monthly anchoring budget caps. The DNS challenge is the strongest mitigation — it can't be passed without control of an actual domain.

Lab fabricating a verification record after the fact. A non-compliant lab could claim "we checked the registry at time T and saw no opt-out," when in fact they didn't check. The registry cannot prevent this — it controls its own records, not the lab's audit logs. The registry's job is to provide records that would defend the lab's good-faith audit log, not to police the lab's logging discipline. (A future extension: lab-side attestations signed by a publishable lab key, with the registry recording the attestation — out of scope for v1.)

Court refusing to credit Arweave timestamps. In a jurisdiction where public blockchain timestamps haven't been precedent-tested, a court may require additional evidence. The registry's design provides the cryptographic primitives; the registry cannot guarantee a specific court's admission ruling. Mitigation: dual-anchor to an additional substrate whose timestamps are precedent-credited in the relevant jurisdiction (e.g. a notarized hash, an IPFS+filecoin commitment, a public CA timestamping authority signature).

Compelled disclosure of the registry's signing key. In a jurisdiction that compels production of the signing private key, the registry could be forced to produce forged historical records. Mitigation: hardware security module storage for the key (HSM-backed; key never extractable as bytes), plus the structural property that every record ever signed is already anchored on Arweave. A compelled key disclosure would let an attacker forge a future record; it could not retroactively change records already anchored. The mirroring and Arweave-anchoring of the historical key catalog protects against post-compulsion narrative reconstruction.

8.3 Security boundaries of the substrate

The substrate inherits Stelais's production security model, which the registry adopts unchanged:

  • All private signing keys at rest are encrypted with AES-256-GCM. The master key is environment-variable-only, never written to the database, rotated on a documented cadence.
  • TLS termination at the edge, with HSTS + certificate transparency monitoring.
  • HTTP API uses bearer tokens (JWT-signed) for authenticated endpoints. Webhook callbacks to publisher endpoints are HMAC-signed; the publisher verifies the HMAC before treating the callback as authoritative.
  • Anchoring spend log (anchoring_spend_log) is append-only; any anomalous spending triggers alerts.
  • Fail-closed semantics throughout: when in doubt, the system refuses to anchor rather than anchor incorrectly.

9. Governance, jurisdiction, and operational integrity

9.1 Who operates the registry, and what happens if they stop?

The registry is operated by Akaeon. The operational responsibility includes:

  • Maintaining the publisher and lab API surfaces.
  • Maintaining the Arweave-anchored Merkle batching cadence.
  • Maintaining the public verification endpoints.
  • Custody of the registry's signing private key.

If Akaeon ceases operations, every record the registry ever produced remains independently verifiable against Arweave. The records do not depend on Akaeon's continued operation. A successor entity can take over the publisher and lab API surfaces; the cryptographic chain remains intact.

The registry commits to:

  • Source availability for the verifier reference implementations under a permissive open-source license (Apache 2.0 or MIT). A successor can re-host verification without negotiating a license.
  • Wind-down protocol: if Akaeon enters wind-down, a 90-day notice window precedes shutdown, during which the registry continues anchoring and the public verification endpoints continue operating. The signing key is published at the end of the window so no further valid records can be produced under the existing key (preventing post-wind-down forgery), and the historical-records archive is mirrored to a long-term storage commitment (the registry's preference: an Arweave-anchored declaration listing every batch ever produced and its transaction id).

9.2 Key rotation policy

The registry's signing key is rotated:

  • Routinely at a documented cadence (initial proposal: every 12 months).
  • On compromise immediately, with a public disclosure of the compromise and a re-issuance flow for records signed by the compromised key (the records themselves don't change; an additional attestation under the new key is anchored to confirm the records' continued validity).
  • On algorithm change if RFC 8032 Ed25519 is deprecated by NIST or cryptographic consensus shifts to a successor scheme.

Key rotation does not invalidate prior records (§7.4). Each record carries the public key snapshot it was signed with; verifiers verify against the embedded snapshot, not against the current key.

The historical key catalog is published on Arweave on each rotation. The catalog is itself a canonical record (type: "registry_key_catalog") signed by the new current key, with chained references to all prior keys' Arweave records. This produces a tamper-evident key history that survives the registry's own continued existence.

9.3 Jurisdiction and data residency

The registry operates from a single jurisdiction; publisher account data and the registry's database are stored in a single region. Anchored data is on the public Arweave network, which is globally distributed and does not have a single jurisdiction.

The registry's database stores:

  • Publisher account credentials (email, API key hashes, organization name).
  • Submission state (pending DNS verification, anchored batch references).
  • Merkle leaf hashes and their batch membership.
  • Spend log entries.

The database does not store: publisher private keys (publishers don't have keypairs in v1), lab API keys' plaintext (only Argon2id hashes), Arweave private keys (these live in environment-only secrets).

GDPR considerations: opt-out records contain a domain and a policy. A publisher's publisher_account_id is a UUID, not a person's identifier; the account-to-person linkage is in the registry's database under the publisher's voluntary registration. A right-to-erasure request can erase the account-to-person linkage from the database; it cannot erase Arweave records (which is a property of the substrate; this is disclosed at publisher registration). The opt-out records themselves contain no personal data beyond what the publisher voluntarily anchors.

9.4 Audit and inspection rights

The registry's operational integrity is verifiable by external parties:

  • Read-only mirror. A second domain (mirror.akaeon-registry.com, initial proposal) serves the public verification surface in read-only mode, with a different operational stack and ideally a different hosting provider. Discrepancy between the primary and mirror signals compromise.
  • Reproducibility audits. Quarterly, the registry publishes a reproducibility report: take the Merkle root of every batch in the quarter, recompute it from the leaf records the registry serves, and publish the result. A third party can spot-check by recomputing themselves; the report is itself anchored as a type: "audit_report" canonical record.
  • Open verifier reference implementation. The reference verifier code is published under an open license. Any party can run it independently; cross-implementation in another language (Python, Go, Rust) is a welcomed contribution and a real strengthening of the trust story.

10. Open questions and known limitations

The following are unresolved as of this draft. They're surfaced explicitly so a sharp reviewer can engage with them rather than discover them later.

10.1 Leaf publication policy. The registry currently anchors only the Merkle root, not the full leaf list. This is cost-efficient but creates a dependency on the registry to produce inclusion proofs. An alternative — anchor the full leaf-hash list in the batch canonical payload — costs more but removes the registry as a necessary mediator. The trade-off is real and the decision is not yet final. Initial proposal: anchor roots only for v1, add full-leaf-list anchoring as an optional batch mode in v2 driven by labs' assurance requirements.

10.2 Cross-substrate dual-anchoring. Section 8.2 mentions dual-anchoring to a second substrate (Irys, an L1, a notarized CA timestamp). v1 anchors only to Arweave. Adding a second substrate is straightforward at the @akaeon/core-arweave network parameter level but requires operational infrastructure (a second budget envelope, a second adapter, mirrored cost controls). Initial proposal: implement dual-anchoring as a v1.1 feature, once labs articulate which second substrate has the most precedent value in their jurisdiction.

10.3 Publisher-side signature. §5.3 mentions an extension where publishers sign their own submissions in addition to the registry signing its observation. Publisher signatures require the publisher to manage a keypair the DNS-based flow deliberately avoids. Initial proposal: defer to v2; collect signal from publishers on whether they want this property, and from labs on whether it would change their compliance posture.

10.4 Withdrawal semantics. A withdrawal is anchored as a new record (§6.1), not an erasure. This is correct for the audit trail (a lab that trained at time T1 and saw the opt-out should be able to verify it later, even if the publisher withdrew at T2 > T1). But it raises a UX question: does the lab's GET /v1/lookup return active opt-outs only, or all opt-outs with their effective windows? Initial proposal: active only by default, with a ?include_withdrawn=true flag for compliance teams that want the full history.

10.5 Granularity of opt-out. v1 supports scope: "domain" and scope: "subdomain". v2 candidates: path, url_pattern, content_type, content_hash (per-asset opt-out). Each adds complexity to the lab's lookup flow. Initial proposal: ship v1 with domain/subdomain only; collect publisher feedback before expanding.

10.6 The Merkle batching primitive's home. Currently designed to live in the registry workspace (not in @akaeon/core-arweave). If a second batching consumer emerges, the primitive can be hoisted to core.

10.7 Rate limits and pricing. Per-publisher submission rate, per-lab lookup rate, and pricing tiers are operational policy decisions not in scope for this technical specification. The runbook will document the rate limits when they're set.


Appendix A — Glossary

| Term | Definition | |---|---| | Anchoring | Writing a record's canonical hash and metadata to a public, permanent, third-party-operated network (Arweave) such that the record's existence at a specific time becomes publicly verifiable. | | Canonical record | A deterministic JSON serialization of an opt-out's stated facts (domain, policy, scope, timestamps). The SHA-256 of this serialization is what gets committed to the Merkle tree. | | Canonical message | The pipe-delimited UTF-8 string the registry signs with Ed25519. Format: <prefix>|<components...>. Timestampless by design. | | Inclusion proof | The sequence of Merkle-tree sibling hashes a verifier uses to reconstruct the batch root from a single leaf. | | Leaf hash | SHA-256(0x00 || canonical_record_bytes). The bottom-of-tree value committed to the Merkle root. | | Merkle root | The top-of-tree hash that summarizes every leaf in the batch. Anchored on Arweave. | | Opt-out | A publisher-declared preference against ML training on content under a specified scope (domain, subdomain, etc.). The unit of record in the registry. | | Substrate | The shared, brand-neutral primitives (@akaeon/core-* packages) that both Stelais and the registry consume. | | TXT challenge | A DNS TXT record the publisher publishes at _akaeon-registry-challenge.<domain> to prove control of the domain. |


This document is versioned. Changes appear in the changelog.