# Sigsum Logging Design v0 We propose sigsum logging. It is similar to Certificate Transparency, except that cryptographically **sig**ned check**sum**s are logged instead of X.509 certificates. Publicly logging sigsum statements allow anyone to discover which keys produced what checksum signatures. For example, malicious and unintended key-usage can be _detected_. We present our design and discuss a few use-cases like binary transparency and reproducible builds. **Preliminaries.** You have basic understanding of cryptographic primitives, e.g., digital signatures, hash functions, and Merkle trees. You roughly know what problem Certificate Transparency solves and how. **Warning.** This is a work-in-progress document that may be moved or modified. A future revision of this document will bump the version number to v1. Please let us know if you have any feedback. ## 1 - Introduction Transparent logs make it possible to detect unwanted events. For example, are there any (mis-)issued TLS certificates [\[CT\]](https://tools.ietf.org/html/rfc6962), did you get a different Go module than everyone else [\[ChecksumDB\]](https://go.googlesource.com/proposal/+/master/design/25530-sumdb.md), or is someone running unexpected commands on your server [\[AuditLog\]](https://transparency.dev/application/reliably-log-all-actions-performed-on-your-servers/). A sigsum log brings transparency to **sig**ned check**sum**s. ### 1.1 - Problem description Suppose you are an entity that distributes some opaque data. For example, the opaque data might be a provenance file, an executable binary, an automatic software update, or a TPM quote. You claim to distribute the right opaque data to everyone. However, past incidents taught us that word is cheap and sometimes things go wrong. Trusted parties get compromised and lie about it [\[DigiNotar\]](), or they might not even realize it until later on because the break-in was stealthy [\[SolarWinds\]](https://www.zdnet.com/article/third-malware-strain-discovered-in-solarwinds-supply-chain-attack/). The goal of sigsum logging is to make your claims verifiable by you and others. To keep the design simple and general, we want to achieve this goal with few assumptions about the opaque data or the involved claims. You can think of this as some sort of bottom-line for what it takes to apply a transparent logging pattern. Past use-cases that wanted to piggy-back on an existing reliable log ecosystem fit well into our scope [\[BinTrans\]](https://wiki.mozilla.org/Security/Binary_Transparency). We also want the design to be easy from many different perspectives, for example log operations and verification in constrained environments. This includes considerations such as simple parsing, protection against log spam and poisoning, and a well-defined gossip protocol without any complex auditing logic. ### 1.2 - Abstract setting You would like users of the opaque data to _believe_ your claims. Therefore, we refer to you as a _claimant_ and your users as _believers_. Belief is going to be reasonable because each claim is expressed as a _signed statement_ that is transparency logged. The opaque data and relevant proofs of public logging are then distributed through a _repository_. Note that repository is an abstract construct. For example, it may be a website or something else. A believer can now be convinced that public logging actually happened, so that a _verifier_ can discover any statement that you as a claimant produced. If it turns out that a statement contains a false claim, an _arbiter_ is notified that can act on it. An overview of these _roles_ and how they interact are shown in Figure 1. A party may play multiple roles. A role may also be fulfilled by multiple parties. Refer to the [claimant model](https://github.com/google/trillian/blob/master/docs/claimantmodel/CoreModel.md) for additional detail. ``` statement +----------+ +----------| Claimant |----------+ | +----------+ |Data | |Proof v v +---------+ +------------+ | Log | | Repository | +---------+ +------------+ | | | | | |Data |statements +----------+ Data | |Proof +---------->| Verifier |<------+ | +----------+ v +---------+ | +------------+ | Arbiter | <--------+ | Believer | +---------+ false claim +------------+ Figure 1: abstract setting ``` A claimant's statement encodes the following claim: _the right opaque data has a certain cryptographic hash_. It is stored in a sigsum log for discoverability. A claimant may add additional claims that are _implicit_ for each statement. An implicit claim is not stored by the log and therefore communicated through policy. Examples of implicit claims: - The opaque data can be located in repository Y using X as an identifier. - The opaque data is a `.buildinfo` file that facilitates a reproducible build [\[R-B\]](https://wiki.debian.org/ReproducibleBuilds/BuildinfoFiles). Detailed examples of use-case specific claimant models are defined in a separate document [\[CM-Examples\]](https://github.com/sigsum/sigsum/blob/main/doc/claimant.md). ### 1.3 - Design considerations Our main contribution is in the details that surround the log role in practise. Below is a brief summary. - **Preserved data flows:** a believer can enforce sigsum logging without making additional outbound network connections. Proofs of public logging are provided using the same distribution mechanism as before. - **Sharding to simplify log life cycles:** starting to operate a log is easier than closing it down in a reliable way. We have a predefined sharding interval that determines the time during which the log will be active. - **Defenses against log spam and poisoning:** to maximize a log's utility it should be open for anyone to use. However, accepting logging requests from anyone at arbitrary rates can lead to abusive usage patterns. We store as little metadata as possible to combat log poisoning. We piggyback on DNS to combat log spam. - **Built-in mechanisms that ensure a globally consistent log:** transparent logs rely on gossip protocols to detect forks. We built a proactive gossip protocol directly into the log. It is a variant of [witness cosigning](). - **No cryptographic agility**: the only supported signature scheme is Ed25519. The only supported hash function is SHA256. Not having any cryptographic agility makes the protocol and the data formats simpler and more secure. - **Simple (de)serialization parsers:** complex (de)serialization parsers increase attack surfaces and make the system more difficult to use in constrained environments. A claimant's sigsum statements are serialized using [Trunnel](https://gitlab.torproject.org/tpo/core/trunnel/-/blob/main/doc/trunnel.md). A sigsum log's statements are serialized using line-terminated ASCII [\[Checkpoint\]](). A sigsum log's HTTP(S) API uses line-terminated ASCII [\[SigsumAPI\]](). The required parsing is easy to implement yourself. ### 1.4 - Roadmap First we describe our threat model. Then we give a bird's view of the design. Finally, we go into greater detail using a question-answer format that is easy to extend and/or modify. The last part contains documentation TODOs. ## 2 - Threat model and (non-)goals We consider a powerful attacker that gained control of a claimant's signing and release infrastructure. This covers a weaker form of attacker that is able to sign data and distribute it to a subset of isolated users. For example, this is essentially what the FBI requested from Apple in the San Bernardino case [\[FBI-Apple\]](https://www.eff.org/cases/apple-challenges-fbi-all-writs-act-order). The fact that signing keys and related infrastructure components get compromised should not be controversial these days [\[SolarWinds\]](https://www.zdnet.com/article/third-malware-strain-discovered-in-solarwinds-supply-chain-attack/). The attacker can also gain control of the transparent log's signing key and infrastructure. This covers a weaker form of attacker that is able to sign log data and distribute it to a subset of isolated users. For example, this could have been the case when a remote code execution was found for a Certificate Transparency Log [\[DigiCert\]](https://groups.google.com/a/chromium.org/g/ct-policy/c/aKNbZuJzwfM). Any attacker that is able to position itself to control these components will likely be _risk-averse_. This is at minimum due to two factors. First, detection would result in a significant loss of capability that is by no means trivial to come by. Second, detection means that some part of the attacker's malicious behavior will be disclosed publicly. Following from our introductory goal we want to facilitate _disocvery_ of sigsum statements. Such discovery makes it possible to detect attacks on a claimant's signing and release infrastructure. For example, a claimant can detect an unwanted sigsum by inspecting the log. It could be the result of a compromised signing key. The opposite direction is also possible. Anyone may detect that a repository is not serving data and/or proofs of public logging. It is a non-goal to disclose the data that a cryptographic checksum represents _in the log_. It is also a non-goal to allow richer metadata that is use-case specific. The type of detection that a sigsum log supports is therefore more _coarse-grained_ when compared to Certificate Transparency. A significant benefit is that the resulting design becomes simpler, general, and less costly to bootstrap into a reliable log ecosystem. For security we need a collision resistant hash function and an unforgeable signature scheme. We also assume that at most a threshold of seemingly independent parties are adversarial to protect against split-views [\[Gossip\]](). ## 3 - Design We consider a _claimant_ that claims to distribute the _right_ opaque data with cryptographic hash X. A claimant may add additional falsifiable claims. However, all claims must be digitally signed to ensure non-repudiation [\[CM\]](https://github.com/google/trillian/blob/master/docs/claimantmodel/CoreModel.md). A user should only use the opaque data if there is reason to _believe_ the claimant's claims. Therefore, users are called _believers_. A good first step is to verify that the opaque data is accompanied by a valid digital signature. This corresponds to current practises where, say, a software developer signs new releases with `gpg` or `minisign -H`. The problem is that it is difficult to verify whether the opaque data is actually _the right opaque data_. For example, what if the claimant was coerced or compromised? Something malicious could be signed as a result. A sigsum log adds _discoverability_ into a claimant's signed statements, see Figure 1. Such discoverability facilitates _verification of claims_. Verifiability is a significant improvement when compared to the blind trust that we had before. ### 3.1 - How it works A sigsum log maintains a public append-only Merkle tree. Independent witnesses verify that this tree is fresh and append-only before cosigning it to achieve a distributed form of trust. A tree leaf contains four fields: - **shard_hint**: a number that binds the leaf to a particular _shard interval_. Sharding means that the log has a predefined time during which logging requests are accepted. Once elapsed, the log can be shut down. - **checksum**: a cryptographic hash of some opaque data. The log never sees the opaque data; just the hash. - **signature**: a digital signature that is computed by a claimant over the leaf's shard hint and checksum. - **key_hash**: a cryptographic hash of the claimant's verification key that can be used to verify the signature. The signed statement encodes the following claim: "the right opaque data has cryptographic hash X". The claimant may also communicate additional implicit claims through policy. For example, "the opaque data can be located in repository Y" and "the opaque data facilitates a reproducible build". A verifier that monitors the log ecosystem can discover new statements and contact an arbiter if any claim turns out to be false. Examples of verifies in a reproducible builds system include third-party rebuilders. Ideally, a believer should only use a (supposedly) reproducible build artifact if it is accompanied by proofs of public logging. Verifiers use the key hash field to determine which claimant produced a new statement. A hash, rather than the full verification key, is used to motivate verifiers to locate the key and make an explicit trust decision. Not disclosing verification keys in the log makes it less likely that someone would use an untrusted key _by mistake_. #### 3.1.1 - Preparing a logging request (step 1) A claimant selects a shard hint and a checksum that should be logged. The selected shard hint represents an abstract statement like "sigsum logs that are active during 2021". The selected checksum is the output of a cryptographic hash function. It could be the hash of an executable binary, a reproducible build recipe, etc. The selected shard hint and checksum are signed by the claimant. A shard hint is incorporated into the signed statement to ensure that old log leaves cannot be replayed in a newer shard by a good Samaritan. The claimant will also have to do a one-time DNS setup. As outlined below, the log will check that _some domain_ is aware of the claimant's verification key. This is part of a defense mechanism that combats log spam. #### 3.1.2 - Submitting a logging request (step 2) Sigsum logs implement an HTTP(S) API. Input and output is human-readable and uses a simple ASCII format. A more complex parser like JSON is not needed because the exchanged data structures are primitive enough. A claimant submits their shard hint, checksum, signature, and public verification key as key-value pairs. The log uses the public verification key to check that the signature is valid, then hashes it to construct the leaf's key hash. The claimant also submits a _domain hint_. The log will download a DNS TXT resource record based on the provided domain name. The downloaded result must match the public verification key hash. By verifying that all claimants control a domain that is aware of their verification key, rate limits can be applied per second-level domain. As a result, you would need a large number of domain names to spam the log in any significant way. Using DNS to combat spam is convenient because many claimants already have a domain name. A single domain name is also relatively cheap. Another benefit is that the same anti-spam mechanism can be used across several independent logs without coordination. This is important because a healthy log ecosystem needs more than one log to be reliable. DNS also has built-in caching which claimants can influence by setting their TTLs accordingly. A claimant's domain hint is not part of the leaf because key management is more complex than that. A separate project should focus on transparent key management. Our work is related to transparent _key-usage_. A sigsum log will _try_ to incorporate a leaf into its Merkle tree if a logging request is accepted. There are no _promises of public logging_ as in Certificate Transparency. Therefore, a claimant needs to wait for an inclusion proof before concluding that the logging request succeeded. Not having inclusion promises makes the entire log ecosystem less complex. The downside is that the resulting log ecosystem cannot guarantee low latency. #### 3.1.3 - Proofs of public logging (step 3) Claimants are responsible for collecting all cryptographic proofs that their believers will need to enforce public logging. These proofs are distributed using the same mechanism as the opaque data. A believer receives: 1. **Opaque data**: a claimant's opaque data. 2. **Shard hint**: a claimant's selected shard hint. 3. **Signature**: a claimant's signed statement. 4. **Checkpoint**: a log's signed tree head and a list of cosignatures from so-called _witnesses_. 5. **Inclusion proof**: a proof of inclusion that is based on the logged leaf and the above checkpoint. Ideally, a believer should only accept the opaque data if these criteria hold: - The claimant's signed statement verifies. - The log's tree head can be reconstructed from the logged leaf and the provided inclusion proof. - The log's checkpoint has enough valid (co)signatures. Notice that there are no new outbound network connections for a believer. Therefore, a proof of public logging is only as convincing as the tree head that an inclusion proof leads up to. Sigsum logs have trustworthy tree heads due to using a variant of witness cosigning. A believer can not be tricked into accepting some opaque data that have yet to be publicly logged unless the attacker controls more than a threshold of witnesses. In other words, witnesses are trust anchors that ensure verifiers see the same sigsum statements as believers. Sigsum logging can facilitate detection of attacks even if a believer fails open or enforces the above criteria partially. For example, the fact that a repository mirror does not serve proofs of public logging could indicate that there is an ongoing attack against a claimant's distributed infrastructure. Interested parties can look for that. _Monitoring_ -- as in inspecting the log for sigsums that interest you -- can be viewed as a separate 4th step. A monitor implements the verifier role and is necessarily ecosystem specific. For example, it requires knowledge of public verification keys, what the opaque data is, and where the opaque data is located. ### 3.2 - Summary Sigsum logs are sharded and shut down at predefined times. A sigsum log can shut down _safely_ because verification on the believer-side is not interactive. The difficulty of bypassing public logging is based on the difficulty of controlling enough independent witnesses. A witness verifies that a log's checkpoint is correct before cosigning. Claimants, verifiers, and witnesses interact with the log using an HTTP(S) API. A claimant must prove that they own a domain name as an anti-spam mechanism. Believers interact with the log _indirectly_ through their claimant's existing distribution mechanism. It is the claimant's job to log sigsums and distribute necessary proofs of public logging. It is the verifier's job to look for new statements in the log and alert an arbiter if any claim is false. An overview of the entire system is provided in Figure 2. ``` TODO: add complete system overview. See drafty figure in archive. - Make terminology consistent with Figure 1 - E.g., s/Monitor/Verifier - E.g., s/leaves/statements - Add arbiter ``` ### 4 - A peek into the details Our bird's view introduction skipped many details that matter in practise. Some of these details are presented here using a question-answer format. A question-answer format is helpful because it is easily modified and extended. #### 4.1 - What is the point of having a shard hint? Unlike X.509 certificates which already have validity ranges, a checksum does not carry any such information. Therefore, we require that a claimant selects a _shard hint_. The selected shard hint must be in the log's _shard interval_. A shard interval is defined by a start time and an end time. Both ends of the shard interval are inclusive and expressed as the number of seconds since the UNIX epoch (January 1, 1970 00:00 UTC). Without sharding, a good Samaritan can add all leaves from an old log into a newer one that just started its operations. This makes log operations unsustainable in the long run because log sizes will grow indefinitely. Such re-logging also comes at the risk of activating someone else's rate limits. Note that _the claimant's shard hint is not a verified timestamp_. The submitter should set the shard hint as large as possible. If a roughly verified timestamp is needed, a cosigned tree head can be used instead. #### How is the threat of log spam and poisoning reduced? - Relates to: "why not log richer metadata and why not store the opaque data" - Relates to: "why we removed identifier field from the leaf" - Relates to: domain hint (maybe better as a separate heading) #### What are the details for witness cosigning? - Relates to: explain `tree-head-latest`, `tree-head-to-sign` and `tree-head-cosigned` #### What cryptographic primitives are supported? #### What (de)serialization parsers are needed? #### Are there any privacy concerns? #### Other - How does it work with more than one log? - What policy should a believer use? - Coarse-grained vs fine-grained detectability properties - \ ## Concluding remarks