VLS Channel-State Splitbrain - The Bug and Its Correct Fix

TL;DR

Core Lightning hit a bug integrating a remote VLS signer: if the node crashes at exactly the wrong moment (after VLS signs a new channel state, but before the node persists that fact), the node and the signer end up disagreeing about the latest state index. On restart, they are "split-brained."

The tempting fix is to drop VLS's policy-commitment-retry-same so it will re-sign at the same index. That fix is unsafe. From the signer's point of view, the crash-and-recover sequence is indistinguishable from a deliberate theft attempt, so allowing one allows the other, and a compromised node could steal funds.

The correct fix lives in the node software, not VLS: write the intent to sign before signing: a write-ahead log. After a crash, the node re-converges with the signer instead of diverging. For VLS itself, this is a wontfix.

Introduction

Core Lightning recently reported an issue with Validating Lightning Signer integration, where an unfortunately-timed crash of the Core Lightning node would cause both VLS and CLN to go out of synchronization.

The initial proposed solution was for VLS to drop its policy-commitment-retry-same (one of VLS's policy controls). This policy prevents VLS from signing a new channel state for the remote side at any state index N if it already signed a different channel state at state index N.

With this policy, VLS will only sign if:

The inputs to the signer are exactly the same as the most recent previous call.
OR
The state index is one higher than the most recent previous call and the previous-previous state was already revoked.

The argued reason for dropping this policy is that, as long as state index N - 1 of the remote side has not been revoked by the remote side, it should still be able to advance. That is, if the Core Lightning node negotiates a different state at N, it should still be safe for Validating Lightning Signer to sign that alternate reality. That is, in this case, VLS should "follow along" with what Core Lightning believes the state should be, because that is what it negotiates with its peer.

However, it turns out that this policy is absolutely necessary for correct operation, as this writeup will show.

The correct solution, it turns out, is to change any Lightning Network node implementation that wants to integrate with VLS, and not to change VLS. And the correct, performant version of that solution requires a slight, but radical, rethinking of when to save the state of the channel.

The Issue Report

The issue as reported gives the following user story:

Actors:
- A, a Lightning Network node that integrates a remote signer.
- V, a Validating Lightning Signer that provides signer capability to the above A.
- R, a Lightning Network node that is peered with and has a channel with A.
Start at state index N - 1.
A and R exchange some update_* messages with each other. Suppose R sends an update_add_htlc for a new HTLC, let us call it H1.
A decides it wants to advance to the next state, and asks V to sign a new R-side state with the new H1 HTLC, at state index N.
- This is done by sending A -> hsmd_sign_commitment_tx -> V.
V signs and writes this fact to its persistent storage, then releases the signature to A.
- This is done by sending V -> hsmd_sign_commitment_tx_reply -> A.
A CRASHES before it can receive the message.
- Since it would save the signature into the persistent storage of A after receiving the message, A FORGETS that there is an already-signed new state at N.
A restarts and reconnects to both R and V.
A and R exchange channel_reestablish.
- Because A never received the signature for the R-side state at index N, it thinks that the latest state is still N - 1.
A and R exchange some new update_* messages. Suppose that this time R sends:
- update_add_htlc to add H1 (which was sent before, but is still included now).
- update_add_htlc to add H2.
A decides to advance to the next state, and asks V to sign a new R-side state with H1 and H2 HTLCs at state index N.
ERROR: V sees the attempt to repeat the request at index N with a different set of HTLCs H1 and H2 and reports that A is compromised.
- This is the policy-commitment-retry-same!

The proposed solution is to drop the policy-commitment-retry-same, which would let V sign the new state and let both V and A update to the alternate version.

Exploiting The Lack Of `policy-commitment-retry-same`

Unfortunately, a Validating Lightning Signer which does not enforce policy-commitment-retry-same can lose funds!

The exploit is done this way.

More importantly, take the point of view of the Validating Lightning Signer V. In both the below exploit, and the above issue user story, V sees the exact same sequence of events. Thus, preventing the exploit below is equivalent to causing the above reported issue!

At a glance: the same events, two intentions

The two timelines below are the honest crash (the reported bug) and the malicious "pretend crash" (the exploit). Watch only the V lane: the messages VLS receives and sends are identical in both. VLS has no way to tell them apart, so any change that lets the second timeline succeed necessarily breaks the rejection that the first one relies on.

Honest crash (the reported bug)

sequenceDiagram
    participant V as VLS
    participant A as Node A
    participant R as Peer R
    Note over A,R: state index N-1
    R->>A: update_add_htlc H1
    A->>V: sign_commitment_tx N (H1)
    V->>V: persist "signed N"
    V-->>A: reply: sig N(H1)
    Note over A: CRASH before persisting,<br/>forgets state N
    Note over A: restart, believes state = N-1
    A->>R: channel_reestablish
    R->>A: update_add_htlc H1
    R->>A: update_add_htlc H2
    A->>V: sign_commitment_tx N (H1, H2)
    Note over V: retry at N, different HTLCs<br/>policy-commitment-retry-same: REJECT

Malicious "pretend crash" (the exploit)

sequenceDiagram
    participant V as VLS
    participant A as Node A (compromised)
    participant R as Peer R (attacker)
    Note over A,R: state index N-1
    R->>A: pretend update_add_htlc H1
    A->>V: sign_commitment_tx N (H1)
    V->>V: persist "signed N"
    V-->>A: reply: sig N(H1)
    A->>R: leak sig N(H1)
    Note over A: PRETEND crash + restart,<br/>claims state = N-1
    A->>R: channel_reestablish
    R->>A: update_add_htlc H1
    R->>A: update_add_htlc H2
    A->>V: sign_commitment_tx N (H1, H2)
    Note over V: no policy: SIGN sig N(H1,H2)
    Note over R: later publishes sig N(H1) on-chain,<br/>H2 never lands: funds stolen

The table below makes the overlap exact. Read each row across the two columns: every step is identical from V's point of view, except the two steps where only the node's intent differs (steps 3 and 4: a real crash versus a pretended one) and the final outcome.

Step	Honest crash (the bug)	Malicious node (the exploit)
1	`R` adds HTLC `H1`; `A` asks `V` to sign state `N` (`H1`)	`R` and `A` pretend to add `H1`; `A` asks `V` to sign state `N` (`H1`)
2	`V` signs `<N> H1`, persists it, returns the signature	`V` signs `<N> H1`, persists it, returns the signature
3 (intent)	`A` crashes before persisting the signature	`A` pretends to crash (and leaks `<N> H1` to `R`)
4 (intent)	`A` restarts, genuinely believing state is `N - 1`	`A` pretends to be back at `N - 1`
5	`R` adds `H1` + `H2`; `A` asks `V` to sign `N` (`H1,H2`)	`R` adds `H1` + `H2`; `A` asks `V` to sign `N` (`H1,H2`)
What `V` saw	sign `N`(`H1`) → reply → sign `N`(`H1,H2`)	sign `N`(`H1`) → reply → sign `N`(`H1,H2`)
Outcome	Legitimate retry rejected: the reported bug	If the policy is dropped, `V` signs `<N> H1,H2`; `R` later publishes `<N> H1` on-chain: funds stolen

In the below story, suppose we did not enforce the policy-commitment-retry-same that caused the issue described above:

Actors:
- A, a Lightning Network node that integrates a remote signer.
  - In actuality, A has been compromised by the actor R and will now coordinate with R to steal funds.
- V, a Validating Lightning Signer that provides signer capability to the above A.
- R, a Lightning Network node that is peered with and has a channel with A.
- S, another Lightning Network node that has a channel with A.
  - In actuality, R and S (and A, which was compromised by R) are controlled by the same entity.
Start at state index N - 1.
R and A pretend to negotiate an update_add_htlc from R for an HTLC H1.
A requests V to sign a new R-side state at index N, with HTLC H1.
- A -> hsmd_sign_commitment_tx -> V.
V signs the R-side state index N with HTLC H1, stores it in V persistent storage, and releases the signature to A.
- Let us call this signature <N> H1.
- V -> hsmd_sign_commitment_tx_reply -> A.
A receives the signature <N> H1 and sends it to R, and then PRETENDS TO CRASH.
A PRETENDS TO RESTART and to have renegotiated, via channel_reestablish, that the R-side state index was at N - 1 (because it is pretending that it never received the signature <N> H1).
A and R pretend to renegotiate two new update_*s:
- update_add_htlc to add H1.
- update_add_htlc to add H2.
A pretends to "re-advance" from the state N - 1 to state N due to having crashed and never receiving the previous hsmd_sign_commitment_tx_reply. It now sends a new hsmd_sign_commitment_tx to V.
V, because it is NOT enforcing the policy-commitment-retry-same (because we were trying to solve the reported issue), signs the new R-side state with H1 and H2. Call this signature <N> H1,H2.
A pretends to receive this new signature, but throws it away.
R sends the revocation for R-side state at index N - 1 to A.
A sends the revocation for N - 1 to V, via the message hsmd_validate_revocation,
- A -> hsmd_validate_revocation -> V.
V now believes that H1 and H2 are "irrevocably committed", because the only valid R-side state is N (and specifically, <N> H1,H2)
- (BOLT nerds: yes I know that is not the full definition of "irrevocably committed" I am eliding some details)
A now tells V that it wants to forward H2 to the other peer S.
V allows the forwarding of H2 from R -> H2 -> A -> H2 -> S (and completes the necessary sign-revoke cycle to instantiate it) based on its knowledge of the state of the R-A channel.
- It believes it is now on <N> H1,H2.
THE EXPLOIT HAPPENS HERE.
- R takes the <N> H1 state signature --- the one that only has H1 and not H2 --- and publishes it onchain.
- Because it is at N and not N - 1, this invalid state is not revocable by V --- R only gave the revocation key for N - 1, not N.
- The compromised node A thus loses the value of H2:
  - It sent out the amount H2 to S (really controlled by R).
  - The incoming HTLC H2 on the R -> A channel is not present on the onchain version of the channel state, so it did not receive that amount!

Thus, Validating Lightning Signer cannot drop the policy-commitment-retry-same in order to solve the issue!

Why this is the crux: the signer cannot tell the honest crash from the malicious "pretend crash"; it observes the identical message sequence in both. That is exactly why a compromised node is the threat model VLS is designed to contain. For the bigger picture on why "non-custodial" alone doesn't answer this, see The Lightning Security Spectrum.

The Correct Solution

...is to modify everyone else except for Validating Lightning Signer, of course. i.e. wontfix.

When a typical Lightning Network node software decides it wants to sign the counterparty commitment, this is how it is typically implemented:

Call "sign" procedure.
Write the fact that we signed to disk.
Send commitment_signed.

The issue here is that there is a race condition between the "sign" procedure and "write the fact we signed":

Call "sign" procedure.
- CRASH!
  - Node thinks we are at N - 1.
  - VLS thinks we are at N.
  - On restart, node tries to make a different N, which VLS rejects as per policy-commitment-retry-same.
Write the fact that we signed to disk.
Send commitment_signed.

Instead, the correct order is:

Write the intent to sign the new state.
Call "sign" procedure.
Send commitment_signed.

When a crash happens:

Write the intent to sign the new state.
- CRASH!
  - Node thinks we are at N.
  - VLS thinks we are at N - 1.
  - On restart, node asks to sign at N, which VLS accepts because it advances the state from N - 1 to N.
Call "sign" procedure.
Send commitment_signed.

This solution feels wrong. Here's why it isn't. The intuition we normally have is that writing to disk should be the last thing we do ***before*** we send anything to the counterparty, to ensure that we have ***everything*** done correctly, before we update our on-disk state and then send the message to the remote side. To assuage the above "feels bad man," let us turn to databases and write-ahead logs.

Database And Intent Logs

What a write-ahead log is

When a modern database storage engine has to commit a transaction, it typically uses a technique called "write-ahead log", sometimes called an "intent log".

A "write-ahead log" is an append-only structure stored on the disk, where the storage engine writes out the instructions to update the actual table storage.

Once the database has written out the instructions to update the table storage, it then signals the transaction as completely committed, without actually updating the table storage.

Why it's fast

As a linear append-only structure, writes to the write-ahead log are very fast. With spinning rust, the drive can write the log entry on a contiguous sectors. Even with flash and other non-volatile memories, the Flash Translation Layer will often balance write wear by internally creating long sequences of cells that it will write in sequence, also naturally mapping to long sequences of the "blocks" it presents to the operating system.

When reading from persistent storage, the database will first linearly scan through the write-ahead log, in reverse order (so that later transactions are checked for the latest data first), before it even tries to look into the table storage. This sacrifices read throughput for more write throughput, but this is a good trade: the database can cache things in memory so that it rarely has to read from the persistent storage and can serve most reads from in-memory cache, but it absolutely has to ensure that any committed transactions are on persistent storage (to achieve its Durability promise), so that biasing towards fast disk writes versus slow disk reads is better.

Checkpointing the write-ahead log

Obviously space will run out if we just keep appending to the write-ahead log. So periodically, the database storage engine will actually read from the write-ahead log, and perform the updates written into the log on the actual table storage. This is called "checkpointing".

Once the entire write-ahead log has been applied to the table storage, the storage engine can now clear the entire log (it could truncate the write-ahead log if it is some separate file, or just add some kind of invalidation of the entire log contents, such as by encrypting the write-ahead log entries using one key, then changing that key completely to "clear" the log by making all the data inside it fail MAC validation afterwards, thus "empty" due to not having valid data; this neatly includes integrity checking and torn write protection, and if the encryption uses the block index as part of a derivation from the current key, protects against misdirected writes too).

Now, suppose the database machine crashes after it has written to the write-ahead log. In that case, the database storage engine simply continues operating: reads are first looked up in the write-ahead log, writes append to the write-ahead log, checkpointing is initiated if the write-ahead log is too large.

The really neat part is that it does not matter whether the crash happens while the storage engine is in the middle of checkpointing or not!

Even if the storage engine was checkpointing, and the table storage is in a halfway state where only half a transaction is non-atomically updated in the actual table, the view of the storage engine is still the same fully ACID view. This is because the write-ahead log is referred to first when reading, and the write-ahead log contains the entire transaction. Even if the update of the table storage is not complete, on restart, checkpointing can be restarted and the writes re-done regardless of whether or not the entry was already written to the table storage before the crash.

This means that as soon as the database writes a complete transaction to the write-ahead log (with protection against "torn writes" where a transaction is not completely written because the machine crashed while the disk was writing; this is usually done by checksumming entire transaction entries), it can now treat the transaction as "committed".

What I want to focus on is that "as soon as we have written the update instructions for the entire transaction to disk, we have atomically updated the database regardless of whether the table storage was updated or not".

Basically, even though the database storage engine has not actually applied the update instructions on disk to the table --- it only has written the update instructions to the log --- then the database storage engine already treats the transaction as "committed": it is now Atomic and Durable on the disk, as far as the database storage engine is concerned.

Thus, in database storage engines, writing your intent to update the table to the log is equivalent to atomically updating the table.

More generally, writing your intent to DO SOMETHING is equivalent to atomically DOING SOMETHING.

Mapping it back to Lightning

Let us now map that back to our own situation over here in Lightning-land:

Writing our intent to sign the commitment transaction is equivalent to atomically signing the commitment transaction.

In particular, in this view, Validating Lightning Signer and the remote peer are the "table storage", and the storage of the node, where it writes the "intent to sign", is the "intent log" or "write-ahead log".

Even if the VLS and remote peer are not up-to-date (i.e. requesting a signature from VLS and sending the signature via commitment_signed to the remote peer is equivalent to "checkpointing" from the "intent log" to the "table storage"), the node has the intent log and can repeat the operations on restart, regardless of whether or not it was already performing those operations before the crash.

Thus, the proposed solution:

Write the intent to sign the new state. = write a log entry to the WAL.
Call "sign" procedure. = write the logged update to the table storage copy 1 (VLS).
Send commitment_signed. = write the logged update to the table storage copy 2 (remote peer).

In effect, we have a "write-ahead log" that is size-limited to only one entry, and as soon as we make that entry, we immediately have to "checkpoint" (due to hitting the maximum size of the write-ahead log) by updating VLS and the remote peer. And as noted, it is immaterial, on restart, whether the checkpointing was completed or not; on restart, we simply need to re-do the signing and commitment_signed once again.