VLS Channel-State Splitbrain - The Bug and Its Correct Fix
A badly-timed Core Lightning crash can desync a node from its VLS signer. The obvious fix enables theft; the correct one is a write-ahead log.
TL;DR
Core Lightning hit a bug integrating a remote VLS signer: if the node crashes at exactly the wrong moment (after VLS signs a new channel state, but before the node persists that fact), the node and the signer end up disagreeing about the latest state index. On restart, they are "split-brained."
The tempting fix is to drop VLS's policy-commitment-retry-same so it will re-sign at the same index. That fix is unsafe. From the signer's point of view, the crash-and-recover sequence is indistinguishable from a deliberate theft attempt, so allowing one allows the other, and a compromised node could steal funds.
The correct fix lives in the node software, not VLS: write the intent to sign before signing: a write-ahead log. After a crash, the node re-converges with the signer instead of diverging. For VLS itself, this is a wontfix.
Introduction
Core Lightning recently reported an issue with Validating Lightning Signer integration, where an unfortunately-timed crash of the Core Lightning node would cause both VLS and CLN to go out of synchronization.
The initial proposed solution was for VLS to drop its
policy-commitment-retry-same (one of VLS's
policy controls).
This policy prevents VLS from signing a new channel state
for the remote side at any state index N if it already
signed a different channel state at state index N.
With this policy, VLS will only sign if:
- The inputs to the signer are exactly the same as the
most recent previous call.
OR - The state index is one higher than the most recent previous call and the previous-previous state was already revoked.
The argued reason for dropping this policy is that, as
long as state index N - 1 of the remote side has not
been revoked by the remote side, it should still be able
to advance.
That is, if the Core Lightning node negotiates a different
state at N, it should still be safe for Validating
Lightning Signer to sign that alternate reality.
That is, in this case, VLS should "follow along" with
what Core Lightning believes the state should be, because
that is what it negotiates with its peer.
However, it turns out that this policy is absolutely necessary for correct operation, as this writeup will show.
The correct solution, it turns out, is to change any Lightning Network node implementation that wants to integrate with VLS, and not to change VLS. And the correct, performant version of that solution requires a slight, but radical, rethinking of when to save the state of the channel.
The Issue Report
The issue as reported gives the following user story:
- Actors:
A, a Lightning Network node that integrates a remote signer.V, a Validating Lightning Signer that provides signer capability to the aboveA.R, a Lightning Network node that is peered with and has a channel withA.
- Start at state index
N - 1. AandRexchange someupdate_*messages with each other. SupposeRsends anupdate_add_htlcfor a new HTLC, let us call itH1.Adecides it wants to advance to the next state, and asksVto sign a newR-side state with the newH1HTLC, at state indexN.- This is done by sending
A -> hsmd_sign_commitment_tx -> V.
- This is done by sending
Vsigns and writes this fact to its persistent storage, then releases the signature toA.- This is done by sending
V -> hsmd_sign_commitment_tx_reply -> A.
- This is done by sending
ACRASHES before it can receive the message.- Since it would save the signature into the persistent
storage of
Aafter receiving the message,AFORGETS that there is an already-signed new state atN.
- Since it would save the signature into the persistent
storage of
Arestarts and reconnects to bothRandV.AandRexchangechannel_reestablish.- Because
Anever received the signature for theR-side state at indexN, it thinks that the latest state is stillN - 1.
- Because
AandRexchange some newupdate_*messages. Suppose that this timeRsends:update_add_htlcto addH1(which was sent before, but is still included now).update_add_htlcto addH2.
Adecides to advance to the next state, and asksVto sign a newR-side state withH1andH2HTLCs at state indexN.- ERROR:
Vsees the attempt to repeat the request at indexNwith a different set of HTLCsH1andH2and reports thatAis compromised.- This is the
policy-commitment-retry-same!
- This is the
The proposed solution is to drop the
policy-commitment-retry-same, which would let V
sign the new state and let both V and A update to
the alternate version.
Exploiting The Lack Of policy-commitment-retry-same
Unfortunately, a Validating Lightning Signer which does
not enforce policy-commitment-retry-same can lose
funds!
The exploit is done this way.
More importantly, take the point of view of the
Validating Lightning Signer V.
In both the below exploit, and the above issue user
story, V sees the exact same sequence of events.
Thus, preventing the exploit below is equivalent
to causing the above reported issue!
At a glance: the same events, two intentions
The two timelines below are the honest crash (the reported bug)
and the malicious "pretend crash" (the exploit). Watch only the
V lane: the messages VLS receives and sends are identical in both.
VLS has no way to tell them apart, so any change that lets the second
timeline succeed necessarily breaks the rejection that the first one
relies on.
Honest crash (the reported bug)
sequenceDiagram
participant V as VLS
participant A as Node A
participant R as Peer R
Note over A,R: state index N-1
R->>A: update_add_htlc H1
A->>V: sign_commitment_tx N (H1)
V->>V: persist "signed N"
V-->>A: reply: sig N(H1)
Note over A: CRASH before persisting,<br/>forgets state N
Note over A: restart, believes state = N-1
A->>R: channel_reestablish
R->>A: update_add_htlc H1
R->>A: update_add_htlc H2
A->>V: sign_commitment_tx N (H1, H2)
Note over V: retry at N, different HTLCs<br/>policy-commitment-retry-same: REJECTMalicious "pretend crash" (the exploit)
sequenceDiagram
participant V as VLS
participant A as Node A (compromised)
participant R as Peer R (attacker)
Note over A,R: state index N-1
R->>A: pretend update_add_htlc H1
A->>V: sign_commitment_tx N (H1)
V->>V: persist "signed N"
V-->>A: reply: sig N(H1)
A->>R: leak sig N(H1)
Note over A: PRETEND crash + restart,<br/>claims state = N-1
A->>R: channel_reestablish
R->>A: update_add_htlc H1
R->>A: update_add_htlc H2
A->>V: sign_commitment_tx N (H1, H2)
Note over V: no policy: SIGN sig N(H1,H2)
Note over R: later publishes sig N(H1) on-chain,<br/>H2 never lands: funds stolenThe table below makes the overlap exact. Read each row across the two
columns: every step is identical from V's point of view, except the two
steps where only the node's intent differs (steps 3 and 4: a real crash
versus a pretended one) and the final outcome.
| Step | Honest crash (the bug) | Malicious node (the exploit) |
|---|---|---|
| 1 | R adds HTLC H1; A asks V to sign state N (H1) |
R and A pretend to add H1; A asks V to sign state N (H1) |
| 2 | V signs <N> H1, persists it, returns the signature |
V signs <N> H1, persists it, returns the signature |
| 3 (intent) | A crashes before persisting the signature |
A pretends to crash (and leaks <N> H1 to R) |
| 4 (intent) | A restarts, genuinely believing state is N - 1 |
A pretends to be back at N - 1 |
| 5 | R adds H1 + H2; A asks V to sign N (H1,H2) |
R adds H1 + H2; A asks V to sign N (H1,H2) |
What V saw |
sign N(H1) → reply → sign N(H1,H2) |
sign N(H1) → reply → sign N(H1,H2) |
| Outcome | Legitimate retry rejected: the reported bug | If the policy is dropped, V signs <N> H1,H2; R later publishes <N> H1 on-chain: funds stolen |
In the below story, suppose we did not enforce
the policy-commitment-retry-same that caused the
issue described above:
- Actors:
A, a Lightning Network node that integrates a remote signer.- In actuality,
Ahas been compromised by the actorRand will now coordinate withRto steal funds.
- In actuality,
V, a Validating Lightning Signer that provides signer capability to the aboveA.R, a Lightning Network node that is peered with and has a channel withA.S, another Lightning Network node that has a channel withA.- In actuality,
RandS(andA, which was compromised byR) are controlled by the same entity.
- In actuality,
- Start at state index
N - 1. RandApretend to negotiate anupdate_add_htlcfromRfor an HTLCH1.ArequestsVto sign a newR-side state at indexN, with HTLCH1.A -> hsmd_sign_commitment_tx -> V.
Vsigns theR-side state indexNwith HTLCH1, stores it inVpersistent storage, and releases the signature toA.- Let us call this signature
<N> H1. V -> hsmd_sign_commitment_tx_reply -> A.
- Let us call this signature
Areceives the signature<N> H1and sends it toR, and then PRETENDS TO CRASH.APRETENDS TO RESTART and to have renegotiated, viachannel_reestablish, that theR-side state index was atN - 1(because it is pretending that it never received the signature<N> H1).AandRpretend to renegotiate two newupdate_*s:update_add_htlcto addH1.update_add_htlcto addH2.
Apretends to "re-advance" from the stateN - 1to stateNdue to having crashed and never receiving the previoushsmd_sign_commitment_tx_reply. It now sends a newhsmd_sign_commitment_txtoV.V, because it is NOT enforcing thepolicy-commitment-retry-same(because we were trying to solve the reported issue), signs the newR-side state withH1andH2. Call this signature<N> H1,H2.Apretends to receive this new signature, but throws it away.Rsends the revocation forR-side state at indexN - 1toA.Asends the revocation forN - 1toV, via the messagehsmd_validate_revocation,A -> hsmd_validate_revocation -> V.
Vnow believes thatH1andH2are "irrevocably committed", because the only validR-side state isN(and specifically,<N> H1,H2)- (BOLT nerds: yes I know that is not the full definition of "irrevocably committed" I am eliding some details)
Anow tellsVthat it wants to forwardH2to the other peerS.Vallows the forwarding ofH2fromR -> H2 -> A -> H2 -> S(and completes the necessary sign-revoke cycle to instantiate it) based on its knowledge of the state of theR-Achannel.- It believes it is now on
<N> H1,H2.
- It believes it is now on
- THE EXPLOIT HAPPENS HERE.
Rtakes the<N> H1state signature --- the one that only hasH1and notH2--- and publishes it onchain.- Because it is at
Nand notN - 1, this invalid state is not revocable byV---Ronly gave the revocation key forN - 1, notN. - The compromised node
Athus loses the value ofH2:- It sent out the amount
H2toS(really controlled byR). - The incoming HTLC
H2on theR -> Achannel is not present on the onchain version of the channel state, so it did not receive that amount!
- It sent out the amount
Thus, Validating Lightning Signer cannot
drop the policy-commitment-retry-same in order
to solve the issue!
The Correct Solution
...is to modify everyone else except for Validating
Lightning Signer, of course.
i.e. wontfix.
When a typical Lightning Network node software decides it wants to sign the counterparty commitment, this is how it is typically implemented:
- Call "sign" procedure.
- Write the fact that we signed to disk.
- Send
commitment_signed.
The issue here is that there is a race condition between the "sign" procedure and "write the fact we signed":
- Call "sign" procedure.
- CRASH!
- Node thinks we are at
N - 1. - VLS thinks we are at
N. - On restart, node tries to make a different
N, which VLS rejects as perpolicy-commitment-retry-same.
- Node thinks we are at
- CRASH!
- Write the fact that we signed to disk.
- Send
commitment_signed.
Instead, the correct order is:
- Write the intent to sign the new state.
- Call "sign" procedure.
- Send
commitment_signed.
When a crash happens:
- Write the intent to sign the new state.
- CRASH!
- Node thinks we are at
N. - VLS thinks we are at
N - 1. - On restart, node asks to sign at
N, which VLS accepts because it advances the state fromN - 1toN.
- Node thinks we are at
- CRASH!
- Call "sign" procedure.
- Send
commitment_signed.
Database And Intent Logs
What a write-ahead log is
When a modern database storage engine has to commit a transaction, it typically uses a technique called "write-ahead log", sometimes called an "intent log".
A "write-ahead log" is an append-only structure stored on the disk, where the storage engine writes out the instructions to update the actual table storage.
Once the database has written out the instructions to update the table storage, it then signals the transaction as completely committed, without actually updating the table storage.
Why it's fast
As a linear append-only structure, writes to the write-ahead log are very fast. With spinning rust, the drive can write the log entry on a contiguous sectors. Even with flash and other non-volatile memories, the Flash Translation Layer will often balance write wear by internally creating long sequences of cells that it will write in sequence, also naturally mapping to long sequences of the "blocks" it presents to the operating system.
When reading from persistent storage, the database will first linearly scan through the write-ahead log, in reverse order (so that later transactions are checked for the latest data first), before it even tries to look into the table storage. This sacrifices read throughput for more write throughput, but this is a good trade: the database can cache things in memory so that it rarely has to read from the persistent storage and can serve most reads from in-memory cache, but it absolutely has to ensure that any committed transactions are on persistent storage (to achieve its Durability promise), so that biasing towards fast disk writes versus slow disk reads is better.
Checkpointing the write-ahead log
Obviously space will run out if we just keep appending to the write-ahead log. So periodically, the database storage engine will actually read from the write-ahead log, and perform the updates written into the log on the actual table storage. This is called "checkpointing".
Once the entire write-ahead log has been applied to the table storage, the storage engine can now clear the entire log (it could truncate the write-ahead log if it is some separate file, or just add some kind of invalidation of the entire log contents, such as by encrypting the write-ahead log entries using one key, then changing that key completely to "clear" the log by making all the data inside it fail MAC validation afterwards, thus "empty" due to not having valid data; this neatly includes integrity checking and torn write protection, and if the encryption uses the block index as part of a derivation from the current key, protects against misdirected writes too).
Now, suppose the database machine crashes after it has written to the write-ahead log. In that case, the database storage engine simply continues operating: reads are first looked up in the write-ahead log, writes append to the write-ahead log, checkpointing is initiated if the write-ahead log is too large.
The really neat part is that it does not matter whether the crash happens while the storage engine is in the middle of checkpointing or not!
Even if the storage engine was checkpointing, and the table storage is in a halfway state where only half a transaction is non-atomically updated in the actual table, the view of the storage engine is still the same fully ACID view. This is because the write-ahead log is referred to first when reading, and the write-ahead log contains the entire transaction. Even if the update of the table storage is not complete, on restart, checkpointing can be restarted and the writes re-done regardless of whether or not the entry was already written to the table storage before the crash.
This means that as soon as the database writes a complete transaction to the write-ahead log (with protection against "torn writes" where a transaction is not completely written because the machine crashed while the disk was writing; this is usually done by checksumming entire transaction entries), it can now treat the transaction as "committed".
What I want to focus on is that "as soon as we have written the update instructions for the entire transaction to disk, we have atomically updated the database regardless of whether the table storage was updated or not".
Basically, even though the database storage engine has not actually applied the update instructions on disk to the table --- it only has written the update instructions to the log --- then the database storage engine already treats the transaction as "committed": it is now Atomic and Durable on the disk, as far as the database storage engine is concerned.
Thus, in database storage engines, writing your intent to update the table to the log is equivalent to atomically updating the table.
More generally, writing your intent to DO SOMETHING is equivalent to atomically DOING SOMETHING.
Mapping it back to Lightning
Let us now map that back to our own situation over here in Lightning-land:
Writing our intent to sign the commitment transaction is equivalent to atomically signing the commitment transaction.
In particular, in this view, Validating Lightning Signer and the remote peer are the "table storage", and the storage of the node, where it writes the "intent to sign", is the "intent log" or "write-ahead log".
Even if the VLS and remote peer are not up-to-date (i.e.
requesting a signature from VLS and sending the signature
via commitment_signed to the remote peer is equivalent
to "checkpointing" from the "intent log" to the "table
storage"), the node has the intent log and can repeat the
operations on restart, regardless of whether or not it
was already performing those operations before the
crash.
Thus, the proposed solution:
- Write the intent to sign the new state. = write a log entry to the WAL.
- Call "sign" procedure. = write the logged update to the table storage copy 1 (VLS).
- Send
commitment_signed. = write the logged update to the table storage copy 2 (remote peer).
In effect, we have a "write-ahead log" that is size-limited
to only one entry, and as soon as we make that entry, we
immediately have to "checkpoint" (due to hitting the
maximum size of the write-ahead log) by updating VLS and
the remote peer.
And as noted, it is immaterial, on restart, whether the
checkpointing was completed or not; on restart, we simply
need to re-do the signing and commitment_signed once
again.