當我們討論 Impossibility 我們在談什麼？

07-05

來自專欄 FLP Impossibility

分散式系統理論里有各種 Impossibility 理論, 比如：最簡單的 Two Generals Problem，或是為了理解 Paxos 一定會談到的 FLP Impossibility。當讀到這些理論後，作為碼農的我，在實際工作中應該如何運用這些理論呢？

1) 吹牛逼，比如像我現在正在做的這樣；

2) 在工作中謹慎地挖坑，防止給自己挖了一個impossible的坑；

那麼除此之外我們還能做些什麼呢？Nancy A. Lynch (就是 FLP 里的 L) 在「A Hundred Impossibility Proofs for Distributed Computing」里對這個問題進行了解答：

Impossibility results tell you when you should stop trying to devise or improve an algorithm...
Its probably true that most systems developers, even when confronted with the proved impossibility of what theyre trying to do, will still keep trying to do it. This doesnt necessarily mean that they are obstinate, but rather they have some flexibility in their goal...
Proving impossibility results cause us to take a very analytical approach to understand the area. It causes us to state carefully exactly what assumptions the results depends on.

即

1) 如果某個問題不可解，那麼再多的優化嘗試解決它也是徒勞的。或者更 general 的說，解決一個問題前先做 research，否則會白白費力 (反例：Redis 實現的鎖 Redlock [4] )。

2) 對於不可解的問題，我們要明確地理解：到底在什麼條件下，這個問題為什麼不可解？我們可不可以放寬某個條件？或者說我們能不能弱化求解問題的性質？

比如對於 FLP Impossibility 不可解性質：

它並不是說的共識問題像停機問題一樣不可解，而是說在能容忍至少一個進程失敗的非同步消息傳遞的系統中，如果想滿足 safety 性質（大家達成一個 non-trivial 的共識），那麼存在一種可能的執行順序(schedule)使得系統無限地推遲(delay)而無法達成共識。[5]

而真實世界是這樣的么？

在真實的系統的執行順序是隨機的，發送消息越多，那麼系統執行某一個特定的執行順序的概率會趨向於0，所以用隨機演算法 [6]能夠解決共識問題。

同時在 FLP 的模型中我們假設系統是完全非同步，但是現實的網路里我們更像是一個 paritially synchronous 的系統，從而對時間和消息傳遞做出某種假設(比如：true time），這也是為什麼在實際環境中 Multi-paxos 和 Raft 可以利用 timeout 機制來避免 liveness 問題。

寫完這篇文章，明天又能和同事吹上一天了。

Reference

Two Generals Problem

2. FLP Impossibility

3. A Hundred Impossibility Proofs for Distributed Computing

4. How to do distributed locking

5. Stumbling over consensus research: Misunderstandings and issues

6. Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols