Raft lecture(Raft user study)2

Raft lecture(Raft user study)2

Normal operation is pretty simple, the client sends a command to the leader that it would like to have executed by all of the state machines, the leader first adds that command to its own log and then it issues append entries remote procedure calls to the followers in the cluster, typically it will execute these RPCs in parallel sending the same message to all of them at the same time and then it waits for the responses to come back.

Once the leader has received enough responses to consider that entry committed that is its gotten responses from at least half of the other servers in the cluster so with itself that makes a majority, then its okay to execute the command so the leader passes the command to its state machine, when that command finishes, then it returns the result back to the client.

Furthermore, once the leader knows that a particular entry is committed, it notifies the other servers about that in subsequent AppendEntries RPCs. So eventually each of the followers will also find out that entry has been committed and then the followers when they find out they will execute that command on their state machines also.

Now if a follower has crashed or is slow to respond to an AppendEntries remote procedure call, the leader will keep retrying that call over and over and over again. So the follower crashes that comes back up again the leader will retry it, but the leader doesnt have to wait for every single follower to respond it only needs enough to respond to guarantee that the entry is stored on a majority of the servers in the cluster, so this results in very good performance in the common case.

In the normal case all thats needed in order to finish a client command is to get a response back from a majority of the servers. In fact the fastest machines in the cluster, at which point the leader can immediately execute the command and return the result to the client, so for example one slow server doesnt necessarily slow down the clients because the leader doesnt have to wait for that server.

Raft tries to maintain a high level of consistency between the various logs in a cluster, ideally theyd all be identical at all times and we cant of course do that given there can be crashes, but as much as possible raft tries to keep the logs identical and this slide lists some properties that are always true at all times.

The first property is that the combination of an index and a term uniquely identifies a log entry, that is if two log entries are in the same index log index position and they have the same term, then its guaranteed that they will also have the same command, and furthermore its also guaranteed that if two entries have this property, then all preceding entries in those logs will also match each other, so the combination of a term and an index uniquely identifies an entire log from its beginning up to that point.

Furthermore, it turns out that if a particular entry is committed, then all preceding entries are also committed and that kind of follows from the previous rule and that you can see if a majority of the servers store this one entry here at index five then because of the rule above they must also store all the same earlier entries and so we know all of those entries will also exist on a majority of the servers.

This property is enforced by a check thats made during the AppendEntries remote procedure call, when a leader issues an AppendEntries RPC to a follower, it includes two values in addition to the new log entries, it includes the index and the term of the entry just before the new ones, and the follower will only accept the RPC if it contains that exact matching entry in its log, if its log doesnt have that entry then it will reject the remote procedure call.

So lets go through an example, suppose a leader has just received a new command jump from a client and it sends an AppendEntries remote procedure call to that follower, well the leader will include the index in term of the preceding entry, so it will include index four and term two in the RPC, the follower checks to see that it has a matching entry there and since it does then it will accept that new entry into its log.

Now consider the example on the bottom, suppose the followers log actually had a different entry preceding the new one, in this case the entry in index 4 has a different term and so because of this the AppendEntries remote procedure call will be rejected, it will not accept that new entry.

This consistency check is really important and you can think of it kind of like an induction step, in the proof of the properties on the preceding slide, it guarantees that a new entry is only accept that if the logs match in their previous entry, but of course the same check was applied when those entries were created and so that guarantees logs also imagine preceding entries and so on, so this means that if a follower accepts a new entry from the leader, its log exactly matches the leaders log up through that entry guaranteed.

That finishes the discussion of normal operation, now lets talk about leader changes, when a new leader comes to power, the logs may not be in a very clean state, because the previous leader could have crashed before it finished completely replicating some of the log entries, now the way raft handles this is that it doesnt take any special steps when a new leader comes online, it doesnt try and do a cleanup phase, it just starts normal operation and the cleanup has to happen during the normal operation.

Now the reason for this is that when a new leader comes up, some of the other machines may be down and so theres no way it can clean up their logs right away, it has to be able to resume operation even if some machines are down, and it could be a long time before those machines come back up again, so we have to design the system so that the normal operation eventually converges all of the logs.

chaotic 混亂的

The raft approach to this is to assume that the leaders log is always correct, it has everything important and so all the leader has to do is over time make all of the followers logs match its log, but in the meantime that leader could crash before it finishes the job and the next leader and the next leader and so extraneous log entries could pile up over a long period of time to create a fairly chaotic looking situation like the example in the bottom of the slide here.

up until now 直到現在

First, I want to mention Im going to change my notation a little bit, up until now Ive shown the commands in log entries but Im not going to do that anymore since we know that the combination of a log index and the term stored in an entry is a unique identifier for that entry, so from now on Im just going to show term numbers on the log entries without commands.

This particular scenario at the bottom could have happened if servers four and five were the leaders for terms two three and four, but somehow never replicated any entries outside themselves, and then they crashed and the system partition for a while and the other servers one two and three took turns being leaders for terms five six and seven but were not able to communicate with servers four and five to clean them up, so now were at a situation where the logs are really quite a mess.

The only thing that really matters here is these entries that Im drawing here entries one through three, these are committed entries and so we have to make sure we preserve them but the other entries none of them have been committed, and so it doesnt really matter whether we keep them or throw them away, we havent passed any of these to a state machine, no client machine is seeing the results of executing any of these commands, so these are all expendable.

If for example server two is the leader for term seven, and its able to communicate with everyone, then eventually it will make all of the logs of the cluster look like its log and any conflicting entries will get deleted.

Now Ill come back later on to talk about how a leader makes the followers logs match its logs, but first I want to talk about correctness and safety, now how do we know that the system is behaving in the correct way, and that were not losing some information thats important because you can see clearly here were going to have to throw away some log entries in order to bring everything back into consistency, so how do we do that in a safe fashion.

Theres a fundamental overall safety requirement that any system for implementing replicated logs must obey, and its as written in red here that once a particular state machine has received a log entry and applied it as a command, we must make sure that no other state machine ever applies a different value for that log entry, they must all apply the same values in the same order for the log entries.

narrower 比較狹窄的

Now in order to achieve this overall safety requirement, raft implements is somewhat narrower what I call safety property, and its whats written on the slide here, once a leader has decided that a particular entry is committed, then raft guarantees that entry will be present in the logs of all future leaders in the system.

So whenever a leader comes to power in the future and for its entire lifetime, it will always have all of the committed log entries present, if we can make raft conform to this property then that will guarantee the safety requirement at the top of the slide and this the rough argument is that, first a leader never overrides an entry in its log, it only appends, and so we know that those log entries will never change then if theyre always present on the leader, second in order to be committed an entry has to be in the leaders log so no other value can be committed, and third we know that entries have to be committed before theyre applied to the state machine, so if you put all of those together, weve guaranteed the property at the top of the slide.

Now the raft algorithm as Ive described it so far does not yet guarantee this property and Im going to go through the problems and show you how we solve them, but first I just want to go back to what were trying to do again that if an entry is committed that implies will be present in future leaders logs so in order to do this were going to change the raft algorithms in two ways.

First Im going to modify the election process to exclude a machine from becoming leader if it doesnt have the right stuff in its logs, and then second thats not going to be enough by itself then second were going to change the definition of committed a little bit so that at some times we have to delay committing an entry until we know that itll be safe that is that we can guarantee that it will be present in future leaders logs.

Ill talk about the election stuff first so how do we make sure we pick a leader that holds all of the committed log entries, well first of all this is kind of tricky because we cant actually tell which entries are committed, if we consider this cluster with three servers in it and suppose that were having to pick a new leader but one of the servers is not available then just looking at the servers that are available during the transition, we cant really tell whether entry five is committed or not, it depends whether its stored on this unavailable server, in this case it is but in other cases it might not be.

Id know for sure which entries are committed so instead what we do is to try and pick a candidate to win an election such that it has the log that is most likely to have all the entries that have been committed and Ill first describe this intuitively and then come back and make it more precise to prove that in fact we can guarantee that we pick a candidate that has all the committed entries.

The way this works is by comparing logs, so when the candidate requests a vote from another server, it includes information about its log and all it needs to include is the index of the last log entry and the term from that entry, and remember from my discussion previously this uniquely describes the entire log.

Then the voting server that receives this request for a vote it compares its own log to that of the candidate, and if the voters log is more complete, think of this in an intuitive sense, it will deny its vote, now to make this specific we define this to mean that if the last term in the voting servers log is greater than the last term in the candidates log, then the voters log is more complete, deny the vote or if the terms match and the voting servers log is longer than the candidates log then again the voters log is more complete and so we deny the vote, so the result of this is that whoever wins the election is guaranteed to have the most complete log among the servers that voted for among some majority of the cluster, and again by most complete I mean this particular definition in terms of the index and term of the last entry in the log.


推薦閱讀:

[譯] 我們是如何高效實現一致性哈希的
Consensus, Paxos和RSM
關於Paxos "幽靈復現"問題看法
MySQL事務在MGR中的漫遊記 - 路線圖

TAG:Raft | Paxos | 分散式系統 | 分散式一致性 |