[OSDI2012] Spanner: Google』s Globally-Distributed Database, Part 1

01-29

最近Google推出spanner cloud，我們也來回顧一下Spanner是怎麼運作的https://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi2012.pdf

Spanner is a scalable, globally-distributed database designed,nbuilt, and deployed at Google. At the highestnlevel of abstraction, it is a database that shards datanacross many sets of Paxos [21] state machines in datacentersnspread all over the world. Replication is used fornglobal availability and geographic locality; clients automaticallynfailover between replicas.

想法現在看起來都非常簡單明了，但是2012年能把這個東西做出來真的是厲害。

Ourninitial customer was F1 [35], a rewrite of Google』s advertisingnbackend. F1 uses five replicas spread acrossnthe United States. Most other applications will probablynreplicate their data across 3 to 5 datacenters in one geographicnregion, but with relatively independent failurenmodes. That is, most applications will choose lower latencynover higher availability, as long as they can surviven1 or 2 datacenter failures

重點是就算Spanner可以給你非常強的保證，客戶有的時候為了速度會願意放寬保證。

Spanner has evolved from a Bigtable-likenversioned key-value store into a temporal multi-versionndatabase. Data is stored in schematized semi-relationalntables; data is versioned, and each version is automaticallyntimestamped with its commit time; old versions ofndata are subject to configurable garbage-collection policies;nand applications can read data at old timestamps.nSpanner supports general-purpose transactions, and providesna SQL-based query language.

這裡就顯示出來Google後端人的強大了。現在很多公司也有後端consistent的資料庫，但是能把SQL套上去的，在2012年真的是厲害。

Applications can specify constraintsnto control which datacenters contain which data,nhow far data is from its users (to control read latency),nhow far replicas are from each other (to control write latency),nand how many replicas are maintained (to controlndurability, availability, and read performance).
Second, Spanner has two featuresnthat are difficult to implement in a distributed database: it provides externally consistent [16] reads and writes, andnglobally-consistent reads across the database at a timestamp

用戶自己可以控制要多少個備份，而且Spanner可以確保global-consistency。

In addition,nthe serialization order satisfies external consistency (ornequivalently, linearizability [20]): if a transaction T1ncommits before another transaction T2 starts, then T1』sncommit timestamp is smaller than T2』s. Spanner is thenfirst system to provide such guarantees at global scale.

這裡提到spanner確保linearlization，也就是說假設第一個commit，T1比第二個commit，T2要早，那麼T1的時間戳比T2要小。後面可以看具體怎麼實現的

實現

Spanner is organized as a set of zones, where eachnzone is the rough analog of a deployment of Bigtable servers [9]. Zones are the unit of administrative deployment.nThe set of zones is also the set of locations acrossnwhich data can be replicated.

A zone has one zonemaster and between one hundrednand several thousand spanservers. The former assignsndata to spanservers; the latter serve data to clients.
The universenmaster and the placement driver are currently singletons.nThe universe master is primarily a console thatndisplays status information about all the zones for interactivendebugging. The placement driver handles automatednmovement of data across zones on the timescalenof minutes.

這裡講了一下spanner在spanserver之上的組成。placement driver負責移動replication，zone master是負責sharding，universe master是維修的時候給人看系統數據的。

軟體層面

This section focuses on the spanserver implementationnto illustrate how replication and distributed transactionsnhave been layered onto our Bigtable-based implementation.nThe software stack is shown in Figure 2. At thenbottom, each spanserver is responsible for between 100nand 1000 instances of a data structure called a tablet. Antablet is similar to Bigtable』s tablet abstraction, in that itnimplements a bag of the following mappings:
(key:string, timestamp:int64) → string

這裡沒有具體提到為什麼一個spanserver會有很多個tablet，我的理解小的tablet有益於數據移動，但是對shard map的反應速度要求比較高。這裡還有一個很有意思的是timestamp是存在key而不是value裡面的。

Unlike Bigtable, Spanner assigns timestamps to data,nwhich is an important way in which Spanner is morenlike a multi-version database than a key-value store. A tablet』s state is stored in set of B-tree-like files and anwrite-ahead log, all on a distributed file system callednColossus (the successor to the Google File System [15]).

有意思的是這裡提到資料庫用的是B-tree 還有write-ahead log，本質上跟MySQL的Engine差別不大，而且用的是GFS的下一代，而不是LevelDB這種適合寫的資料庫。這個可以從側面說明Spanner設計上來說是適合寫入比較少的數據。

To support replication, each spanserver implements ansingle Paxos state machine on top of each tablet.
The Paxos state machines are used to implement anconsistently replicated bag of mappings. The key-valuenmapping state of each replica is stored in its correspondingntablet. Writes must initiate the Paxos protocol at thenleader; reads access state directly from the underlyingntablet at any replica that is sufficiently up-to-date. Thenset of replicas is collectively a Paxos group

這裡一個重點是寫都是去leader的，read都是本地，只要本地的數據足夠新。

The Paxos state machines are used to implement anconsistently replicated bag of mappings. The key-valuenmapping state of each replica is stored in its correspondingntablet. Writes must initiate the Paxos protocol at thenleader; reads access state directly from the underlyingntablet at any replica that is sufficiently up-to-date. Thenset of replicas is collectively a Paxos group.
At every replica that is a leader, each spanserver implementsna lock table to implement concurrency control.nThe lock table contains the state for two-phase locking:nit maps ranges of keys to lock states.

master負責鎖一個key range，這個模式只有寫去leader的時候實現才不會速度出現大的問題。

At every replica that is a leader, each spanserver alsonimplements a transaction manager to support distributedntransactions. The transaction manager is used to implementna participant leader; the other replicas in the groupnwill be referred to as participant slaves. If a transactionninvolves only one Paxos group (as is the case fornmost transactions), it can bypass the transaction manager,nsince the lock table and Paxos together provide transactionality.nIf a transaction involves more than one Paxosngroup, those groups』 leaders coordinate to perform two-phasencommit. One of the participant groups is chosen asnthe coordinator: the participant leader of that group willnbe referred to as the coordinator leader, and the slaves ofnthat group as coordinator slaves. The state of each transactionnmanager is stored in the underlying Paxos groupn(and therefore is replicated).

這裡具體提到two phase commit是必須兩個group leader裡面選出一個coordinator。

Directories and Placement

On top of the bag of key-value mappings, the Spannernimplementation supports a bucketing abstraction called andirectory, which is a set of contiguous keys that share ancommon prefix. (The choice of the term directory is anhistorical accident; a better term might be bucket.)
A directory is the unit of data placement. All data inna directory has the same replication configuration. Whenndata is moved between Paxos groups, it is moved directorynby directory, as shown in Figure 3.

所有的數據都是directory（或者叫bucket）。接下來提到很重要的一點，BigTable tablet跟Spanner tablet的區別：

The fact that a Paxos group may contain multiple directories implies that a Spanner tablet is different from a Bigtable tablet: the former is not necessarily a single lexicographically contiguous partition of the row space. Instead, a Spanner tablet is a container that may encapsulate multiple partitions of the row space. We made this decision so that it would be possible to colocate multiple directories that are frequently accessed together.

具體為什麼要把不同的directory放到一起，估計是應為SQL不同的table經常會一起被訪問（JOIN），這時候要是數據不放到一個tablet下面的話鎖起來會很麻煩。

Movedir is the background task used to move directories between Paxos groups [14]. Movedir is also used to add or remove replicas to Paxos groups [25], because Spanner does not yet support in-Paxos configuration changes. Movedir is not implemented as a single transaction, so as to avoid blocking ongoing reads and writes on a bulky data move. Instead, movedir registers the fact that it is starting to move data and moves the data in the background. When it has moved all but a nominal amount of the data, it uses a transaction to atomically move that nominal amount and update the metadata for the two Paxos groups.

這裡提到數據的流動是atomic的，但是基本上模式是複製一個大文件+delta的方式，這樣可能寫起來比較方便，也比較適合spanner的數據結構（key+timestamp）。

A directory is also the smallest unit whose geographic replication properties (or placement, for short) can be specified by an application. The design of our placement-specification language separates responsibilities for managing replication configurations. Administrators control two dimensions: the number and types of replicas, and the geographic placement of those replicas. They create a menu of named options in these two dimensions (e.g., North America, replicated 5 ways with 1 witness). An application controls how data is replicated, by tagging each database and/or individual directories with a combination of those options. For example, an application might store each end-user』s data in its own directory, which would enable user A』s data to have three replicas in Europe, and user B 』s data to have five replicas in North America.

這裡設計的非常好，我細講一下。每個directory在設計的時候，寫入的人可以寫明這個directory需要在哪些數據中心複製，複製多少遍。為什麼這個很重要呢？因為每個數據中心之間有延時，每個數據中心之間的貸款也是不一樣的。假設我在美國和在歐洲有數據中心，我會希望在美國的用戶數據在美國，在歐洲的用戶數據在歐洲，這樣1.讀取延時低，2.寫入延時低，因為基於paxos的寫入需要不止一個伺服器應答 3.節省存儲，不是所有的數據都需要在所有的數據中心出現。

數據結構

Spanner exposes the following set of data features to applications: a data model based on schematized semi-relational tables, a query language, and general-purpose transactions.

paper裡面還提到，Spanner比BigTable好的一點就是Spanner支持transaction和不同數據中心之間的synchronous replication，而BigTable只支持eventual consistency。接下來講spanner跟SQL資料庫相同和不同的地方

An application creates one or more databases in a universe. Each database can contain an unlimited number of schematized tables. Tables look like relational-database tables, with rows, columns, and versioned values
Spanner』s data model is not purely relational, in that rows must have names. More precisely, every table is required to have an ordered set of one or more primary-key columns. This requirement is where Spanner still looks like a key-value store: the primary keys form the name for a row, and each table defines a mapping from the primary-key columns to the non-primary-key columns. A row has existence only if some value (even if it is NULL) is defined for the row』s keys. Imposing this structure is useful because it lets applications control data locality through their choices of keys.

spanner的數據已定要有key，這樣的話可以根據這個key去shard。假設我是一個照片存儲服務，那麼我的數據結構是這樣的：

這裡，Album用INTERLEAVE IN寫明了它的parent是Users，用 ON DELETE CASCADE寫明了刪除用戶的時候，照片也一併刪除。這個interleave保證了，我在join user跟album的時候，我可以在一個伺服器上面完成這件事情。

在下一個部分裡面我會繼續講Spanner的True Time跟如何處理並發請求