[OSDI] Large-scale Incremental Processing Using Distributed Transactions and Notifications

02-04

今天我們來聊一聊percolator， Google的incremental web index update system。原文點這裡：https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36726.pdf

We have built Percolator, a system for incrementally
processing updates to a large data set, and deployed it
to create the Google web search index. By replacing a
batch-based indexing system with an indexing system

based on incremental processing using Percolator, we
process the same number of documents per day, while
reducing the average age of documents in Google search
results by 50%.

Introduction

文章一開始先解釋這個問題的難點：每次更新google search index的時候，如果用mapreduce的方法的話，需要更新的是所有crawl下來的文件。這樣的話，如果更新的部分比較小，就會耗費很多資源

However, reprocessing the entire web discards the work done in
earlier runs and makes latency proportional to the size of
the repository, rather than the size of an update.

現有的資料庫沒有辦法支持這麼大數據的吞吐量，Google的bigtable倒是可以存和度這麼大的亮，但是在bigtable上面得需要一些更多的功能。

跟當下的stream processing的最大的不同在與percolator還要支持multi table transaction，因為這些index在外部看起來必須是consistent的。為了看起來consistent，現在用的是snapshot isolation semantic。

為了可以部分更新index，percolator支持observer，這些observer還可以建立更多的observer。

由於這個系統就是專美為了可以incremental update index設計的，percolator在更新速度方面還是效果拔群的

By converting the indexing system to an
incremental system, we are able to process individual
documents as they are crawled. This reduced the average
document processing latency by a factor of 100, and
the average age of a document appearing in a search result
dropped by nearly 50 percent (the age of a search result
includes delays other than indexing such as the time
between a document being changed and being crawled).

Design

percolator主要是基於percolator binary，GFS和Bigtable

A Percolator system consists of three binaries that run
on every machine in the cluster: a Percolator worker, a

Bigtable [9] tablet server, and a GFS [20] chunkserver.
All observers are linked into the Percolator worker,
which scans the Bigtable for changed columns (「noti-
fications」) and invokes the corresponding observers as
a function call in the worker process.

Bigtable主要是用來存鎖的，GFS是用來存具體文件的。Percolator不是所有的transaction都會實時處理的，很多時候鎖會被lazily cleanup來處理。這個paper裡面特別講到percolator是大部分基於percolator的，而且API跟BigTable差不多，所以percolator跟BigTable最大的差別就是multirow transaction跟observer framework

Transactions

下面這個圖是這個文章裡面給的一個最簡單的例子

Get跟Commit是blocking的。Commit如果失敗了會返回false。這裡沒有commit的話就必須要想辦法重試了。具體commit數據進入資料庫是下面的過程

每次真的去查看這一行的數據的時候，必須看現在最近的數據的時間戳是哪一個時間戳。Percolator commit的psudocode是長這樣的：

我們可以看到的是這裡有prewrite和commit，這兩個是分開的。prewrite是可以失敗，有可能需要重試的。get首先得查找0到現在時間戳的，看看有沒有人鎖住這個record，如果有鎖的話get必須等到解鎖了才行。

這裡可以看出來之前的鎖要是沒有被解鎖的話，get是無法讀取的，所以get必須有辦法判斷這個鎖是不是可以被刪除掉。percolate裡面通過定義primary lock，每次要刪改的時候必須拿到primary lock，包括在commit執行之中。有的時候在清理lock的時候，client得根據數據具體的狀況來決定到底是rollback還是roll forward。

來判斷一個lock的worker到底是不是在幹活，這個系統用了另外一個lock服務叫Chubby，每次client在幹活的時候回定期更新Chubby上面的時間戳，已到達其他訪問可以判斷這些數據是不是正在被處理。

這裡時間戳的概念沒有Google另外一個資料庫那麼複雜，就是monotonically increasing就行了，因為所有的時間戳都是一個oracle發布出來的。

Notification

Percolator有大量client的邏輯是基於notification的，這些代碼會分布到每一個percolator機器上面。每一個column只能有一個observer，每個observer後面可以跟其他的observer。

每一個observer接收到了一個消息之後，最多能commit一個transaction。有的時候多個消息會同時給一個observer，以確保太熱的entry不至於一直被更新。

消息是作為一個bigtable column存在的，每個消息都有一個ack column，來看這個消息被接收到了沒有。如果兩個observer同時相應，只有一個observer可以拿到ack column。

製造消息方必須有能力找到被改過的column，所以每一個column都有一個對應的notify，transaction每次成功的時候都要寫入notify。這個notify是一個分開的表，而不是原來數據的一個column，這樣每次創造消息的時候，比起搜索所有的數據來找到被修改過的column，需要搜索的數據就少了很多。

每次搜索消息的時候都，會有多個worker去各個Bigtable tablet上面去找這個tablet裡面有哪些消息，而且每個worker是隨機挑選一個tablet去搜。每次選tablet的時候還需要去找一個lock service來確保兩個worker不會隨機抽到同一個worker，但是抽到了也沒事，所以這個lock service相對來說就不用搞的那麼複雜，因為它不需要存state。

這個paper後面有具體討論這個系統跟其他系統的差別，我就不講跟多的細節了，最重要的就是下面這個圖

percolator是有代價的，當處理的部分超過一定程度的時候（這裡是30%），那麼處理速度是比不上直接跑一個全局map reduce的。

歡迎噴歡迎留言，paper鏈接在最開頭。