Paper Reading:開啟大數據時代的經典系統

04-07

//素材收集階段,尚未完成施工

背景是最近接了一口鍋,組裡要搞paper reading,組織工作砸到了我這個非常沒文化的本科生頭上...

首先的問題是講什麼topic,這個問題還好,基本靠大佬欽點。大佬說先挑選大數據領域經典論文來讀。

下一個問題就是去哪裡找論文,這個問題也不難...平時就知道什麼MIT 6.824,CMU 15-712,那些神人們已經幫你指好了路。

下一個問題還是繞不開的,就是為什麼要花時間讀這些論文,可能這句話最能回答:

"經典之所以成為經典是無數人犯錯無數遍才得到的。" 想少踩設計上的大坑,還是從論文里看看前人怎麼說的吧。

論文列表

1.MapReduce

MapReduce: Simplified Data Processing on Large Clusters, OSDI04

MapReduce is a programming model and an associated
implementation for processing and generating large
data sets. Users specify a map function that processes a
key/value pair to generate a set of intermediate key/value
pairs, and a reduce function that merges all intermediate
values associated with the same intermediate key. Many

real world tasks are expressible in this model

關鍵詞: 谷歌三駕馬車,數據處理的重要抽象,Jeff Dean

注:一個可以拿來和MapReduce比較的系統是Dryad,Dryad提供更複雜也更自由的計算抽象,相當於一個有向無環圖的執行引擎，其中圖裡面的每個點表示了一個計算，而邊則表示了數據流，邊的方向決定了數據的流向。具體可以看《Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks》。

2.GFS

The Google File System, SOSP03

We have designed and implemented the Google File System,
a scalable distributed file system for large distributed
data-intensive applications. It provides fault tolerance while
running on inexpensive commodity hardware, and it delivers
high aggregate performance to a large number of clients.

關鍵詞:谷歌三駕馬車,可擴展可容錯的數據中心級文件系統

3.BigTable / HBase

Bigtable: A Distributed Storage System for Structured Data, OSDI06

注:Understanding HBase and BigTable - DZone 這篇文章有助於理解BigTable

4.Chubby

The Chubby lock service for loosely-coupled distributed systems, OSDI06

4.Spark

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for

In-Memory Cluster Computing, NSDI12

We present Resilient Distributed Datasets (RDDs), a distributed
memory abstraction that lets programmers perform
in-memory computations on large clusters in a

fault-tolerant manner.

關鍵詞:in-memory計算抽象

5.Dynamo/Cassandra

Dynamo: Amazon』s Highly Available Key-value Store, SOSP07

Reliability at massive scale is one of the biggest challenges we
face at Amazon.com, one of the largest e-commerce operations in
the world; even the slightest outage has significant financial
consequences and impacts customer trust. The http://Amazon.com
platform, which provides services for many web sites worldwide,
is implemented on top of an infrastructure of tens of thousands of
servers and network components located in many datacenters

around the world. At this scale, small and large components fail
continuously and the way persistent state is managed in the face
of these failures drives the reliability and scalability of the
software systems.

關鍵詞:雲時代的雛形

6.Pig

Pig Latin: A Not-So-Foreign Language for Data Processing, SIGMOD08

Pig的出現標誌著Hadoop的community從此走上了一條和Google分道揚鑣的道路，標誌著大數據近代的到來，在我眼裡，這個變化是具有歷史意義的里程碑式的大事件。

7.HIVE

Hive – A Petabyte Scale Data Warehouse Using Hadoop, ICDE10

8.Dremel

Dremel: Interactive Analysis of Web-Scale Datasets, VLDB10

參考鏈接:

1.李沐：學習分散式系統需要怎樣的知識？

2.大數據那些事(5):沉沒的微軟以及Dryad

3.大數據那些事(7)：騰飛的拉丁豬

4.大數據那些事(8):HIVE之初期起

5.SOSP09跟蹤 + 論文評析（詳細版，寫至第2篇）

6.一篇改變互聯網發展進程的論文 | Dynamo