Paper Reading:開啟大數據時代的經典系統
//素材收集階段,尚未完成施工
背景是最近接了一口鍋,組裡要搞paper reading,組織工作砸到了我這個非常沒文化的本科生頭上...
首先的問題是講什麼topic,這個問題還好,基本靠大佬欽點。大佬說先挑選大數據領域經典論文來讀。
下一個問題就是去哪裡找論文,這個問題也不難...平時就知道什麼MIT 6.824,CMU 15-712,那些神人們已經幫你指好了路。
下一個問題還是繞不開的,就是為什麼要花時間讀這些論文,可能這句話最能回答:
"經典之所以成為經典是無數人犯錯無數遍才得到的。" 想少踩設計上的大坑,還是從論文里看看前人怎麼說的吧。
論文列表
1.MapReduce
MapReduce: Simplified Data Processing on Large Clusters, OSDI04
MapReduce is a programming model and an associated
implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model
關鍵詞: 谷歌三駕馬車,數據處理的重要抽象,Jeff Dean
注:一個可以拿來和MapReduce比較的系統是Dryad,Dryad提供更複雜也更自由的計算抽象,相當於一個有向無環圖的執行引擎,其中圖裡面的每個點表示了一個計算,而邊則表示了數據流,邊的方向決定了數據的流向。具體可以看《Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks》。
2.GFS
The Google File System, SOSP03
We have designed and implemented the Google File System,
a scalable distributed file system for large distributeddata-intensive applications. It provides fault tolerance whilerunning on inexpensive commodity hardware, and it delivershigh aggregate performance to a large number of clients.
關鍵詞:谷歌三駕馬車,可擴展可容錯的數據中心級文件系統
3.BigTable / HBase
Bigtable: A Distributed Storage System for Structured Data, OSDI06
注:Understanding HBase and BigTable - DZone 這篇文章有助於理解BigTable
4.Chubby
The Chubby lock service for loosely-coupled distributed systems, OSDI06
4.Spark
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for
In-Memory Cluster Computing, NSDI12We present Resilient Distributed Datasets (RDDs), a distributed
memory abstraction that lets programmers performin-memory computations on large clusters in afault-tolerant manner.
關鍵詞:in-memory計算抽象
5.Dynamo/Cassandra
Dynamo: Amazon』s Highly Available Key-value Store, SOSP07
Reliability at massive scale is one of the biggest challenges we
face at Amazon.com, one of the largest e-commerce operations inthe world; even the slightest outage has significant financialconsequences and impacts customer trust. The http://Amazon.complatform, which provides services for many web sites worldwide,is implemented on top of an infrastructure of tens of thousands ofservers and network components located in many datacentersaround the world. At this scale, small and large components fail
continuously and the way persistent state is managed in the faceof these failures drives the reliability and scalability of thesoftware systems.
關鍵詞:雲時代的雛形
6.Pig
Pig Latin: A Not-So-Foreign Language for Data Processing, SIGMOD08
Pig的出現標誌著Hadoop的community從此走上了一條和Google分道揚鑣的道路,標誌著大數據近代的到來,在我眼裡,這個變化是具有歷史意義的里程碑式的大事件。
7.HIVE
Hive – A Petabyte Scale Data Warehouse Using Hadoop, ICDE10
8.Dremel
Dremel: Interactive Analysis of Web-Scale Datasets, VLDB10
參考鏈接:
1.李沐:學習分散式系統需要怎樣的知識?
2.大數據那些事(5):沉沒的微軟以及Dryad
3.大數據那些事(7):騰飛的拉丁豬
4.大數據那些事(8):HIVE之初期起
5.SOSP09跟蹤 + 論文評析(詳細版,寫至第2篇)
6.一篇改變互聯網發展進程的論文 | Dynamo
推薦閱讀:
※從單租戶IaaS到多租戶PaaS——金融級別大數據平台MaxCompute的多租戶隔離實踐
※《數據架構》閱讀筆記(六)數據架構
※驚呆了!顏值爆表的20+位阿里技術女神同一時間向你發出共事邀請!
※越來越多的網路攻擊背後,你看清黑客的套路了嗎?
※支持向量機(SVM)——原理篇