Spark 2017歐洲技術峰會摘要（Spark 生態體系分類）

01-27

下載全部視頻和PPT，請關注公眾號(bigdata_summit)，並點擊「視頻下載」菜單

A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP

by Artem Aliev, DataStax

video,

Graph is on the rise and it』s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we』ll be explaining everything from the ground up!Session hashtag: #EUeco3

下面的內容來自機器翻譯:

圖表正在崛起，現在是時候開始學習可伸縮圖形分析了！在本次會議中，我們將介紹兩個基於Spark的Graph Analytics框架：Tinkerpop和GraphFrames。雖然這兩個框架可以表達非常相似的遍歷，但它們具有不同的性能特徵和API。在這個Deep-Dive的示例演示中，我們將演示一些常見的遍歷，並解釋如何在Spark級別上實際計算每個遍歷！學習流利的Gremlin API以及功能強大的GraphFrame Motif api，因為我們同時展示了兩個示例。沒有必要熟悉Graphs或Spark，因為我們將從頭開始解釋一切！Session＃標籤：＃EUeco3

Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark

by Emily Curtin, The Weather Company / IBM

video, slide

spark-bench is an open-source benchmarking tool, and it』s also so much more. spark-bench is a flexible system for simulating, comparing, testing, and benchmarking Spark applications and Spark itself. spark-bench originally began as a benchmarking suite to get timing numbers on very specific algorithms mostly in the machine learning domain. Since then it has morphed into a highly configurable and flexible framework suitable for many use cases. This talk will discuss the high level design and capabilities of spark-bench before walking through some major, practical use cases. Use cases include, but are certainly not limited to: regression testing changes to Spark; comparing performance of different hardware and Spark tuning options; simulating multiple notebook users hitting a cluster at the same time; comparing parameters of a machine learning algorithm on the same set of data; providing insight into bottlenecks through use of compute-intensive and i/o-intensive workloads; and, yes, even benchmarking. In particular this talk will address the use of spark-bench in developing new features features for Spark core.Session hashtag: #EUeco8

下面的內容來自機器翻譯:

火花台是一個開源的基準測試工具，也是如此之多。 spark-bench是一個靈活的系統，用於對Spark應用程序和Spark本身進行模擬，比較，測試和基準測試。火花台最初是作為一個基準測試套件，以獲取主要在機器學習領域的非常具體演算法的計時數。從那以後，它變成了一個適用於許多用例的高度可配置和靈活的框架。這次演講將在討論一些主要的實際使用案例之前，討論火花台的高層設計和功能。用例包括但不限於：對Spark的回歸測試更改;比較不同硬體和Spark調試選項的性能;模擬多個筆記本用戶同時碰到集群;將機器學習演算法的參數在同一組數據上進行比較;通過使用計算密集型和I / O密集型工作負載來洞察瓶頸;是的，甚至是基準。特別是這個演講將解決火花台在開發Spark核心新功能特性方面的用途。Session＃hasheg：＃EUeco8

Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBase through Spark SQL

by Weiqing Yang, Hortonworks

video, slide

Both Spark and HBase are widely used, but how to use them together with high performance and simplicity is a very challenging topic. Spark HBase Connector(SHC) provides feature rich and efficient access to HBase through Spark SQL. It bridges the gap between the simple HBase key value store and complex relational SQL queries and enables users to perform complex data analytics on top of HBase using Spark. SHC implements the standard Spark data source APIs, and leverages the Spark catalyst engine for query optimization. To achieve high performance, SHC constructs the RDD from scratch instead of using the standard HadoopRDD. With the customized RDD, all critical techniques can be applied and fully implemented, such as partition pruning, column pruning, predicate pushdown and data locality. The design makes the maintenance easy, while achieving a good tradeoff between performance and simplicity. In addition to fully supporting all the Avro schemas natively, SHC has also integrated natively with Phoenix data types. With SHC, Spark can execute batch jobs to read/write data from/into Phoenix tables. Phoenix can also read/write data from/into HBase tables created by SHC. For example, users can run a complex SQL query on top of an HBase table created by Phoenix inside Spark, perform a table join against an Dataframe which reads the data from a Hive table, or integrate with Spark Streaming to implement a more complicated system. In this talk, apart from explaining why SHC is of great use, we will also demo how SHC works, how to use SHC in secure/non-secure clusters, how SHC works with multiple secure HBase clusters, etc. This talk will also benefit people who use Spark and other data sources (besides HBase) as it inspires them with ideas of how to support high performance data source access at the Spark DataFrame level.Session hashtag: #EUeco7

下面的內容來自機器翻譯:

Spark和HBase都被廣泛使用，但是如何將它們與高性能和簡單性結合使用是一個非常具有挑戰性的話題。 Spark HBase連接器（SHC）通過Spark SQL提供對HBase的功能豐富和高效的訪問。它彌合了簡單的HBase鍵值存儲和複雜的關係SQL查詢之間的差距，並使用戶能夠使用Spark在HBase之上執行複雜的數據分析。 SHC實現標準的Spark數據源API，並利用Spark催化劑引擎進行查詢優化。為了實現高性能，SHC從頭開始構建RDD，而不是使用標準的HadoopRDD。通過定製的RDD，所有的關鍵技術都可以被應用和完全實現，比如分區修剪，列修剪，謂詞下推和數據局部性。該設計使維護變得簡單，同時在性能和簡單性之間取得了良好的折衷。除了完全支持所有的Avro模式，SHC還與Phoenix數據類型本地集成。使用SHC，Spark可以執行批處理作業，以便從/向Phoenix表中讀取/寫入數據。 Phoenix還可以讀取/寫入由SHC創建的HBase表中的數據。例如，用戶可以在Spark內部由Phoenix創建的HBase表上運行複雜的SQL查詢，對從Hive表中讀取數據的Dataframe執行表連接，或者與Spark Streaming集成以實現更複雜的系統。在這個演講中，除了解釋為什麼SHC具有很好的使用性外，我們還將演示SHC如何工作，如何在安全/非安全集群中使用SHC，SHC如何與多個安全HBase集群協同工作等。使用Spark和其他數據源（除了HBase）的人，因為它激發了他們如何在Spark DataFrame級別支持高性能數據源訪問的想法。Session標籤：＃EUeco7

Best Practices for Using Alluxio with Apache Spark

by Gene Pang, Alluxio, Inc.

video, slide

Alluxio, formerly Tachyon, is a memory speed virtual distributed storage system and leverages memory for storing data and accelerating access to data in different storage systems. Many organizations and deployments use Alluxio with Apache Spark, and some of them scale out to over PB』s of data. Alluxio can enable Spark to be even more effective, in both on-premise deployments and public cloud deployments. Alluxio bridges Spark applications with various storage systems and further accelerates data intensive applications. In this talk, we briefly introduce Alluxio, and present different ways how Alluxio can help Spark jobs. We discuss best practices of using Alluxio with Spark, including RDDs and DataFrames, as well as on-premise deployments and public cloud deployments.Session hashtag: #EUeco2

下面的內容來自機器翻譯:

Alluxio（以前稱為Tachyon）是一種內存速度高的虛擬分散式存儲系統，利用內存來存儲數據，並加速對不同存儲系統中數據的訪問。許多組織和部署使用Apache Spark的Alluxio，其中一些擴展到PB的數據。 Alluxio可以使Spark在內部部署和公共雲部署中更加高效。 Alluxio將Spark應用程序與各種存儲系統進行橋接，並進一步加速數據密集型應用程序在這個演講中，我們簡單地介紹一下Alluxio，並介紹Alluxio如何幫助Spark工作。我們將討論使用Spark的Alluxio的最佳實踐，包括RDD和DataFrame，以及內部部署和公共雲部署。Session＃標籤：＃EUeco2

Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spark Job-Server

by Arvind Heda, InfoEdge Ltd.

video, slide

Kapil Malik and Arvind Heda will discuss a solution for interactive querying of large scale structured data, stored in a distributed file system (HDFS / S3), in a scalable and reliable manner using a unique combination of Spark SQL, Apache Zeppelin and Spark Job-server (SJS) on Yarn. The solution is production tested and can cater to thousands of queries processing terabytes of data every day. It contains following components – 1. Zeppelin server : A custom interpreter is deployed, which de-couples spark context from the user notebooks. It connects to the remote spark context on Spark Job-server. A rich set of APIs are exposed for the users. The user input is parsed, validated and executed remotely on SJS. 2. Spark job-server : A custom application is deployed, which implements the set of APIs exposed on Zeppelin custom interpreter, as one or more spark jobs. 3. Context router : It routes different user queries from custom interpreter to one of many Spark Job-servers / contexts. The solution has following characteristics – * Multi-tenancy There are hundreds of users, each having one or more Zeppelin notebooks. All these notebooks connect to same set of Spark contexts for running a job. * Fault tolerance The notebooks do not use Spark interpreter, but a custom interpreter, connecting to a remote context. If one spark context fails, the context router sends user queries to another context. * Load balancing Context router identifies which contexts are under heavy load / responding slowly, and selects the most optimal context for serving a user query. * Efficiency We use Alluxio for caching common datasets. * Elastic resource usage We use spark dynamic allocation for the contexts. This ensures that cluster resources are blocked by this application only when it』s doing some actual work.Session hashtag: #EUeco9

下面的內容來自機器翻譯:

Kapil Malik和Arvind Heda將使用Spark SQL，Apache Zeppelin和Spark Job-only的獨特組合，以可擴展和可靠的方式討論存儲在分散式文件系統（HDFS / S3）中的大規模結構化數據的互動式查詢解決方案。紗線上的伺服器（SJS）。該解決方案經過生產測試，可滿足每天處理數千TB數據的數千個查詢。它包含以下組件 - 1. Zeppelin伺服器：部署了一個自定義的解釋器，從用戶筆記本中去除了火花上下文。它連接到Spark Job-server上的遠程Spark上下文。為用戶提供了豐富的API集合。用戶輸入在SJS上被遠程解析，驗證和執行。 2. Spark作業伺服器：部署一個自定義應用程序，它實現了Zeppelin自定義解釋器上公開的一組API，作為一個或多個Spark作業。 3.上下文路由器：將自定義解釋器的不同用戶查詢路由到多個Spark Job-servers /上下文中的一個。該解決方案具有以下特點 - *多租戶有數百個用戶，每個用戶有一個或多個Zeppelin筆記本電腦。所有這些筆記本連接到相同的Spark上下文集來運行作業。容錯筆記本不使用Spark解釋器，而是使用自定義解釋器連接到遠程上下文。如果一個spark上下文失敗，上下文路由器將用戶查詢發送到另一個上下文。 *負載均衡上下文路由器識別哪些上下文處於高負載/緩慢響應狀態，並選擇用於服務用戶查詢的最佳上下文。 *效率我們使用Alluxio來緩存常用數據集。 *彈性資源使用情況我們使用spark動態分配上下文。這可以確保只有在執行某些實際工作時，群集資源才被此應用程序阻止。Session＃標籤：＃EUeco9

Running Spark Inside Docker Containers: From Workload to Cluster

by Haohai Ma, IBM

video, slide

This presentation describes the journey we went through in containerizing Spark workload into multiple elastic Spark clusters in a multi-tenant kubernetes environment. Initially we deployed Spark binaries onto a host-level filesystem, and then the Spark drivers, executors and master can transparently migrate to run inside a Docker container by automatically mounting host-level volumes. In this environment, we do not need to prepare a specific Spark image in order to run Spark workload in containers. We then utilized Kubernetes helm charts to deploy a Spark cluster. The administrator could further create a Spark instance group for each tenant. A Spark instance group, which is akin to the Spark notion of a tenant, is logically an independent kingdom for a tenant』s Spark applications in which they own dedicated Spark masters, history server, shuffle service and notebooks. Once a Spark instance group is created, it automatically generates its image and commits to a specified repository. Meanwhile, from Kubernetes』 perspective, each Spark instance group is a first-class deployment and thus the administrator can scale up/down its size according to the tenant』s SLA and demand. In a cloud-based data center, each Spark cluster can provide a Spark as a service while sharing the Kubernetes cluster. Each tenant that is registered into the service gets a fully isolated Spark instance group. In an on-prem Kubernetes cluster, each Spark cluster can map to a Business Unit, and thus each user in the BU can get a dedicated Spark instance group. The next step on this journey will address the resource sharing across Spark instance groups by leveraging new Kubernetes』 features (Kubernetes31068/9), as well as the Elastic workload containers depending on job demands (Spark18278). Demo: https://www.youtube.com/watch?v=eFYu6o3-Ea4&t=5sSession hashtag: #EUeco5

下面的內容來自機器翻譯:

本演示文稿描述了我們將Spark工作負載容納到多租戶kubernetes環境中的多個彈性Spark集群中所經歷的過程。最初我們將Spark二進位文件部署到主機級文件系統上，然後Spark驅動程序，執行程序和主伺服器可以自動掛載主機級卷，從而透明地遷移到Docker容器中運行。在這種環境下，我們不需要準備一個特定的Spark圖像來運行容器中的Spark工作負載。然後，我們利用Kubernetes掌舵圖來部署一個Spark集群。管理員可以為每個租戶進一步創建一個Spark實例組。類似於租戶的Spark概念的Spark實例組在邏輯上對於租戶的Spark應用程序是一個獨立的王國，在這個應用程序中他們擁有專門的Spark主人，歷史記錄伺服器，隨機播放服務和筆記本電腦。一旦創建了一個Spark實例組，它就會自動生成它的鏡像並提交給指定的倉庫。同時，從Kubernetes的角度來看，每個Spark實例組都是一流的部署，因此管理員可以根據租戶的SLA和需求來擴大/縮小其規模。在基於雲的數據中心中，每個Spark群集可以在共享Kubernetes群集時提供Spark作為服務。註冊到服務中的每個租戶都獲得一個完全隔離的Spark實例組。在一個本地Kubernetes集群中，每個Spark集群都可以映射到一個業務單元，因此BU中的每個用戶都可以獲得一個專用的Spark實例組。這次旅程的下一步將通過利用新的Kubernetes功能（Kubernetes31068 / 9）以及取決於工作需求的彈性工作負載容器（Spark18278）來解決Spark實例組間的資源共享問題。演示：https://www.youtube.com/watch?v=eFYu6o3-Ea4&t=5sSession標籤：＃EUeco5

Smack Stack and Beyond—Building Fast Data Pipelines

by Jorg Schad, Mesosphere

video, slide

There are an ever increasing number of use cases, like online fraud detection, for which the response times of traditional batch processing are too slow. In order to be able to react to such events in close to real-time, you need to go beyond classical batch processing and utilize stream processing systems such as Apache Spark Streaming, Apache Flink, or Apache Storm. These systems, however, are not sufficient on their own. For an efficient and fault-tolerant setup, you also need a message queue and storage system. One common example for setting up a fast data pipeline is the SMACK stack. SMACK stands for Spark (Streaming) – the stream processing system Mesos – the cluster orchestrator Akka – the system for providing custom actors for reacting upon the analyses Cassandra – the storage system Kafka – the message queue Setting up this kind of pipeline in a scalable, efficient and fault-tolerant manner is not trivial. First, this workshop will discuss the different components in the SMACK stack. Then, participants will get hands-on experience in setting up and maintaining data pipelines.Session hashtag: #EUeco1

下面的內容來自機器翻譯:

在線欺詐檢測等用例數量不斷增加，傳統批量處理的響應時間太慢。為了能夠對這些事件做出接近實時的反應，您需要超越經典的批處理，並利用流處理系統，如Apache Spark Streaming，Apache Flink或Apache Storm。但是，這些系統本身是不夠的。對於高效的容錯設置，還需要消息隊列和存儲系統。建立快速數據流水線的一個常見例子是SMACK堆棧。 SMACK代表Spark（Streaming） - 流處理系統Mesos - 集群協調者Akka - 提供用於對分析作出反應的自定義參與者的系統Cassandra - 存儲系統Kafka - 消息隊列以可擴展的方式設置這種流水線，高效和容錯的方式並不是微不足道的。首先，本次研討會將討論SMACK堆棧中的不同組件。然後，參與者將獲得建立和維護數據管道的實踐經驗。會議主題標籤：＃EUeco1

Testing Apache Spark—Avoiding the Fail Boat Beyond RDDs

by Holden Karau, IBM

video,

As Spark continues to evolve, we need to revisit our testing techniques to support Datasets, streaming, and more. This talk expands on 「Beyond Parallelize and Collect」 (not required to have been seen) to discuss how to create large scale test jobs while supporting Spark』s latest features. We will explore the difficulties with testing Streaming Programs, options for setting up integration testing, beyond just local mode, with Spark, and also examine best practices for acceptance tests.Session hashtag: #EUeco4

下面的內容來自機器翻譯:

隨著Spark的不斷發展，我們需要重新審視我們的測試技術，以支持數據集，流媒體等等。這次演講擴展了「超越並行和收集」（不需要被看到），討論如何創建大規模的測試工作，同時支持Spark的最新功能。我們將探索測試流式處理程序的困難，設置集成測試的選項，除了本地模式，還有Spark，還要考察驗收測試的最佳實踐。Session＃標籤：＃EUeco4

Variant-Apache Spark for Bioinformatics

by Piotr Szul, CSIRO Data61

video, slide

This talk will showcase work done by the bioinformatics team at CSIRO in Sydney, Australia to make Spark more useful and usable for the bioinformatics community. They have created a custom library, variant-spark, which provides a DSL and also a custom implementation of Spark ML via random forests for genomic pipeline processing. We』ve created a demo, using their 『Hipster-genome』 and a Databricks notebook to better explain their library to the world-wide bioinformatics community. This notebooks compares results with another popular genomics library (http://HAIL.io) as well.Session hashtag: #EUeco6

下面的內容來自機器翻譯:

本次演講將展示澳大利亞悉尼CSIRO生物信息學團隊所做的工作，以使Spark更加有用，並可用於生物信息學界。他們已經創建了一個自定義庫，variant-spark，它提供了一個DSL，並通過隨機森林提供了Spark ML的自定義實現，用於基因組流水線處理。我們已經創建了一個演示，使用他們的「時髦基因組」和Databricks筆記本更好地向世界範圍的生物信息學界解釋他們的圖書館。這個筆記本還將結果與另一個流行的基因組庫（http://HAIL.io）進行比較。會話標籤：＃EUeco6