Spark 2017歐洲技術峰會摘要（開發人員分類）

02-01

下載全部視頻和PPT，請關注公眾號(bigdata_summit)，並點擊「視頻下載」菜單

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets

by Jules Damji, Databricks

video, slide

Of all the developers』 delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you』ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)Session hashtag: #EUdev12

下面的內容來自機器翻譯:

在所有開發人員的喜悅中，沒有一個比一套使開發人員具有生產力，易於使用且直觀和富有表現力的API更有吸引力。 Apache Spark通過Spark SQL，Streaming，Machine Learning和Graph Processing等組件提供這些API，以便以Scala，Java，Python和R等語言對大型數據集進行大規模分散式大數據處理。在這次演講中，我將探討Apache Spark 2.x中提供的三套API（RDD，DataFrame和Datasets）的演變。我特彆強調三個要點：1）為什麼和什麼時候應該使用每一套作為最佳實踐2）概述其性能和優化的好處;以及3）強調何時使用DataFrame和Datasets而不是RDD來處理大數據分散式的情況。通過使用API代碼示例進行簡單的筆記本演示，您將了解如何使用RDD，DataFrame和Datasets處理大數據，並在其中進行互操作。（這將是博客的發聲，以及Apache Spark 2.x數據框/數據集和Spark SQL API的最新發展：https://databricks.com/blog/2016/07/14/a-tale-of- 3-apache-spark-apis-rdds-dataframes-and-datasets.html）會話主題標籤：＃EUdev12

An Adaptive Execution Engine For Apache Spark SQL

by Carson Wang, Intel

video, slide

Catalyst is an excellent optimizer in SparkSQL, provides open interface for rule-based optimization in planning stage. However, the static (rule-based) optimization will not consider any data distribution at runtime. A technology called Adaptive Execution has been introduced since Spark 2.0 and aims to cover this part, but still pending in early stage. We enhanced the existing Adaptive Execution feature, and focus on the execution plan adjustment at runtime according to different staged intermediate outputs, like set partition numbers for joins and aggregations, avoid unnecessary data shuffling and disk IO, handle data skew cases, and even optimize the join order like CBO etc.. In our benchmark comparison experiments, this feature save huge manual efforts in tuning the parameters like the shuffled partition number, which is error-prone and misleading. In this talk, we will expose the new adaptive execution framework, task scheduling, failover retry mechanism, runtime plan switching etc. At last, we will also share our experience of benchmark 100 -300 TB scale of TPCx-BB in a hundreds of bare metal Spark cluster.Session hashtag: EUdev4

下面的內容來自機器翻譯:

Catalyst是SparkSQL中極好的優化器，在規劃階段為規則優化提供開放介面。但是，靜態（基於規則的）優化不會考慮運行時的任何數據分布。自Spark 2.0以來，引入了一種名為「自適應執行」的技術，旨在涵蓋此部分，但仍處於早期階段。我們增強了現有的自適應執行功能，並根據不同階段的中間輸出（如為連接和聚合設置分區數量），在運行時重點調整執行計劃，避免不必要的數據混洗和磁碟IO，處理數據歪斜情況，甚至優化加入CBO等命令。在我們的基準比較實驗中，該功能節省了大量的手動調整參數，如混洗分區數量，這是容易出錯和誤導。在這次演講中，我們將展示新的自適應執行框架，任務調度，故障轉移重試機制，運行時計劃切換等。最後，我們還將分享我們在幾百個裸機上分享我們100-300 TB規模的TPCx-BB規模的經驗金屬Spark集群。Session標籤：EUdev4

Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Methodologies

by Luca Canali, CERN

video, slide

This talk is about methods and tools for troubleshooting Spark workloads at scale and is aimed at developers, administrators and performance practitioners. You will find examples illustrating the importance of using the right tools and right methodologies for measuring and understanding performance, in particular highlighting the importance of using data and root cause analysis to understand and improve the performance of Spark applications. The talk has a strong focus on practical examples and on tools for collecting data relevant for performance analysis. This includes tools for collecting Spark metrics and tools for collecting OS metrics. Among others, the talk will cover sparkMeasure, a tool developed by the author to collect Spark task metric and SQL metrics data, tools for analysing I/O and network workloads, tools for analysing CPU usage and memory bandwidth, tools for profiling CPU usage and for Flame Graph visualization.Session hashtag: #EUdev2

下面的內容來自機器翻譯:

這次演講的內容是關於大規模調試Spark工作負載的方法和工具，針對開發人員，管理人員和性能從業人員。您將找到示例說明使用正確的工具和正確的方法衡量和理解性能的重要性，特彆強調了使用數據和根本原因分析來了解和改進Spark應用程序性能的重要性。這次演講非常關注實際案例和收集與績效分析有關的數據的工具。這包括收集Spark指標的工具和收集OS指標的工具。其中包括SparkMeasure，這是一個由作者開發的用於收集Spark任務度量和SQL度量數據的工具，用於分析I / O和網路工作負載的工具，分析CPU使用率和內存帶寬的工具，分析CPU使用情況的工具以及為Flame Graph visualization.Session標籤：＃EUdev2

Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytics

by Akshay Rai, LinkedIn

video, slide

Is your job running slower than usual? Do you want to make sense from the thousands of Hadoop & Spark metrics? Do you want to monitor the performance of your flow, get alerts and auto tune them? These are the common questions every Hadoop user asks but there is not a single solution that addresses it. We at Linkedin faced lots of such issues and have built a simple self serve tool for the hadoop users called Dr. Elephant. Dr. Elephant, which is already open sourced, is a performance monitoring and tuning tool for Hadoop and Spark. It tries to improve the developer productivity and cluster efficiency by making it easier to tune jobs. Since its open source, it has been adopted by multiple organizations and followed with a lot of interest in the Hadoop and Spark community. In this talk, we will discuss about Dr. Elephant and outline our efforts to expand the scope of Dr. Elephant to be a comprehensive monitoring, debugging and tuning tool for Hadoop and Spark applications. We will talk about how Dr. Elephant performs exception analysis, give clear and specific suggestions on tuning, tracking metrics and monitoring their historical trends. Open Source: https://github.com/linkedin/dr-elephantSession hashtag: #EUdev9

下面的內容來自機器翻譯:

你的工作比平時慢嗎？您想從數千個Hadoop和Spark度量中理解嗎？你想監視你的流量的性能，獲得警報和自動調整？這些都是每個Hadoop用戶所要求的常見問題，但沒有解決這個問題的單一解決方案。我們在Linkedin上遇到了很多這樣的問題，並且為稱為Dr. Elephant的hadoop用戶構建了一個簡單的自助服務工具。已經開源的Elephant博士是Hadoop和Spark的性能監控和調優工具。它試圖通過調整作業更容易提高開發人員的生產力和集群效率。自從開源以來，它已經被多個組織所採用，並對Hadoop和Spark社區產生了很大的興趣。在這次演講中，我們將討論Elephant博士，並概述我們將Elephant博士的範圍擴展為Hadoop和Spark應用程序的全面監控，調試和調優工具的努力。我們將討論大象博士如何進行異常分析，就調整，跟蹤指標和監測其歷史趨勢給出明確的具體建議。開源：https：//http://github.com/linkedin/dr-elephantSession標籤：＃EUdev9

Extending Apache Spark SQL Data Source APIs with Join Push Down

by Ioana Delaney, IBM

video, slide

When Spark applications operate on distributed data coming from disparate data sources, they often have to directly query data sources external to Spark such as backing relational databases, or data warehouses. For that, Spark provides Data Source APIs, which are a pluggable mechanism for accessing structured data through Spark SQL. Data Source APIs are tightly integrated with the Spark Optimizer. They provide optimizations such as filter push down to the external data source and column pruning. While these optimizations significantly speed up Spark query execution, depending on the data source, they only provide a subset of the functionality that can be pushed down and executed at the data source. As part of our ongoing project to provide a generic data source push down API, this presentation will show our work related to join push down. An example is star-schema join, which can be simply viewed as filters applied to the fact table. Today, Spark Optimizer recognizes star-schema joins based on heuristics and executes star-joins using efficient left-deep trees. An alternative execution proposed by this work is to push down the star-join to the external data source in order to take advantage of multi-column indexes defined on the fact tables, and other star-join optimization techniques implemented by the relational data source.Session hashtag: #EUdev7

下面的內容來自機器翻譯:

Spark應用程序對來自不同數據源的分散式數據進行操作時，通常需要直接查詢Spark外部的數據源，如支持關係資料庫或數據倉庫。為此，Spark提供了數據源API，這是通過Spark SQL訪問結構化數據的可插入機制。數據源API與Spark Optimizer緊密集成。它們提供了優化，比如過濾器下推到外部數據源和列修剪。雖然這些優化顯著加快了Spark查詢的執行速度，但依賴於數據源，它們只提供可在數據源處下推執行的一部分功能。作為我們正在進行的提供通用數據源下推API的項目的一部分，此演示文稿將顯示我們與加入下推相關的工作。星型模式連接就是一個例子，可以簡單地將其視為應用於事實表的過濾器。如今，Spark Optimizer基於啟發式技術識別星型模式連接，並使用高效的左深度樹執行星型連接。這項工作提出的另一種執行方式是將星形連接推送到外部數據源，以利用事實表上定義的多列索引以及由關係數據源實現的其他星形連接優化技術。會話主題標籤：＃EUdev7

Extending Apache Sparks Ingestion: Building Your Own Java Data Source

by Jean Georges Perrin, Oplo

video, slide

Apache Spark is a wonderful platform for running your analytics jobs. It has great ingestion features from CSV, Hive, JDBC, etc. however, you may have your own data sources or formats you want to use. Your solution could be to convert your data in a CSV or JSON file and then ask Spark to do ingest it through its built-in tools. However, for enhanced performance, we will explore the way to build a data source, in Java, to extend Spark』s ingestion capabilities. We will first understand how Spark works for ingestion, then walk through the development of this data source plug-in. Targeted audience Software and data engineers who need to expand Spark』s ingestion capability. Key takeaways Requirements, needs & architecture – 15%. Build the required tool set in Java – 85%.Session hashtag: #EUdev6

下面的內容來自機器翻譯:

Apache Spark是運行分析作業的絕佳平台。它具有從CSV，Hive，JDBC等偉大的攝取功能。但是，您可能有自己的數據源或您要使用的格式。您的解決方案可能是將您的數據轉換為CSV或JSON文件，然後要求Spark通過其內置工具進行攝取。但是，為了提高性能，我們將探索以Java為基礎構建數據源的方式，以擴展Spark的攝取功能。我們將首先了解Spark如何工作，然後通過這個數據源插件的開發。目標受眾需要擴展Spark攝取功能的軟體和數據工程師。關鍵要求要求，需求和架構 - 15％。在Java中構建所需的工具集 - 85％.Session標籤：＃EUdev6

Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud

by Michael McCune, Red Hat

video, slide

Writing intelligent cloud native applications is hard enough when things go well, but what happens when there are performance and debugging issues that arise during production? Inspecting the logs is a good start, but what if the logs don』t show the whole picture? Now you have to go deeper, examining the live performance metrics that are generated by Spark, or even deploying specialized microservices to monitor and act upon that data. Spark provides several built-in sinks for exposing metrics data about the internal state of its executors and drivers, but getting at that information when your cluster is in the cloud can be a time consuming and arduous process. In this presentation, Michael McCune will walk through the options available for gaining access to the metrics data even when a Spark cluster lives in a cloud native containerized environment. Attendees will see demonstrations of techniques that will help them to integrate a full-fledged metrics story into their deployments. Michael will also discuss the pain points and challenges around publishing this data outside of the cloud and explain how to overcome them. In this talk you will learn about: Deploying metrics sinks as microservices, Common configuration options, and Accessing metrics data through a variety of mechanisms.Session hashtag: #EUdev11

下面的內容來自機器翻譯:

當事情進展順利的時候編寫智能雲原生應用程序已經夠難了，但是當生產過程中出現性能和調試問題時會發生什麼？檢查日誌是一個好的開始，但如果日誌不顯示整個圖像呢？現在，您必須更深入地研究Spark生成的實時性能指標，甚至部署專門的微服務來監視和處理這些數據。 Spark提供了幾個內置接收器來公開有關執行程序和驅動程序內部狀態的度量標準數據，但是當您的群集在雲中時獲取這些信息可能是一個耗時且艱巨的過程。在本演示中，即使Spark集群位於雲本地集裝箱環境中，邁克爾·麥庫納也將瀏覽可用於獲取指標數據的選項。與會者將看到技術演示，這將有助於他們將全面的指標故事整合到他們的部署中。邁克爾還將討論圍繞雲發布這些數據的難點和挑戰，並解釋如何克服這些問題。在本次演講中，您將了解：將指標匯聚為微服務，通用配置選項以及通過各種機制訪問指標數據。Session hashtag：＃EUdev11

From Pipelines to Refineries: Building Complex Data Applications with Apache Spark

by Tim Hunter, Databricks

video, slide

Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.Session hashtag: #EUdev1

下面的內容來自機器翻譯:

大數據工具很難結合成一個更大的應用程序：具有諷刺意味的是，大數據應用程序本身不能很好地擴展。整合和數據管理的這些問題只會被越來越多的數據所放大。 Apache Spark為批處理，流和臨時互動式分析提供了強大的構建塊。然而，當將一個連貫的流水線集中在一起，可能涉及數百個轉換步驟時，用戶面臨著挑戰，特別是在面臨快速迭代的需求時。這個演講通過函數式編程的鏡頭來探討這些問題。它提供了一個實驗框架，通過向Apache Spark引入更多的懶惰來提供全面的管道保證。通過整個程序檢查，自動緩存以及積極的計算並行和重用，這個框架允許轉換無縫組合併緩解常見問題。Session hashtag：＃EUdev1

Lessons From the Field: Applying Best Practices to Your Apache Spark Applications

by Silvio Fiorito, Databricks

video, slide

Apache Spark is an excellent tool to accelerate your analytics, whether you』re doing ETL, Machine Learning, or Data Warehousing. However, to really make the most of Spark it pays to understand best practices for data storage, file formats, and query optimization. This talk will cover best practices I』ve applied over years in the field helping customers write Spark applications as well as identifying what patterns make sense for your use case.Session hashtag: #EUdev5

下面的內容來自機器翻譯:

無論您是在進行ETL，機器學習還是數據倉庫，Apache Spark都是加速分析的絕佳工具。但是，為了充分利用Spark，需要了解數據存儲，文件格式和查詢優化的最佳實踐。本次演講將涵蓋我在現場應用多年的最佳實踐，幫助客戶編寫Spark應用程序，並確定哪些模式對您的用例有意義。Session hashtag：＃EUdev5

Optimal Strategies for Large-Scale Batch ETL Jobs

by Emma Tang, Neustar

video, slide

The ad tech industry processes large volumes of pixel and server-to-server data for each online user』s click, impression, and conversion data. At Neustar, we process 10+ billion events per day, and all of our events are fed through a number of Spark ETL batch jobs. Many of our Spark jobs process over 100 terabytes of data per run, each job runs to completion in around 3.5 hours. This means we needed to optimize our jobs in specific ways to achieve massive parallelization while keeping the memory usage (and cost) as low as possible. Our talk is focused on strategies dealing with extremely large data. We will talk about the things we learned and the mistakes we made. This includes: – Optimizing memory usage using Ganglia – Optimizing partition counts for different types of stages and effective joins – Counterintuitive strategies for materializing data to maximize efficiency – Spark default settings specific to large scale jobs, and how they matter – Running Spark using Amazon EMR with more than 3200 cores – Review different types of errors and stack traces that occur with large-scale jobs and how to read and handle them – How to deal with large number of map output status when there are 100k partitions joining with 100k partitions – How to prevent serialization buffer overflow as well as map out status buffer overflow. This can easily happen when data is extremely large – How to effectively use partitioners to combine stages and minimize shuffle.Session hashtag: #EUdev3

下面的內容來自機器翻譯:

廣告技術行業會為每個在線用戶的點擊量，展示次數和轉化數據處理大量像素和伺服器到伺服器的數據。在Neustar，我們每天處理超過10億個事件，並且我們所有的事件都通過一系列Spark ETL批處理作業進行處理。我們的許多Spark作業每次處理超過100太位元組的數據，每個作業在大約3.5小時內完成。這意味著我們需要通過特定的方式優化我們的工作，以實現大規模並行化，同時儘可能降低內存使用（和成本）。我們的談話集中在處理極大數據的策略上。我們會談談我們學到的東西和我們犯的錯誤。這包括： - 使用Ganglia優化內存使用 - 針對不同類型的階段和有效的連接優化分區計數 - 實現數據最大化效率的違反直覺的策略 - 特定於大型作業的Spark默認設置，以及它們的重要性 - 使用Amazon EMR運行Spark有超過3200個內核 - 查看大規模作業中發生的不同類型的錯誤和堆棧跟蹤，以及如何讀取和處理它們 - 當100k分區加入100k分區時如何處理大量的地圖輸出狀態 - 防止序列化緩衝區溢出以及映射出狀態緩衝區溢出。當數據非常大時，這很容易發生 - 如何有效地使用分區來合併階段並最大限度地減少shuffle。Session hashtag：＃EUdev3

Storage Engine Considerations for Your Apache Spark Applications

by Mladen Kovacevic, Cloudera

video, slide

You have the perfect use case for your Spark applications – whether it be batch processing or super fast near-real time streaming — Now, where to store your valuable data!? In this talk we take a look at four storage options; HDFS, HBase, Solr and Kudu. With so many to choose from, which will fit your use case? What considerations should be taken into account? What are the pros and cons, what are the similarities and differences and how do they fit in with your Spark application? Learn the answers to these questions and more with a look at design patterns and techniques, and sample code to integrate into your application immediately. Walk away with the confidence to propose the right architecture for your use cases and the development know-how to implement and deliver with success.Session hashtag: #EUdev10

下面的內容來自機器翻譯:

您的Spark應用程序具有完美的用例 - 無論是批處理還是超快近實時流 - 現在，在哪裡存儲您的寶貴數據！在這個演講中，我們看看四個存儲選項; HDFS，HBase，Solr和Kudu。有這麼多的選擇，哪個適合你的用例？應該考慮哪些因素？什麼是優點和缺點，有什麼相似之處和不同之處，它們如何適合你的Spark應用程序？通過了解設計模式和技術以及示例代碼，立即了解這些問題的答案和更多內容。有信心為您提供適合您的用例的體系結構，並提供成功實施和交付的開發技巧。Session hashtag：＃EUdev10

Supporting Highly Multitenant Spark Notebook Workloads: Best Practices and Useful Patches

by Brad Kaiser, IBM

video, slide

Notebooks: they enable our users, but they can cripple our clusters. Let』s fix that. Notebooks have soared in popularity at companies world-wide because they provide an easy, user-friendly way of accessing the cluster-computing power of Spark. But the more users you have hitting a cluster, the harder it is to manage the cluster resources as big, long-running jobs start to starve out small, short-running jobs. While you could have users spin up EMR-style clusters, this reduces the ability to take advantage of the collaborative nature of notebooks. It also quickly becomes expensive as clusters sit idle for long periods of time waiting on single users. What we want is fair, efficient resource utilization on a large single cluster for a large number of users. In this talk we』ll discuss dynamic allocation and the best practices for configuring the current version of Spark as-is to help solve this problem. We』ll also present new improvements we』ve made to address this use case. These include: decommissioning executors without losing cached data, proactively shutting down executors to prevent starvation, and improving the start times of new executors.Session hashtag: #EUdev8

下面的內容來自機器翻譯:

筆記本電腦：他們使我們的用戶，但他們可以削弱我們的集群。我們來解決這個問題。筆記本電腦在世界各地的公司中受到普遍歡迎，因為它們提供了訪問Spark集群計算能力的簡單易用的方法。但是，越是擁有群集的用戶，管理群集資源就越困難，因為大型，長時間運行的作業開始消耗小而短的作業。雖然您可以讓用戶啟用EMR風格的集群，但這會降低利用筆記本電腦協作特性的能力。隨著群集閑置很長一段時間等待單個用戶，它也很快變得昂貴。我們所需要的是在大量用戶的大型單一群集上進行公平，高效的資源利用。在這個討論中，我們將討論動態分配，並且將Spark當前版本配置為最佳實踐來幫助解決這個問題。我們還將介紹我們為解決這個用例所做的新的改進。其中包括：在不丟失緩存數據的情況下停止執行程序，主動關閉執行程序以防止飢餓，並改善新執行程序的啟動時間。Session＃hashdeg：＃EUdev8