Spark 2017歐洲技術峰會摘要（Engineering 分類）

01-29

下載全部視頻和PPT，請關注公眾號(bigdata_summit)，並點擊「視頻下載」菜單

Apache Spark Pipelines in the Cloud with Alluxio

by Gene Pang, Alluxio, Inc.

video, slide

Organizations commonly use Apache Spark to gain actionable insight from their large amounts of data. Often, these analytics are in the form of data processing pipelines, where there are a series of processing stages, and each stage performs a particular function, and the output of one stage is the input of the next stage. There are several examples of pipelines, such as log processing, IoT pipelines, and machine learning. The common attribute among different pipelines is the sharing of data between stages. It is also common for Spark pipelines to process data stored in the public cloud, such as Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Storage. The global availability and cost effectiveness of these public cloud storage services make them the preferred storage for data. However, running pipeline jobs while sharing data via cloud storage can be expensive in terms of increased network traffic, and slower data sharing and job completion times. Using Alluxio, a memory speed virtual distributed storage system, enables sharing data between different stages or jobs at memory speed. By reading and writing data in Alluxio, the data can stay in memory for the next stage of the pipeline, and this result in great performance gains. In this talk, we discuss how Alluxio can be deployed and used with a Spark data processing pipeline in the cloud. We show how pipeline stages can share data with Alluxio memory for improved performance benefits, and how Alluxio can improves completion times and reduces performance variability for Spark pipelines in the cloud.Session hashtag: #EUde5

下面的內容來自機器翻譯:

組織通常使用Apache Spark從大量數據中獲取可操作的洞察力。這些分析通常是以數據處理流水線的形式出現的，其中有一系列的處理階段，每個階段執行一個特定的功能，一個階段的輸出是下一階段的輸入。有幾個管道的例子，如日誌處理，物聯網管道和機器學習。不同管道之間的共同屬性是階段之間的數據共享。對於存儲在公有雲中的數據（如Amazon S3，Microsoft Azure Blob存儲或Google雲存儲），Spark管道也很常見。這些公共雲存儲服務的全球可用性和成本效益使其成為數據的首選存儲。然而，在通過雲存儲共享數據的同時運行管道作業在增加網路流量方面可能是昂貴的，並且數據共享和作業完成時間較慢。使用內存速度虛擬分散式存儲系統Alluxio，可以以不同的速度在不同的階段或作業之間共享數據。通過在Alluxio中讀寫數據，數據可以留在內存中用於流水線的下一階段，這樣可以獲得很大的性能提升。在這次演講中，我們將討論如何在雲中部署和使用Alluxio和Spark數據處理管道。我們展示了流水線階段如何與Alluxio內存共享數據，以提高性能優勢，以及Alluxio如何提高完成時間並降低雲中Spark 管道的性能可變性。Session＃hash＃： EUde5

Beyond Unit Tests: End-to-End Testing for Spark Workflows

by Anant Nag, LinkedIn

video, slide

As a Spark developer, do you want to quickly develop your Spark workflows? Do you want to test your workflows in a sandboxed environment similar to production? Do you want to write end-to-end tests for your workflows and add assertions on top of it? In just a few years, the number of users writing Spark jobs at LinkedIn have grown from tens to hundreds, and the number of jobs running every day has grown from hundreds to thousands. With the ever increasing number of users and jobs, it becomes crucial to reduce the development time for these jobs. It is also important to test these jobs thoroughly before they go to production. Currently, there is no way users can test their spark jobs end-to-end. The only way is to divide the spark jobs into functions and unit test the functions. We』ve tried to address these issues by creating a testing framework for Spark workflows. The testing framework enables the users to run their jobs in an environment similar to the production environment and on the data which is sampled from the original data. The testing framework consists of a test deployment system, a data generation pipeline to generate the sampled data, a data management system to help users manage and search the sampled data and an assertion engine to validate the test output. In this talk, we will discuss the motivation behind the testing framework before deep diving into its design. We will further discuss how the testing framework is helping the Spark users at LinkedIn to be more productive.Session hashtag: #EUde12

下面的內容來自機器翻譯:

作為Spark開發人員，您是否想快速開發Spark工作流程？您是否想在類似於生產的沙盒環境中測試您的工作流程？你想為你的工作流編寫端到端測試，並在其上添加斷言嗎？在短短的幾年時間裡，在LinkedIn領域撰寫 Spark工作的用戶數量從幾十個增加到幾百個，每天運行的作業數量從幾百個增加到了幾千個。隨著用戶和工作的不斷增加，減少這些工作的開發時間變得至關重要。在生產前徹底測試這些工作也很重要。目前，用戶無法通過端到端測試他們的spark作業。唯一的方法是將spark作業分成函數和單元測試函數。我們試圖通過為Spark工作流程創建測試框架來解決這些問題。測試框架使用戶能夠在類似於生產環境的環境中以及從原始數據採樣的數據中運行他們的工作。測試框架由測試部署系統，生成採樣數據的數據生成管道，幫助用戶管理和搜索採樣數據的數據管理系統以及用於驗證測試輸出的斷言引擎組成。在這次演講中，我們將在深入探究其設計之前討論測試框架背後的動機。我們將進一步討論測試框架如何幫助LinkedIn的Spark用戶提高工作效率。Session hashtag：＃EUde12

High Performance Enterprise Data Processing with Apache Spark

by Sandeep Varma, ZS Associates

video, slide

Data engineering to support reporting and analytics for commercial Lifesciences groups consists of very complex interdependent processing with highly complex business rules (thousands of transformations on hundreds of data sources). We will talk about our experiences in building a very high performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance. We will touch upon optimizing enterprise grade Spark architecture for data warehousing and data mart type applications, optimizing end to end pipelines for extreme performance, running hundreds of jobs in parallel in Spark, orchestrating across multiple Spark clusters, and some guidelines for high speed platform and application development within enterprises. Key takeaways: – example architecture for complex data warehousing and data mart applications on Spark – architecture to build high performance Spark platforms for enterprises that balance functionality with total cost of ownership – orchestrating multiple elastic Spark clusters while running hundreds of jobs in parallel – business benefits of high performance data engineering, especially for Lifesciences.Session hashtag: #EUde3

下面的內容來自機器翻譯:

用於支持商業生命科學組報告和分析的數據工程包括非常複雜的相互依存的處理以及高度複雜的業務規則（數百個數據源的數千次轉換）。我們將討論我們在構建由Spark驅動的高性能數據處理平台方面的經驗，以平衡極端性能，開發速度和維護成本的考慮因素。我們將著重於優化企業級數據倉庫和數據集市應用程序的Spark架構，優化端到端流水線以獲得極高的性能，並行運行數百個作業。，協調跨多個 Spark 集群，以及企業內部高速平台和應用程序開發的一些指導。關鍵要點： - 針對複雜數據倉庫和數據集市應用的示例架構 Spark - 為構建高性能的架構Spark這種平衡功能與總體擁有成本 - 在並行運行數百個作業的同時協調多個彈性 Spark 集群 - 高性能數據工程的業務優勢，尤其是Lifesciences.Session標籤：＃ EUde3

How to Share State Across Multiple Apache Spark Jobs using Apache Ignite

by Akmal Chaudhri, GridGain

video, slide

Attend this session to learn how to easily share state in-memory across multiple Spark jobs, either within the same application or between different Spark applications using an implementation of the Spark RDD abstraction provided in Apache Ignite. During the talk, attendees will learn in detail how IgniteRDD – an implementation of native Spark RDD and DataFrame APIs – shares the state of the RDD across other Spark jobs, applications and workers. Examples will show how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames.Session hashtag: #EUde9

下面的內容來自機器翻譯:

參加本次會議，了解如何在同一個應用程序中或不同的Spark之間跨多個Spark >應用程序使用Apache Ignite中提供的Spark RDD抽象的實現。在演講中，與會者將詳細了解IgniteRDD（實現本地 Spark RDD和DataFrame API）如何與其他共享RDD的狀態Spark作業，應用程序和工作人員。示例將展示IgniteRDD如何利用其先進的內存中索引功能，使得SQL查詢的執行速度比本地 Spark RDD或Data Frames.Session標籤的速度快很多倍：＃EUde9

Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production

by Brandon Carl, Facebook

video, slide

With more than 700 million monthly active users, Instagram continues to make it easier for people across the globe to join the community, share their experiences, and strengthen connections to their friends and passions. Powering Instagram』s various products requires the use of machine learning, high performance ranking services, and most importantly large amounts of data. At Instagram, we use Apache Spark for several critical production pipelines, including generating labeled training data for our machine learning models. In this session, you』ll learn about how one of Instagram』s largest Spark pipelines has evolved over time in order to process ~300 TB of input and ~90 TB of shuffle data. We』ll discuss the experience of building and managing such a large production pipeline and some tips and tricks we』ve learned along the way to manage Spark at scale. Topics include migrating from RDD to Dataset for better memory efficiency, splitting up long-running pipelines in order to better tune intermediate shuffle data, and dealing with changing data skew over time. Finally, we will also go over some optimizations we have made in order to maintain reliability of this critical data pipeline.Session hashtag: #EUde0

下面的內容來自機器翻譯:

每月有超過7億的活躍用戶，Instagram繼續讓全球的人們更容易加入社區，分享他們的經驗，並加強與朋友和激情的聯繫。支持Instagram的各種產品需要使用機器學習，高性能的排名服務，最重要的是大量的數據。在Instagram上，我們使用Apache Spark為幾個關鍵生產流水線，包括為我們的機器學習模型生成標記的訓練數據。在本次會議中，您將了解到為了處理大約300 TB的輸入和大約90 TB的隨機數據，Instagram最大的 Spark管道是如何演變的。我們將討論構建和管理如此龐大的生產流程的經驗，以及我們在管理Spark時學到的一些技巧和竅門。主題包括從RDD遷移到數據集以提高內存效率，分離長期運行的流水線以更好地調整中間隨機數據，並處理隨時間變化的數據偏移。最後，為了保持這個關鍵的數據管道的可靠性，我們還會進行一些優化。Session＃hashdeg：＃EUde0

Real-Time Detection of Anomalies in the Database Infrastructure using Apache Spark

by Daniel Lanza, CERN

video, slide

At CERN, the biggest physics laboratory in the world, large volumes of data are generated every hour, it implies serious challenges to store and process all this data. An important part of this responsibility comes to the database group which not only provides services for RDBMS but also scalable systems as Hadoop, Spark and HBase. Since databases are critical, they need to be monitored, for that we have built a highly scalable, secure and central repository that stores consolidated audit data and listener, alert and OS log events generated by the databases. This central platform is used for reporting, alerting and security policy management. The database group want to further exploit the information available in this central repository to build intrusion detection system to enhance the security of the database infrastructure. In addition, build pattern detection models to flush out anomalies using the monitoring and performance metrics available in the central repository. Finally, this platform also helps us for capacity planning of the database deployment. The audience would get first-hand experience of how to build real time Apache Spark application that is deployed in production. They would hear the challenges faced and decisions taken while developing the application and troubleshooting Apache Spark and Spark streaming application in production.Session hashtag: #EUde13

下面的內容來自機器翻譯:

在世界上最大的物理實驗室CERN，每小時都會產生大量的數據，這意味著存儲和處理所有這些數據都面臨著嚴峻的挑戰。這個責任的一個重要組成部分來自資料庫組，它不僅為RDBMS提供服務，而且還為可擴展系統提供Hadoop，Spark和HBase「。由於資料庫至關重要，因此需要對其進行監控，因為我們已經構建了一個高度可擴展的安全中央存儲庫，用於存儲由資料庫生成的整合的審計數據以及監聽器，警報和操作系統日誌事件。這個中央平台用於報告，警報和安全策略管理。資料庫組希望進一步利用此中央存儲庫中可用的信息來構建入侵檢測系統，以提高資料庫基礎架構的安全性。此外，還可以使用中央存儲庫中提供的監視和性能指標來構建模式檢測模型，以消除異常情況。最後，這個平台還幫助我們進行資料庫部署的容量規劃。觀眾將親身體驗如何構建實時部署在生產環境中的Apache Spark應用程序。他們會聽到在開發Apache Spark和Spark應用程序的應用程序和故障診斷過程中面臨的挑戰和決策.Session hashtag ：＃EUde13

The State of Apache Spark in the Cloud

by Nicolas Poggi, Barcelona Super Computing

video, slide

Cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Spark and Hive come ready to use, with a general-purpose configuration and upgrade management. Over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements on performance and the release of v2, making it challenging to keep up-to-date production services both on-premises and in the cloud for compatibility and stability. Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc with an on-premises commodity cluster as baseline. Nicolas uses BigBench, the brand new standard (TPCx-BB) for big data systems, with both Spark and Hive implementations for benchmarking the systems. BigBench combines SQL queries, MapReduce, user code (UDF), and machine learning, which makes it ideal to stress Spark libraries (SparkSQL, DataFrames, MLlib, etc.). The work is framed within the ALOJA research project, which features an open source benchmarking and analysis platform that has been recently extended to support SQL-on-Hadoop engines and BigBench. The ALOJA project aims to lower the total cost of ownership (TCO) of big data deployments and study their performance characteristics for optimization. Nicolas highlights how to easily repeat the benchmarks through ALOJA and benefit from BigBench to optimize your Spark cluster for advanced users. The work is a continuation of a paper to be published at the IEEE Big Data 16 conference.Session hashtag: #EUde6

下面的內容來自機器翻譯:

雲提供商目前提供便捷的按需管理的大數據集群（PaaS）和即付即用模式。在PaaS中，諸如Spark和Hive等分析引擎已經可以使用，並具有通用配置和升級管理。在過去的一年裡，Spark框架和API已經發展得非常迅速，性能和v2的發布都有了重大的改進，使得保持最新的產品為了兼容性和穩定性而在本地和雲中提供服務。 Nicolas Poggi評估了Spark的開箱即用支持，並比較了主要PaaS提供商的產品，可靠性，可擴展性和性價比，包括Azure HDinsight，Amazon Web服務EMR，帶有本地商品群集的Google Dataproc作為基準。 Nicolas使用BigBench作為大數據系統的全新標準（TPCx-BB），用於基準測試的Spark和Hive系統。 BigBench將SQL查詢，MapReduce，用戶代碼（UDF）和機器學習結合在一起，使其成為理解Spark 庫（Spark > SQL，DataFrames，MLlib等）。這項工作是在ALOJA研究項目中構建的，該項目採用了開源的基準測試和分析平台，該平台最近已經擴展到支持SQL-on-span class =no> Hadoop 引擎和BigBench。 ALOJA項目旨在降低大數據部署的總體擁有成本（TCO），並研究其性能特徵以優化。 Nicolas強調如何通過ALOJA輕鬆重複基準測試，並從BigBench中受益，為高級用戶優化您的Spark群集。這項工作是在IEEE大數據16會議上發表的論文的繼續。會議主題標籤：＃EUde6

Using Apache Spark in the Cloud—A Devops Perspective

by Telmo Oliveira, Toon

video, slide

Toon is a leading brand in the European smart energy market, currently expanding internationally, providing energy usage insights, eco-friendly energy management and smart thermostat use for the connected home. As value added services become ever more relevant in this market, we have the need to ensure that we can easily and safely on-board new tenants into our data platform. In this talk we』re going to guide you across a less discussed side of using Spark in production – devops. We will speak about our journey from an on-premise cluster to a managed solution in the cloud. A lot of moving parts were involved: ETL flows, data sharing with 3rd parties and data migration to the new environment. Add to this the need to have a multi-tenant environment, revamp our toolset and deploy a live public facing service. It』s possible to find a lot of great examples of how Spark is used for data-science purposes. On the data engineering side, we need to deploy production services, ensure data is cleaned, secured and available, and keep the data-science teams happy. We』d like to share some of the options we took and some of the lessons learned from this (ongoing) transition.Session hashtag: #EUde10

下面的內容來自機器翻譯:

Toon是歐洲智能能源市場的領導品牌，目前正在向國際擴張，為聯網家庭提供能源使用的見解，環保的能源管理和智能溫控器的使用。隨著增值服務在這個市場變得越來越重要，我們有必要確保我們能夠輕鬆安全地將新租戶加入我們的數據平台。在本次演講中，我們將引導您討論在生產中使用Spark - devops「。我們將講述我們從雲端內部集群到託管解決方案的旅程。包括許多移動部件：ETL流程，與第三方共享數據以及將數據遷移到新環境。除此之外，還需要擁有多租戶環境，修改我們的工具集並部署面向公眾的服務。可以找到很多關於Spark用於數據科學目的的很好的例子。在數據工程方面，我們需要部署生產服務，確保數據清理，安全和可用，並保持數據科學團隊的滿意。我們想分享一些我們所選擇的選項以及從這個（正在進行的）轉換中學到的一些經驗教訓。Session＃hashdeg：＃EUde10

Working with Skewed Data: The Iterative Broadcast

by Fokko Driesprong, GoDataDriven

video, slide

Skewed data is the enemy when joining tables using Spark. It shuffles a large proportion of the data onto a few overloaded nodes, bottlenecking Spark』s parallelism and resulting in out of memory errors. The go-to answer is to use broadcast joins; leaving the large, skewed dataset in place and transmitting a smaller table to every machine in the cluster for joining. But what happens when your second table is too large to broadcast, and does not fit into memory? Or even worse, when a single key is bigger than the total size of your executor? Firstly, we will give an introduction into the problem. Secondly, the current ways of fighting the problem will be explained, including why these solutions are limited. Finally, we will demonstrate a new technique – the iterative broadcast join – developed while processing ING Bank』s global transaction data. This technique, implemented on top of the Spark SQL API, allows multiple large and highly skewed datasets to be joined successfully, while retaining a high level of parallelism. This is something that is not possible with existing Spark join types.Session hashtag: #EUde11

下面的內容來自機器翻譯:

當使用Spark連接表時，歪斜的數據是敵人。它將大部分數據混洗在一些超載的節點上，瓶頸和Spark的並行性，並導致內存不足的錯誤。前往的答案是使用廣播連接;留下大的，傾斜的數據集，並將更小的表傳送到群集中的每台機器以加入。但是當你的第二張桌子太大而無法播放時會發生什麼情況，而且不適合記憶？或者更糟糕的是，當一個關鍵字大於執行者的總大小？首先我們介紹一下這個問題。其次，解釋目前解決問題的方法，包括為什麼這些解決方案是有限的。最後，我們將演示在處理ING銀行全球交易數據的同時開發的新技術 - 迭代廣播加入。這種在Spark SQL API之上實現的技術允許多個大的高度傾斜的數據集成功連接，同時保持高水平的並行性。這是現有的Spark連接類型不可能實現的。Session＃hashdeg：＃EUde11