Spark 2017歐洲技術峰會摘要(Spark 核心技術分類)

下載全部視頻和PPT,請關注公眾號(bigdata_summit),並點擊「視頻下載」菜單

A Developers View Into Sparks Memory Model

by Wenchen Fan, Databricks

video, slide

As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark』s backend execution and push performance closer to the limits of modern hardware. In this talk, we』ll take a deep dive into Apache Spark』s unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection.Session hashtag: #EUdd2

下面的內容來自機器翻譯:

作為Project Tungsten的一部分,我們開始不斷努力,大幅度提高Apache Spark後端執行的內存和CPU效率,並將性能推向接近現代硬體的極限。在這次演講中,我們將深入探討Apache Spark的統一內存模型,並討論Spark如何利用內存層次結構並利用應用程序語義顯式管理內存(在堆內外),以消除JVM對象模型的開銷和垃圾collection.Session主題標籤:#EUdd2

Deep Dive into Deep Learning Pipelines

by Sue Ann Hong, Databricks

video, slide

Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.Session hashtag: #EUdd3

下面的內容來自機器翻譯:

深度學習已經取得了巨大的成功,但通常需要付出很大努力才能發揮其作用。現有的深度學習框架需要編寫大量代碼來運行模型,更不用說分散式的方式。深度學習管道是一個Spark包裝庫,可以根據Spark MLlib Pipelines API進行實際的深度學習。利用Spark,深度學習管道擴展了許多計算密集型深度學習任務。在這個演講中,我們深入探討了深度學習管道的各種使用案例,如大規模預測,轉移學習和超參數調整,其中許多可以用幾行代碼完成。 - 如何處理複雜的數據,如Spark和Deep Learning Pipelines中的圖像。 - 如何通過熟悉的Spark API(例如MLlib和Spark SQL)來部署深度學習模型,從而使機器學習從業者到業務分析人員都能獲得成功。最後,我們討論與流行的深度學習框架的整合。會議主題標籤:#EUdd3

Deep Dive into Deep Learning Pipelines - continues

by Sue Ann Hong, Databricks

video, slide

Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.Session hashtag: #EUdd3

下面的內容來自機器翻譯:

深度學習已經取得了巨大的成功,但通常需要付出很大努力才能發揮其作用。現有的深度學習框架需要編寫大量代碼來運行模型,更不用說分散式的方式。深度學習管道是一個Spark包裝庫,可以根據Spark MLlib Pipelines API進行實際的深度學習。利用Spark,深度學習管道擴展了許多計算密集型深度學習任務。在這個演講中,我們深入探討了深度學習管道的各種使用案例,如大規模預測,轉移學習和超參數調整,其中許多可以用幾行代碼完成。 - 如何處理複雜的數據,如Spark和Deep Learning Pipelines中的圖像。 - 如何通過熟悉的Spark API(例如MLlib和Spark SQL)來部署深度學習模型,從而使機器學習從業者到業務分析人員都能獲得成功。最後,我們討論與流行的深度學習框架的整合。會議主題標籤:#EUdd3

Easy, Scalable, Fault-Tolerant Stream Processing with Structured Streaming in Apache Spark

by Tathagata Das, Databricks

video, slide

Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He』ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.Session hashtag: #EUdd1

下面的內容來自機器翻譯:

去年,Databricks在Apache Spark 2.0中引入了Structured Streaming,這是一個基於Spark SQL的新型流處理引擎,它徹底改變了開發人員如何編寫流處理應用程序。結構化流式傳輸使用戶能夠以與在靜態數據上表示批量查詢相同的方式表達其計算。開發人員可以使用強大的高級API(包括DataFrame,Dataset和SQL)來表達查詢。然後,Spark SQL引擎能夠將這些類似批處理的轉換轉換為一個可以處理流數據的增量執行計劃,同時自動處理延遲的亂序數據,並確保端到端的準確一次容錯保證。自從Spark 2.0以來,Databricks一直在努力與Kafka建立一流的整合。藉助這種新的連接性,執行複雜的低延遲分析現在與編寫標準的SQL查詢一樣簡單。除了現有的Spark SQL連接性之外,該功能還可以使用統一的框架輕鬆分析數據。用戶現在可以無縫地從數據中提取見解,無論是來自雜亂/非結構化文件,結構化/列式歷史數據倉庫,還是從Kafka / Kinesis實時到達。在本次會議中,Das將通過具體的例如,在少於10行的內容中,您可以閱讀Kafka,將JSON有效內容數據解析為單獨的列,對其進行轉換,通過加入靜態數據將其填充,然後將其作為表格寫出,以備最新的批處理和臨時查詢最後一分鐘的數據。他將使用包括基於事件時間的聚合,任意有狀態操作以及使用事件時間水印的自動狀態管理等技術。Session hashtag:#EUdd1

Easy, Scalable, Fault-Tolerant Stream Processing with Structured Streaming in Apache Spark - continues

by Tathagata Das, Databricks

video, slide

Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He』ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.

下面的內容來自機器翻譯:

去年,Databricks在Apache Spark 2.0中引入了Structured Streaming,這是一個基於Spark SQL的新型流處理引擎,它徹底改變了開發人員如何編寫流處理應用程序。結構化流式傳輸使用戶能夠以與在靜態數據上表示批量查詢相同的方式表達其計算。開發人員可以使用強大的高級API(包括DataFrame,Dataset和SQL)來表達查詢。然後,Spark SQL引擎能夠將這些類似批處理的轉換轉換為一個可以處理流數據的增量執行計劃,同時自動處理延遲的亂序數據,並確保端到端的準確一次容錯保證。自從Spark 2.0以來,Databricks一直在努力與Kafka建立一流的整合。藉助這種新的連接性,執行複雜的低延遲分析現在與編寫標準的SQL查詢一樣簡單。除了現有的Spark SQL連接性之外,該功能還可以使用統一的框架輕鬆分析數據。用戶現在可以無縫地從數據中提取見解,無論是來自雜亂/非結構化文件,結構化/列式歷史數據倉庫,還是從Kafka / Kinesis實時到達。在本次會議中,Das將通過具體的例如,在少於10行的內容中,您可以閱讀Kafka,將JSON有效內容數據解析為單獨的列,對其進行轉換,通過加入靜態數據將其填充,然後將其作為表格寫出,以備最新的批處理和臨時查詢最後一分鐘的數據。他將使用包括基於事件時間的聚合,任意有狀態操作以及使用事件時間水印的自動狀態管理等技術。

From Basic to Advanced Aggregate Operators in Apache Spark SQL 2.2 by Examples and their Catalyst Optimizations

by Jacek Laskowski,

video,

There are many different aggregate operators in Spark SQL. They range from the very basic groupBy and not so basic groupByKey that shines bright in Apache Spark Structured Streaming』s stateful aggregations, including the more advanced cube, rollup and pivot to my beloved windowed aggregations. It』s unbelievable how different the performance characteristic they have, even for the same use cases.What is particularly interesting is the comparison of the simplicity and performance of windowed aggregations vs groupBy. And that』s just Spark SQL alone. Then there is Spark Structured Streaming that has put groupByKey operator at the forefront of stateful stream processing (and to my surprise as the performance might not be that satisfactory).This deep-dive talk is going to show all the different use cases for the aggregate operators and functions as well as their performance differences in Spark SQL 2.2 and beyond. Code and fun included!Session hashtag: #EUdd5

下面的內容來自機器翻譯:

Spark SQL中有許多不同的聚合運算符。他們的範圍從非常基本的groupBy,而不是基本的groupByKey,在Apache Spark結構化流的有狀態聚合,包括更高級的多維數據集,匯總和透視到我心愛的窗口聚合明亮。即使在相同的使用情況下,它們的性能特點也不相同。特別有趣的是,窗口化聚合的簡單性和性能與groupBy的比較。而這僅僅是Spark SQL。然後是Spark Structured Streaming,它把groupByKey運算放在了有狀態流處理的最前沿(令人驚訝的是性能可能不那麼令人滿意)。這個深入的討論將會展示所有不同的用例集合運算符和函數以及在Spark SQL 2.2及更高版本中的性能差異。代碼和樂趣包括!會議主題標籤:#EUdd5

From Basic to Advanced Aggregate Operators in Apache Spark SQL 2.2 by Examples and their Catalyst Optimizations - continues

by Jacek Laskowski,

video,

There are many different aggregate operators in Spark SQL. They range from the very basic groupBy and not so basic groupByKey that shines bright in Apache Spark Structured Streaming』s stateful aggregations, including the more advanced cube, rollup and pivot to my beloved windowed aggregations. It』s unbelievable how different the performance characteristic they have, even for the same use cases.What is particularly interesting is the comparison of the simplicity and performance of windowed aggregations vs groupBy. And that』s just Spark SQL alone. Then there is Spark Structured Streaming that has put groupByKey operator at the forefront of stateful stream processing (and to my surprise as the performance might not be that satisfactory).This deep-dive talk is going to show all the different use cases for the aggregate operators and functions as well as their performance differences in Spark SQL 2.2 and beyond. Code and fun included!Session hashtag: #EUdd5

下面的內容來自機器翻譯:

Spark SQL中有許多不同的聚合運算符。他們的範圍從非常基本的groupBy,而不是基本的groupByKey,在Apache Spark結構化流的有狀態聚合,包括更高級的多維數據集,匯總和透視到我心愛的窗口聚合明亮。即使在相同的使用情況下,它們的性能特點也不相同。特別有趣的是,窗口化聚合的簡單性和性能與groupBy的比較。而這僅僅是Spark SQL。然後是Spark Structured Streaming,它把groupByKey運算放在了有狀態流處理的最前沿(令人驚訝的是性能可能不那麼令人滿意)。這個深入的討論將會展示所有不同的用例集合運算符和函數以及在Spark SQL 2.2及更高版本中的性能差異。代碼和樂趣包括!會議主題標籤:#EUdd5

Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and TensorFlow

by Alexander Thomas, Indeed

video, slide

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks. Ideally, all three of these pieces should be able to be integrated into a single workflow. This makes development, experimentation, and deploying results much easier. Spark』s MLlib provides a number of machine learning algorithms, and now there are also projects making deep learning achievable in MLlib pipelines. All we need is the NLP annotation frameworks. SparkNLP adds NLP annotations into the MLlib ecosystem. This talk will introduce SparkNLP: how to use it, its current functionality, and where it is going in the future.Session hashtag: #EUdd4

下面的內容來自機器翻譯:

自然語言處理是許多數據科學系統的關鍵組成部分,必須理解或推理文本。常見的用例包括問題回答,釋義或摘要,情感分析,自然語言BI,語言建模和消歧。構建這樣的系統通常需要結合三種類型的軟體庫:NLP注釋框架,機器學習框架和深度學習框架。理想情況下,所有這三個部分應該能夠被集成到一個工作流程中。這使得開發,實驗和部署結果變得更加容易。 Spark的MLlib提供了許多機器學習演算法,現在也有一些項目可以在MLlib管道中實現深度學習。我們需要的是NLP注釋框架。 SparkNLP將NLP注釋添加到MLlib生態系統中。本次演講將介紹SparkNLP:如何使用它,當前的功能以及未來的發展方向。Session hashtag:#EUdd4


推薦閱讀:

何處安放的街道?
淺析中央銀行數字貨幣對大數據應用的影響
人大的CDA數據分析師培訓以及考試怎麼樣?
賓夕法尼亞大學MSSP項目——探索美國第一所現代意義上的大學

TAG:Spark | Hadoop | 大数据 |