Spark 2017歐洲技術峰會摘要（流計算分類）

01-27

下載全部視頻和PPT，請關注公眾號(bigdata_summit)，並點擊「視頻下載」菜單

Apache Spark Streaming + Kafka 0.10: An Integration Story

by Joan Viladrosa Riera, Billy Mobile

video, slide

Spark Streaming has supported Kafka since it』s inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself. So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples. Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.Session hashtag: #EUstr5

下面的內容來自機器翻譯:

Spark Streaming從開始就支持Kafka，但自那時以來，Spark和Kafka兩方面都發生了很大的變化，使得這種集成更具容錯性和可靠性.Apache Kafka 0.10（實際上從0.9開始）引入了新的Consumer API，建立在由卡夫卡本身提供的新的團體協調協議之上。因此，一個新的Spark Streaming集成到操場中，具有類似於0.8 Direct DStream方法的設計。但是，在使用方面存在顯著差異，並且有許多令人興奮的新功能。在這個演講中，我們將介紹這個新的整合和前一個（對於Kafka 0.8）之間的主要區別，以及為什麼Direct DStream已經取代了Receiver。我們還將看到如何用代碼示例實現不同的語義（至少一個，最多只有一個）。最後，我們將簡要介紹在Billy Mobile中使用這種集成技術來從AdNetwork.Session主題標籤中獲取並處理連續的事件流：＃EUstr5

Apache Spark Streaming Programming Techniques You Should Know

by Gerard Maas, Lightbend

video, slide

At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten 『ConstantInputDStream』 and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.Session hashtag: #EUstr2

下面的內容來自機器翻譯:

Spark Streaming的核心是一個調度框架，能夠有效地收集數據並將其傳遞給Spark進行進一步處理。雖然DStream抽象提供了高級函數來處理流，但是一些操作還允許我們訪問更深層次的API，在那裡我們可以直接操作RDD，將它們轉換為數據集以便利用這個抽象或者將數據存儲起來處理。在這些API層之間有許多鉤子，我們可以操作來豐富我們的Spark Streaming作業。在本演示中，我們將演示如何利用Spark Streaming調度程序來運行任意數據工作負載，我們將展示被遺忘的ConstantInputDStream的實際用法，並將解釋如何將Spark Streaming與概率數據結構結合起來以優化內存的使用為了改善長時間運行的流式作業的資源使用情況。本次會議的與會者將提供更豐富的技術工具箱，以擴大Spark Streaming的使用，並提高新的或現有工作的健壯性。會話標籤：＃EUstr2

Apache Spark Structured Streaming Helps Smart Manufacturing

by Xiaochang Wu, Intel

video,

This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.Session hashtag: #EUstr1

下面的內容來自機器翻譯:

本演講介紹了如何使用最新的Spark Structured Streaming框架來設計和實施實時處理平台，以智能化地改造製造業中的生產線。在傳統的生產線上，存在著感測器數據，機器屏幕輸出，日誌輸出，資料庫記錄等多種孤立的結構化，半結構化和非結構化的數據。主要有兩種數據場景：1）圖像和視頻數據與頻率低但數量多; 2）高頻連續的數據。他們不是每個單位的大量數據。然而，它們的總量是非常大的，例如用於檢測設備質量的振動數據。這些數據具有流數據的特徵：實時性，易失性，突發性，無序性和無窮大性。做出有效的實時決策以從這些數據中檢索出價值對智能製造來說至關重要。最新的Spark Structured Streaming框架極大地降低了構建高度可擴展性和容錯流應用程序的難度。得益於Spark，我們能夠構建一個低延遲，高吞吐量，可靠的數據採集，傳輸，分析和存儲的操作系統。實際的用戶案例證明，系統滿足了實時決策的需要。該系統大大提高了預測性故障修復和生產線材料跟蹤效率的生產過程，可以減少約一半的生產線勞動力。會議主題標籤：＃EUstr1

Building a Business Logic Translation Engine with Spark Streaming for Communicating Between Legacy Code and Microservices

by Patrick Bamba, Attestation Légale

video, slide

Attestation Legale is a social networking service for companies that alleviates the administrative burden European countries are imposing on client supplier relationships. It helps companies from construction, staffing and transport industries, digitalize, secure and share their legal documents. With clients ranging from one-person businesses to industry leaders such as Orange or Bouygues Construction, they ease business relationships for a social network of companies that would be equivalent to a 34 billion dollar industry. While providing a high quality of service through our SAAS platform, we faced many challenges including refactoring our monolith into microservices, a daunting architectural task a lot of organizations are facing today. Strategies for tackling that problem primarily revolve around extracting business logic from the monolith or building new applications with their own logic that interfaces with the legacy. Sometimes however, especially in companies sustaining an important growth, new business opportunities arise and the required logic from your microservices might greatly differs from the legacy. We will discuss how we used Spark Streaming and Kafka to build a real time business logic translation engine that allows loose technical and business coupling between our microservices and legacy code. You will also hear about how making Apache Spark a part of our consumer facing product also came with technical challenges, especially when it comes to reliability. Finally, we will share the lambda architecture that allowed us to use move data in batch (migrating data from the monolith for initialization) and real time (handling data generated after through use). Key takeaways include: – Breaking down this strategy and its derived technical and business profits – Feedback on how we achieved reliability – Examples of implementations using RabbitMQ (then Kafka) and GraphX – Testing business rules and data transformation.Session hashtag: #EUstr6

下面的內容來自機器翻譯:

認證Legale是一個為減輕歐洲國家對客戶供應商關係施加的管理負擔的公司提供的社交網路服務。它幫助建築，人員和運輸行業的公司數字化，保護和分享他們的法律文件。從一人企業到Orange或Bouygues Construction等行業領導者，客戶都可以輕鬆建立一個相當於340億美元行業的社交網路。在通過我們的SAAS平台提供高質量的服務的同時，我們還面臨著許多挑戰，包括將我們的巨無霸重構為微服務，這是許多組織今天面臨的艱巨的架構任務。解決這個問題的策略主要圍繞從整體中提取業務邏輯，或者用自己的與遺產介面的邏輯構建新的應用程序。然而，有時候，特別是在保持重要增長的公司中，出現了新的商業機會，微服務所需的邏輯可能會大大不同於傳統。我們將討論我們如何使用Spark Streaming和Kafka構建一個實時業務邏輯轉換引擎，從而允許我們的微服務和傳統代碼之間的技術和業務耦合鬆散。您也將聽到如何使Apache Spark成為我們面向消費者的產品的一部分，也帶來了技術挑戰，特別是在可靠性方面。最後，我們將分享一下lambda體系結構，它允許我們批量使用移動數據（從monolith移植數據進行初始化）和實時（處理使用後生成的數據）。關鍵要點包括： - 打破這一戰略及其派生的技術和業務利潤 - 對如何實現可靠性的反饋 - 使用RabbitMQ（然後是Kafka）和GraphX - 測試業務規則和數據轉換的實現示例.Session主題標籤：＃EUstr6

Deep Dive into Stateful Stream Processing in Structured Streaming

by Tathagata Das, Databricks

video, slide

Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming. In particular, I am going to discuss the following. – Different stateful operations in Structured Streaming – How state data is stored in a distributed, fault-tolerant manner using State Stores – How you can write custom State Stores for saving state to external storage systems.Session hashtag: #EUstr7

下面的內容來自機器翻譯:

有狀態處理是分散式容錯流處理中最具挑戰性的方面之一。結構化數據流中的DataFrame API使開發人員可以非常容易地表達他們的有狀態邏輯，無論是隱式地（流聚合）還是顯式地（mapGroupsWithState）。然而，引擎蓋下面有許多可移動的部件，使得所有的魔法成為可能。在這個演講中，我將深入探討有狀態處理在結構化流處理中的工作原理。我特別要討論以下內容。 - 結構化數據流中的不同有狀態操作 - 狀態數據如何使用狀態存儲以分散式，容錯方式存儲 - 如何編寫自定義狀態存儲以將狀態保存到外部存儲系統。Session標籤：＃EUstr7

Fast Data with Apache Ignite and Apache Spark

by Christos Erotocritou, GridGain

video, slide

Spark and Ignite are two of the most popular open source projects in the area of high-performance Big Data and Fast Data. But did you know that one of the best ways to boost performance for your next generation real-time applications is to use them together? In this session, Christos Erotocritou – Lead GridGain solutions architect, will explain in detail how IgniteRDD – an implementation of native Spark RDD and DataFrame APIs – shares the state of the RDD across other Spark jobs, applications and workers. Christos will also demonstrate how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames. Furthermore we will be discussing the newest feature additions and what the future holds for this integration.Session hashtag: #EUstr10

下面的內容來自機器翻譯:

Spark和Ignite是高性能大數據和快速數據領域最受歡迎的兩個開源項目。但是您是否知道為下一代實時應用程序提高性能的最佳方法之一是將它們一起使用？在本次會議中，首席GridGain解決方案架構師Christos Erotocritou將詳細解釋IgniteRDD（原生Spark RDD和DataFrame API的實現）如何與其他Spark作業，應用程序和工作人員共享RDD的狀態。 Christos還將演示IgniteRDD如何利用其先進的內存索引功能，使得SQL查詢的執行速度比原生Spark RDD或數據框快許多倍。此外，我們將討論最新的功能添加以及未來的這種整合。Session標籤：＃EUstr10

Monitoring Structured Streaming Applications Using Web UI

by Jacek Laskowski,

video,

Spark Structured Streaming in Apache Spark 2.2 comes with quite a few unique Catalyst operators, most notably stateful streaming operators and three different output modes. Understanding how Spark Structured Streaming manages intermediate state between triggers and how it affects performance is paramount. After all you use Apache Spark for processing huge amount of data that alone can be tricky to get right, and Spark Structured Streaming adds the additional streaming factor that given a structured query can make the data even bigger due to state management.This deep-dive talk is going to show you what is included in execution diagrams, logical and physical plans, and metrics in SQL tab』s Details for Query page.The talk is going to answer the following questions:* What do blue boxes represent in Details for Query page in SQL tab?

下面的內容來自機器翻譯:

Apache Spark 2.2中的Spark結構化流式處理器附帶了不少獨特的Catalyst運算器，其中最引人注目的是有狀態的流式運算符和三種不同的輸出模式。理解Spark Structured Streaming如何管理觸發器之間的中間狀態以及它如何影響性能是極為重要的。畢竟，您使用Apache Spark來處理大量的數據，而Spark結構化數據流只是一個棘手的問題，而Spark Structured Streaming增加了額外的數據流因素，給定的結構化查詢可以通過狀態管理使數據更大。談話將向您展示執行圖，邏輯和物理計劃以及SQL選項卡的「查詢詳細信息」頁面中的指標。談話將回答以下問題：*「詳細信息查詢」頁面中的藍色框表示什麼SQL選項卡？

Productionizing Behavioural Features for Machine Learning with Apache Spark Streaming

by Ben Teeuwen, http://Booking.com

video, slide

We are using Spark Streaming for building online Machine Learning(ML) features that are used in http://Booking.com for real-time prediction of behaviour and preferences of our users, demand for hotels and improve processes in customer support. Our initial set of goals was to speedup experimentation with real-time features, make features reusable by Data Scientists (DS) within the company and reduce training/serving data skew problem. The tooling that we』ve built and integrated into company』s infrastructure simplifies development of new features to the level that online feature collection can be implemented and deployed into production by DS with very little or no help from developers. That makes this approach scalable and allows us to iterate fast. We use Kafka as a streaming source of real-time events from the website as well as other sources and with connectivity to Cassandra and Hive we were able to make data more consistent between training and serving phases of ML pipelines. Our key takeaways: – It is possible to design production pipelines in a way that allows DS to build and deploy them without help of a developer. – Constructing online features is a much more complex job than offline construction and business-wise it is not always a priority to invest into their construction even if they are proven to be beneficial to the model performance. We plan to invest further into development of pipelines with Spark Streaming and are happy to see that support for streaming operations in Spark evolves in right direction.Session hashtag: #EUstr4

下面的內容來自機器翻譯:

我們正在使用Spark Streaming來構建在線機器學習（ML）功能，用於實時預測用戶的行為和偏好，對酒店的需求以及改善客戶支持流程。我們最初的目標是加速對實時功能的實驗，使公司內的數據科學家（DS）可重複使用功能，並減少培訓/服務數據歪斜問題。我們構建並集成到公司基礎架構中的工具簡化了新功能的開發，使得在線功能收集可以由DS很少或根本沒有幫助地部署到生產環境中。這使得這種方法可以擴展，並使我們能夠快速迭代。我們使用Kafka作為來自網站的實時事件以及其他來源的流媒體源，並且與Cassandra和Hive連接，我們能夠使ML管道的培訓和服務階段之間的數據更加一致。我們的關鍵要點： - 可以設計生產流水線，讓DS可以在沒有開發人員幫助的情況下構建和部署生產流水線。 - 建設在線功能比線下建設和商業化要複雜得多，即使它們被證明對模型性能有利，投資建設並不總是優先考慮的事情。我們計劃通過Spark Streaming進一步投資管道開發，並且很高興看到在Spark中支持流式處理操作的方向正在朝著正確的方向發展。會話標籤：＃EUstr4

Story Deduplication and Mutation

by Andrew Morgan, ByteSumo Ltd

video, slide

We demonstrate how to use Spark Streaming to build a global News Scanner that scrapes news in near real time, and uses sophisticated text analysis, SimHash,Session hashtag: #EUstr9

下面的內容來自機器翻譯:

我們演示了如何使用Spark Streaming來構建一個全球性的新聞掃描器，它可以近乎實時地掃描新聞，並使用複雜的文本分析SimHash，Session＃標籤：＃EUstr9