Spark 2017 歐洲技術峰會摘要（人工智慧）

01-25

下載全部視頻和PPT，請關注公眾號(bigdata_summit)，並點擊「視頻下載」菜單

Apache Spark-and-Tensorflow-as-a-Service

by Jim Dowling, KTH—Royal Institute of Technology

video, slide

In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick』s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.Session hashtag: #EUai8

下面的內容來自機器翻譯:

在瑞典，來自www.hops.site的Rise ICE數據中心，我們正在為Spark -as-a-Service以及最近的Tensorflow-as-a-Service提供研究人員，作為啤酒花平台的一部分的服務。在這個演講中，我們將研究在不同的方式中，Tensorflow可以包含在Spark工作流中，從批處理到流式處理到結構化的流應用程序。我們將分析用於將Spark與Tensorflow，從Tensorframes到TensorflowOn Spark集成到Databrick深度學習管道的不同框架。我們介紹了支持的不同編程模型，並強調了代表用戶管理不同版本的python庫的集群支持的重要性。我們還將介紹對共享GPU的集群管理支持，包括Mesos和YARN（在Hops Hadoop中）。最後，我們將對可以從HDFS或 Kafka讀取數據的Jupyter上的TensorflowOn Spark應用程序進行實時演示。 / span>，轉換Spark中的數據，並在Tensorflow上訓練一個深度神經網路。我們將演示如何使用Spark UI和Tensorboard調試應用程序，以及如何檢查日誌和監視培訓。Session＃hashai：＃EUai8

Deduplication and Author-Disambiguation of Streaming Records via Supervised Models Based on Content Encoders

by Reza Karimi, Elsevier

video, slide

Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier』s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.Session hashtag: #EUai2

下面的內容來自機器翻譯:

在這裡，我們通過Spark給出記錄重複數據刪除和作者消除歧義的一般監督框架。這項工作的區別在於：Databricks的應用和AWS使得這是一個可擴展的實現。計算資源比傳統的使用大箱子24/7的傳統技術要低。可伸縮性是至關重要的，因為Elsevier的Scopus數據是最大的科學抽象存儲庫，涵蓋了幾百年來7000萬個摘要中的大約2.5億個作者。 - 我們通過深度學習和/或word2vec演算法為每個內容創建一個指紋，以加速兩兩相似度計算。這些編碼器大大減少了計算時間，同時保持了語義相似性（不同於傳統的TFIDF或預定義分類法）。我們將簡要討論如何以高並行化來優化word2vec訓練。此外，我們還展示了如何使用這些編碼器為我們的所有實體（如文檔，作者，用戶，期刊等）導出標準表示。此標準表示可以將推薦問題簡化為成對相似性搜索，因此可以提供我們可能沒有一個專門的推薦引擎設計的跨產品應用程序的基本推薦。 - 傳統的作者消除歧義或記錄重複數據消除演算法是批量處理，從小到無的訓練數據。但是，我們有大約2千5百萬個作者，通過用戶反饋手動策劃或修正。因此，維護歷史檔案是至關重要的，因此我們開發了一個機器學習實現來處理數據流，並且一次一個地處理小批量或一個文檔。我們將討論如何測量這種系統的準確性，如何調整它，以及如何處理成對相似函數的原始數據到最後的聚類。從這次演講中吸取的經驗教訓可以幫助各種想要整合數據或重複數據刪除用戶/客戶/產品資料庫的公司。會議標籤：＃EUai2

Extending Apache Spark ML: Adding Your Own Algorithms and Tools

by Holden Karau, IBM

video,

Apache Spark』s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren』t available yet. This talk introduces Spark』s ML pipelines, and then looks at how to extend them with your own custom algorithms. By integrating your own data preparation and machine learning tools into Spark』s ML pipelines, you will be able to take advantage of useful meta-algorithms, like parameter searching and pipeline persistence (with a bit more work, of course). Even if you don』t have your own machine learning algorithms that you want to implement, this session will give you an inside look at how the ML APIs are built. It will also help you make even more awesome ML pipelines and customize Spark models for your needs. And if you don』t want to extend Spark ML pipelines with custom algorithms, you』ll still benefit by developing a stronger background for future Spark ML projects. The examples in this talk will be presented in Scala, but any non-standard syntax will be explained.Session hashtag: #EUai6

下面的內容來自機器翻譯:

Apache Spark的機器學習（ML）流水線提供了很大的功能，但是有時您需要的針對特定問題的工具還不可用。本文將介紹Spark的ML管道，然後介紹如何使用自定義演算法擴展它們。通過將您自己的數據準備和機器學習工具集成到Spark管線中，您將能夠利用有用的元演算法，如參數搜索和管道持久性當然有更多的工作）。即使你沒有自己想要實現的機器學習演算法，這個會話也會給你一個關於如何構建ML API的內幕。它還將幫助您製作更棒的ML流水線，並根據您的需求定製Spark型號。如果您不想使用自定義演算法來擴展Spark ML管道，您仍然可以通過為未來開發更強大的背景而獲益<Spark class =「no」> Spark < / ML項目。這個演講中的例子將在Scala中介紹，但是任何非標準的語法都將被解釋。Session hashtag：＃EUai6

Getting Ready to Use Redis with Apache Spark

by Dvir Volk, Redis Labs

video, slide

is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.Session hashtag: #EUai4

下面的內容來自機器翻譯:

是一個技術教程，旨在將Redis與Apache Spark部署集成，以提高服務複雜決策模型的性能。要為會話設置上下文，我們先簡要介紹一下Redis以及Redis提供的功能。我們介紹了Redis提供的基本數據類型，並涵蓋了模塊系統。使用廣告服務用例，我們看看Redis如何提高性能並降低在生產中使用複雜ML模型的成本。參與者將通過建立和整合Redis與Spark的關鍵步驟，包括如何使用Spark來訓練模型使用Redis載入和提供服務，以及如何使用Spark Redis模塊。我們將討論Redis機器學習模塊（redis-ml）的功能，主要關注決策樹和回歸（線性和邏輯）以及代碼示例，以演示如何使用這些功能。在會議結束時，開發人員應該有信心使用Redis和Spark構建原型/概念驗證應用程序。與會者將了解Redis如何補充 Spark 以及如何使用Redis為高性能的複雜ML模型提供服務。Session＃hashai：＃EUai4

MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library for Apache Spark

by Miruna Oprescu, Microsoft

video, slide

With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.Session hashtag: #EUai7

下面的內容來自機器翻譯:

隨著可用數據集的快速增長，擁有從大數據中提取洞察力的好工具勢在必行。數據科學家發現自己正在努力解決以下問題：Spark ML庫對於執行大規模數據處理和機器學習實驗提供了極好的支持。操作，缺乏對圖像處理，文本分析和深度學習的支持，以及無法使用Spark以及其他流行的機器學習庫。為了解決這些問題，微軟最近發布了針對Apache的微軟機器學習庫Spark（MML Spark），一個開源機器學習庫建立在 Spark之上「，旨在簡化數據科學過程，並集成Spark計算機視覺庫，如Microsoft認知工具包（CNTK）和OpenCV。通過使用MML Spark，Data Scientists可以通過Pipeline對象構建具有1/10代碼的模型，這些對象可以與Spark ML生態系統。在本課程中，我們將探討構建MML Spark的一些主要經驗教訓。加入我們，如果你想知道如何擴展管道，以確保與Spark ML的無縫集成，如何從Scala變形金剛和估算器自動生成Python和R包裝，如何集成並以分散式方式使用以前的非分散式庫，以及如何在多個平台上有效地部署Spark庫。Session hashtag：＃EUai7

MatFast: In-Memory Distributed Matrix Computation Processing and Optimization Based on Spark SQL

by Mingjie Tang, Hortonworks

video, slide

The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.Session hashtag: #EUai1

下面的內容來自機器翻譯:

大規模機器學習和數據挖掘方法的使用正在從商業智能和生物信息學到自駕車等許多應用領域中無處不在。這些方法很大程度上依賴於矩陣計算，因此使這些計算具有可擴展性和高效性至關重要。這些矩陣計算通常很複雜，需要對多個步驟進行優化和排序才能有效執行。這項工作提出了基於Spark的新的高效和可擴展的矩陣處理和優化技術。所提出的技術估計中間矩陣計算結果的稀疏性並優化通信成本。引入了用於複雜矩陣計算的評估計劃生成器以及利用基於動態成本的分析和基於規則的啟發式演算法的分散式計劃優化器。矩陣運算的結果經常用作另一個矩陣運算的輸入，由此定義矩陣矩陣程序中的數據依賴關係。矩陣查詢計劃生成器通過基於執行計劃中的數據依賴關係對矩陣進行分區來生成查詢執行計劃，以最大限度地減少內存使用和通信開銷。我們在 Spark SQL中實現了所提出的矩陣技術，並基於 Spark SQL Catalyst優化了矩陣執行計劃。我們對一系列ML模型和具有不同數據集特徵的矩陣計算進行案例研究。這些是PageRank，GNMF，BFGS，稀疏矩陣鏈乘法和生物數據分析。開源庫ScaLAPACK和基於數組的資料庫SciDB用於性能評估。我們的實驗是在六個真實世界的數據集上進行的：社交網路數據（例如soc-pokec，cit-Patents，LiveJournal），Twitter2010，Netflix推薦數據和1000 Genomes Project樣本。實驗表明，我們提出的技術達到了一個數量級的性能。會議主題標籤：＃EUai1

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark

by Marcin Kulka, 9LivesData

video, slide

Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this 「black-arts」 have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF』17 (#SFds5) but from more technical perspective.Session hashtag: #EUai9

下面的內容來自機器翻譯:

構建精確的機器學習模型一直是數據科學家的一門藝術，即演算法選擇，超參數調整，特徵選擇等。最近，突破這個「黑色藝術」的挑戰已經開始了。與我們的合作夥伴NEC Laboratories America合作，我們開發了一個基於Spark的自動預測建模系統。系統自動搜索最佳演算法，參數和功能，無需任何手動工作。在這個演講中，我們將分享如何設計自動化系統來利用Spark的吸引力優勢。真實的開放數據評估表明，我們的系統可以探索數百種預測模型，並在超高密度伺服器上發現最準確的模型，在3U機箱中採用272個CPU核心，2TB內存和17TB SSD。我們還將分享開放的挑戰，以便在Spark上學習如此龐大的模型，特別是從可靠性和穩定性的角度來看。本次演講將涵蓋已經在Spark峰會SF"17（＃SFds5）上展示的演示文稿，但從更多的技術角度來看。Session＃hashai：＃EUai9

Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Learning

by Eiti Kimura, Movile

video, slide

Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company』s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.Session hashtag: #EUai10

下面的內容來自機器翻譯:

你有沒有想過一個簡單的機器學習解決方案，能夠防止收入泄漏，並監視您的分散式應用程為了回答這個問題，我們提供了一個實用而簡單的機器學習解決方案，基於使用Apache Spark MLlib的簡單數據分析創建智能監控應用程序。我們的應用程序使用線性回歸模型進行預測，並檢查平台是否遇到任何可能影響收入損失的操作問題。應用程序監控分散式系統，並提供通知，說明檢測到的問題，這樣用戶可以快速操作，避免直接影響公司收入和縮短行動時間的嚴重問題。我們不僅要提供一個監控系統的架構，還要為我們的中斷恢復提供一個積極的參與者。在演示結束時，您將可以訪問我們的培訓計劃源代碼，您將能夠在公司中進行調整和實施。這個解決方案已經幫助去年防止了大約3美元的損失。會議主題標籤：＃EUai10

http://weixin.qq.com/r/zkTz64vEM9ZMrc_I9xHc (二維碼自動識別)