通過調試了解Spark on k8s的實現

通過調試了解Spark on k8s的實現

來自專欄 未來已來

Spark 2.3.0 開始支持使用k8s作為資源管理原生調度spark,Spark-Submit可直接提交Spark應用到Kubernetes集群。以下是Spark-Submit原理圖

  1. Client啟動一個Kubernetes pod,運行Spark Driver
  2. Spark Driver啟動一組Kubernetes pods,每個Kubernetes pod創建Executor,並執行應用程序代碼
  3. 運行完程序代碼,Spark Driver清理Executor所在的pod,並保持為「Complete」狀態

運行Spark on k8s

前提條件是安裝k8s和jdk

1. 下載spark2.3.0版本

apache.org/dyn/closer.l

2. 製作鏡像

cd spark-2.3.0-bin-hadoop2.7docker build -t xxx/spark:2.3.0 -f kubernetes/dockerfiles/spark/Dockerfile .docker push xxx/spark:2.3.0

3. 任務提交

bin/spark-submit --master k8s://172.22.3.107:6443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=5 --conf spark.kubernetes.container.image=xxx/spark:2.3.0 local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar

調試Spark on k8s

1. 調試spark-submit

啟動命令

/root/spark/jdk1.8.0_171/bin/java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8000 -cp /root/spark/spark-2.3.0-bin-hadoop2.7/conf/:/root/spark/spark.deploy.SparkSubmit --master k8s://172.22.12.11:6443 --deploy-mode cluster --conf spark.kubernetes.container.image=zuowang/spark:2.3.0 --conf spark.executor.insparkPi --name spark-pi local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar

jdb調試

jdb -connect com.sun.jdi.SocketAttach:hostname=localhost,port=8000

設置斷點

stop at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication:196

stop at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication:210

stop at org.apache.spark.deploy.k8s.submit.Client:107

獲取callstack

main[1] where [1] org.apache.spark.deploy.k8s.submit.Client.run (KubernetesClientApplication.scala:134) [2] org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply (KubernetesClientApplication.scala:235) [3] org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply (KubernetesClientApplication.scala:227) [4] org.apache.spark.util.Utils$.tryWithResource (Utils.scala:2,585) [5] org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run (KubernetesClientApplication.scala:227) [6] org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start (KubernetesClientApplication.scala:192) [7] org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain (SparkSubmit.scala:879) [8] org.apache.spark.deploy.SparkSubmit$.doRunMain$1 (SparkSubmit.scala:197) [9] org.apache.spark.deploy.SparkSubmit$.submit (SparkSubmit.scala:227) [10] org.apache.spark.deploy.SparkSubmit$.main (SparkSubmit.scala:136) [11] org.apache.spark.deploy.SparkSubmit.main (null)

SparkSubmit啟動KubernetesClientApplication,後者通過kubernetes API啟動Kubernetes pod,Container image為xxx/spark:2.3.0,啟動參數為driver

2. 調試Spark Driver

啟動命令

/root/spark/jdk1.8.0_171/bin/java -cp /root/spark/spark-2.3.0-bin-hadoop2.7/conf/:/root/spark/spark-2.3.0-bin-hadoop2.7/jars/* org.apache.spark.deploy.SparkSubmit --master k8s://172.22.12.11:6443 --deploy-mode cluster --conf spark.kubernetes.container.image=zuowang/spark:2.3.0 --driver-java-options -agentlib:jdwp=transport=dt_socket,address=9904,server=y,suspend=y --conf spark.executor.instances=5 --class org.apache.spark.examples.SparkPi --name spark-pi local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar

進入driver的pod

kubectl exec -it spark-pi-eb149b0ba85138e4bf60c594c0755309-driver bash

jdb調試

jdb -attach localhost:9904

設置斷點

stop at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend:103

stop at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager:47

獲取callstack

[1] org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend (KubernetesClusterManager.scala:47) [2] org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler (SparkContext.scala:2,741) [3] org.apache.spark.SparkContext.<init> (SparkContext.scala:492) [4] org.apache.spark.SparkContext$.getOrCreate (SparkContext.scala:2,486) [5] org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply (SparkSession.scala:930) [6] org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply (SparkSession.scala:921) [7] scala.Option.getOrElse (Option.scala:121) [8] org.apache.spark.sql.SparkSession$Builder.getOrCreate (SparkSession.scala:921) [9] org.apache.spark.examples.SparkPi$.main (SparkPi.scala:31) [10] org.apache.spark.examples.SparkPi.main (null)

Spark Driver中運行SparkPi的main函數,並創建SparkSession,後者使用KubernetesClusterManager作為SchedulerBackend,啟動Kubernetes pod,創建Executor。

參考:

Running Spark on Kubernetes


推薦閱讀:

2018 年不容錯過的 Django 全棧項目 YaDjangoBlog
如何基於Docker進行開發?
如何評價docker?
Docker 的應用場景在哪裡?
基於Docker、Registrator、Zookeeper實現的服務自動註冊

TAG:Spark | Docker | Kubernetes |