通過調試了解Spark on k8s的實現
來自專欄 未來已來
Spark 2.3.0 開始支持使用k8s作為資源管理原生調度spark,Spark-Submit可直接提交Spark應用到Kubernetes集群。以下是Spark-Submit原理圖
- Client啟動一個Kubernetes pod,運行Spark Driver
- Spark Driver啟動一組Kubernetes pods,每個Kubernetes pod創建Executor,並執行應用程序代碼
- 運行完程序代碼,Spark Driver清理Executor所在的pod,並保持為「Complete」狀態
運行Spark on k8s
前提條件是安裝k8s和jdk
1. 下載spark2.3.0版本
https://www.apache.org/dyn/closer.lua/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
2. 製作鏡像
cd spark-2.3.0-bin-hadoop2.7docker build -t xxx/spark:2.3.0 -f kubernetes/dockerfiles/spark/Dockerfile .docker push xxx/spark:2.3.0
3. 任務提交
bin/spark-submit --master k8s://172.22.3.107:6443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=5 --conf spark.kubernetes.container.image=xxx/spark:2.3.0 local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
調試Spark on k8s
1. 調試spark-submit
啟動命令
/root/spark/jdk1.8.0_171/bin/java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8000 -cp /root/spark/spark-2.3.0-bin-hadoop2.7/conf/:/root/spark/spark.deploy.SparkSubmit --master k8s://172.22.12.11:6443 --deploy-mode cluster --conf spark.kubernetes.container.image=zuowang/spark:2.3.0 --conf spark.executor.insparkPi --name spark-pi local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
jdb調試
jdb -connect com.sun.jdi.SocketAttach:hostname=localhost,port=8000
設置斷點
stop at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication:196
stop at org.apache.spark.deploy.k8s.submit.KubernetesClientApplication:210
stop at org.apache.spark.deploy.k8s.submit.Client:107
獲取callstack
main[1] where [1] org.apache.spark.deploy.k8s.submit.Client.run (KubernetesClientApplication.scala:134) [2] org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply (KubernetesClientApplication.scala:235) [3] org.apache.spark.deploy.k8s.submit.KubernetesClientApplication$$anonfun$run$5.apply (KubernetesClientApplication.scala:227) [4] org.apache.spark.util.Utils$.tryWithResource (Utils.scala:2,585) [5] org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run (KubernetesClientApplication.scala:227) [6] org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start (KubernetesClientApplication.scala:192) [7] org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain (SparkSubmit.scala:879) [8] org.apache.spark.deploy.SparkSubmit$.doRunMain$1 (SparkSubmit.scala:197) [9] org.apache.spark.deploy.SparkSubmit$.submit (SparkSubmit.scala:227) [10] org.apache.spark.deploy.SparkSubmit$.main (SparkSubmit.scala:136) [11] org.apache.spark.deploy.SparkSubmit.main (null)
SparkSubmit啟動KubernetesClientApplication,後者通過kubernetes API啟動Kubernetes pod,Container image為xxx/spark:2.3.0,啟動參數為driver
2. 調試Spark Driver
啟動命令
/root/spark/jdk1.8.0_171/bin/java -cp /root/spark/spark-2.3.0-bin-hadoop2.7/conf/:/root/spark/spark-2.3.0-bin-hadoop2.7/jars/* org.apache.spark.deploy.SparkSubmit --master k8s://172.22.12.11:6443 --deploy-mode cluster --conf spark.kubernetes.container.image=zuowang/spark:2.3.0 --driver-java-options -agentlib:jdwp=transport=dt_socket,address=9904,server=y,suspend=y --conf spark.executor.instances=5 --class org.apache.spark.examples.SparkPi --name spark-pi local:///opt/spark/examples/jars/spark-examples_2.11-2.3.0.jar
進入driver的pod
kubectl exec -it spark-pi-eb149b0ba85138e4bf60c594c0755309-driver bash
jdb調試
jdb -attach localhost:9904
設置斷點
stop at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend:103
stop at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager:47
獲取callstack
[1] org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend (KubernetesClusterManager.scala:47) [2] org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler (SparkContext.scala:2,741) [3] org.apache.spark.SparkContext.<init> (SparkContext.scala:492) [4] org.apache.spark.SparkContext$.getOrCreate (SparkContext.scala:2,486) [5] org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply (SparkSession.scala:930) [6] org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply (SparkSession.scala:921) [7] scala.Option.getOrElse (Option.scala:121) [8] org.apache.spark.sql.SparkSession$Builder.getOrCreate (SparkSession.scala:921) [9] org.apache.spark.examples.SparkPi$.main (SparkPi.scala:31) [10] org.apache.spark.examples.SparkPi.main (null)
Spark Driver中運行SparkPi的main函數,並創建SparkSession,後者使用KubernetesClusterManager作為SchedulerBackend,啟動Kubernetes pod,創建Executor。
參考:
Running Spark on Kubernetes
推薦閱讀:
※2018 年不容錯過的 Django 全棧項目 YaDjangoBlog
※如何基於Docker進行開發?
※如何評價docker?
※Docker 的應用場景在哪裡?
※基於Docker、Registrator、Zookeeper實現的服務自動註冊
TAG:Spark | Docker | Kubernetes |