Prerequisites:
- Have installed the desktop version of docker and kubernetes of win10
- Ingress has been installed on kubernetes
The following steps are referenced from:
https://testdriven.io/blog/deploying-spark-on-kubernetes/
- spark docker image:
Dockerfile:
FROM java:openjdk-8-jdk
# define spark and hadoop versions
ENV SPARK_VERSION=3.0.0
ENV HADOOP_VERSION=3.3.0
# download and install hadoop
RUN mkdir -p /opt && \
cd /opt && \
curl http://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz | \
tar -zx hadoop-${HADOOP_VERSION}/lib/native && \
ln -s hadoop-${HADOOP_VERSION} hadoop && \
echo Hadoop ${HADOOP_VERSION} native libraries installed in /opt/hadoop/lib/native
# download and install spark
RUN mkdir -p /opt && \
cd /opt && \
curl http://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop2.7.tgz | \
tar -zx && \
ln -s spark-${SPARK_VERSION}-bin-hadoop2.7 spark && \
echo Spark ${SPARK_VERSION} installed in /opt
# add scripts and update spark default config
ADD common.sh spark-master spark-worker /
ADD spark-defaults.conf /opt/spark/conf/spark-defaults.conf
ENV PATH $PATH:/opt/spark/bin
You can find the corresponding Dockerfile file in this github repository here.
Repository address: https://github.com/testdrivenio/spark-kubernetes
Compile the image:
docker build -f docker/Dockerfile -t spark-hadoop:3.0.0 ./docker
If you don’t want to compile the image locally, you can directly use the compiled docker image. The name of the image on dockerhub is mjhea0/spark-hadoop:3.0.0
docker pull down and use the docker tag command to rename spark-hadoop:3.0.0. Up
docker image ls spark-hadoop
REPOSITORY TAG IMAGE ID CREATED SIZE
spark-hadoop 3.0.0 8f3ccdadd795 11 minutes ago 911MB
Spark Master
spark-master-deployment.yaml:
kind: Deployment
apiVersion: apps/v1
metadata:
name: spark-master
spec:
replicas: 1
selector:
matchLabels:
component: spark-master
template:
metadata:
labels:
component: spark-master
spec:
containers:
- name: spark-master
image: spark-hadoop:3.0.0
command: ["/spark-master"]
ports:
- containerPort: 7077
- containerPort: 8080
resources:
requests:
cpu: 100m
spark-master-service.yaml:
kind: Service
apiVersion: v1
metadata:
name: spark-master
spec:
ports:
- name: webui
port: 8080
targetPort: 8080
- name: spark
port: 7077
targetPort: 7077
selector:
component: spark-master
Deploy the spark master and start the service:
kubectl create -f ./kubernetes/spark-master-deployment.yaml
kubectl create -f ./kubernetes/spark-master-service.yaml
verification:
$ kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
spark-master 1/1 1 1 12s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
spark-master-6c4469fdb6-rs642 1/1 Running 0 6s
Spark Workers
spark-worker-deployment.yaml:
kind: Deployment
apiVersion: apps/v1
metadata:
name: spark-worker
spec:
replicas: 2
selector:
matchLabels:
component: spark-worker
template:
metadata:
labels:
component: spark-worker
spec:
containers:
- name: spark-worker
image: spark-hadoop:3.0.0
command: ["/spark-worker"]
ports:
- containerPort: 8081
resources:
requests:
cpu: 100m
deploy
$ kubectl create -f ./kubernetes/spark-worker-deployment.yaml
verification:
NAME READY UP-TO-DATE AVAILABLE AGE
spark-master 1/1 1 1 92s
spark-worker 2/2 2 2 6s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
spark-master-6c4469fdb6-rs642 1/1 Running 0 114s
spark-worker-5d4bdd44db-p2q8v 1/1 Running 0 28s
spark-worker-5d4bdd44db-v4d84 1/1 Running 0 28s
Ingress
ingress is used to set up the web interface to access the spark master (the code in the github repository is used to set up minikube, this is different, the version of ingress needs to be updated a bit)
minikube-ingress.yaml (original version) :
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: minikube-ingress
annotations:
spec:
rules:
- host: spark-kubernetes
http:
paths:
- path: /
backend:
serviceName: spark-master
servicePort: 8080
minikube-ingress.yaml (modified version, applicable to kubernetes version 1.19 and later, ingress-controller needs to be installed in advance):
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: minimal-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: spark-master
port:
number: 8080
Create ingress object:
$ kubectl apply -f ./kubernetes/minikube-ingress.yaml
Visit http://127.0.0.1 to see the management interface of spark master, as follows
test
Start pyspark in spark-master
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
spark-master-6c4469fdb6-rs642 1/1 Running 0 10m
spark-worker-5d4bdd44db-p2q8v 1/1 Running 0 8m42s
spark-worker-5d4bdd44db-v4d84 1/1 Running 0 8m42s
$ kubectl exec spark-master-6c4469fdb6-rs642 -it -- pyspark
words = 'the quick brown fox jumps over the\
lazy dog the quick brown fox jumps over the lazy dog'
seq = words.split()
data = spark.sparkContext.parallelize(seq)
counts = data.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b).collect()
dict(counts)
After that you can see the following result:
{'brown': 2, 'lazy': 2, 'over': 2, 'fox': 2, 'dog': 2, 'quick': 2, 'the': 4, 'jumps': 2}