Abstract: Containers open data service journey series (2): How Kubernetes helps Spark big data analysis (2): How Kubernetes helps Spark big data analysis overview This article introduces a containerized data service Spark + OSS on ACK, which allows Direct access of Spark distributed computing nodes to Alibaba Cloud OSS object storage.
Containers open the journey of data services (2): How Kubernetes helps Spark big data analysis
(2): How Kubernetes Helps Spark Big Data Analysis
Overview
This article introduces a containerized data service Spark + OSS on ACK, which allows Spark distributed computing nodes to directly access Alibaba Cloud OSS object storage. With the deep integration of Alibaba Cloud Kubernetes Container Service and Alibaba Cloud OSS storage resources, Spark distributed memory computing and machine learning clusters are allowed to directly analyze and save the results of big data on the cloud.
prerequisites
You have created a Kubernetes cluster through Alibaba Cloud Container Service. For detailed steps, see Creating a Kubernetes Cluster
Create a Spark OSS instance from the Container Service console
Create a 1 master + 3 worker Spark OSS instance with three clicks
1 Log in to https://cs.console.aliyun.com/
2 Click "Application Directory"
3 Select "spark-oss", click "Parameters"

- Give your application a name, eg spark-oss-online2
- (Required) Fill in your oss_access_key_id and oss_access_key_secret
Worker:
# set OSS access keyID and secret
oss_access_key_id: <Your sub-account>
oss_access_key_secret: <your key_secret of sub-account>
3. (Optional) Modify the number of worker nodes Worker.Replicas: 3

4 Click "Deploy"
5 Click "Kubernetes Console" to view the deployment instance
6 Click on the service to view the external endpoint, click on the URL to access the Spark cluster

7 Test the Spark cluster
1. Open a spark-shell
kubectl get pod | grep worker
spark-oss-online2-worker-57894f65d8-fmzjs 1/1 Running 0 44m
spark-oss-online2-worker-57894f65d8-mbsc4 1/1 Running 0 44m
spark-oss-online2-worker-57894f65d8-zhwr4 1/1 Running 0 44m
kubectl exec -it spark-oss-online2-worker-57894f65d8-fmzjs -- /opt/spark/bin/spark-shell --master spark://spark-oss-online2-master:7077
Paste the following code to use Spark to test the read/write of OSS
// Save RDD to OSS bucket
val stringRdd = sc.parallelize(Seq("Test Strings\n Test String2"))
stringRdd.saveAsTextFile("oss://eric-new/testwrite12")
// Read data from OSS bucket
val lines = sc.textFile("oss://eric-new/testwrite12")
lines.take(10).foreach(println)
Test Strings
Test String2
CLI command line operations
Setup keys and deploy spark cluster in one command
export OSS_ID=<your oss id>
export OSS_SECRET=<your oss secrets>
helm install -n myspark-oss --set "Worker.oss_access_key_id="$OSS_ID",Worker.oss_access_key_secret="$OSS_SECRET incubator/spark-oss
kubectl get svc| grep oss
myspark-oss-master ClusterIP 172.19.9.111 <none> 7077/TCP 2m
myspark-oss-webui LoadBalancer 172.19.13.1 120.55.104.27 8080:30477/TCP 2m
To read more good articles, please scan the following QR code: