Containers open the journey of data services (2): How Kubernetes helps Spark big data analysis

Abstract: Containers open data service journey series (2): How Kubernetes helps Spark big data analysis (2): How Kubernetes helps Spark big data analysis overview This article introduces a containerized data service Spark + OSS on ACK, which allows Direct access of Spark distributed computing nodes to Alibaba Cloud OSS object storage.

Containers open the journey of data services (2): How Kubernetes helps Spark big data analysis

(2): How Kubernetes Helps Spark Big Data Analysis

Overview

This article introduces a containerized data service Spark + OSS on ACK, which allows Spark distributed computing nodes to directly access Alibaba Cloud OSS object storage. With the deep integration of Alibaba Cloud Kubernetes Container Service and Alibaba Cloud OSS storage resources, Spark distributed memory computing and machine learning clusters are allowed to directly analyze and save the results of big data on the cloud.

prerequisites

You have created a Kubernetes cluster through Alibaba Cloud Container Service. For detailed steps, see Creating a Kubernetes Cluster

Create a Spark OSS instance from the Container Service console

Create a 1 master + 3 worker Spark OSS instance with three clicks

1 Log in to https://cs.console.aliyun.com/
2 Click "Application Directory"
3 Select "spark-oss", click "Parameters"

image description

  1. Give your application a name, eg spark-oss-online2
  2. (Required) Fill in your oss_access_key_id and oss_access_key_secret
Worker:

 # set OSS access keyID and secret
  oss_access_key_id: <Your sub-account>
  oss_access_key_secret: <your key_secret of sub-account>

3. (Optional) Modify the number of worker nodes Worker.Replicas: 3

image description

4 Click "Deploy"
5 Click "Kubernetes Console" to view the deployment instance

image description

6 Click on the service to view the external endpoint, click on the URL to access the Spark cluster
image description

image description

7 Test the Spark cluster

1. Open a spark-shell

kubectl get pod | grep worker

spark-oss-online2-worker-57894f65d8-fmzjs 1/1 Running 0 44m

spark-oss-online2-worker-57894f65d8-mbsc4 1/1 Running 0 44m
spark-oss-online2-worker-57894f65d8-zhwr4 1/1 Running 0 44m

kubectl exec -it spark-oss-online2-worker-57894f65d8-fmzjs --  /opt/spark/bin/spark-shell --master spark://spark-oss-online2-master:7077

Paste the following code to use Spark to test the read/write of OSS

// Save RDD to OSS bucket
val stringRdd = sc.parallelize(Seq("Test Strings\n Test String2"))
stringRdd.saveAsTextFile("oss://eric-new/testwrite12")

// Read data from OSS bucket
val lines = sc.textFile("oss://eric-new/testwrite12")
lines.take(10).foreach(println)

Test Strings
Test String2

CLI command line operations

Setup keys and deploy spark cluster in one command

export OSS_ID=<your oss id>
export OSS_SECRET=<your oss secrets>

helm install -n myspark-oss --set "Worker.oss_access_key_id="$OSS_ID",Worker.oss_access_key_secret="$OSS_SECRET incubator/spark-oss
kubectl get svc| grep oss
myspark-oss-master   ClusterIP      172.19.9.111    <none>          7077/TCP         2m
myspark-oss-webui    LoadBalancer   172.19.13.1     120.55.104.27   8080:30477/TCP   2m

Original link

To read more good articles, please scan the following QR code:
image description

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325853933&siteId=291194637