spark study notes 01

spark study notes 01

1. Course Objectives

  • 1. Familiar with spark related concepts

  • 2. Build a spark cluster

  • 3. Write a simple spark application

2. Spark overview

  • what is spark

    • It is a distributed computing engine based on memory, and the computing speed is very fast. It only involves the calculation of data, and does not involve data storage. Can connect to external data sources (such as hdfs, at this time, you need to build a hadoop cluster)

  • Why learn spark

    • Spark runs fast. Because the intermediate data results can be stored directly in memory, the speed is much faster than mapreduce.

3. Spark features

  • high speed

    • spark is 100x faster in memory and 10x faster in disk than mapreduce

      • The intermediate result data of spark tasks must not be dropped, and stored directly in memory

      • In a mapreduce task, if there are currently 100 tasks, how many corresponding processes will be generated to run? 100, mapreduce runs tasks as processes

      • In a spark task, if there are currently 100 tasks, it only needs to start 100 threads to run, and spark runs tasks in the form of threads.

  • Ease of use

    • Can quickly write spark applications and develop java/scala/python/R in 4 languages

  • Versatility

    • Can use sparksql/sparkStreaming/Mlib/Graphx

  • compatibility

    • spark can run on different platforms

      • yarn whole resource scheduling by resourcemanager

      • Mesos apache open source resource scheduling framework

      • standalone The entire resource scheduling is performed by the master

4. Spark cluster installation

  • 1. Download the spark installation package

    • spark-2.0.2-bin-hadoop2.7.tgz

  • 2. First plan the installation directory

    • /export/servers

  • 3. Unzip the installation package

    • tar -zxvf spark-2.0.2-bin-hadoop2.7.tgz -C /export/servers

  • 4. Rename the installation directory

    • mv spark-2.0.2-bin-hadoop2.7 spark

  • 5. Modify the configuration file

    • Go to the conf folder in the spark installation directory

      • Modify spark-env.sh.template (rename first)

        • mv spark-env.sh.template spark-env.sh

        • Configure java environment variables

          • export JAVA_HOME=/export/servers/jdk

        • Configure the master address of the spark cluster

          • export SPARK_MASTER_HOST=node1

        • Configure the master port of the spark cluster

          • export SPARK_MASTER_PORT=7077

      • Modify slaves.teamplate (rename first)

        • mv slaves.template slaves

        • Add worker nodes in spark cluster

          • node2

          • node3

  • 6. Configure spark environment variables

    • Modify /etc/profile

      • export SPARK_HOME=/export/servers/spark

      • export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

  • 7. Distribute the installation directory to other nodes

    • scp -r /export/servers/spark root@node2:/export/servers

    • scp -r /export/servers/spark root@node3:/export/servers

    • scp /etc/profile root@node2:/etc

    • scp /etc/profile root@node3:/etc

  • 8. Make the environment variables of all nodes take effect

    • Execute the command on all nodes

      • source /etc/profile

5. Spark cluster start and stop

  • start spark cluster

    • $SPARK_HOME/sbin/start-all.sh

  • stop spark cluster

    • $SPARK_HOME/sbin/stop-all.sh

6. Spark's master web management interface

7. SparkHA high availability cluster based on zk

  • 1. You need to build a zookeeper cluster

  • 2. The configuration needs to be modified

    • $SPARK_HOME/conf/spark-env.sh

      • 1. Comment out the manually specified live master

        • #export SPARK_MASTER_HOST=node1

      • 2. Introduce zk configuration

        export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER  -Dspark.deploy.zookeeper.url=node1:2181,node2:2181,node3:2181  -Dspark.deploy.zookeeper.dir=/spark"
  • 3. Start sparkHA

    • Start zookeeper first

    • Start start-all.sh on any machine (need to configure free login between any 2 machines)

      • When the script is executed, it spawns a master process on the current machine.

      • In the slaves file, get the corresponding worker node, and then start the worker process on the specified host

    • The master process can be started separately on other nodes

      • start-master.sh

8. Introduction to the role of spark

  • 1、driver

    • It is the process that runs the main method, which will create the sparkContext

  • 2、application

    • It is an application, which includes the driver code and some resources required for the entire task to calculate

  • 3、Master

    • It is the leader of the entire spark cluster, and it is responsible for task allocation and resource scheduling

  • 4、Worker

    • is the node that actually computes the task

  • 5、Executor

    • It is a process that will start running tasks on the worker nodes

  • 6、Task

    • It is the smallest unit of work in the spark cluster, and it runs in the executor process in the form of threads

9, first acquainted with the spark program

  • 1. Commit in normal mode (specify the master address alive in the current cluster)


    bin/spark-submit \
    --class org.apache.spark.examples.SparkPi \
    --master spark://node1:7077 \
    --executor-memory 1G \
    --total-executor-cores 2 \
    examples/jars/spark-examples_2.11-2.0.2.jar \
    100
  • 2. Submit in high availability mode


    bin/spark-submit \
    --class org.apache.spark.examples.SparkPi \
    --master spark://node1:7077,node2:7077,node3:7077 \
    --executor-memory 1G \
    --total-executor-cores 2 \
    examples/jars/spark-examples_2.11-2.0.2.jar \
    100

10. Using spark-shell

  • 1. spark-shell --master local[2] Read local data files to achieve word count

    • local[N]

      • local means running locally, and the following number N means how many threads are used locally to run the task

    • local[*]

      • local means running locally, followed by * means using all available resources on the current machine

    • It will spawn a SparkSubmit process.


    • sc.textFile("file:///root/words.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect
  • 2. spark-shell --master local[2] Read HDFS data files to achieve word count


    sc.textFile("hdfs://node1:9000/words.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect
    • Spark integrates HDFS

      • Need to modify spark-env.sh

        • export HADOOP_CONF_DIR=/export/servers/hadoop/etc/hadoop


        sc.textFile("/words.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect
  • 3. spark-shell --master spark://node1:7077 Read HDFS data files to achieve word count


    sc.textFile("/words.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect

11. Use scala to write spark's wordcount program (run locally)

  • import dependencies


        <dependency>
          <groupId>org.scala-lang</groupId>
          <artifactId>scala-library</artifactId>
          <version>2.11.8</version>
      </dependency>
      <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-core_2.11</artifactId>
          <version>2.0.2</version>
      </dependency>
      <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-client</artifactId>
          <version>2.7.4</version>
      </dependency>
  • code development


package cn.itcast
​import
org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
​//
todo: use scala to implement spark's wordcount program
object WordCount {
def main(args: Array[String ]): Unit = {
    //1. Create sparkconf and set appName and master address local[2] means that 2 threads are used locally to run
    val sparkConf: SparkConf = new SparkConf().setAppName("WordCount").setMaster("local[ 2]")
    //2. Create sparkcontext, the source of all calculations, it will create DAGScheduler and TaskScheduler
    val sc = new SparkContext(sparkConf)
    //3. Read data file
    val data: RDD[String] = sc.textFile( "D:\\words.txt")
    //4. Split each line
    val words: RDD[String] = data.flatMap(_.split(" "))
    //5. Each word counts as 1
    val wordAndOne: RDD[(String, Int)] = words.map((_,1))
    //6. Accumulate the number of occurrences of the same word
    val result: RDD[(String, Int)] = wordAndOne.reduceByKey(_+_ )
    //7, print output
    val finalResult: Array[(String, Int)] = result.collect()
    println(finalResult.toBuffer)
  //Close sc
  sc.stop() ​}
}


12. Use scala to write spark wordcount program (cluster operation)


package cn.itcast
​import
org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
​//
todo: use scala to implement spark's wordcount program (cluster operation)
object WordCount_Online {
def main(args : Array[String]): Unit = {
  //1. Create sparkconf and set appName
  val sparkConf: SparkConf = new SparkConf().setAppName("WordCount_Online")
  //2. Create sparkcontext, the source of all calculations, it will create DAGScheduler And TaskScheduler
  val sc = new SparkContext(sparkConf)
  //3. Read the data file
  val data: RDD[String] = sc.textFile(args(0))
  //4. Split each line
  val words: RDD[String] = data.flatMap(_.split(" "))
  //5, each word counts as 1
  val wordAndOne: RDD[(String, Int)] = words.map((_,1))
  //6. Accumulate the number of occurrences of the same word
  val result: RDD[(String, Int)] = wordAndOne.reduceByKey(_+_)
  //7. Save the result data to HDFS
    result.saveAsTextFile(args(1))
​/
  /close sc
  sc.stop() ​}
}


  • submit script


spark-submit --class cn.itcast.WordCount_Online --master spark://node1:7077 --executor-memory 1g --total-executor-cores 2 original-spark_class04-2.0.jar /words.txt /2018

13. Use java to write spark's wordcount program (run locally)

package cn.itcast;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

import java.util.Arrays;
import java.util.Iterator;
import java.util.List;

//todo:利用java语言实现spark的wordcount程序
public class WordCount_Java {
  public static void main(String[] args) {
      //1、创建sparkconf对象
      SparkConf sparkConf = new SparkConf().setAppName("WordCount_Java").setMaster("local[2]");
      //2. Create javaSparkContext
      JavaSparkContext jsc = new JavaSparkContext(sparkConf);
      //3. Read data file
      JavaRDD< String> dataJavaRDD = jsc.textFile("d:\\words.txt");
      //4. Split each line of
      JavaRDD<String> wordsJavaRDD = dataJavaRDD.flatMap(new FlatMapFunction<String, String>() {
          public Iterator< String> call(String line) throws Exception {
              String[] words = line.split(" ");
              return Arrays.asList(words).iterator();
          }
      });
      //5, each word counts as 1
      JavaPairRDD<String, Integer> wordAndOneJavaPairRDD = wordsJavaRDD.mapToPair(new PairFunction<String, String, Integer>() {
          public Tuple2<String, Integer> call(String word) throws Exception {
              return new Tuple2<String, Integer>(word, 1);
          }
      });
      //6. Accumulate the number of occurrences of the same word
      JavaPairRDD<String, Integer> resultJavaPairRDD = wordAndOneJavaPairRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {
          public Integer call(Integer v1, Integer v2 ) throws Exception {
              return v1 + v2;
          }
      });
        //In descending order according to the number of occurrences of the word, you need to reverse the position of (word, number of times) (number of times, word)
      JavaPairRDD<Integer, String> reverseJavaRDD = resultJavaPairRDD.mapToPair(new PairFunction<Tuple2<String, Integer>, Integer, String>() {
          public Tuple2<Integer, String> call(Tuple2<String, Integer> t) throws Exception {
              return new Tuple2<Integer, String>(t._2, t._1);
          }
      });
​//
      In descending order according to the number of occurrences of words, you need to reverse the position of (number, word) (word, number)
      JavaPairRDD<String, Integer > sortedRDD = reverseJavaRDD.sortByKey(false).mapToPair(new PairFunction<Tuple2<Integer, String>, String, Integer>() {
          public Tuple2<String, Integer> call(Tuple2<Integer, String> t) throws Exception {
              return new Tuple2<String, Integer>(t._2, t._1);
          }
      });
​//
      7、Print result
      List<Tuple2<String, Integer>> finalResult = sortedRDD.collect();
      for (Tuple2<String, Integer> tuple : finalResult) {
          System.out.println("Word: "+tuple. _1+" Times: "+tuple._2);
      }
      //8. Close
      jsc.stop();   ​}
}


 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324938502&siteId=291194637