Realization of Spark component functionality and technical architecture WordCount

1. spark There are several ways to deploy, what are the differences?

local ( local mode ) : commonly used in local development testing, it is also divided into local local local-cluster single and multi-threaded;

Standalone ( trunked mode ) : Typical Mater / slave mode, but can also be seen that there is a Master single point of failure; the Spark achieved HA support ZooKeeper

on yarn ( cluster mode ) :  runs on yarn resources management framework, the yarn responsible for resource management, Spark is responsible for scheduling and calculation

on mesos ( cluster mode ) :  runs on mesos manager framework resource by mesos responsible for resource management, Spark is responsible for scheduling and calculation

on cloud ( cluster mode ) : such as AWS's EC2, using this model can easily access Amazon's S3; Spark supports a variety of distributed storage system: HDFS and S3

 

 

2. spark what technology stack components, each component what function, what treatment scenarios?

1) Spark core: is the basis for other components, spark kernel, mainly comprising: a directed cyclic graph, RDD, Lingage, Cache, broadcast, etc., and encapsulates the underlying communication framework, is the basis of the Spark.

 

2) SparkStreaming streaming system is a high-throughput, real-time fault tolerance of data streams, may be similarly Map of multiple data sources (e.g., Kafka, Flume, Twitter, Zero and TCP sockets), and the Reduce Join and other complex operations, the calculated flow into a series of short batch job.

 

3) Spark sql: Shark is SparkSQL's predecessor, an important feature of Spark SQL is its ability to deal with a unified relational tables and RDD, making it easy for developers to use SQL commands for external inquiries, while more complex data analysis.

 

4) BlinkDB: is a massively parallel query engine for running an interactive SQL queries on the massive data, which allows users to improve the accuracy of the data by balancing the query response time, accuracy of data which is within an allowable error range.

 

5) MLBase Spark is a lower part of the ecosystem focused on machine learning, machine learning so that the threshold, so that some may not be aware of machine learning users can easily use MLbase. MLBase divided into four parts: MLlib, MLI, ML Optimizer and MLRuntime.

 

6) GraphX ​​and FIG Spark for parallel computing;

 

wordcount ideas and implement 3. Spark statistics.

1) Create a configuration object AppName set, the attribute value Master

2) to create a portal object SparkContext spark of

3) The call flatMap converts data of each line set type List [String] of

4) List of the element mapping method of using the map correspondence tuple

5) The polymerization was performed tuple data aggregation function, to obtain statistics

object WordCount {
  def main(args: Array[String]): Unit = {
   new SparkContext("local""wordcount"new SparkConf().set("log4j.rootCategory""WARN, console"))context.textFile("data/words.txt").flatMap(_.split(" +")).map((_, 1)).reduceByKey(_ + _).foreach(println)
  }
}

 

  import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;

import java.util.Arrays;
import java.util.Iterator;


public class WordCount {
    public static void main(String[] args) {
        JavaSparkContext context = new JavaSparkContext("local", "wordcount", new SparkConf());
        context.setLogLevel("ERROR");

        SparkConf conf = new SparkConf();
        conf.setMaster("local");
        conf.setAppName("wc");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaRDD<String> lines = sc.textFile("./data/words");
        JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public Iterator<String> call(String line) throws Exception {
                return Arrays.asList(line.split(" ")).iterator();
            }
        });
        JavaPairRDD<String, Integer> pairWords = words.mapToPair(new PairFunction<String, String, Integer>() {
            @Override
            public Tuple2<String, Integer> call(String s) throws Exception {
                return new Tuple2<>(s, 1);
            }
        });
        JavaPairRDD<String, Integer> result = pairWords.reduceByKey(new Function2<Integer, Integer, Integer>() {
            @Override
            public Integer call(Integer v1, Integer v2) throws Exception {
                return v1 + v2;
            }
        });
        result.foreach(new VoidFunction<Tuple2<String, Integer>>() {
            @Override
            public void call(Tuple2<String, Integer> tp) throws Exception {
                System.out.println();
            }
        });
        sc.stop();
    }
}

 

 

Guess you like

Origin www.cnblogs.com/eric666666/p/11203488.html