spark rapid analysis of large data Chapter II

1, driver to access the Spark, this object represents a connection via a compute cluster SparkContext object. shell has automatically created a SparkContext object. SparkContext use objects to create a RDD

2, spark and mapreduce difference

  mapreduce map and reduce in two phases, two phase ends mapreduce mission is over, we can only deal with the map and reduce the processing in a job that is very limited in can do.

  spark is iterative computational model, a job can be divided into n stages, because it is a memory iterative, after we finished with a stage, you can continue to process down multiple stages instead of two stages.

3, spark resolve the RDD

  RDD on a logical abstraction hdfs represents a file, i.e., it is actually divided into a plurality of partitions partitions. These partitions spark scattered in different nodes, for example, a data RDD 400,000, can be divided into four partitions (partitions are on different nodes).

  Elastic spark: Each partition RDD default is placed in memory, if memory storage no less, when it is stored on disk storage node on the spark

  RDD fault tolerance: for example node 1 ---> node 2, node 2 when a fault occurs, resulting in Partition2 data (node ​​2) is lost. If it is found this failure, the system will automatically recalculate from one node to reacquire the data they need.

4, Wordcount examples

  (1) Java language

package cn.spark.study.core;

import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

public class wordCountLocal {

public  static  void main (String [] args) {
     // write Spark application
     // executed locally, is the main method can be performed in the eclipse, the execution 
 
    // Step 1: Create SparkConf objects, set the configuration application Spark information
     // url use setMaster () Spark can set the application to connect to master nodes in the cluster Spark
     // but if it is set to local representatives, locally run 
    SparkConf conf = new new SparkConf () 
            .setAppName ( "WordCountLocal" ) 
            . setMaster ( "local" ); // this time is running locally if you want to run the delete this line spark cluster, again packed run. 
    
    // Step 2: Create JavaSparkContext objects
     // in Spark, SparkContext entrance Spark is a function of all, whether you use java, scala, or even write Python
         //Must have a SparkContext, its main role, including some of the core components required for initialization Spark applications, including
         // scheduler (DAGSchedule, TaskScheduler), will go to register on Spark Master node, etc.
     // word, SparkContext, Spark is the application, it can be said is the most important one object
     // but then, in Spark, Spark write different types of applications, SparkContext use is different, if you use Scala,
         // use the object is SparkContext native
         // However, if using Java, then the object is JavaSparkContext
         // if it is to develop Spark SQL program, so that SqlContext, HiveContext
         // if the program is to develop Spark Streaming, then it is unique SparkContext
         // to such push 
    JavaSparkContext sc = new new JavaSparkContext (conf); 

    // third step: to be an initial RDD for the input source (hdfs files, local files, etc.), create
     //Data input source will be broken up, assigned to each partition RDD, thereby forming an initial set of distributed data
     // here, because the test is local, so it is for a local file
     // SparkContext in for the method according to the type of input source file to create RDD, called textFile () method
     // in Java, ordinary RDD created are called JavaRDD
     // here it, RDD, there are elements of this concept, if it is hdfs or local file it, RDD created, each element is equivalent
     // file in the line of 
    JavaRDD <String> Lines = sc.textFile ( "C: /Users/Think/Desktop/spark.txt" ); 

    / / step four: carry out transformation operations on the initial RDD, that is, some of the computing operation
     // usually action by creating a function, and with the RDD's map, flatMap and other operators to perform
     // function, usually, if simple, create Function of the specified anonymous inner class
     // but if the function is more complex, it will create a separate class, as a class that implements this interface function
     // each line first split into individual words
     //FlatMapFunction, there are two generic parameters, represent the input and output types
     // we do here, the input must be String, because it is the text line by line, output, in fact, String, because each line of text
     // here first brief flatMap operator action, in fact, an element of RDD, to split into one or more elements 
    JavaRDD <String> = lines.flatMap words ( new new FlatMapFunction <String, String> () { 
        
        Private  static  Final  Long = 1L serialVersionUID ; 
        
        @Override 
       public the Iterator <String> Call (String S) throws Exception { 
                List <String> List = new new the ArrayList <String> (); 
                String [] ARR = s.split ( "" );
                 for(String SS: ARR) { 
                    List.add (SS); 
                } 
                return list.iterator (); 
            } 
        
    }); 
    
    // Next, each word needs to be mapped to (word 1) of this format
         // because only in this way can be the basis behind the words as key, to accumulate the number of times each word appears
     // mapToPair, in fact, each element is mapped to a (v1, v2) such Tuple2 types of elements
         // If you still remember scala which was about tuple, then yes, tuple2 is the scala types here, including the two values
     // mapToPair the operator is required is used in conjunction with the PairFunction, the first generic parameter represents the input type
         // the second and third generic parameter, a type of the first output value representative Tuple2 and second values
     // two JavaPairRDD the generic parameters, representing the value of the first tuple element and two types of values 
    JavaPairRDD <String, Integer> pairs = words.mapToPair ( 
            
            new newPairFunction <String, String, Integer> () { 

                Private  static  Final  Long serialVersionUID = 1L ; 
    
                @Override 
                public Tuple2 <String, Integer> Call (String Word) throws Exception {
                     return  new new Tuple2 <String, Integer> (Word,. 1 ); 
                } 
                
            }); 
    
    // Next, key words as needed to, count the number of each word appears
     // herein reduceByKey the operator to use, for each corresponding key value, are performed reduce operation
     // example has several JavaPairRDD elements respectively (hello, 1) to (hello, 1) (hello, 1) (World,. 1)
     // the reduce operation, a value equivalent to the first and second values are calculated, and then the results The third value is calculated
     //Hello example here, it is equivalent to, first of 1 + 1 = 2, then 1 + 2 = 3
     // element JavaPairRDD last returned tuple is, but each value is the first key, the first two of the key value is the value
     // result after reduce, that is equivalent to the number of times each word appears 
    JavaPairRDD <String, Integer> wordcounts = pairs.reduceByKey ( 
            
            new new the Function2 <Integer, Integer, Integer> () { 
                
                Private  static  Final  Long serialVersionUID = 1L ; 
    
                @Override 
                public Integer Call (V1 Integer, Integer V2) throws Exception {
                     return V1 + V2; 
                } 
                
            }); 
    
    //Up to this point, we counted a few Spark sub-operations, has the statistics of the number of times the word
     // However, flatMap, mapToPair, reduceByKey this operation before we use are called transformation operation
     // a Spark application, just there transformation operation, does not work, will not be implemented, there must be something called action
     // then, finally, you can use something called action operation, for example, foreach, to trigger the execution of the program 
    wordCounts.foreach ( new new VoidFunction <Tuple2 <String, Integer >> () { 
        
        Private  static  Final  Long serialVersionUID = 1L ; 
        
        @Override 
        public  void Call (Tuple2 <String, Integer> wordCount) throws Exception { 
            System.out.println (wordCount._1 + "Appeared" wordCount._2 + + "Times."  );    
        }
        
    });
    
    sc.close();
}

}

  (2) Scala language: Packaging (use maven and export) run.

package cn.spark.study.core

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object WordCount {
  def main(args:Array[String]){
    def conf=new SparkConf()
    .setAppName("WordCount").setMaster("local");
    val sc=new SparkContext(conf);
    val lines=sc.textFile("hdfs://localhost:9000/hdc/input_1");
    val words = lines.flatMap { line => line.split(" ") }   
    val pairs = words.map { word => (word, 1) }   
    val wordCounts = pairs.reduceByKey { _ + _ }
    wordCounts.foreach(wordCount => println(wordCount._1 + " appeared " + wordCount._2 + " times."))   
  }
}

Spark-submit then used to configure and run a script which code is as follows:

  

/home/hdc/software/spark-2.4.3-bin-hadoop2.7/bin/spark-submit \
--class cn.spark.study.core.WordCountCluster \        //包名cn.spark.study.core以及类名WordCountCluster
/home/hdc/Document/spark_test/Wordcount.jar \

 

Guess you like

Origin www.cnblogs.com/hdc520/p/11246328.html