[05] RDD spark in-depth study tour program Basics -01

----------------

This Section

1.RDD workflow

2.WordCount Commentary

 · Shell version WordCount

 · Java version WordCount

----------------

 

A, RDD workflow

   1. RDD is a spark-specific data model, it comes RDD will mention what resilient distributed datasets, what directed acyclic graph, we temporarily not to expand these advanced concepts in reading this article, when we can put RDD as an array, so our understanding of the RDD of learning API is very helpful. All sample code in this article are using scala language. RDD implementation process is as follows:

• Create an input from an external data RDD, distribution or collection of objects in the driver from the driver

· RDD for the transformation, one RDD into a new RDD, such as filter () operation

• If you want to reuse, told RDD execution persist () operation

• Perform action trigger computing parallel computing, spark the first optimization and then perform calculations, such as count () and first ()

  

  RDD created in two ways

(1) a distribution object driver from the driver set
is configured from the memory RDD, the methods used: makeRDD methods and parallelize  

----------------------- 

val rdd01 = sc.makeRDD(List(1,2,3,4,5,6));

val r01 = rdd01.map { x => x * x }

println(r01.collect().mkString(","))

/* Array */

val rdd02 = sc.makeRDD(Array(1,2,3,4,5,6))

val r02 = rdd02.filter { x => x < 5}

println(r02.collect().mkString(","))

 

val rdd03 = sc.parallelize(List(1,2,3,4,5,6), 1)

val r03 = rdd03.map { x => x + 1 }

println(r03.collect().mkString(","))

/* Array */

val rdd04 = sc.parallelize(Array(1,2,3,4,5,6) ,1)

val r04 = rdd04.filter { x => x > 3 }

println(r04.collect().mkString(","))

 ----------------------- 

 

And parallelize the difference 2.makeRDD

  makeRDD implemented in two ways, the first way Parallelize declarations are the same, the received parameters and exactly the same Parallelize, def makeRDD [T: ClassTag], makeRDD of this implementation relies parallelize; makeRDD second implementation of defmakeRDD [ T: ClassTag] (T, Seq (String)))

The first: mkRDD implementation

val blog1=sc.parallelize(List(1,2,3));

val blog2=sc.makeRDD(List(1,2,3));

 

The second: mkRDD implementation

valseq=List((1,List("a","b","c")),(2,List("aa","bb","cc")));

val blog3=sc.makeRDD(seq);

blog3.preferredLocations(blog3.partitions(0));

blog3.preferredLocations(blog3.partitions(1));

 

WordCount

  WordCount distributed programming start example, this section also provides examples described WordCount RDD DEMO

1.Spark shell version

---------------------------------------------------

// load the file on hdfs

val txtFile ="/tmp/test/core-site.xml" ;       

val txtData = sc.textFile(txtFile);

// Save the previous step RDD-generated objects into the cache, after Spark does not need to recalculate every time data query

txtData.cache()    ;

// flatMap flattened after the first mapping,

val wcData = txtData.flatMap(l =>l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _);    

// You can extract all data items in rdd, progressive output

wcData.collect().foreach(println);   

 

Remarks:

A. startup parameters specified on the spark-shell

bin/spark-shell --executor-memory 1G --total-executor-cores10 --executor-cores 1 --master yarn-client  --driver-class-path /usr/local/tdr_hadoop/spark/spark-1.6.0-bin-hadoop2.6/lib/mysql-connector-java-5.1.40-bin.jar

 

--executor-memory: specify each executor (executor) memory occupied 

--total-executor-cores: cpu audit all executor uses a total of 

--executor-cores: cpu core count per executor of use 

--driver-class-path: Specifies the package to be loaded jar

--master:

local [8]: indicates run locally, the data will be downloaded to the local machine to perform an interface, stand-alone

spark: // master01: 7077: represents the path where the cluster is running on a cluster application, filed in the designated task. This requires a real start Standalone cluster in advance. A plurality of master address can be specified, separated by commas.

yarn-client: On the client mode, driver and customer submits the program ends in a process

yarn-cluster: on the cluster mode, driver is to start from a worker process in a cluster, this process as long as the completion of the task will submit a job quit, do not wait for the application submitted is complete. When Spark-shell, you must use the yarn-client mode, because you have to write SQL on the client.

 

B.spark-shell is a spark application, need to apply resources to the resource manager is running as standalone spark, YARN, Mesos. Apply to the present embodiment standalone spark resources, when operating spark-shell needs to point standalonespark Info request resource cluster, which parameters MASTER.

If not stated in spark-env.sh MASTER, then use the command MASTER = spark: // cdh1: 7077bin / spark-shell boot; if already stated in spark-env.sh MASTER, you can directly bin / spark- shell starts.

As the case of spark-shell default, it will apply for all the CPU resources

 

B. Spark every Executor implementation of its mandate

 

 

 

 

 

2. java version

Spark development environment to build

(1) provided: configured jdk and to the scala windows
(2) go to the official website to download the installation Intellij Intellij: https: //www.jetbrains.com/idea/, in the windows environment you can double-click the installation
(3) Installation scala plugin

Step As shown below, the scala plug installed, click restart to restart intellij

 

(4) using Intellij WordCount write code for
a new scala project.
  File -> new new -> Project -> Project scala -> scala, Project Name: spark02

In src directory established cn.com packet, establishing the packet in the object class named word, complete word.scala code as follows:

---------------------------------------------------------------------package cn.com
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

/**
  * Created by Administrator on 2016/11/2.
  */
object word {
  def main(args: Array[String]) {
    if(args.length < 1) {
      System.err.println("Usage: <file>")
      System.exit(1)
    }
    val conf = new SparkConf()
    val sc = new SparkContext(conf)
    //SparkContext 是把代码提交到集群或者本地的通道,我们编写Spark代码,无论是要本地运行还是集群运行都必须有SparkContext的实例
    
val line = sc.textFile(args(0))
    //把读取的内容保存给line变量,其实line是一个MappedRDD,Spark的所有操作都是基于RDD的
    
line.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect.foreach(println)
    sc.stop
  }

}

---------------------------------------------------------------------

b.导入spark包
 File

->Project structure

->project settting

->libraries->+

导入spark-assembly-1.6.0-hadoop2.6.0.jar包(该包从spark安装包的lib下获得)

 

c.选择Artifacts

File

->Project structure

->project settting

->Artifacts->+,选择要导入的项目,以及main类

 


 

并且指定jar包输出的位置






d.输出jar包
Build -> Build ArtiFacts ->build,打好jar包到:D:\spark02\out\artifacts\spark02_jar\spark02.jar

 

e.上传jar包到spark客户端,并执行
  执行命令:

spark-submit --master yarn --executor-memory 1000M /usr/local/tdr_hadoop/spark/spark02.jarhdfs://tdrHadoop/tmp/test/core-site.xml

 

在yarn的前台显示正在执行

 

执行结果输出:

转载于:https://www.cnblogs.com/licheng/p/6815309.html

Guess you like

Origin blog.csdn.net/weixin_34368949/article/details/92630149