[spark] RDD creation

First, we need to create a sparkconf configuration file, and then create a sparkcontext through the configuration file.

import org.apache.spark._
object MyRdd {
    def main(args:Array[String]): Unit ={
        //Initialization configuration: set the host name and the name of the main class of the program
        val conf = new SparkConf().setMaster("local[*]").setAppName("MyRdd");
        //Create sparkcontext through conf
        val sc = new SparkContext(conf);
      
    }
}

Then we create RDD through sparkcontext

Several ways to create RDD

1. Create an RDD based on the collection in the program - role: mainly used for testing 

  pass sc.parallelize(collection)方法来创建RDD

       /*
        * Create RDD from scala collection
        * Calculation: 1+2+3+...+100
        */
        val nums = List(1,2,3,4,5);//collection
        val rdd = sc.parallelize(nums);//Create rdd
        val sum = rdd.reduce(_+_);
        println(sum);    

2. Create RDD based on local files - role: large data volume test  

"file:///home/hadoop/spark-1.6.0-bin-hadoop2.6/examples/src/main/resources/people.json"

3. Create RDD based on HDFS - role: the most commonly used RDD creation method in production environment

"hdfs://112.74.21.122:9000/user/hive/warehouse/hive_test"

  Read the file through the sc.textFile(file) method

       /*
        * Create RDD from local file system
        * Calculate the total length of characters in the people.json file
        */
        val rows = sc.textFile("file://")//file address or HDFS file path
        val length = rows.map(row=>row.length()).reduce(_+_)
        println("total chars length:"+length)

4. Create RDDs based on DB, NoSQL (such as HBase), S3, and data streams

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325215039&siteId=291194637