First, we need to create a sparkconf configuration file, and then create a sparkcontext through the configuration file.
import org.apache.spark._ object MyRdd { def main(args:Array[String]): Unit ={ //Initialization configuration: set the host name and the name of the main class of the program val conf = new SparkConf().setMaster("local[*]").setAppName("MyRdd"); //Create sparkcontext through conf val sc = new SparkContext(conf); } }
Then we create RDD through sparkcontext
Several ways to create RDD
1. Create an RDD based on the collection in the program - role: mainly used for testing
pass sc.parallelize(collection)方法来创建RDD
/* * Create RDD from scala collection * Calculation: 1+2+3+...+100 */ val nums = List(1,2,3,4,5);//collection val rdd = sc.parallelize(nums);//Create rdd val sum = rdd.reduce(_+_); println(sum);
2. Create RDD based on local files - role: large data volume test
"file:///home/hadoop/spark-1.6.0-bin-hadoop2.6/examples/src/main/resources/people.json"
3. Create RDD based on HDFS - role: the most commonly used RDD creation method in production environment
"hdfs://112.74.21.122:9000/user/hive/warehouse/hive_test"
Read the file through the sc.textFile(file) method
/* * Create RDD from local file system * Calculate the total length of characters in the people.json file */ val rows = sc.textFile("file://")//file address or HDFS file path val length = rows.map(row=>row.length()).reduce(_+_) println("total chars length:"+length)
4. Create RDDs based on DB, NoSQL (such as HBase), S3, and data streams