JAVA RDD Introduction

RDD Introduction

RDD, full name Resilient Distributed Datasets (resilient distributed datasets), is the most core concept Spark, Spark is an abstract of the data.

RDD is a distributed set of elements, each read operation supports only RDD and RDD are each divided into a plurality of partitions are stored on different nodes of the cluster. In addition, RDD also allows the user to display the specified data is stored in the memory and disks to master the RDD program is the first step in the development of SPARK.

 

1: creation (creation operation): RDD created by SparkContext to be responsible.
2: conversion operation (transformation operation): RDD one another through a certain operation is converted to RDD.
3: Action operation (action operation): Spark calculation is inert, for RDD action triggers the operation job running Spark
4: control operation (control operation): RDD for persistence of the like.

 

DEMO code address: https://github.com/zhp8341/sparkdemo/blob/master/src/main/java/com/demo/spark/rdddemo/OneRDD.java

One: Create operation

RDD created in two ways:
1 reads a data set (SparkContext.textFile ()):

JavaDStreamlines=jssc.textFileStream("/Users/huipeizhu/Documents/sparkdata/input/"); 
JavaReceiverInputDStreamlines = jssc.socketTextStream("localhost", 9999);


2 reads a set (SparkContext.parallelize ()):

Listlist = Arrays.asList(5, 4, 3, 2, 1);
JavaRDDrdd = sc.parallelize(list);

II: conversion operation

1: a single conversion operation RDD
Map (): for each element of the operation, returns a new RDD
System.out.println ( "RDD by each element 10:" + rdd.map (v -


filter (): each element of most screening, eligible return elements of a new RDD
System.out.println ( "RDD remove the element 1:"! + rdd.filter (v - > v = 1));

flatMap (): the operation performed for each element, all the elements of the iterator returned to form a new RDD return
r.dd.flatMap (x -> x.to (3 )) collect ().

distinct (): to retry
System.out.println ( "RDD deduplication operation:" + rdd.distinct ());

rdd maximum and minimum values

Integer max=  rdd.reduce((v1, v2) -Math.max(v1, v2));

Integer min = eet .reduce ((v1, v2)  - Type .min (v1, v2))


2: two RDD conversion operations:


[1, 2, 3] [3, 4, 5] a two simple correlation operation RDD

union (): combined weight not
System.out.println ( "RDD two collections:" + rdd1.union (rdd2) .collect ());

intersection (): intersection
System.out.println ( "RDD collection of two common elements:" + rdd1.intersection (rdd2) .collect ());

cartesian (): Cartesian product
System.out.println ( "Cartesian product and a further set of RDD:" + rdd1.cartesian (rdd2) .collect ());

subtract (): remove the same content
rdd1.subtract (rdd2) .collect ()

 

Three: Action Operation


collect (): returns all elements
System.out.println ( "raw data:" + rdd.collect ());

count (): Returns the number of elements
System.out.println ( "RDD statistics of all the elements:" + rdd.count ());

countByValue (): number of times each of the elements appear
System.out.println ( "number of times each element can occur:" + rdd.countByValue ());

take (num): Returns num element
System.out.println ( "taken rdd returns two elements:" + rdd.take (2)) ;

top (num): num element before returning
System.out.println ( "Back extraction rdd frontmost two elements:" + rdd.top (2)) ;


reduce (func): the integration of all the parallel data (most commonly used) in the RDD
System.out.println ( "RDD integrate all the data (sum):" + rdd.reduce ( (v1, v2) -> v1 + v2) );

foreach (func): using each element FUNC
rdd.foreach (T -> of System.out.print (T));


Four: control operation


cache():

persist (): retains the dependency of RDD

checkpoint (level: StorageLevel): RDD [T] RDD cutting dependencies

The so-called control operation is persistence
you can persist () or cache () method to persist a rdd. First, in action calculated in RDD; Then, it is stored in a memory of each node. Spark cache is a fault-tolerant technology - a partition loss if any RDD, which can be operated automatically double counting and create the partition by the original conversion (transformations).
In addition, we can use different storage levels to store each of which is persistent RDD.
Spark automatically monitoring usage of each node cache, delete old data using the principle of least recently used. If you want to manually delete the RDD, you can use RDD.unpersist () method.
We can use third-party data such as persistence in practice among: redis

 

From: https://www.cnblogs.com/diaozhaojian/p/9152530.html

RDD, full name Resilient Distributed Datasets (resilient distributed datasets), is the most core concept Spark, Spark is an abstract of the data.

RDD is a distributed set of elements, each read operation supports only RDD and RDD are each divided into a plurality of partitions are stored on different nodes of the cluster. In addition, RDD also allows the user to display the specified data is stored in the memory and disks to master the RDD program is the first step in the development of SPARK.

 

1: creation (creation operation): RDD created by SparkContext to be responsible.
2: conversion operation (transformation operation): RDD one another through a certain operation is converted to RDD.
3: Action operation (action operation): Spark calculation is inert, for RDD action triggers the operation job running Spark
4: control operation (control operation): RDD for persistence of the like.

 

DEMO code address: https://github.com/zhp8341/sparkdemo/blob/master/src/main/java/com/demo/spark/rdddemo/OneRDD.java

One: Create operation

RDD created in two ways:
1 reads a data set (SparkContext.textFile ()):

JavaDStreamlines=jssc.textFileStream("/Users/huipeizhu/Documents/sparkdata/input/"); 
JavaReceiverInputDStreamlines = jssc.socketTextStream("localhost", 9999);


2 reads a set (SparkContext.parallelize ()):

Listlist = Arrays.asList(5, 4, 3, 2, 1);
JavaRDDrdd = sc.parallelize(list);

II: conversion operation

1: a single conversion operation RDD
Map (): for each element of the operation, returns a new RDD
System.out.println ( "RDD by each element 10:" + rdd.map (v -


filter (): each element of most screening, eligible return elements of a new RDD
System.out.println ( "RDD remove the element 1:"! + rdd.filter (v - > v = 1));

flatMap (): the operation performed for each element, all the elements of the iterator returned to form a new RDD return
r.dd.flatMap (x -> x.to (3 )) collect ().

distinct (): to retry
System.out.println ( "RDD deduplication operation:" + rdd.distinct ());

rdd maximum and minimum values

Integer max=  rdd.reduce((v1, v2) -Math.max(v1, v2));

Integer min = eet .reduce ((v1, v2)  - Type .min (v1, v2))


2: two RDD conversion operations:


[1, 2, 3] [3, 4, 5] a two simple correlation operation RDD

union (): combined weight not
System.out.println ( "RDD two collections:" + rdd1.union (rdd2) .collect ());

intersection (): intersection
System.out.println ( "RDD collection of two common elements:" + rdd1.intersection (rdd2) .collect ());

cartesian (): Cartesian product
System.out.println ( "Cartesian product and a further set of RDD:" + rdd1.cartesian (rdd2) .collect ());

subtract (): remove the same content
rdd1.subtract (rdd2) .collect ()

 

Three: Action Operation


collect (): returns all elements
System.out.println ( "raw data:" + rdd.collect ());

count (): Returns the number of elements
System.out.println ( "RDD statistics of all the elements:" + rdd.count ());

countByValue (): number of times each of the elements appear
System.out.println ( "number of times each element can occur:" + rdd.countByValue ());

take (num): Returns num element
System.out.println ( "taken rdd returns two elements:" + rdd.take (2)) ;

top (num): num element before returning
System.out.println ( "Back extraction rdd frontmost two elements:" + rdd.top (2)) ;


reduce (func): the integration of all the parallel data (most commonly used) in the RDD
System.out.println ( "RDD integrate all the data (sum):" + rdd.reduce ( (v1, v2) -> v1 + v2) );

foreach (func): using each element FUNC
rdd.foreach (T -> of System.out.print (T));


Four: control operation


cache():

persist (): retains the dependency of RDD

checkpoint (level: StorageLevel): RDD [T] RDD cutting dependencies

The so-called control operation is persistence
you can persist () or cache () method to persist a rdd. First, in action calculated in RDD; Then, it is stored in a memory of each node. Spark cache is a fault-tolerant technology - a partition loss if any RDD, which can be operated automatically double counting and create the partition by the original conversion (transformations).
In addition, we can use different storage levels to store each of which is persistent RDD.
Spark automatically monitoring usage of each node cache, delete old data using the principle of least recently used. If you want to manually delete the RDD, you can use RDD.unpersist () method.
We can use third-party data such as persistence in practice among: redis

 

From: https://www.cnblogs.com/diaozhaojian/p/9152530.html

Guess you like

Origin www.cnblogs.com/Allen-rg/p/11366606.html