spark core development 2

commonly used spark Operator:
map operators:
original of each data item RDD custom function f maps into a new element in the map by the user, each block represents a RDD way partition, the left side of the partition after a user-defined function f is mapped to the right of the RDD new partition, however, only to wait until after the actual operator action is triggered, this data will f function and other functions performed in a stage of operation
mapValues operator:
for (key, value) for value type data map of the key operation without boxes represent processing performed in FIG RDD partition a => a + 2 for the representative (V1,1) Key value data such as the data value of detachment 1 operation plus 2
flatmap operator
the original R to obtain a plurality of sets each partition RDD in each element is converted to a new element by a function f, and the generated combined into a set of elements
mapPartition operator:
for each partition, the partition function is performed by the elements of the entire iterator operation of the whole partition
union operator: take current and
need to ensure that the same type of data RDD two elements, the number of returned RDD And the same type are combined RDD data element type, re-operation is not performed to save all of the elements
** intersection operator: ** intersected
This function returns the intersection of two of RDD, and to re-
groupbykey operator:
the execution of the before the operator needs to be generated by the function of the corresponding element Key, data is converted to the key-value format, then the operator will use the same Key elements grouped can specify the last number of partitions generated groupbykey (x), x determining the number of partitions and the partition function, determines the degree of parallelization
Here Insert Picture Description
Combinebykey: aggregateByKey foldByKey reduceByKey other functions are implemented based on the function corresponding to the elements (int, int order transition RDD (int seq [int]) RDD type element)
(V1,2)
(V1,1) - -----> (Vl, SEQ (2,1))
Reducebykey:
two values into one value, (int, int v) ==== > (int, intC)
by user-defined functions (a, B) :( a + B) function, the same key data (V1,1), (V1,2) ----> (v1,3)
aggregatebykey:
in units of partitions operation, aggregatebykey (initial value, 1 function, function 2)
Example: 1-defined logical function of selecting the maximum value is the value of each of the plurality of key,
functional logic 2 is defined by summing
the underlying implementation:
1. the same key for each partition, one and comparing the initial value v1, V2 and the return value, the return value is acquired ... until the maximum.
statistics different partitions 2. the obtained aggregated, the aggregated statistics same way key adding
Sortbykey:
sorted key values
join operators:
two connections need be cogroup RDD operation function, the same first key data can be placed in the same partition, operation of shape after cogroup Into new RDD operate the Cartesian product of the elements under each key, and then returned results flattened, corresponding to all the ancestral form a set, returns RDD [(K, (V, W))] under Key
Here Insert Picture Description
cogroup Operators child:
cogroup function RDD two collaboratively division, key-value pair in the two types of elements in the RDD, each RDD key elements are the same polymerization as a set, and returns an iterator RDD two key elements of the corresponding set of is: (K, (Iterable [V ], Iterable [W])) where, value the iterator is ancestral two RDD two identical key data composed of a set of
Here Insert Picture Description
cartesian operator:
all within two of the RDD Cartesian product operation element
Here Insert Picture Description
repartition operator:
input RDD repartition
output partition for partitioning the subset of the input
filter operator:
to filter elements, the type of the return value boolen, true retained in RDD, false designation is filtered
Sample (false , 0.001,20):
False: whether to back
0.001: probability
20: random number seed
for each partition of data RDD sampling according to the sampling rate, and returns a new RDD
Takesample (Action operator):
direct print data set

java development:

package com.spark;

import java.io.File;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;

import org.apache.cassandra.cli.CliParser.newColumnFamily_return;
import org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.intervalLiteral_return;
import org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.nullCondition_return;
import org.apache.hadoop.ipc.RetriableException;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.netlib.blas.Srot;


public class RDDOperation {
	static String basedir = System.getProperty("user.dir"+File.separator+"conf"+File.separator);
	public static void main(String[] args) {
		SparkConf conf =new SparkConf().setAppName("RDDOperation").setMaster("local");
		JavaSparkContext sc = new JavaSparkContext(conf);
		mapDemo(sc);
		System.out.println("-------------");
		flatMapDemo(sc);
		System.out.println("-------------");
		filterDemo(sc);
		System.out.println("-------------");
		sampleDemo(sc);
		System.out.println("-------------");
		mapPartitionDemo(sc);
		System.out.println("-------------");
		groupByKeyDemo(sc);
		System.out.println("-------------");
		reduceByKeyDemo(sc);
		System.out.println("-------------");
			
		
		
	}
	
private static void mapDemo(JavaSparkContext sc) {
	List<String> list =Arrays.asList("hadoop hbase hdfs spark storm","java scala python");
	JavaRDD<String> stringRdd =sc.parallelize(list);//只有一个分区
//	[hadoop hbase hdfs spark storm, java scala python]
//	System.out.println(stringRdd.collect());
//	map函数把rdd中的每一个元素传进来  一个元素“hadoop hbase hdfs spark stormpoiup”
	JavaRDD<String[]>splitRdd =stringRdd.map(
			new Function<String,String[]>(){
				public static final long serialVersionUID=1L;
//				call方法会被反复调用
				public String[] call(String v1) {
					return v1.split(" ");
				}
	    
	});
	List<String[]> result =splitRdd.collect();
	for (int i = 0; i <result.size(); i++) {
		for(String s :result.get(i)) {
			System.out.println("Array"+i+"Data:"+ s);			
		}		
	}	
}

private static void flatMapDemo(JavaSparkContext sc) {
	List<List<String>> list =Arrays.asList(
			Arrays.asList("hadoop hbase hdfs spark storm","java scala python"),
			Arrays.asList("java scala python","beijing","zhejiang","shanghai"));
	JavaRDD<List<String>> stringRdd =sc.parallelize(list);
	JavaRDD<String>flatmapRdd=stringRdd.flatMap(
			new FlatMapFunction<List<String>, String>() {
				private static final long serialVersionUID=1L;
				List<String> list =new ArrayList<String>();
				String[] wordArr  = null;
				StringBuilder sb = new StringBuilder();
				@Override
				
				public Iterable<String> call(List<String> s) {
					for (int i =0;i<s.size();i++) {
						sb.append(s.get(i));
						sb.append(" ");
					}
				wordArr = sb.toString().split(" ");
				for (String word : wordArr) {
					list.add(word);
				}
				return list;					
					
			}					
		});
	    System.out.println("source data "+ stringRdd.collect());
	    System.out.println("flatmapoutput"+ flatmapRdd.collect());
	        
	
	
	
	
}
private static void filterDemo(JavaSparkContext sc) {
	List<String> list = Arrays.asList("hadoop hbase hdfs spark strom","java scala r");
	JavaRDD<String> stringRdd = sc.parallelize(list);
	JavaRDD<String> filterRDD = stringRdd.filter(
			new Function<String, Boolean>() 
			{
				private static final long servialVersionUID=1L;
				public Boolean call(String s) {
//					s 包含两个元素
					return s.contains("j");
//					返回“java scala r”  
					
				}
						
	});
	System.out.println(filterRDD.collect());
	
}
private static void mapPartitionDemo(JavaSparkContext sc) {
	List<Integer> list = Arrays.asList(1,2,3,4,5,6);
	JavaRDD<Integer> intRdd = sc.parallelize(list,2);
//	传进来的是一个一个分区
	JavaRDD<Integer> mapPartitionRdd = intRdd.mapPartitions(
		new FlatMapFunction<Iterator<Integer>, Integer>() {
		private static final long servialVersionUID=1L;
		public Iterable<Integer> call(Iterator<Integer> integerIterator) throws Exception {
			List<Integer> list =new ArrayList<>();
			while (integerIterator.hasNext()) {
				list.add(integerIterator.next());							
			}
			list.add(0);
			return list;
				
		}
	});
	System.out.println(mapPartitionRdd.collect());
	
}
private static void sampleDemo(JavaSparkContext sc) {
	List<Integer> list = new ArrayList<Integer>();
	for (int i = 0; i < 10000; i++) {
		list.add(i);
		
	}
	JavaRDD<Integer> intRdd= sc.parallelize(list);
//	是否放回采样   采样的概率   随机种子
	JavaRDD<Integer> sampleRdd = intRdd.sample(false, 0.001,20);
	System.out.println(sampleRdd.partitions().size());
	System.out.print(sampleRdd.collect());
	
	
}
private static void groupByKeyDemo(JavaSparkContext sc) {
	List<String> list =Arrays.asList("dog","tiger","lion","cat","spider","elephent");
	JavaPairRDD<Integer, String> pairRDD = sc.parallelize(list).keyBy(new Function<String, Integer>() {
		private static final long serialVersionUID=1L;
		public Integer call(String v1) {
			return v1.length();
			
		}		
		
	});
	JavaPairRDD<Integer, Iterable<String>> groupByPairRDD = pairRDD.groupByKey();
	System.out.println(groupByPairRDD.collect());
}
private static void reduceByKeyDemo(JavaSparkContext sc) {
	List<String> list = Arrays.asList("dog","cat","owl","gnu","ant");
	JavaPairRDD<Integer, String> pairRDD = sc.parallelize(list).keyBy(new Function<String, Integer>() {
		private static final long serialVersionUID=1L;
		public Integer call(String v1) {
			return v1.length();
//			返回 3,dog  3,cat....
		}
		
	});
//	
	JavaPairRDD<Integer, String> reduceByRdd = pairRDD.reduceByKey(
//			两个输入  一个输出
//			1.<k1,v1>中的v1  和  <k2,v2>中的v2   -----》 v1_v2
//			v1-v2 v3----》v1-v2-v3
			new Function2<String, String, String>() {
		
		@Override
	
		public String call(String v1, String v2) throws Exception {
			// TODO Auto-generated method stub
			return v1 + "-" + v2;
		}
	});
	System.out.print(reduceByRdd.collect());
}



}

Published 20 original articles · won praise 23 · views 992

Guess you like

Origin blog.csdn.net/surijing/article/details/104561172