HDFS to HDFS process
Look at the location of map and flatmap
Definition of Flatmap and map
map() is a function that applies a function to each element in the RDD, and the return value forms a new RDD.
flatmap() is to apply a function to each element in the RDD, forming a new RDD with all the contents of the returned iterator
Example :
val rdd = sc.parallelize(List("coffee panda","happy panda","happiest panda party"))
enter
rdd. map (x => x) .collect
result
res9: Array[String] = Array(coffee panda, happy panda, happiest panda party)
enter
rdd.flatMap(x=>x.split(" ")).collect
result
res8: Array[String] = Array(coffee, panda, happy, panda, happiest, panda, party)
flatMap explains that first map and then flat, let's look at an example
val rdd1 = sc.parallelize(List(1,2,3,3))
scala> rdd1.map(x=>x+1).collect
res10: Array[Int] = Array(2, 3, 4, 4)
scala> rdd1.flatMap(x=>x.to(3)).collect
res11: Array[Int] = Array(1, 2, 3, 2, 3, 3, 3)
map(func)
Pass each element of the original data to the function func for formatting and return a new distributed data set. ( Original: Return a new distributed dataset formed by passing each element of the source through a function func.)
flatMap(func)
Similar to map(func) , but each input item sum becomes 0 or more output items ( so the func function should return a serialized data rather than a single data item ) . ( Original: Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).)
Flatmap and map usage instructions
When used, map will convert an RDD of length N into another RDD of length N ; while flatMap will convert an RDD of length N into a collection of N elements, and then synthesize the N elements into A result set of a single RDD .
For example, a data file " README.md " with three lines of content.
The map function and flatMap function are two commonly used functions in spark . Where map : operate on each element in the collection. flatMap : Operates on each element in the collection and then flattens it.
E.g:
val arr=sc.parallelize(Array(("A",1),("B",2),("C",3))) arr.flatmap(x=>(x._1+x._2)).foreach(println)
The output is
A 1
B2
C 3
If you use map
val arr=sc.parallelize(Array(("A",1),("B",2),("C",3)))
arr.map(x=>(x._1+x._2)).foreach(println)
output result
A1
B2
C3
So flatMap flattening probably means to use map once and then map all the data again .
Flatmap and map actual usage scenarios
There is a scenario, how to count the number of occurrences of adjacent character pairs in a string. It means that if there is an A;B;C;D;B;C string, then (A,B), (C,D), (D,B) adjacent character pairs appear once, and (B,C) appear twice Second-rate. if there is data
A;B;C;D;B;D;C
B;D;A;E;D;C
A;B
The code for counting the occurrences of adjacent character pairs is as follows
data.map(_.split(";")).flatMap(x=>{
for(i<-0untilx.length-1) yield (x(i)+","+x(i+1),1)
}).reduceByKey(_+_).foreach(println)
The output is
(A,E,1)(E,D,1)(D,A,1)(C,D,1)(B,C,1)(B,D,2)(D,C,2)(D,B,1)(A,B,2)
The map operations I remember are map (one-to-one), mapToPair ( map into key-value pairs), flatMap (one record becomes n ( n>=0 ))
Difference between Flatmap and map
The map(func) function will perform the specified func operation on each input, and then return an object for each input; while flatMap(func) will also perform the func operation on each input, and then each input returns a relative, But in the end, all objects will be recombined into one object; in terms of the number of returned results, the number of data objects returned by map is the same as the original input data, while the number returned by flatMap is different.
The map function will perform the specified operation on each input, and then return an object for each input; while the flatMap function is a collection of two operations - it is "map first and then flatten":
Operation 1: Same as the map function: perform the specified operation on each input, and then return an object for each input
Action 2: Finally merge all objects into one object
As can be seen from the above figure, flatMap actually has more flatten operations than map .