The difference between map and flatmap of spark RDD

HDFS to HDFS process

Look at the location of map and flatmap


Definition of Flatmap  and map 

map() is a function that applies a function to each element in the RDD, and the return value forms a new RDD.

flatmap() is to apply a function to each element in the RDD, forming a new RDD with all the contents of the returned iterator

Example :

val rdd = sc.parallelize(List("coffee panda","happy panda","happiest panda party"))

enter

rdd. map (x => x) .collect

result

res9: Array[String] = Array(coffee panda, happy panda, happiest panda party)

enter

rdd.flatMap(x=>x.split(" ")).collect

result

res8: Array[String] = Array(coffee, panda, happy, panda, happiest, panda, party)

flatMap explains that first map and then flat, let's look at an example

val rdd1 = sc.parallelize(List(1,2,3,3))

scala> rdd1.map(x=>x+1).collect

res10: Array[Int] = Array(2344)

scala> rdd1.flatMap(x=>x.to(3)).collect

res11: Array[Int] = Array(1232333)

map(func)

Pass each element of the original data to the function func for formatting and return a new distributed data set. ( Original: Return a new distributed dataset formed by passing each element of the source through a function func.)

flatMap(func)

Similar to map(func) , but each input item sum becomes 0 or more output items ( so the func function should return a serialized data rather than a single data item ) . ( Original: Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).)

Flatmap  and map usage instructions

When used, map will convert an RDD of length N into another RDD of length N ; while flatMap will convert an RDD of length N into a collection of N elements, and then synthesize the N elements into A result set of a single RDD .

For example, a data file " README.md " with three lines of content.

The map function and flatMap function are two commonly used functions in spark . Where map : operate on each element in the collection. flatMap : Operates on each element in the collection and then flattens it. 
 
 

E.g:

val arr=sc.parallelize(Array(("A",1),("B",2),("C",3))) arr.flatmap(x=>(x._1+x._2)).foreach(println)

The output is

A 1

B2

C 3

If you use map

val arr=sc.parallelize(Array(("A",1),("B",2),("C",3)))

arr.map(x=>(x._1+x._2)).foreach(println)

output result

A1

B2

C3

So flatMap flattening probably means to use map once and then map all the data again .

Flatmap  and map actual usage scenarios

There is a scenario, how to count the number of occurrences of adjacent character pairs in a string. It means that if there is an A;B;C;D;B;C string, then (A,B), (C,D), (D,B) adjacent character pairs appear once, and (B,C) appear twice Second-rate. if there is data

A;B;C;D;B;D;C

B;D;A;E;D;C

A;B

The code for counting the occurrences of adjacent character pairs is as follows

data.map(_.split(";")).flatMap(x=>{

      for(i<-0untilx.length-1) yield (x(i)+","+x(i+1),1)  

    }).reduceByKey(_+_).foreach(println)

The output is

(A,E,1)(E,D,1)(D,A,1)(C,D,1)(B,C,1)(B,D,2)(D,C,2)(D,B,1)(A,B,2)

The map operations I remember are map (one-to-one), mapToPair ( map into key-value pairs), flatMap (one record becomes n ( n>=0 ))

Difference between Flatmap  and map

The map(func) function will perform the specified func operation on each input, and then return an object for each input; while flatMap(func) will also perform the func operation on each input, and then each input returns a relative, But in the end, all objects will be recombined into one object; in terms of the number of returned results, the number of data objects returned by map is the same as the original input data, while the number returned by flatMap is different.

 

The map function will perform the specified operation on each input, and then return an object for each input; while the flatMap function is a collection of two operations - it is "map first and then flatten":

Operation 1: Same as the map function: perform the specified operation on each input, and then return an object for each input

Action 2: Finally merge all objects into one object

As can be seen from the above figure, flatMap actually has more flatten operations than map .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325550845&siteId=291194637