SparkStreaming算子

Transformation meaning
map(func) Passed through each element of the source function func DStream returns a new DStream
flatMap(func) Similar map, but each entry may be mapped to zero or more output terms.
filter (func) DStream to return a new recording by selecting only source DStream func returns to true.
distribution (numPartitions) Repartition, to change the level of parallelism in this DStream by creating more or less partitions.
union(otherStream) Returns a new DStream, it contains elements of a joint source DStream and other DStream in.
count() RDD calculated by the number of elements in each of the source DSTREAM, returns a new single element RDD DStream.
reduce(func) Using the function func (function accepts two arguments and returns an argument) polymerization of each source element RDD DStream, to return the new DStream RDDs single element. This function should be combined and commutative, so as to parallel computing.
countByValue() When calls to DStream K type element, returns a new (K, Long) DSTREAM pairs, where each key value that is in each frequency source DSTREAM in RDD.
reduceByKey(func, [numTasks]) When the call DSTREAM (K, V), returns a new (K, V) of the DStream, wherein each key value for use in a given function reduce polymerization. Note: By default, it will use the default Spark number of parallel tasks (local mode 2, in the cluster mode, which is determined by the number of Spark .default.parallelism config properties) to be grouped. We can pass an optional numTasks parameters to set a different number of tasks.
join(otherStream, [numTasks]) When two calls (K, V), and (K, W) of the DStream, returns a new (K, (V, W)) of the DStream, which contains all the Key elements of each pair.
cogroup(otherStream, [numTasks]) When calling (K, V), and (K, W) of the DSTREAM, it returns a new (K, Seq [V], Seq [W]) tuples DStream.
transform(func) By RDD-to-RDD RDD function to each source DSTREAM returns a new DStream. It can be used in any application operating DStream API RDD not disclosed. For example the data stream to another data set for each batch and the connection is not disclosed in DStream API directly. But you can easily use to achieve this transform. This leads to a very strong possibility. For example, the input data stream by a pre-computed spam (may also be generated using Spark) combining data in real time cleaning
updateStateByKey(func) Returns a new "state" DStream, wherein each Key status updates to the new value of the previous state and Key Key by a given function. This can be used to maintain any state of the data for each Key. To use it, you need to perform two steps: (1) the definition of the state - a state may be any type of data; (2) defines the status update function - a function specifies how to use the previous state and new input stream. value update status.
Published 18 original articles · won praise 2 · Views 377

Guess you like

Origin blog.csdn.net/CH_Axiaobai/article/details/104170264