Flink stream processing API bis

1、Transform

1.1 map

val streamMap = stream.map { x => x * 2 }

1.2 flatmap

flatMap function signature:def flatMap[A,B](as: List[A])(f: A ⇒ List[B]): List[B]

E.g: flatMap(List(1,2,3))(i ⇒ List(i,i))

The results areList(1,1,2,2,3,3)

andList("a b", "c d").flatMap(line ⇒ line.split(" "))

The results areList(a, b, c, d)

val streamFlatMap = stream.flatMap{
    x => x.split(" ")
} 

1.3 filter

val streamFilter = stream.filter{
    x => x == 1
}

1.4 keyby

→ KeyedStream DataStream : logically dividing the stream into a disjoint partitions, each comprising elements with the same key, the hash internally implemented in the form of.

val streamKeyby = stream.keyBy(0)

1.5 Aggregation Operator rolling (Rolling Aggregation)

These operators can do for each tributary aggregation of KeyedStream.

  • sum()
  • (I)
  • max()
  • my town()
  • maxBy()

1.6 reduce

→ DataStream KeyedStream : polymerization operation a packet data stream, and the last element of the current combined result of the polymerization, to produce a new value, the returned stream includes every result of the polymerization, rather than returning only the final result of the last polymerized .

val env: StreamExecutionEnvironment =
StreamExecutionEnvironment.getExecutionEnvironment

val dataDS: DataStream[String] = env.readTextFile("input/data.txt")
val ds: DataStream[WaterSensor] = dataDS.map(
    s => {
        val datas = s.split(",")
        WaterSensor(datas(0), datas(1).toLong, datas(2).toDouble)
    }
).keyBy(0).reduce(
    (s1, s2) => {
        println(s"${s1.vc} <==> ${s2.vc}")
        WaterSensor(s1.id, s1.ts, math.max(s1.vc, s2.vc))
    }
)

ds.print()

env.execute("sensor")

 

1.7 split & select

Split operator

→ SplitStream DataStream : In accordance with certain features to a split into two or more DataStream DataStream.

val split = someDataStream.split(
  (num: Int) =>
    (num % 2) match {
      case 0 => List("even")
      case 1 => List("odd")
    }
)     

 

Select operator

SplitStream → DataStream : obtaining a plurality or a SplitStream from the DataStream.

val even = split select("even")
val odd = split select("odd")
val all = split.select("even","odd")       

 

1.8 Connect & CoMap

DataStream, DataStream → ConnectedStreams : connecting two types of data streams to maintain them, after the two data streams is Connect, only to be placed in a same stream, remains inside and in the form of respective data without any change, two stream independent of each other.

someStream : DataStream[Int] = ...
otherStream : DataStream[String] = ...

val connectedStreams = someStream.connect(otherStream)

 

connectedStreams.map(
    (_ : Int) => true,
    (_ : String) => false
)
connectedStreams.flatMap(
    (_ : Int) => true,
    (_ : String) => false
)

 

1.9 Union

DataStream → DataStream : two or more of the union operation for DataStream, a new DataStream DataStream containing all the elements.

dataStream.union(otherStream1, otherStream2, ...)

Connect with Union differences:

  1. Before Union type two streams must be the same, Connect may not be the same again after the adjustment of coMap become the same.
  2. Connect operation only two streams, Union can operate a plurality.

2, supported data types

Flink will try to deduce a lot of information about data types, these data are stored during the exchange or a distributed computing network. Think of it as a database table structure inferred. In most cases, Flink can rely on their own all types of transparent deduce the required information. These types of information can help grasp Flink achieve a lot of unexpected features:

  • For POJOs types of data can be (such as by specifying field names dataSet.keyBy("username")) for grouping, joining, aggregating operation. Flink type of information can help make check some spelling errors and the types of compatibility issues before you run, rather than wait until run time to expose these issues.
  • Flink more understanding of data types, sequences and data layout scheme, the better. This is particularly important in the memory usage paradigm Flink (heap or heap may be processed as data sequences and the sequence out of operation is very cheap).
  • Finally, it allows the user free from worry about serialization framework and the type of registration in most cases.

Usually in the application phase of the previous run (pre-Flight Phase) , the type of information required data - that is, the program DataStreamor DataSetafter the operation invocation, in execute(), print(), count(), collect()before calling.

Flink in Java and Scala support all common data types. The most widely used type are the following.

2.0 Flink class of TypeInformation

TypeInformation class is the base class for all types of descriptors. This class represents a type of basic properties, and may generate a serializer, in some special cases may generate a type of comparator. ( Please note, Flink comparator in order just defined - they are the basic tools handling keys )

Internal Flink type made the following distinction:

  • Base type : All Java primitive types (primitive) and their packaging, plus the void, String, Date, BigDecimaland BigInteger.
  • An array of primitives (primitive array) and an array of objects
  • Composite type
    • Java Flink in the tuple (Tuples) (tuple is part Flink Java API's): Supports up to 25 fields, null is not supported.
    • Scala in Case classes (including Scala tuple): null is not supported.
    • Row: tuples having any number of fields and supports null field. .
    • POJOs S: follow some similar bean-based mode.
  • Auxiliary type (Option, Either, Lists, Maps, etc.)
  • Generic type : These Flink not by itself serialized, but by Kryo serialized.

POJOs are particularly interesting because they support the creation and use of complex types of direct field names in the definition of the keys: dataSet.join(another).where("name").equalTo("personName")they are also transparent run-time, and can be processed very efficiently by the Flink.

TypeInformation supports the following types:

  1. BasicTypeInfo: any Java primitives or String
  2. BasicArrayTypeInfo: any Java primitive array or String array
  3. WritableTypeInfo: any interface implementation class Hadoop Writable
  4. TupleTypeInfo: any Flink Tuple type (support Tuple1 to Tuple25). Flink tuples is a fixed length to achieve a fixed type Java Tuple
  5. CaseClassTypeInfo: arbitrary Scala CaseClass (including Scala tuples)
  6. PojoTypeInfo: any POJO (Java or Scala), for example, all member variables Java objects, either public modifier defined, or have getter / setter methods
  7. GenericTypeInfo: not match any of several types of classes before

For the first six types of data sets, Flink all can automatically generate a corresponding TypeSerializer, can be very efficiently serialize the data set and deserialization.

2.1 basic data types

Flink supports all Java and Scala basic data types, Int, Double, Long, String, ...

val numbers: DataStream[Long] = env.fromElements(1L, 2L, 3L, 4L)
numbers.map( n => n + 1 )

 

2.2 Java and Scala tuple (Tuples)

val persons: DataStream[(String, Integer)] = env.fromElements( 
("Adam", 17), 
("Sarah", 23) ) 
persons.filter(p => p._2 > 18)

 

2.3 Scala Sample Class (case classes)

case class Person(name: String, age: Int) 
val persons: DataStream[Person] = env.fromElements(
Person("Adam", 17), 
Person("Sarah", 23) )
persons.filter(p => p.age > 18)

 

2.4 Java simple objects (POJOs)

public class Person {
public String name;
public int age;
    public Person() {}
    public Person(String name, int age) { 
this.name = name;      
this.age = age;  
}
}
DataStream<Person> persons = env.fromElements(   
new Person("Alex", 42),   
new Person("Wendy", 23));

 

2.5 Other (Arrays, Lists, Maps, Enums, etc.)

Flink some types of special purpose in Java and Scala are also supported, such as the Java ArrayList, HashMap, Enum, and so on.

3, to achieve a more fine-grained UDF function control flow ----

Flink using various operator simultaneously, in order to more fine-grained control data and operational data, provides developers with the ability to extend the existing functions

3.1 class of functions (Function Classes)

Flink all interfaces exposed udf functions (implemented way interface or abstract class). For example MapFunction, FilterFunction, ProcessFunction and so on.

MapFunction classes implement custom function interfaces:

main DEF (args: the Array [String]): Unit = { 
    
    // the TODO get data from the source file 
    Val the env: = StreamExecutionEnvironment StreamExecutionEnvironment.getExecutionEnvironment; 
    env.setParallelism ( . 1 ) 
    
    Val List = List ( 
        WaterSensor ( " sensor_1 " , 150000L , 25 ), 
        WaterSensor ( " sensor_1 " , 150001L , 27 ), 
        WaterSensor ( " sensor_1 " , 150005L , 30 ),
        WaterSensor ( " sensor_1 " , 150007L , 40 ) 
    ) 
    
    Val waterSensorDS: DataStream [WaterSensor] = env.fromCollection (List) 

    // the UDF function: function to process custom data
     // waterSensorDS.map (WS => (ws.id , ws.vc))
     // may be used instead of an anonymous function class function 
    Val mapFunctionDS: DataStream [(String, Int)] = waterSensorDS.map ( new new MyMapFunction) 

    mapFunctionDS.print ( " mapfun >>> " ) 
    env.execute () 
    
} 
// custom UDF function. To implement the mapping conversion function
 // 1. inherited MapFunction
 // 2. 重写方法
class MyMapFunction extends MapFunction[WaterSensor, (String, Int)]{
    override def map(ws: WaterSensor): (String, Int) = {
        (ws.id, ws.vc)
    }
}

 

3.2 anonymous functions (Lambda Functions)

val tweets: DataStream[String] = ...
val flinkTweets = tweets.filter(_.contains("flink"))

 

3.3 rich functions (Rich Functions)

"Rich function" is a function of the interface class DataStream API provides all functions Flink class has its own version of Rich. It is different from the conventional functions that you can get the context of the operating environment, and has some of the life-cycle approach, it is possible to achieve more complex functions. Also it means providing more and more feature-rich

  • RichMapFunction
  • RichFlatMapFunction
  • RichFilterFunction
  • ...

Rich Function concept of a life cycle. A typical life cycle approach are:

  • open()The method of the initialization method of the rich function, for example, when a map operator, or filter open () is called before they are invoked.
  • close()The last method is a method called life cycle, do some cleanup work.
  • getRuntimeContext()The method provides information RuntimeContext function, such as the name of the function executed in parallel, task state and status
main DEF (args: the Array [String]): Unit = { 
    
    // the TODO get data from the source file 
    Val the env: = StreamExecutionEnvironment StreamExecutionEnvironment.getExecutionEnvironment; 
    env.setParallelism ( . 1 ) 
    
    Val List = List ( 
        WaterSensor ( " sensor_1 " , 150000L , 25 ), 
        WaterSensor ( " sensor_1 " , 150001L , 27 ), 
        WaterSensor ( " sensor_1 " , 150005L , 30 ),
        WaterSensor ( " sensor_1 " , 150007L , 40 ) 
    ) 
    
    Val waterSensorDS: DataStream [WaterSensor] = env.fromCollection (List) 

    // the UDF function: function to process custom data
     // waterSensorDS.map (WS => (ws.id , ws.vc))
     // may be used instead of an anonymous function class function 
    Val mapFunctionDS: DataStream [(String, Int)] = waterSensorDS.map ( new new MyMapRichFunction) 
    mapFunctionDS.print ( " mapfun >>> " ) 
    env.execute () 
    
} 
    
// custom UDF rich functions. To implement the mapping conversion function
 // 1. inherited RichMapFunction
 // 2. 重写方法
class MyMapRichFunction extends RichMapFunction[WaterSensor, (String, Int)] {
    
    override def open(parameters: Configuration): Unit = super.open(parameters)

    override def map(ws: WaterSensor): (String, Int) = {
        //getRuntimeContext.
        (ws.id, getRuntimeContext.getIndexOfThisSubtask)
    }

    override def close(): Unit = super.close()
}

 

Guess you like

Origin www.cnblogs.com/hyunbar/p/12633183.html