官网算子介绍:
- https://ci.apache.org/projects/flink/flink-docs-master/dev/batch/dataset_transformations.html
transformation算子
-
常用transformation算子
- Map:输入一个元素,然后返回一个元素,中间可以做一些清洗转换等操作
- FlatMap:输入一个元素,可以返回零个,一个或者多个元素
- MapPartition:类似map,一次处理一个分区的数据【如果在进行map处理的时候需要获取第三方资源链接,建议使用MapPartition】
- Filter:过滤函数,对传入的数据进行判断,符合条件的数据会被留下
- Reduce:对数据进行聚合操作,结合当前元素和上一次reduce返回的值进行聚合操作,然后返回一个新的值
- Aggregate:sum、max、min等
- Distinct:返回一个数据集中去重之后的元素,data.distinct()
- Join:内连接
- OuterJoin:外链接
-
小案例1:使用mapPartition将数据保存到数据库
- 第一步:导入mysql的jar包坐标
<dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.38</version> </dependency>
- 第二步:创建mysql数据库以及数据库表
CREATE TABLE `user` ( `id` int(10) NOT NULL AUTO_INCREMENT, `name` varchar(32) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=utf8;
- 第三步:代码开发
import java.sql.PreparedStatement import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment} object MapPartition2MySql { def main(args: Array[String]): Unit = { val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment import org.apache.flink.api.scala._ val sourceDataset: DataSet[String] = environment.fromElements("1 zhangsan","2 lisi","3 wangwu") sourceDataset.mapPartition(part => { Class.forName("com.mysql.jdbc.Driver").newInstance() val conn = java.sql.DriverManager.getConnection("jdbc:mysql://localhost:3306/flink_db", "flink", "123456") part.map(x => { val statement: PreparedStatement = conn.prepareStatement("insert into user (id,name) values(?,?)") statement.setInt(1, x.split(" ")(0).toInt) statement.setString(2, x.split(" ")(1)) statement.execute() }) }).print() environment.execute() } }
-
小案例2:连接操作
bject BatchDemoOuterJoinScala { def main(args: Array[String]): Unit = { val env = ExecutionEnvironment.getExecutionEnvironment import org.apache.flink.api.scala._ val data1 = ListBuffer[Tuple2[Int,String]]() data1.append((1,"zs")) data1.append((2,"ls")) data1.append((3,"ww")) val data2 = ListBuffer[Tuple2[Int,String]]() data2.append((1,"beijing")) data2.append((2,"shanghai")) data2.append((4,"guangzhou")) val text1 = env.fromCollection(data1) val text2 = env.fromCollection(data2) text1.leftOuterJoin(text2).where(0).equalTo(0).apply((first,second)=>{ if(second==null){ (first._1,first._2,"null") }else{ (first._1,first._2,second._2) } }).print() println("===============================") text1.rightOuterJoin(text2).where(0).equalTo(0).apply((first,second)=>{ if(first==null){ (second._1,"null",second._2) }else{ (first._1,first._2,second._2) } }).print() println("===============================") text1.fullOuterJoin(text2).where(0).equalTo(0).apply((first,second)=>{ if(first==null){ (second._1,"null",second._2) }else if(second==null){ (first._1,first._2,"null") }else{ (first._1,first._2,second._2) } }).print() } }
partition算子
- 常用 partition算子
- Rebalance:对数据集进行再平衡,重分区,消除数据倾斜
- Hash-Partition:根据指定key的哈希值对数据集进行分区
- partitionByHash()
- Range-Partition:根据指定的key对数据集进行范围分区
- partitionByRange()
- Custom Partitioning:自定义分区规则,自定义分区需要实现Partitioner接口partitionCustom(partitioner, “someKey”)或者partitionCustom(partitioner, 0)
object FlinkPartition { def main(args: Array[String]): Unit = { val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment environment.setParallelism(2) import org.apache.flink.api.scala._ val sourceDataSet: DataSet[String] = environment.fromElements("hello world","spark flink","hive sqoop") val filterSet: DataSet[String] = sourceDataSet.filter(x => x.contains("hello")) .rebalance() filterSet.print() environment.execute() } }
- 小案例:自定义分区来实现数据分区操作
- 第一步:自定义分区scala的class类
import org.apache.flink.api.common.functions.Partitioner class MyPartitioner extends Partitioner[String]{ override def partition(word: String, num: Int): Int = { println("分区个数为" + num) if(word.contains("hello")){ println("0号分区") 0 }else{ println("1号分区") 1 } } }
- 第二步:代码实现
import org.apache.flink.api.scala.ExecutionEnvironment object FlinkCustomerPartition { def main(args: Array[String]): Unit = { val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment //设置我们的分区数,如果不设置,默认使用CPU核数作为分区个数 environment.setParallelism(2) import org.apache.flink.api.scala._ //获取dataset val sourceDataSet: DataSet[String] = environment.fromElements("hello world","spark flink","hello world","hive hadoop") val result: DataSet[String] = sourceDataSet.partitionCustom(new MyPartitioner,x => x + "") val value: DataSet[String] = result.map(x => { println("数据的key为" + x + "线程为" + Thread.currentThread().getId) x }) value.print() environment.execute() } }
sink算子
- 常用sink算子
- writeAsText() / TextOutputFormat:以字符串的形式逐行写入元素。字符串是通过调用每个元素的toString()方法获得的
- writeAsFormattedText() / TextOutputFormat:以字符串的形式逐行写入元素。字符串是通过为每个元素调用用户定义的format()方法获得的。
- writeAsCsv(…) / CsvOutputFormat:将元组写入以逗号分隔的文件。行和字段分隔符是可配置的。每个字段的值来自对象的toString()方法。
- print() / printToErr() / print(String msg) / printToErr(String msg) ()(注: 线上应用杜绝使用,采用抽样打印或者日志的方式)
- write() / FileOutputFormat
- output()/ OutputFormat:通用的输出方法,用于不基于文件的数据接收器(如将结果存储在数据库中)。