1.Standalone集群安装
下载方式:
直接上官网下载
上官网复制链接地址:wget 粘贴链接
2.上传安装包到linux系统
使用rz命令
3.解压
tar –zxvf flink-1.5.0-bin-hadoop24-scala_2.11.tgz
4.修改配置文件
vim conf/flink-conf.yaml
jobmanager.rpc.address: hadoop01
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 1024
taskmanager.numberOfTaskSlots: 2
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.web.port: 8081
taskmanager.tmp.dirs: /export/servers/flink-1.5.0/tmp
配置参数解释:
jobmanager.rpc.address: localhost JobManager的外部地址,它是分布式系统的主/协调器(DEFAULT:localhost)设置成你master节点的IP地址
jobmanager.rpc.port: 6123 JobManager的端口号(DEFAULT:6123)
jobmanager.heap.mb: 1024 JobManager的默认JVM堆大小(以兆字节为单位)
taskmanager.heap.mb: 1024 用于TaskManagers的JVM堆大小(以兆字节为单位)
taskmanager.numberOfTaskSlots: 1 每台机器可用的CPU数量(默认值:1)
taskmanager.memory.preallocate: false 是否进行预分配内存,默认不进行预分配,这样在我们不使用flink集群时候不会占用集群资源
parallelism.default: 1 指定程序的默认并行度
jobmanager.web.port: 8081 JobManager的Web界面的端口(默认:8081)
taskmanager.tmp.dirs:临时文件的目录
5.启动flink集群
方式一:
添加一个JobManager
bin/jobmanager.sh ((start|start-foreground)cluster)|stop|stop-all
添加一个TaskManager
bin/taskmanager.sh start|start-foreground|stop|stop-all
方式二:
bin/start-cluster.sh
bin/stop-cluster.sh
6.运行测试任务
bin/flink run /export/servers/flink-1.5.0/examples/batch/WordCount.jar --input /export/servers/zookeeper.out --output /export/servers/flink_data
7.集群的HA高可用
对于一个企业级的应用,稳定性是首要要考虑的问题,然后才是性能,因此 HA 机制是必不可少的;
和 Hadoop 一代一样,从架构中我们可以很明显的发现 JobManager 有明显的单点问题(SPOF,single point of failure)。 JobManager 肩负着任务调度以及资源分配,一旦 JobManager 出现意外,其后果可想而知。Flink 对 JobManager HA 的处理方式,原理上基本和 Hadoop 一样;
对于 Standalone 来说,Flink 必须依赖于 Zookeeper 来实现 JobManager 的 HA(Zookeeper 已经成为了大部分开源框架 HA 必不可少的模块)。在 Zookeeper 的帮助下,一个 Standalone 的 Flink 集群会同时有多个活着的 JobManager,其中只有一个处于工作状态,其他处于 Standby 状态。当工作中的 JobManager 失去连接后(如宕机或 Crash),Zookeeper 会从 Standby 中选举新的 JobManager 来接管 Flink 集群。
(当然,对于flink的集群模式来说,除了standalone外,还有yarn cluster模式,这种模式的在hadoop节点的HA处搭建)
修改配置:
7.1修改conf/flink-conf.yaml
vim conf/flink-conf.yaml
jobmanager.rpc.address: hadoop01【注意。HA的需要按照机器分配】
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 1024
taskmanager.numberOfTaskSlots: 2
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.web.port: 8081
taskmanager.tmp.dirs: /export/servers/flink-1.5.0/tmp
#开启HA
state.backend: filesystem
state.backend.fs.checkpointdir: hdfs://hadoop01:9000/flink-checkpoints
high-availability: zookeeper
high-availability.storageDir: hdfs://hadoop01:9000/flink/ha/
high-availability.zookeeper.quorum: hadoop01:2181,hadoop02:2181,hadoop03:2181
high-availability.zookeeper.client.acl: open
HA参数解释:
state.backend 启用检查点,支持两种后端备份点:
jobmanager:内存状态,备份到JobManager的 ZooKeeper的内存。应仅用于最小状态(Kafka偏移量)或测试和本地调试。
filesystem:状态在TaskManager的内存中,并且状态快照存储在文件系统中。支持Flink支持的所有文件系统,例如HDFS,S3 ...
state.backend.fs.checkpointdir:用于将检查点存储在Flink支持的文件系统中的目录。注意:状态后端必须可以从JobManager访问,file://仅用于本地设置
high-availability: zookeeper 定义用于群集执行的高可用性模式
high-availability.storageDir
用于存储JobManager元数据的目录; 这是持久的状态后端,只有一个指向这个状态的指针存储在ZooKeeper中。完全像检查点目录一样,它必须可以从JobManager访问
high-availability.zookeeper.quorum zookeeper的地址
7.2 修改conf/zoo.cfg[zookeeper是不存在]
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial synchronization phase can take
initLimit=10
# The number of ticks that can pass between sending a request and getting an acknowledgement
syncLimit=5
# The directory where the snapshot is stored.
# dataDir=/tmp/zookeeper
# The port at which the clients will connect
clientPort=2181
# ZooKeeper quorum peers
server.1=hadoop01:2888:3888
server.2=hadoop02:2888:3888
server.3=hadoop03:2888:3888
7.3修改conf/masters
hadoop01:8081
hadoop02:8082
7.4修改conf/slave
hadoop01
hadoop02
hadoop03
7.5启动HA的flink
启动zookeeper
Bin/zkServer.sh start (所有的zookeeper确保启动成功)
启动hdfs(检查点和元数据信息存储在了hdfs)
Start-dfs.sh
启动flink
bin/start-cluster.sh
执行成功后,测试HA:
模拟突发宕机:
此时hadoop01这台机器的jobmanager出问题,然后访问另外节点的页面:hadoop02:8082
如果IP切换陈宫,代表当前的HA搭建完毕;
第二种,Flink运行在yarn上
在一个企业中,为了最大化的利用集群资源,一般都会在一个集群中同时运行多种类型的 Workload。因此 Flink 也支持在 Yarn 上面运行;flink on yarn的前提是:hdfs、yarn均启动
1.修改hadoop的配置参数:
vim etc/hadoop/yarn-site.xml
添加:
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true。
在这里面我们需要关闭,因为对于flink使用yarn模式下,很容易内存超标,这个时候yarn会自动杀掉job
2.修改全局变量/etc/profile:
添加:
export HADOOP_CONF_DIR=/export/servers/hadoop/etc/Hadoop
YARN_CONF_DIR或者HADOOP_CONF_DIR必须将环境变量设置为读取YARN和HDFS配置
3.使用flink on yarn提交任务:
在YARN上启动一个Flink主要有两种方式:
(1)、启动一个YARN session(Start a long-running Flink cluster on YARN);
(2)、直接在YARN上提交运行Flink作业(Run a Flink job on YARN)
第一种方式:YARN session
这种模式下会启动yarn session,并且会启动Flink的两个必要服务:JobManager和TaskManagers,然后你可以向集群提交作业。同一个Session中可以提交多个Flink作业。需要注意的是,这种模式下Hadoop的版本至少是2.2,而且必须安装了HDFS(因为启动YARN session的时候会向HDFS上提交相关的jar文件和配置文件)
通过./bin/yarn-session.sh脚本启动YARN Session
脚本可以携带的参数:
Usage:
Required
-n,--container <arg> Number of YARN container to allocate (=Number of Task Managers)
Optional
-D <arg> Dynamic properties
-d,--detached Start detached
-id,--applicationId <arg> Attach to running YARN session
-j,--jar <arg> Path to Flink jar file
-jm,--jobManagerMemory <arg> Memory for JobManager Container [in MB]
-n,--container <arg> Number of YARN container to allocate (=Number of Task Managers)
-nm,--name <arg> Set a custom name for the application on YARN
-q,--query Display available YARN resources (memory, cores)
-qu,--queue <arg> Specify YARN queue.
-s,--slots <arg> Number of slots per TaskManager
-st,--streaming Start Flink in streaming mode
Usage:
Required
-n,--container <arg> Number of YARN container to allocate (=Number of Task Managers)
Optional
-D <arg> Dynamic properties
-d,--detached Start detached
-id,--applicationId <arg> Attach to running YARN session
-j,--jar <arg> Path to Flink jar file
-jm,--jobManagerMemory <arg> Memory for JobManager Container [in MB]
-n,--container <arg> Number of YARN container to allocate (=Number of Task Managers)
-nm,--name <arg> Set a custom name for the application on YARN
-q,--query Display available YARN resources (memory, cores)
-qu,--queue <arg> Specify YARN queue.
-s,--slots <arg> Number of slots per TaskManager
-st,--streaming Start Flink in streaming mode
注意:
如果不想让Flink YARN客户端始终运行,那么也可以启动分离的 YARN会话。该参数被称为-d或--detached。
在这种情况下,Flink YARN客户端只会将Flink提交给群集,然后关闭它自己
启动:
bin/yarn-session.sh -n 2 -tm 800 -s 2
上面的命令的意思是,同时向Yarn申请3个container(即便只申请了两个,因为ApplicationMaster和Job Manager有一个额外的容器。一旦将Flink部署到YARN群集中,它就会显示Job Manager的连接详细信息。),其中 2 个 Container 启动 TaskManager(-n 2),每个 TaskManager 拥有两个 Task Slot(-s 2),并且向每个 TaskManager 的 Container 申请 800M 的内存,以及一个ApplicationMaster(Job Manager)。
启动成功之后,控制台显示:
去yarn页面:ip:8088可以查看当前提交的flink session
点击ApplicationMaster进入任务页面:
上面的页面就是使用:yarn-session.sh提交后的任务页面;
然后使用flink提交任务:
bin/flink run examples/batch/WordCount.jar
在控制台中可以看到wordCount.jar计算出来的任务结果;
在yarn-session.sh提交后的任务页面中也可以观察到当前提交的任务:
点击查看任务细节:
停止当前任务:
1:CTRL+C
2:stop命令
3:yarn application -kill application_1527077715040_0007
分离的YARN会话
如果不想让Flink YARN客户端始终运行,那么也可以启动分离的 YARN会话。该参数被称为-d或--detached。
在这种情况下,Flink YARN客户端只会将Flink提交给群集,然后关闭它自己。请注意,在这种情况下,无法使用Flink停止YARN会话。
使用YARN实用程序(yarn application -kill <appId>)停止YARN会话
通过分离yarn会话来执行:
bin/yarn-session.sh -n 2 -tm 800 -s 2 -d
关闭:
yarn application -kill application_1527077715040_0007
第二种方式:在YARN上运行一个Flink作业
上面的YARN session是在Hadoop YARN环境下启动一个Flink cluster集群,里面的资源是可以共享给其他的Flink作业。我们还可以在YARN上启动一个Flink作业,这里我们还是使用./bin/flink,但是不需要事先启动YARN session:
bin/flink run -m yarn-cluster -yn 2 ./examples/batch/WordCount.jar
以上命令在参数前加上y前缀,-yn表示TaskManager个数
在8088页面观察:
停止yarn-cluster
yarn application -kill application的ID
注意:
在创建集群的时候,集群的配置参数就写好了,但是往往因为业务需要,要更改一些配置参数,这个时候可以不必因为一个实例的提交而修改conf/flink-conf.yaml;
可以通过:-D <arg> Dynamic properties
来覆盖原有的配置信息:比如:
-Dfs.overwrite-files=true -Dtaskmanager.network.numberOfBuffers=16368
- Flink应用开发
flink和spark类似,也是一种一站式处理的框架;既可以进行批处理(DataSet),也可以进行实时处理(DataStream)
-
- 使用maven导入相关依赖<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.11.2</scala.version>
<scala.compat.version>2.11</scala.compat.version>
<hadoop.version>2.6.2</hadoop.version>
<flink.version>1.5.0</flink.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId><version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-table_2.11</artifactId> <version>${flink.version}</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.38</version> </dependency> <dependency> <groupId>com.alibaba</groupId> <artifactId>fastjson</artifactId> <version>1.2.22</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka-0.9_2.11</artifactId> <version>${flink.version}</version> </dependency> </dependencies> DateSet开发 开发流程 获得一个execution environment, 加载/创建初始数据, 指定这些数据的转换, 指定将计算结果放在哪里, 触发程序执行 例子 object DataSet_WordCount { def main(args: Array[String]) { //TODO 初始化环境 val env = ExecutionEnvironment.getExecutionEnvironment //TODO 加载/创建初始数据 val text = env.fromElements( "Who's there?", "I think I hear them. Stand, ho! Who's there?") //TODO 指定这些数据的转换 val split_words = text.flatMap(line => line.toLowerCase().split("\\W+")) val filter_words = split_words.filter(x=> x.nonEmpty) val map_words = filter_words.map(x=> (x,1)) val groupBy_words = map_words.groupBy(0) val sum_words = groupBy_words.sum(1) //todo 指定将计算结果放在哪里 // sum_words.setParallelism(1)//汇总结果 sum_words.writeAsText(args(0))//"/Users/niutao/Desktop/flink.txt" //TODO 触发程序执行 env.execute("DataSet wordCount") } } 将程序打包,提交到yarn 添加maven打包插件
<build> <sourceDirectory>src/main/java</sourceDirectory> <testSourceDirectory>src/test/scala</testSourceDirectory> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>2.5.1</version> <configuration> <source>1.7</source> <target>1.7</target> <!--<encoding>${project.build.sourceEncoding}</encoding>--> </configuration> </plugin> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.0</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> <configuration> <args> <!--<arg>-make:transitive</arg>--> <arg>-dependencyfile</arg> <arg>${project.build.directory}/.scala_dependencies</arg> </args> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <version>2.18.1</version> <configuration> <useFile>false</useFile> <disableXmlReport>true</disableXmlReport> <includes> <include>**/*Test.*</include> <include>**/*Suite.*</include> </includes> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.3</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <filters> <filter> <artifact>*:*</artifact> <excludes> <!-- zip -d learn_spark.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF --> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <mainClass>com.itcast.DEMO.WordCount</mainClass> </transformer> </transformers> </configuration> </execution> </executions> </plugin> </plugins> </build>
- 执行成功后,在target文件夹中生成jar包;
-
使用rz命令上传jar包,然后执行程序:
-
bin/flink run -m yarn-cluster -yn 2 /home/elasticsearch/flinkjar/itcast_learn_flink-1.0-SNAPSHOT.jar com.itcast.DEMO.WordCount
-
在yarn的8088页面可以观察到提交的程序:
-
去/export/servers/flink-1.3.2/flinkJAR文件夹下可以找到输出的运行结果:
- DateSet的Transformation
-
Transformation
Description
Map
Takes one element and produces one element.
data.map
FlatMap
Takes one element and produces zero, one, or more elements.
data.flatMap
MapPartition
Transforms a parallel partition in a single function call. The function get the partition as an `Iterator` and can produce an arbitrary number of result values. The number of elements in each partition depends on the degree-of-parallelism and previous operations.
data.mapPartition
Filter
Evaluates a boolean function for each element and retains those for which the function returns true.
IMPORTANT: The system assumes that the function does not modify the element on which the predicate is applied. Violating this assumption can lead to incorrect results.data.filter
Reduce
Combines a group of elements into a single element by repeatedly combining two elements into one. Reduce may be applied on a full data set, or on a grouped data set.
data.reduce
ReduceGroup
Combines a group of elements into one or more elements. ReduceGroup may be applied on a full data set, or on a grouped data set.
data.reduceGroup
Aggregate
Aggregates a group of values into a single value. Aggregation functions can be thought of as built-in reduce functions. Aggregate may be applied on a full data set, or on a grouped data set.
val
,
String,
Double)]val
,
String,
Doublr)]You can also use short-hand syntax for minimum, maximum, and sum aggregations.
val
,
String,
Double)]val
,
String,
Double)]Distinct
Returns the distinct elements of a data set. It removes the duplicate entries from the input DataSet, with respect to all fields of the elements, or a subset of fields.
data.distinct()
Join
Joins two data sets by creating all pairs of elements that are equal on their keys. Optionally uses a JoinFunction to turn the pair of elements into a single element, or a FlatJoinFunction to turn the pair of elements into arbitrarily many (including none) elements. See the keys section to learn how to define join keys.
// In this case tuple fields are used as keys. "0" is the join field on the first tuple
// "1" is the join field on the second tuple.
val
You can specify the way that the runtime executes the join via Join Hints. The hints describe whether the join happens through partitioning or broadcasting, and whether it uses a sort-based or a hash-based algorithm. Please refer to the Transformations Guide for a list of possible hints and an example. If no hint is specified, the system will try to make an estimate of the input sizes and pick the best strategy according to those estimates.
// This executes a join by broadcasting the first data set
// using a hash table for the broadcasted data
val
Note that the join transformation works only for equi-joins. Other join types need to be expressed using OuterJoin or CoGroup.
OuterJoin
Performs a left, right, or full outer join on two data sets. Outer joins are similar to regular (inner) joins and create all pairs of elements that are equal on their keys. In addition, records of the "outer" side (left, right, or both in case of full) are preserved if no matching key is found in the other side. Matching pairs of elements (or one element and a `null` value for the other input) are given to a JoinFunction to turn the pair of elements into a single element, or to a FlatJoinFunction to turn the pair of elements into arbitrarily many (including none) elements. See the keys section to learn how to define join keys.
val
CoGroup
The two-dimensional variant of the reduce operation. Groups each input on one or more fields and then joins the groups. The transformation function is called per pair of groups. See the keys section to learn how to define coGroup keys.
data1.coGroup(data2).where(0).equalTo(1)
Cross
Builds the Cartesian product (cross product) of two inputs, creating all pairs of elements. Optionally uses a CrossFunction to turn the pair of elements into a single element
val
val
val
,
String)]Note: Cross is potentially a very compute-intensive operation which can challenge even large compute clusters! It is adviced to hint the system with the DataSet sizes by using crossWithTiny() and crossWithHuge().
Union
Produces the union of two data sets.
data.union(data2)
Rebalance
Evenly rebalances the parallel partitions of a data set to eliminate data skew. Only Map-like transformations may follow a rebalance transformation.
val
val
,
String)]Hash-Partition
Hash-partitions a data set on a given key. Keys can be specified as position keys, expression keys, and key selector functions.
val
,
String)]val
Range-Partition
Range-partitions a data set on a given key. Keys can be specified as position keys, expression keys, and key selector functions.
val
,
String)]val
Custom Partitioning
Manually specify a partitioning over the data.
Note: This method works only on single field keys.val
,
String)]val
Sort Partition
Locally sorts all partitions of a data set on a specified field in a specified order. Fields can be specified as tuple positions or field expressions. Sorting on multiple fields is done by chaining sortPartition() calls.
val
,
String)]val
First-n
Returns the first n (arbitrary) elements of a data set. First-n can be applied on a regular data set, a grouped data set, or a grouped-sorted data set. Grouping keys can be specified as key-selector functions, tuple positions or case class fields.
val
,
String)]// regular data set
val
// grouped data set
val
// grouped-sorted data set
val
1:map函数
2:flatMap函数
//初始化执行环境 val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment //加载数据 val data = env.fromElements(("A" , 1) , ("B" , 1) , ("C" , 1)) //使用trasformation加载这些数据 //TODO map val map_result = data.map(line => line._1+line._2) map_result.print() //TODO flatmap val flatmap_result = data.flatMap(line => line._1+line._2) flatmap_result.print()
练习:如下数据
A;B;C;D;B;D;C
B;D;A;E;D;C
A;B
要求:统计相邻字符串出现的次数
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment} import org.apache.flink.streaming.api.scala._ /** * Created by angel; */ object demo { /** A;B;C;D;B;D;C B;D;A;E;D;C A;B 统计相邻字符串出现的次数(A+B , 2) (B+C , 1).... * */ def main(args: Array[String]): Unit = { val env = ExecutionEnvironment.getExecutionEnvironment val data = env.fromElements("A;B;C;D;B;D;C;B;D;A;E;D;C;A;B") val map_data: DataSet[Array[String]] = data.map(line => line.split(";")) //[A,B,C,D] ---"A,B,C,D" //[A,B,C,D] ---> (x,1) , (y,1) -->groupBy--->sum--total val tupe_data = map_data.flatMap{ line => for(index <- 0 until line.length-1) yield (line(index)+"+"+line(index+1) , 1) } val gropudata = tupe_data.groupBy(0) val result = gropudata.sum(1) result.print() } }
3:mapPartition函数
//TODO mapPartition val ele_partition = elements.setParallelism(2)//将分区设置为2 val partition = ele_partition.mapPartition(line => line.map(x=> x+"======"))//line是每个分区下面的数据 partition.print()
mapPartition:是一个分区一个分区拿出来的
好处就是以后我们操作完数据了需要存储到mysql中,这样做的好处就是几个分区拿几个连接,如果用map的话,就是多少条数据拿多少个mysql的连接
4:filter函数
Filter函数在实际生产中特别实用,数据处理阶段可以过滤掉大部分不符合业务的内容,可以极大减轻整体flink的运算压力
//TODO fileter val filter:DataSet[String] = elements.filter(line => line.contains("java"))//过滤出带java的数据 filter.print()
5:reduce函数
//TODO reduce val elements:DataSet[List[Tuple2[String , Int]]] = env.fromElements(List(("java" , 1) , ("scala" , 1) , ("java" , 1))) val tuple_map = elements.flatMap(x=> x)//拆开里面的list,编程tuple val group_map = tuple_map.groupBy(x => x._1)//按照单词聚合 val reduce = group_map.reduce((x,y) => (x._1 ,x._2+y._2)) reduce.print()
-
reduceGroup
普通的reduce函数
-
reduceGroup是reduce的一种优化方案;
它会先分组reduce,然后在做整体的reduce;这样做的好处就是可以减少网络IO;
- 使用maven导入相关依赖<properties>
-
//TODO reduceGroup val elements:DataSet[List[Tuple2[String , Int]]] = env.fromElements(List(("java" , 1) ,("java" , 1), ("scala" , 1))) val tuple_words = elements.flatMap(x=>x) val group_words = tuple_words.groupBy(x => x._1) val a = group_words.reduceGroup{ (in:Iterator[(String,Int)],out:Collector[(String , Int)]) => val result = in.reduce((x, y) => (x._1, x._2+y._2)) out.collect(result) } a.print() }
- GroupReduceFunction和GroupCombineFunction(自定义函数)
-
import collection.JavaConverters._ class Tuple3GroupReduceWithCombine extends GroupReduceFunction[( String , Int), (String, Int)] with GroupCombineFunction[(String, Int), (String, Int)] { override def reduce(values: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = { for(in <- values.asScala){ out.collect((in._1 , in._2)) } } override def combine(values: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = { val map = new mutable.HashMap[String , Int]() var num = 0 var s = "" for(in <- values.asScala){ num += in._2 s = in._1 } out.collect((s , num)) } }
-
// TODO GroupReduceFunction GroupCombineFunction val env = ExecutionEnvironment.getExecutionEnvironment val elements:DataSet[List[Tuple2[String , Int]]] = env.fromElements(List(("java" , 3) ,("java" , 1), ("scala" , 1))) val collection = elements.flatMap(line => line) val groupDatas:GroupedDataSet[(String, Int)] = collection.groupBy(line => line._1) //在reduceGroup下使用自定义的reduce和combiner函数 val result = groupDatas.reduceGroup(new Tuple3GroupReduceWithCombine()) val result_sort = result.collect().sortBy(x=>x._1) println(result_sort)
-
combineGroup
使用之前的group操作,比如:reduceGroup或者GroupReduceFuncation;这种操作很容易造成内存溢出;因为要一次性把所有的数据一步转化到位,所以需要足够的内存支撑,如果内存不够的情况下,那么需要使用combineGroup; combineGroup在分组数据集上应用GroupCombineFunction。 GroupCombineFunction类似于GroupReduceFunction,但不执行完整的数据交换。
-
【注意】:使用combineGroup可能得不到完整的结果而是部分的结果
import collection.JavaConverters._ class MycombineGroup extends GroupCombineFunction[Tuple1[String] , (String , Int)]{ override def combine(iterable: Iterable[Tuple1[String]], out: Collector[(String, Int)]): Unit = { var key: String = null var count = 0 for(line <- iterable.asScala){ key = line._1 count += 1 } out.collect((key, count)) } }
-
//TODO combineGroup val input = env.fromElements("a", "b", "c", "a").map(Tuple1(_)) val combinedWords = input.groupBy(0).combineGroup(new MycombineGroup()) combinedWords.print()
-
Aggregate
在数据集上进行聚合求最值(最大值、最小值)
Aggregate只能作用于元组上
-
//TODO Aggregate val data = new mutable.MutableList[(Int, String, Double)] data.+=((1, "yuwen", 89.0)) data.+=((2, "shuxue", 92.2)) data.+=((3, "yingyu", 89.99)) data.+=((4, "wuli", 98.9)) data.+=((1, "yuwen", 88.88)) data.+=((1, "wuli", 93.00)) data.+=((1, "yuwen", 94.3)) // //fromCollection将数据转化成DataSet val input: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data)) val output = input.groupBy(1).aggregate(Aggregations.MAX, 2) output.print()
-
:minBy和maxBy
//TODO MinBy / MaxBy val data = new mutable.MutableList[(Int, String, Double)] data.+=((1, "yuwen", 90.0)) data.+=((2, "shuxue", 20.0)) data.+=((3, "yingyu", 30.0)) data.+=((4, "wuli", 40.0)) data.+=((5, "yuwen", 50.0)) data.+=((6, "wuli", 60.0)) data.+=((7, "yuwen", 70.0)) // //fromCollection将数据转化成DataSet val input: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data)) val output: DataSet[(Int, String, Double)] = input .groupBy(1) //求每个学科下的最小分数 //minBy的参数代表要求哪个字段的最小值 .minBy(2) output.print()
- distinct去重
//TODO distinct 去重 val data = new mutable.MutableList[(Int, String, Double)] data.+=((1, "yuwen", 90.0)) data.+=((2, "shuxue", 20.0)) data.+=((3, "yingyu", 30.0)) data.+=((4, "wuli", 40.0)) data.+=((5, "yuwen", 50.0)) data.+=((6, "wuli", 60.0)) data.+=((7, "yuwen", 70.0)) // //fromCollection将数据转化成DataSet val input: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data)) val distinct = input.distinct(1) distinct.print()
-
join
Flink在操作过程中,有时候也会遇到关联组合操作,这样可以方便的返回想要的关联结果,比如:
求每个班级的每个学科的最高分数
-
//TODO join val data1 = new mutable.MutableList[(Int, String, Double)] //学生学号---学科---分数 data1.+=((1, "yuwen", 90.0)) data1.+=((2, "shuxue", 20.0)) data1.+=((3, "yingyu", 30.0)) data1.+=((4, "yuwen", 40.0)) data1.+=((5, "shuxue", 50.0)) data1.+=((6, "yingyu", 60.0)) data1.+=((7, "yuwen", 70.0)) data1.+=((8, "yuwen", 20.0)) val data2 = new mutable.MutableList[(Int, String)] //学号 ---班级 data2.+=((1,"class_1")) data2.+=((2,"class_1")) data2.+=((3,"class_2")) data2.+=((4,"class_2")) data2.+=((5,"class_3")) data2.+=((6,"class_3")) data2.+=((7,"class_4")) data2.+=((8,"class_1")) val input1: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data1)) val input2: DataSet[(Int, String)] = env.fromCollection(Random.shuffle(data2)) //求每个班级下每个学科最高分数 val joindata = input2.join(input1).where(0).equalTo(0){ (input2 , input1) => (input2._1 , input2._2 , input1._2 , input1._3) } // joindata.print() // println("===================") val aggregateDataSet = joindata.groupBy(1,2).aggregate(Aggregations.MAX , 3) aggregateDataSet.print()
-
//TODO join val data1 = new mutable.MutableList[(Int, String, Double)] //学生学号---学科---分数 data1.+=((1, "yuwen", 90.0)) data1.+=((2, "shuxue", 20.0)) data1.+=((3, "yingyu", 30.0)) data1.+=((4, "yuwen", 40.0)) data1.+=((5, "shuxue", 50.0)) data1.+=((6, "yingyu", 60.0)) data1.+=((7, "yuwen", 70.0)) data1.+=((8, "yuwen", 20.0)) val data2 = new mutable.MutableList[(Int, String)] //学号 ---班级 data2.+=((1,"class_1")) data2.+=((2,"class_1")) data2.+=((3,"class_2")) data2.+=((4,"class_2")) data2.+=((5,"class_3")) data2.+=((6,"class_3")) data2.+=((7,"class_4")) data2.+=((8,"class_1")) val input1: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data1)) val input2: DataSet[(Int, String)] = env.fromCollection(Random.shuffle(data2)) //求每个班级下每个学科最高分数 val joindata = input2.join(input1).where(0).equalTo(0){ (input2 , input1) => (input2._1 , input2._2 , input1._2 , input1._3) } // joindata.print() // println("===================") val aggregateDataSet = joindata.groupBy(1,2).aggregate(Aggregations.MAX , 3) aggregateDataSet.print()