Flink搭建使用

1.Standalone集群安装

下载方式:

直接上官网下载

上官网复制链接地址:wget 粘贴链接

2.上传安装包到linux系统

使用rz命令

3.解压

tar –zxvf flink-1.5.0-bin-hadoop24-scala_2.11.tgz

4.修改配置文件

vim conf/flink-conf.yaml

 

扫描二维码关注公众号,回复: 6180900 查看本文章

jobmanager.rpc.address: hadoop01

jobmanager.rpc.port: 6123

jobmanager.heap.mb: 1024

taskmanager.heap.mb: 1024

taskmanager.numberOfTaskSlots: 2

taskmanager.memory.preallocate: false

parallelism.default: 1

jobmanager.web.port: 8081

taskmanager.tmp.dirs: /export/servers/flink-1.5.0/tmp

 

配置参数解释:

jobmanager.rpc.address: localhost   JobManager的外部地址,它是分布式系统的主/协调器(DEFAULT:localhost)设置成你master节点的IP地址

jobmanager.rpc.port: 6123   JobManager的端口号(DEFAULT:6123)

jobmanager.heap.mb: 1024         JobManager的默认JVM堆大小(以兆字节为单位)

taskmanager.heap.mb: 1024       用于TaskManagers的JVM堆大小(以兆字节为单位)

taskmanager.numberOfTaskSlots: 1    每台机器可用的CPU数量(默认值:1)

taskmanager.memory.preallocate: false   是否进行预分配内存,默认不进行预分配,这样在我们不使用flink集群时候不会占用集群资源

parallelism.default: 1      指定程序的默认并行度

jobmanager.web.port: 8081   JobManager的Web界面的端口(默认:8081)

taskmanager.tmp.dirs:临时文件的目录

 

5.启动flink集群

方式一:

添加一个JobManager

bin/jobmanager.sh ((start|start-foreground)cluster)|stop|stop-all

添加一个TaskManager

 bin/taskmanager.sh start|start-foreground|stop|stop-all

方式二:

  bin/start-cluster.sh

  bin/stop-cluster.sh

6.运行测试任务

bin/flink run /export/servers/flink-1.5.0/examples/batch/WordCount.jar --input /export/servers/zookeeper.out --output /export/servers/flink_data

7.集群的HA高可用

     对于一个企业级的应用,稳定性是首要要考虑的问题,然后才是性能,因此 HA 机制是必不可少的;

    和 Hadoop 一代一样,从架构中我们可以很明显的发现 JobManager 有明显的单点问题(SPOF,single point of failure)。 JobManager 肩负着任务调度以及资源分配,一旦 JobManager 出现意外,其后果可想而知。Flink 对 JobManager HA 的处理方式,原理上基本和 Hadoop 一样;

     对于 Standalone 来说,Flink 必须依赖于 Zookeeper 来实现 JobManager 的 HA(Zookeeper 已经成为了大部分开源框架 HA 必不可少的模块)。在 Zookeeper 的帮助下,一个 Standalone 的 Flink 集群会同时有多个活着的 JobManager,其中只有一个处于工作状态,其他处于 Standby 状态。当工作中的 JobManager 失去连接后(如宕机或 Crash),Zookeeper 会从 Standby 中选举新的 JobManager 来接管 Flink 集群。

(当然,对于flink的集群模式来说,除了standalone外,还有yarn cluster模式,这种模式的在hadoop节点的HA处搭建)

 

修改配置:

7.1修改conf/flink-conf.yaml

    vim conf/flink-conf.yaml

 

jobmanager.rpc.address: hadoop01【注意。HA的需要按照机器分配】

jobmanager.rpc.port: 6123

jobmanager.heap.mb: 1024

taskmanager.heap.mb: 1024

taskmanager.numberOfTaskSlots: 2

taskmanager.memory.preallocate: false

parallelism.default: 1

jobmanager.web.port: 8081

taskmanager.tmp.dirs: /export/servers/flink-1.5.0/tmp

#开启HA

state.backend: filesystem

state.backend.fs.checkpointdir: hdfs://hadoop01:9000/flink-checkpoints

high-availability: zookeeper

high-availability.storageDir: hdfs://hadoop01:9000/flink/ha/

high-availability.zookeeper.quorum: hadoop01:2181,hadoop02:2181,hadoop03:2181

high-availability.zookeeper.client.acl: open

 

HA参数解释:

state.backend 启用检查点,支持两种后端备份点:

jobmanager:内存状态,备份到JobManager的 ZooKeeper的内存。应仅用于最小状态(Kafka偏移量)或测试和本地调试。

filesystem:状态在TaskManager的内存中,并且状态快照存储在文件系统中。支持Flink支持的所有文件系统,例如HDFS,S3 ...

 

state.backend.fs.checkpointdir:用于将检查点存储在Flink支持的文件系统中的目录。注意:状态后端必须可以从JobManager访问,file://仅用于本地设置

high-availability: zookeeper  定义用于群集执行的高可用性模式

high-availability.storageDir

用于存储JobManager元数据的目录; 这是持久的状态后端,只有一个指向这个状态的指针存储在ZooKeeper中。完全像检查点目录一样,它必须可以从JobManager访问

high-availability.zookeeper.quorum    zookeeper的地址

 

7.2 修改conf/zoo.cfg[zookeeper是不存在]

 

# The number of milliseconds of each tick

tickTime=2000

 

# The number of ticks that the initial  synchronization phase can take

initLimit=10

 

# The number of ticks that can pass between  sending a request and getting an acknowledgement

syncLimit=5

 

# The directory where the snapshot is stored.

# dataDir=/tmp/zookeeper

 

# The port at which the clients will connect

clientPort=2181

 

# ZooKeeper quorum peers

server.1=hadoop01:2888:3888

server.2=hadoop02:2888:3888

server.3=hadoop03:2888:3888

 

7.3修改conf/masters

hadoop01:8081

hadoop02:8082

7.4修改conf/slave

hadoop01

hadoop02

hadoop03

7.5启动HA的flink

启动zookeeper

Bin/zkServer.sh start (所有的zookeeper确保启动成功)

启动hdfs(检查点和元数据信息存储在了hdfs)

Start-dfs.sh

启动flink

bin/start-cluster.sh

执行成功后,测试HA:

模拟突发宕机:

此时hadoop01这台机器的jobmanager出问题,然后访问另外节点的页面:hadoop02:8082

如果IP切换陈宫,代表当前的HA搭建完毕;

 

第二种,Flink运行在yarn上

在一个企业中,为了最大化的利用集群资源,一般都会在一个集群中同时运行多种类型的 Workload。因此 Flink 也支持在 Yarn 上面运行;flink on yarn的前提是:hdfs、yarn均启动

1.修改hadoop的配置参数:

vim etc/hadoop/yarn-site.xml

添加:

<property>

         <name>yarn.nodemanager.vmem-check-enabled</name>

         <value>false</value>

</property>

是否启动一个线程检查每个任务正使用的虚拟内存量,如果任务超出分配值,则直接将其杀掉,默认是true。

在这里面我们需要关闭,因为对于flink使用yarn模式下,很容易内存超标,这个时候yarn会自动杀掉job

2.修改全局变量/etc/profile:

添加:

export HADOOP_CONF_DIR=/export/servers/hadoop/etc/Hadoop

YARN_CONF_DIR或者HADOOP_CONF_DIR必须将环境变量设置为读取YARNHDFS配置

3.使用flink on yarn提交任务:

在YARN上启动一个Flink主要有两种方式:

(1)、启动一个YARN session(Start a long-running Flink cluster on YARN);

(2)、直接在YARN上提交运行Flink作业(Run a Flink job on YARN)

第一种方式:YARN session

这种模式下会启动yarn session,并且会启动Flink的两个必要服务:JobManager和TaskManagers,然后你可以向集群提交作业。同一个Session中可以提交多个Flink作业。需要注意的是,这种模式下Hadoop的版本至少是2.2,而且必须安装了HDFS(因为启动YARN session的时候会向HDFS上提交相关的jar文件和配置文件)

通过./bin/yarn-session.sh脚本启动YARN Session

脚本可以携带的参数:

Usage:

   Required

     -n,--container <arg>   Number of YARN container to allocate (=Number of Task Managers)

   Optional

     -D <arg>                        Dynamic properties

     -d,--detached                   Start detached

     -id,--applicationId <arg>       Attach to running YARN session

     -j,--jar <arg>                  Path to Flink jar file

     -jm,--jobManagerMemory <arg>    Memory for JobManager Container [in MB]

     -n,--container <arg>            Number of YARN container to allocate (=Number of Task Managers)

     -nm,--name <arg>                Set a custom name for the application on YARN

     -q,--query                      Display available YARN resources (memory, cores)

     -qu,--queue <arg>               Specify YARN queue.

     -s,--slots <arg>                Number of slots per TaskManager

     -st,--streaming                 Start Flink in streaming mode

    

Usage:

   Required

     -n,--container <arg>   Number of YARN container to allocate (=Number of Task Managers)

   Optional

     -D <arg>                        Dynamic properties

     -d,--detached                   Start detached

     -id,--applicationId <arg>       Attach to running YARN session

     -j,--jar <arg>                  Path to Flink jar file

     -jm,--jobManagerMemory <arg>    Memory for JobManager Container [in MB]

     -n,--container <arg>            Number of YARN container to allocate (=Number of Task Managers)

     -nm,--name <arg>                Set a custom name for the application on YARN

     -q,--query                      Display available YARN resources (memory, cores)

     -qu,--queue <arg>               Specify YARN queue.

     -s,--slots <arg>                Number of slots per TaskManager

     -st,--streaming                 Start Flink in streaming mode

注意:

如果不想让Flink YARN客户端始终运行,那么也可以启动分离的 YARN会话。该参数被称为-d或--detached。

在这种情况下,Flink YARN客户端只会将Flink提交给群集,然后关闭它自己

启动:

bin/yarn-session.sh -n 2 -tm 800 -s 2

上面的命令的意思是,同时向Yarn申请3个container(即便只申请了两个,因为ApplicationMaster和Job Manager有一个额外的容器。一旦将Flink部署到YARN群集中,它就会显示Job Manager的连接详细信息。),其中 2 个 Container 启动 TaskManager(-n 2),每个 TaskManager 拥有两个 Task Slot(-s 2),并且向每个 TaskManager 的 Container 申请 800M 的内存,以及一个ApplicationMaster(Job Manager)。

启动成功之后,控制台显示:

去yarn页面:ip:8088可以查看当前提交的flink session

点击ApplicationMaster进入任务页面:

上面的页面就是使用:yarn-session.sh提交后的任务页面;

然后使用flink提交任务:

bin/flink run examples/batch/WordCount.jar

在控制台中可以看到wordCount.jar计算出来的任务结果;

yarn-session.sh提交后的任务页面中也可以观察到当前提交的任务:

点击查看任务细节:

停止当前任务:

1:CTRL+C

2:stop命令

3:yarn application -kill  application_1527077715040_0007

 

分离的YARN会话

如果不想让Flink YARN客户端始终运行,那么也可以启动分离的 YARN会话。该参数被称为-d或--detached。

在这种情况下,Flink YARN客户端只会将Flink提交给群集,然后关闭它自己。请注意,在这种情况下,无法使用Flink停止YARN会话。

使用YARN实用程序(yarn application -kill <appId>)停止YARN会话

 

通过分离yarn会话来执行:

bin/yarn-session.sh -n 2 -tm 800 -s 2 -d

 

关闭:

yarn application -kill application_1527077715040_0007

第二种方式:在YARN上运行一个Flink作业

上面的YARN session是在Hadoop YARN环境下启动一个Flink cluster集群,里面的资源是可以共享给其他的Flink作业。我们还可以在YARN上启动一个Flink作业,这里我们还是使用./bin/flink,但是不需要事先启动YARN session

bin/flink run -m yarn-cluster -yn 2 ./examples/batch/WordCount.jar

以上命令在参数前加上y前缀,-yn表示TaskManager个数

在8088页面观察:

停止yarn-cluster

yarn application -kill application的ID

注意:

在创建集群的时候,集群的配置参数就写好了,但是往往因为业务需要,要更改一些配置参数,这个时候可以不必因为一个实例的提交而修改conf/flink-conf.yaml;

可以通过:-D <arg>                        Dynamic properties

来覆盖原有的配置信息:比如:

-Dfs.overwrite-files=true -Dtaskmanager.network.numberOfBuffers=16368

  1. Flink应用开发

flink和spark类似,也是一种一站式处理的框架;既可以进行批处理(DataSet),也可以进行实时处理(DataStream)

    1. 使用maven导入相关依赖<properties>
          <maven.compiler.source>1.8</maven.compiler.source>
          <maven.compiler.target>1.8</maven.compiler.target>
          <encoding>UTF-8</encoding>
          <scala.version>2.11.2</scala.version>
          <scala.compat.version>2.11</scala.compat.version>
          <hadoop.version>2.6.2</hadoop.version>
          <flink.version>1.5.0</flink.version>
      </properties>

      <dependencies>
          <dependency>
              <groupId>org.scala-lang</groupId>
              <artifactId>scala-library</artifactId>
              <version>${scala.version}</version>
          </dependency>

          <dependency>
              <groupId>org.apache.flink</groupId>
              <artifactId>flink-streaming-scala_2.11</artifactId>
              <version>${flink.version}</version>
          </dependency>
          <dependency>
              <groupId>org.apache.flink</groupId>
              <artifactId>flink-scala_2.11</artifactId>
              <version>${flink.version}</version>
          </dependency>
          <dependency>
              <groupId>org.apache.flink</groupId>
              <artifactId>flink-clients_2.11</artifactId>
        <version>${flink.version}</version>
      
          </dependency>
      
      
      
          <dependency>
      
              <groupId>org.apache.flink</groupId>
      
              <artifactId>flink-table_2.11</artifactId>
      
              <version>${flink.version}</version>
      
          </dependency>
      
      
      
          <dependency>
      
              <groupId>org.apache.hadoop</groupId>
      
              <artifactId>hadoop-client</artifactId>
      
              <version>${hadoop.version}</version>
      
          </dependency>
      
          <dependency>
      
              <groupId>mysql</groupId>
      
              <artifactId>mysql-connector-java</artifactId>
      
              <version>5.1.38</version>
      
          </dependency>
      
          <dependency>
      
              <groupId>com.alibaba</groupId>
      
              <artifactId>fastjson</artifactId>
      
              <version>1.2.22</version>
      
          </dependency>
      
          <dependency>
      
              <groupId>org.apache.flink</groupId>
      
              <artifactId>flink-connector-kafka-0.9_2.11</artifactId>
      
              <version>${flink.version}</version>
      
          </dependency>
      
      </dependencies>
      
      DateSet开发 
       开发流程
       获得一个execution environment
       加载/创建初始数据,
       指定这些数据的转换,
       指定将计算结果放在哪里,
       触发程序执行
      例子
      object DataSet_WordCount {
        def main(args: Array[String]) {
          //TODO 初始化环境
          val env = ExecutionEnvironment.getExecutionEnvironment
          //TODO 加载/创建初始数据
          val text = env.fromElements(
            "Who's there?",
            "I think I hear them. Stand, ho! Who's there?")
          //TODO 指定这些数据的转换
          val split_words = text.flatMap(line => line.toLowerCase().split("\\W+"))
          val filter_words = split_words.filter(x=> x.nonEmpty)
          val map_words = filter_words.map(x=> (x,1))
          val groupBy_words = map_words.groupBy(0)
          val sum_words = groupBy_words.sum(1)
          //todo 指定将计算结果放在哪里
      //    sum_words.setParallelism(1)//汇总结果
          sum_words.writeAsText(args(0))//"/Users/niutao/Desktop/flink.txt"
          //TODO 触发程序执行
          env.execute("DataSet wordCount")
        }
      }
      将程序打包,提交到yarn
      添加maven打包插件
      <build>
      
          <sourceDirectory>src/main/java</sourceDirectory>
      
          <testSourceDirectory>src/test/scala</testSourceDirectory>
      
          <plugins>
      
      
      
              <plugin>
      
                  <groupId>org.apache.maven.plugins</groupId>
      
                  <artifactId>maven-compiler-plugin</artifactId>
      
                  <version>2.5.1</version>
      
                  <configuration>
      
                      <source>1.7</source>
      
                      <target>1.7</target>
      
                      <!--<encoding>${project.build.sourceEncoding}</encoding>-->
      
                  </configuration>
      
              </plugin>
      
      
      
              <plugin>
      
                  <groupId>net.alchim31.maven</groupId>
      
                  <artifactId>scala-maven-plugin</artifactId>
      
                  <version>3.2.0</version>
      
                  <executions>
      
                      <execution>
      
                          <goals>
      
                              <goal>compile</goal>
      
                              <goal>testCompile</goal>
      
                          </goals>
      
                          <configuration>
      
                              <args>
      
                                  <!--<arg>-make:transitive</arg>-->
      
                                  <arg>-dependencyfile</arg>
      
                                  <arg>${project.build.directory}/.scala_dependencies</arg>
      
                              </args>
      
      
      
                          </configuration>
      
                      </execution>
      
                  </executions>
      
              </plugin>
      
              <plugin>
      
                  <groupId>org.apache.maven.plugins</groupId>
      
                  <artifactId>maven-surefire-plugin</artifactId>
      
                  <version>2.18.1</version>
      
                  <configuration>
      
                      <useFile>false</useFile>
      
                      <disableXmlReport>true</disableXmlReport>
      
                      <includes>
      
                          <include>**/*Test.*</include>
      
                          <include>**/*Suite.*</include>
      
                      </includes>
      
                  </configuration>
      
              </plugin>
      
      
      
              <plugin>
      
                  <groupId>org.apache.maven.plugins</groupId>
      
                  <artifactId>maven-shade-plugin</artifactId>
      
                  <version>2.3</version>
      
                  <executions>
      
                      <execution>
      
                          <phase>package</phase>
      
                          <goals>
      
                              <goal>shade</goal>
      
                          </goals>
      
                          <configuration>
      
                              <filters>
      
                                  <filter>
      
                                      <artifact>*:*</artifact>
      
                                      <excludes>
      
                                          <!--
      
                                          zip -d learn_spark.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF
      
                                          -->
      
                                          <exclude>META-INF/*.SF</exclude>
      
                                          <exclude>META-INF/*.DSA</exclude>
      
                                          <exclude>META-INF/*.RSA</exclude>
      
                                      </excludes>
      
                                  </filter>
      
                              </filters>
      
                              <transformers>
      
                                  <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
      
                                      <mainClass>com.itcast.DEMO.WordCount</mainClass>
      
                                  </transformer>
      
                              </transformers>
      
                          </configuration>
      
                      </execution>
      
      
      
                  </executions>
      
              </plugin>
      
          </plugins>
      
      </build>
    2. 执行成功后,在target文件夹中生成jar包;
    3. 使用rz命令上传jar包,然后执行程序:

    4. bin/flink run -m yarn-cluster -yn 2 /home/elasticsearch/flinkjar/itcast_learn_flink-1.0-SNAPSHOT.jar com.itcast.DEMO.WordCount

    5. 在yarn的8088页面可以观察到提交的程序:

    6. 去/export/servers/flink-1.3.2/flinkJAR文件夹下可以找到输出的运行结果:

    7. DateSet的Transformation
    8. Transformation

      Description

      Map

      Takes one element and produces one element.

      data.map { x => x.toInt }

      FlatMap

      Takes one element and produces zero, one, or more elements.

      data.flatMap { str => str.split(" ") }

      MapPartition

      Transforms a parallel partition in a single function call. The function get the partition as an `Iterator` and can produce an arbitrary number of result values. The number of elements in each partition depends on the degree-of-parallelism and previous operations.

      data.mapPartition { in => in map { (_, 1) } }

      Filter

      Evaluates a boolean function for each element and retains those for which the function returns true.
      IMPORTANT: The system assumes that the function does not modify the element on which the predicate is applied. Violating this assumption can lead to incorrect results.

      data.filter { _ > 1000 }

      Reduce

      Combines a group of elements into a single element by repeatedly combining two elements into one. Reduce may be applied on a full data set, or on a grouped data set.

      data.reduce { _ + _ }

      ReduceGroup

      Combines a group of elements into one or more elements. ReduceGroup may be applied on a full data set, or on a grouped data set.

      data.reduceGroup { elements => elements.sum }

      Aggregate

      Aggregates a group of values into a single value. Aggregation functions can be thought of as built-in reduce functions. Aggregate may be applied on a full data set, or on a grouped data set.

      val input: DataSet[(Int, String, Double)] = // [...]
      val output: DataSet[(Int, String, Doublr)] = input.aggregate(SUM, 0).aggregate(MIN, 2);

      You can also use short-hand syntax for minimum, maximum, and sum aggregations.

      val input: DataSet[(Int, String, Double)] = // [...]
      val output: DataSet[(Int, String, Double)] = input.sum(0).min(2)

      Distinct

      Returns the distinct elements of a data set. It removes the duplicate entries from the input DataSet, with respect to all fields of the elements, or a subset of fields.

      data.distinct()

      Join

      Joins two data sets by creating all pairs of elements that are equal on their keys. Optionally uses a JoinFunction to turn the pair of elements into a single element, or a FlatJoinFunction to turn the pair of elements into arbitrarily many (including none) elements. See the keys section to learn how to define join keys.

      // In this case tuple fields are used as keys. "0" is the join field on the first tuple
      // "1" is the join field on the second tuple.
      val result = input1.join(input2).where(0).equalTo(1)

      You can specify the way that the runtime executes the join via Join Hints. The hints describe whether the join happens through partitioning or broadcasting, and whether it uses a sort-based or a hash-based algorithm. Please refer to the Transformations Guide for a list of possible hints and an example. If no hint is specified, the system will try to make an estimate of the input sizes and pick the best strategy according to those estimates.

      // This executes a join by broadcasting the first data set
      // using a hash table for the broadcasted data
      val result = input1.join(input2, JoinHint.BROADCAST_HASH_FIRST)
                         .where(0).equalTo(1)

      Note that the join transformation works only for equi-joins. Other join types need to be expressed using OuterJoin or CoGroup.

      OuterJoin

      Performs a left, right, or full outer join on two data sets. Outer joins are similar to regular (inner) joins and create all pairs of elements that are equal on their keys. In addition, records of the "outer" side (left, right, or both in case of full) are preserved if no matching key is found in the other side. Matching pairs of elements (or one element and a `null` value for the other input) are given to a JoinFunction to turn the pair of elements into a single element, or to a FlatJoinFunction to turn the pair of elements into arbitrarily many (including none) elements. See the keys section to learn how to define join keys.

      val joined = left.leftOuterJoin(right).where(0).equalTo(1) {
         (left, right) =>
           val a = if (left == null) "none" else left._1
           (a, right)
        }

      CoGroup

      The two-dimensional variant of the reduce operation. Groups each input on one or more fields and then joins the groups. The transformation function is called per pair of groups. See the keys section to learn how to define coGroup keys.

      data1.coGroup(data2).where(0).equalTo(1)

      Cross

      Builds the Cartesian product (cross product) of two inputs, creating all pairs of elements. Optionally uses a CrossFunction to turn the pair of elements into a single element

      val data1: DataSet[Int] = // [...]
      val data2: DataSet[String] = // [...]
      val result: DataSet[(Int, String)] = data1.cross(data2)

      Note: Cross is potentially a very compute-intensive operation which can challenge even large compute clusters! It is adviced to hint the system with the DataSet sizes by using crossWithTiny() and crossWithHuge().

      Union

      Produces the union of two data sets.

      data.union(data2)

      Rebalance

      Evenly rebalances the parallel partitions of a data set to eliminate data skew. Only Map-like transformations may follow a rebalance transformation.

      val data1: DataSet[Int] = // [...]
      val result: DataSet[(Int, String)] = data1.rebalance().map(...)

      Hash-Partition

      Hash-partitions a data set on a given key. Keys can be specified as position keys, expression keys, and key selector functions.

      val in: DataSet[(Int, String)] = // [...]
      val result = in.partitionByHash(0).mapPartition { ... }

      Range-Partition

      Range-partitions a data set on a given key. Keys can be specified as position keys, expression keys, and key selector functions.

      val in: DataSet[(Int, String)] = // [...]
      val result = in.partitionByRange(0).mapPartition { ... }

      Custom Partitioning

      Manually specify a partitioning over the data. 
      Note: This method works only on single field keys.

      val in: DataSet[(Int, String)] = // [...]
      val result = in
        .partitionCustom(partitioner: Partitioner[K], key)

      Sort Partition

      Locally sorts all partitions of a data set on a specified field in a specified order. Fields can be specified as tuple positions or field expressions. Sorting on multiple fields is done by chaining sortPartition() calls.

      val in: DataSet[(Int, String)] = // [...]
      val result = in.sortPartition(1, Order.ASCENDING).mapPartition { ... }

      First-n

      Returns the first n (arbitrary) elements of a data set. First-n can be applied on a regular data set, a grouped data set, or a grouped-sorted data set. Grouping keys can be specified as key-selector functions, tuple positions or case class fields.

      val in: DataSet[(Int, String)] = // [...]
      // regular data set
      val result1 = in.first(3)
      // grouped data set
      val result2 = in.groupBy(0).first(3)
      // grouped-sorted data set
      val result3 = in.groupBy(0).sortGroup(1, Order.ASCENDING).first(3)
      

      1:map函数

      2:flatMap函数

      //初始化执行环境
      
        val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
      
        //加载数据
      
        val data = env.fromElements(("A" , 1) , ("B" , 1) , ("C" , 1))
      
        //使用trasformation加载这些数据
      
      //TODO map
      
        val map_result = data.map(line => line._1+line._2)
      
      map_result.print()
      
        //TODO flatmap
      
        val flatmap_result = data.flatMap(line => line._1+line._2)
      
      flatmap_result.print()

       

       

      练习:如下数据

      A;B;C;D;B;D;C

      B;D;A;E;D;C

      A;B

      要求:统计相邻字符串出现的次数

      import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
      
        import org.apache.flink.streaming.api.scala._
      
        /**
      
        * Created by angel
      
        */
      
        object demo {
      
        /**
      
      A;B;C;D;B;D;C
      
      B;D;A;E;D;C
      
      A;B
      
        统计相邻字符串出现的次数(A+B , 2) (B+C , 1)....
      
        * */
      
        def main(args: Array[String]): Unit = {
      
          val env = ExecutionEnvironment.getExecutionEnvironment
      
          val data = env.fromElements("A;B;C;D;B;D;C;B;D;A;E;D;C;A;B")
      
          val map_data: DataSet[Array[String]] = data.map(line => line.split(";"))
      
          //[A,B,C,D] ---"A,B,C,D"
      
          //[A,B,C,D] ---> (x,1) , (y,1) -->groupBy--->sum--total
      
          val tupe_data = map_data.flatMap{
      
            line =>
      
              for(index <- 0 until line.length-1) yield (line(index)+"+"+line(index+1) , 1)
      
          }
      
          val gropudata = tupe_data.groupBy(0)
      
          val result = gropudata.sum(1)
      
          result.print()
      
        }
      
      }

       

       

      3:mapPartition函数

      //TODO mapPartition
      
        val ele_partition = elements.setParallelism(2)//将分区设置为2
      
        val partition = ele_partition.mapPartition(line => line.map(x=> x+"======"))//line是每个分区下面的数据
      
        partition.print()

       

      mapPartition:是一个分区一个分区拿出来的
      好处就是以后我们操作完数据了需要存储到mysql中,这样做的好处就是几个分区拿几个连接,如果用map的话,就是多少条数据拿多少个mysql的连接

       

      4:filter函数

      Filter函数在实际生产中特别实用,数据处理阶段可以过滤掉大部分不符合业务的内容,可以极大减轻整体flink的运算压力

      //TODO fileter
      
        val filter:DataSet[String] = elements.filter(line => line.contains("java"))//过滤出带java的数据
      
        filter.print()

       

       

      5:reduce函数

      //TODO reduce
      
       val elements:DataSet[List[Tuple2[String , Int]]] = env.fromElements(List(("java" , 1) , ("scala" , 1) , ("java" , 1)))
      
        val tuple_map = elements.flatMap(x=> x)//拆开里面的list,编程tuple
      
        val group_map = tuple_map.groupBy(x => x._1)//按照单词聚合
      
        val reduce = group_map.reduce((x,y) => (x._1 ,x._2+y._2))
      
      reduce.print()
      
    9. reduceGroup

      普通的reduce函数

    10. reduceGroup是reduce的一种优化方案;

      它会先分组reduce,然后在做整体的reduce;这样做的好处就是可以减少网络IO;

  1.  //TODO reduceGroup
    
    
    
      val elements:DataSet[List[Tuple2[String , Int]]] = env.fromElements(List(("java" , 1) ,("java" , 1), ("scala" , 1)))
    
      val tuple_words = elements.flatMap(x=>x)
    
      val group_words = tuple_words.groupBy(x => x._1)
    
      val a = group_words.reduceGroup{
    
        (in:Iterator[(String,Int)],out:Collector[(String , Int)]) =>
    
          val result = in.reduce((x, y) => (x._1, x._2+y._2))
    
          out.collect(result)
    
      }
    
      a.print()
    
    }
  2. GroupReduceFunction和GroupCombineFunction(自定义函数)
  3. import collection.JavaConverters._
    
    class Tuple3GroupReduceWithCombine extends GroupReduceFunction[( String , Int), (String, Int)] with GroupCombineFunction[(String, Int), (String, Int)] {
    
      override def reduce(values: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
    
    
    
        for(in <- values.asScala){
    
          out.collect((in._1 , in._2))
    
        }
    
      }
    
    
    
      override def combine(values: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
    
        val map = new mutable.HashMap[String , Int]()
    
        var num = 0
    
        var s = ""
    
        for(in <- values.asScala){
    
            num += in._2
    
            s = in._1
    
        }
    
        out.collect((s , num))
    
      }
    
    }
  4. //  TODO GroupReduceFunction  GroupCombineFunction
    
    val env = ExecutionEnvironment.getExecutionEnvironment
    
    val elements:DataSet[List[Tuple2[String , Int]]] = env.fromElements(List(("java" , 3) ,("java" , 1), ("scala" , 1)))
    
    val collection = elements.flatMap(line => line)
    
    val groupDatas:GroupedDataSet[(String, Int)] = collection.groupBy(line => line._1)
    
    //reduceGroup下使用自定义的reducecombiner函数
    
    val result = groupDatas.reduceGroup(new Tuple3GroupReduceWithCombine())
    
    val result_sort = result.collect().sortBy(x=>x._1)
    
    println(result_sort)
    
  5. combineGroup

    使用之前的group操作,比如:reduceGroup或者GroupReduceFuncation;这种操作很容易造成内存溢出;因为要一次性把所有的数据一步转化到位,所以需要足够的内存支撑,如果内存不够的情况下,那么需要使用combineGroup;
    combineGroup在分组数据集上应用GroupCombineFunction。
    GroupCombineFunction类似于GroupReduceFunction,但不执行完整的数据交换。
  6. 【注意】:使用combineGroup可能得不到完整的结果而是部分的结果
    
    
    import collection.JavaConverters._
    
    class MycombineGroup extends GroupCombineFunction[Tuple1[String] , (String , Int)]{
    
      override def combine(iterable: Iterable[Tuple1[String]], out: Collector[(String, Int)]): Unit = {
    
        var key: String = null
    
        var count = 0
    
        for(line <- iterable.asScala){
    
          key = line._1
    
          count += 1
    
        }
    
        out.collect((key, count))
    
    
    
      }
    
    }
  7. //TODO  combineGroup
    
    val input = env.fromElements("a", "b", "c", "a").map(Tuple1(_))
    
    val combinedWords = input.groupBy(0).combineGroup(new MycombineGroup())
    
    combinedWords.print()
  8. Aggregate

    在数据集上进行聚合求最值(最大值、最小值)
    Aggregate只能作用于元组上
  9. //TODO Aggregate
    
    val data = new mutable.MutableList[(Int, String, Double)]
    
    data.+=((1, "yuwen", 89.0))
    
    data.+=((2, "shuxue", 92.2))
    
    data.+=((3, "yingyu", 89.99))
    
    data.+=((4, "wuli", 98.9))
    
    data.+=((1, "yuwen", 88.88))
    
    data.+=((1, "wuli", 93.00))
    
    data.+=((1, "yuwen", 94.3))
    
    //    //fromCollection将数据转化成DataSet
    
    val input: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data))
    
    val output = input.groupBy(1).aggregate(Aggregations.MAX, 2)
    
    output.print()
  10. :minBy和maxBy

    //TODO MinBy / MaxBy
    
    val data = new mutable.MutableList[(Int, String, Double)]
    
    data.+=((1, "yuwen", 90.0))
    
    data.+=((2, "shuxue", 20.0))
    
    data.+=((3, "yingyu", 30.0))
    
    data.+=((4, "wuli", 40.0))
    
    data.+=((5, "yuwen", 50.0))
    
    data.+=((6, "wuli", 60.0))
    
    data.+=((7, "yuwen", 70.0))
    
    //    //fromCollection将数据转化成DataSet
    
    val input: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data))
    
    val output: DataSet[(Int, String, Double)] = input
    
      .groupBy(1)
    
      //求每个学科下的最小分数
    
      //minBy的参数代表要求哪个字段的最小值
    
      .minBy(2)
    
    output.print()
    
  11. distinct去重
    //TODO distinct 去重
    
      val data = new mutable.MutableList[(Int, String, Double)]
    
      data.+=((1, "yuwen", 90.0))
    
      data.+=((2, "shuxue", 20.0))
    
      data.+=((3, "yingyu", 30.0))
    
      data.+=((4, "wuli", 40.0))
    
      data.+=((5, "yuwen", 50.0))
    
      data.+=((6, "wuli", 60.0))
    
      data.+=((7, "yuwen", 70.0))
    
      //    //fromCollection将数据转化成DataSet
    
      val input: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data))
    
      val distinct = input.distinct(1)
    
      distinct.print()
  12. join

    Flink在操作过程中,有时候也会遇到关联组合操作,这样可以方便的返回想要的关联结果,比如:

    求每个班级的每个学科的最高分数

  13. //TODO join
    
        val data1 = new mutable.MutableList[(Int, String, Double)]
    
        //学生学号---学科---分数
    
        data1.+=((1, "yuwen", 90.0))
    
        data1.+=((2, "shuxue", 20.0))
    
        data1.+=((3, "yingyu", 30.0))
    
        data1.+=((4, "yuwen", 40.0))
    
        data1.+=((5, "shuxue", 50.0))
    
        data1.+=((6, "yingyu", 60.0))
    
        data1.+=((7, "yuwen", 70.0))
    
        data1.+=((8, "yuwen", 20.0))
    
        val data2 = new mutable.MutableList[(Int, String)]
    
        //学号 ---班级
    
        data2.+=((1,"class_1"))
    
        data2.+=((2,"class_1"))
    
        data2.+=((3,"class_2"))
    
        data2.+=((4,"class_2"))
    
        data2.+=((5,"class_3"))
    
        data2.+=((6,"class_3"))
    
        data2.+=((7,"class_4"))
    
        data2.+=((8,"class_1"))
    
        val input1: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data1))
    
        val input2: DataSet[(Int, String)] = env.fromCollection(Random.shuffle(data2))
    
        //求每个班级下每个学科最高分数
    
        val joindata = input2.join(input1).where(0).equalTo(0){
    
          (input2 , input1) => (input2._1 , input2._2 , input1._2 , input1._3)
    
        }
    
    //    joindata.print()
    
    //    println("===================")
    
        val aggregateDataSet = joindata.groupBy(1,2).aggregate(Aggregations.MAX , 3)
    
        aggregateDataSet.print()
  14.  

    //TODO join
    
        val data1 = new mutable.MutableList[(Int, String, Double)]
    
        //学生学号---学科---分数
    
        data1.+=((1, "yuwen", 90.0))
    
        data1.+=((2, "shuxue", 20.0))
    
        data1.+=((3, "yingyu", 30.0))
    
        data1.+=((4, "yuwen", 40.0))
    
        data1.+=((5, "shuxue", 50.0))
    
        data1.+=((6, "yingyu", 60.0))
    
        data1.+=((7, "yuwen", 70.0))
    
        data1.+=((8, "yuwen", 20.0))
    
        val data2 = new mutable.MutableList[(Int, String)]
    
        //学号 ---班级
    
        data2.+=((1,"class_1"))
    
        data2.+=((2,"class_1"))
    
        data2.+=((3,"class_2"))
    
        data2.+=((4,"class_2"))
    
        data2.+=((5,"class_3"))
    
        data2.+=((6,"class_3"))
    
        data2.+=((7,"class_4"))
    
        data2.+=((8,"class_1"))
    
        val input1: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data1))
    
        val input2: DataSet[(Int, String)] = env.fromCollection(Random.shuffle(data2))
    
        //求每个班级下每个学科最高分数
    
        val joindata = input2.join(input1).where(0).equalTo(0){
    
          (input2 , input1) => (input2._1 , input2._2 , input1._2 , input1._3)
    
        }
    
    //    joindata.print()
    
    //    println("===================")
    
        val aggregateDataSet = joindata.groupBy(1,2).aggregate(Aggregations.MAX , 3)
    
        aggregateDataSet.print()

     

猜你喜欢

转载自blog.csdn.net/someInNeed/article/details/89811345