Flink Series Metrics

 Flink is a stream of data for batch processing and distributed processing engine, the last two years really frequently appear in the data processing field. In fact, Flink in 2014, has become one of ASF (Apache Software Foundation) top-level project, perhaps before the spark is being covered up light, spark advantages in data processing undeniable, but through personal study of the source and the spark and flink after the project combat, prefer flink number. In real-time computing, with respect to the spark of micro-batch (micro batch), flink data processing method is more truly be called stream processing, the recent release of spark 2.3 now also has provided a way similar flink stream processing, but there is without actual verification of large Internet companies do not know how their online presence, we can wait and see; in addition to the way stream processing, flink in memory management, network transmission is also very unique place, and another spark SQL flink SQL after comparison, in the code level, flink do a simple package of calcite direct use of API, so that SQL change is no longer so mysterious, easier for us to customize semantics, customize our own SQL statements; Flink based on their status mechanisms of CEP (Complex event processing) Library allows us to workflow processes to match the combination of events we have defined. After these I will flink in the series 11 to do its principle explanation and interpretation of the code.
 
Back to the title of this article: flink-metrics. Why should this article as the beginning of the series, beginning should not be flink description of the principles and components of such articles do? Yes, this should be turned from deep to shallow flink into the series, but I have to consider two things: First, this article is for the purpose of testing the waters, would like to know where everyone's points of interest, if you most flow is calculated veteran practitioners, that flink white Beginners article seemingly do not go into details. But if you are interested in flink, but its basic principles and concepts not very understanding, then I'll make up The second series. Second, the current project of my own needs, look for metrics have become more and want to record it, it can be considered to be a review of it.

 

Flink Metrics refers to the task of the indicators during operation flink cluster, including machine system metrics: Hostname, CPU, Memory, Thread, GC, NetWork, IO and runtime components task indicators: JobManager, TaskManager, Job, Task, Operater related index. Flink provide metrics for two purposes: first, real-time collection of data for flink metrics UI for data display, the user can see the status of submitted jobs themselves, delays and other information on the page. Second, provide external metrics collection interface, the user can cluster metrics entire fllink MetricsReport reported by third-party systems to store, display and monitoring. The second large Internet companies are useful, they are generally relatively large size of the cluster is not possible to show all the tasks by flink UI, so we were reporting metrics dashboard display by the way, while storing the metrics can be used to down alarm monitoring, further, the data mining can generate greater value with historical data. Flink native offers several popular third-party reporting mode: JMXReporter, GangliaReport, GraphiteReport, users can configure direct.

 

Flink Metrics is achieved by introducing com.codahale.metrics packet, it will collect the metrics divided into four categories: Counter, Gauge, Histogram and Meter below illustrate:
  • Counter counter for a total count of metrics. The indicators take the flink example, as Task / Operator in numRecordsIn (this task or the operator of the total received record) and numRecordsOut (Record this task or the total amount of transmitted operator) belongs Counter.
  • Gauge index value, used to record the instantaneous values ​​of a metrics. Take the flink in the index, for example, like JobManager or TaskManager in JVM.Heap.Used belong Gauge, recording a time JobManager or heap usage of the machine where the JVM TaskManager.
  • Histogram Histogram, sometimes we are not satisfied just to get the metrics of the total or instantaneous value, when the maximum conceivable metrics, minimum, median and other information, we will be able to use the Histogram. Flink belonging Histogram indicator few, but the most important one is the latency part of the operator. Delayed information on this indicator will record data processing, play a very important role in the monitoring mission.
  • Meter average metrics used to record a certain period of time average. flink similar task indicators / operator in numRecordsInPerSecond, it will be understood literally, refer to this task or operator receives the number of records per second.
 
Metrics code analysis
Flink is that code metrics on how to collect it (the specific code metrics of the bag flink-runtime). Here we take a step by step instructions:
  1. flink in first defines good ScopeFormat, scopeFormat define the scope of various types of components metrics_group, and individual components (JobManager, TaskManager, Operator, etc.) will inherit ScopeFormat classes to achieve their format.
  2. MetricsGroup then start defining the individual components. Each group defined in this component belonging to all the metrics. For example TaskIOMetricGroup class, it defines the process for task execution metrics related to the IO.
  3. After each metricsGroup been defined, in the initialization of the various components, as will the corresponding parameters into metricsGroup constructor initializes. We take JobManager is an example: class JobManager (protected val flinkConfiguration: Configuration,
        protected val futureExecutor: ScheduledExecutorService,
        protected val ioExecutor: Executor,
        protected val instanceManager: InstanceManager,
        protected val scheduler: FlinkScheduler,
        protected val blobServer: BlobServer,
        protected val libraryCacheManager: BlobLibraryCacheManager,
        protected val archive: ActorRef,
        protected val restartStrategyFactory: RestartStrategyFactory,
        protected val timeout: FiniteDuration,
        protected val leaderElectionService: LeaderElectionService,
        protected val submittedJobGraphs : SubmittedJobGraphStore,
        protected val checkpointRecoveryFactory : CheckpointRecoveryFactory,
        protected val jobRecoveryTimeout: FiniteDuration,
        protected val jobManagerMetricGroup: JobManagerMetricGroup,
        protected val optRestAddress: Option[String])
    JobManager initialization time to bring the JobManagerMetricGroup, behind such in preStart () method is called instantiateMetrics (jobManagerMetricGroup), we look at instantiateMetrics method content:   
  1.  private def instantiateMetrics(jobManagerMetricGroup: MetricGroup) : Unit = {
      jobManagerMetricGroup.gauge[Long, Gauge[Long]]("taskSlotsAvailable", new Gauge[Long] {
        override def getValue: Long = JobManager.this.instanceManager.getNumberOfAvailableSlots
      })
      jobManagerMetricGroup.gauge[Long, Gauge[Long]]("taskSlotsTotal", new Gauge[Long] {
        override def getValue: Long = JobManager.this.instanceManager.getTotalNumberOfSlots
      })
      jobManagerMetricGroup.gauge[Long, Gauge[Long]]("numRegisteredTaskManagers", new Gauge[Long] {
        override def getValue: Long
        = JobManager.this.instanceManager.getNumberOfRegisteredTaskManagers
      })
      jobManagerMetricGroup.gauge[Long, Gauge[Long]]("numRunningJobs", new Gauge[Long] {
        override def getValue: Long = JobManager.this.currentJobs.size
      })
    }   在instantiateMetrics方法内,把相应的metrics都加入到了jobManagerMetricGroup中,这样就建立了metrics和metrics_group的映射关系。
  2.  Subsequently, the various components instantiated MetricRegistryImpl, and then start using startQueryService method MetricRegistry queries metrics (corresponding to start is essentially Akka Actor)
  3. Finally, flink native reporter (mainly three ways described above) and MetricRegistry establish contacts in this report can come up with all the collected metrics, the metrics and then sent to a third-party system.
 
Configuration Metrics
When we understand the specific implementation steps flink metrics after that get started operation, and how to configure metrics to make it take effect? Then you tell us about the configuration steps:
  • There is a next flink conf directory folder, there is a flink-conf.yaml conf file, all flink on configuring all in here.
  • Configuration metrics_scope, metrics_scope when the specified combination metrics reporting. There are six scope needs to be configured: metrics.scope.jm arranged JobManager related metrics, the default format is <host> .jobmanager metrics.scope.jm.job arranged on the relevant metrics Job JobManager, default format is <host> .jobmanager. <job_name>
               metrics.scope.tm configuration TaskManager relevant metrics, the default format is <host> .taskmanager. <tm_id>     
               metrics.scope.tm.job configuration TaskManager Job related metrics, the default format is <host> .taskmanager. <tm_id>. <job_name>
               metrics.scope.task   配置Task相关metrics,默认为 <host>.taskmanager.<tm_id>.<job_name>.<task_name>.<subtask_index>
              Operator configuration metrics.scope.operator related metrics, the default format is <host> .taskmanager. <tm_id>. <job_name>. <operator_name>. <subtask_index>
          6 or more according to the intention of the user can change the scope combinations, e.g. metrics.scope.operator, I can change <host>. <Job_name>. <Task_name>. <Operator_name>. <Subtask_index>, after modification, received operator the metrics would be in the following format: <host> <job_name> <task_name> <operator_name> <subtask_index> .xxx = xxxx (if all the default may not be required in the configuration file, in the source code has been specified.... the default value)
  • Configuration Report, Report configuration vary according to their different implementation class, I use GraphiteReport project currently use an example to illustrate:
          metrics.reporters: grph
          metrics.reporter.grph.class: org.apache.flink.metrics.graphite.GraphiteReporter
          metrics.reporter.grph.host: xxx
          metrics.reporter.grph.port: xxx
          metrics.reporter.grph.protocol: TCP/UDP
      metrics.reporters specify the name of the report, metrics.reporter.grph.class MetricsReport assign specific implementation class, metrics.reporter.grph.host graphite specify the remote host ip, metrics.reporter.grph.port graphite specify the remote listening port, metrics.reporter.grph.protocol specified protocol graphite utilized.
  • Finally, save the file, restart the cluster to take effect flink
If we do not use flink native MetricsReport, want to achieve their own customized Report okay? The answer is yes, the user can refer to GraphiteReporter class, custom class that inherits ScheduledDropwizardReporter class, you can override the report methods. In addition we now use GraphiteReport, also define their own custom reporting metrics KafkaReport to meet more users demand.
 
to sum up
Flink of metrics in the entire project is essential, community Jira often been suggested that a variety of improvement, you want to join this kind of metrics, but few community eventually accepted, because these are customized for individual company codes changes into the master meger little significance. Rational use of metrics can be found in a timely manner cluster status and tasks, thus corresponding measures can guarantee the stability of the cluster, to avoid unnecessary losses.

 

 

Reproduced in: https: //my.oschina.net/xiaominmin/blog/3057713

Guess you like

Origin blog.csdn.net/weixin_34331102/article/details/92531601