Apache NiFi Processor in action

1 Introduction

What is Apache NiFi? The official website of NiFi gives the following explanation: "An easy-to-use, powerful and reliable data processing and distribution system". In layman's terms, Apache NiFi is an easy-to-use, powerful, and reliable data processing and distribution system designed for data flow, which supports highly configurable directive graph data routing, transformation, and system mediation logic.
In order to make it clearer that NiFi can be expressed, the following is a brief introduction to the architecture of NiFi, as shown in the following figure.



According to the description of each component on the official website, make a summary translation:
• WebServer: Its purpose is to provide an HTTP-based command and control API.
• Flow Controller: This is the core of the operation. It takes the Processor as the processing unit, provides the extension thread for running, and manages the scheduling when the extension receives resources.
• Extensions: Various types of NiFi extensions are described in other documents. The key to Extensions is that extensions operate and execute in the JVM.
• FlowFile Repository: The role of the FlowFile library is that NiFi keeps track of the state of a given flowfile that is currently active in the flow, its implementation is pluggable, the default method is a persistent write-before-write located on the specified disk partition log.
• Content Repository: The role of the Content library is where the actual content bytes of a given stream file are located, and its implementation is also pluggable. The default method is a relatively simple mechanism for storing data blocks in the filesystem.
• Provenance Repository: The Provenance repository is where all source data is stored and is pluggable. The default implementation uses one or more physical disk volumes where event data is indexed and searchable.

2 Introduction to NiFi Processer

So much has been said in the previous section. The basic concept of NiFi is mainly introduced through the architecture diagram of NiFi. From the concept, it can be seen that Flow Controller is the core of NiFi, so what is Flow Controller? Flow Controller plays the role of processor for file exchange, maintains the connection of multiple processors and manages each Processor, which is the actual processing unit. So, let's see what NiFi's Processor contains through NiFi's UI?




As can be seen from the above figure, Processor contains various types of components, such as amazon, attributes, hadoop, etc., which can be easily identified by prefixes. For example, the beginning of Get and Fetch represents acquisition, such as getFile, getFTP, FetchHDFS, and execute represents execution, such as ExecuteSQL, ExecuteProcess, ExecuteFlumeSink, etc. can be easier to know its simple purpose.

3 NiFi Processer actual combat After

talking so much, I introduced the structure and Processor of NiFi, so what about the actual combat? Then, this article takes an actual requirement of the author as an example to carry out the actual combat of the Processor. The requirements are as follows: Select a data processing scheduling tool to implement customized scheduling and execution of server scripts. The script of the server involves the scheduling of environment variables, oracle database, and Hadoop ecosystem components. When the execution of the server script is completed, it returns to the script running state and provides an interface for re-running on failure.
In order to meet the requirements, I have scheduled various scheduling tools, such as Apache Oozie, Azkaban, Pentaho, etc. In the end, I compared various advantages and disadvantages and tried to use Apache NiFi as an attempt. By consulting the NiFi Processor API, the processors that can better support remote operations are: ExecuteProcess. The requirements will be explained in practice below.

3.1 Addition and configuration of Processor

1. Click "Add Processor", select ExecuteProcess and click the Add button to complete the addition, as shown in the figure below.




2. Right-click ExecuteProcess and select Configure Processor to configure the Properties tab. Each configuration option provides relevant instructions, as shown in the following figure.




As shown in the figure above, it is necessary to explain each option here.
• Command: sh.
• Command Arguments: -c;ssh user@ip sh js/job/job_hourly.sh `date
• Batch Duration: Not set. //We need to schedule by timing, not by interval time.
• Redirect Error Stream: Not set.
• Argument Delimiter (execution command parameter delimiter): ; //Delimit the parameters with ;.

3.2 Processor Scheduling

NiFi supports three scheduling strategies, including Time Driven (time-driven), CRON Driven (CRON-driven) and Event Driven (event-driven, non-optional), choose CRON Driven according to our actual needs, personally understand that CRON is Crontab The application of CRON, the meaning of each parameter of CRON represents: second, minute, hour, day, month, week, year, need to cooperate with *, ? Executed together with L (* means that all field values ​​are valid; ? means no value is specified for the specified field; L means long integer). For example: "0 0 13 * * ?" means that you want to schedule execution at 1 pm every day. Therefore, the scheduling configuration of parameters is carried out according to our needs. As shown below.





3.3 Running Status Monitoring

NiFi uses the Rest API for developers to schedule. Here, we use the Processor API to monitor the running status (status parameter acquisition, processor start and stop).
1. Obtain running status monitoring parameters:
The command is as follows: curl 'http://IP/nifi-api/processors/processorsID ' to get the following results, which can be parsed and obtained by json parser.




2. Start and stop of Processor:
NiFi's Processor starts and stops through its Put method. The most effective function of Put is to change its running state. NiFi's Process has three states in total, namely Running, Stopped and Disabled.
Then we can put the start and stop commands of the Rest API in the script and execute it.
• Start command (using Rest API's Put method):
curl -i -X ​​PUT -H 'Content-Type:application/json' -d '
{
"revision": {
"clientId": "586ec1d7-015d-1000-6459 -28251212434e",
"version":17},
"component": {
"id": "39e0dafc-015d-1000-918d-bee89ae2226e",
"state": "RUNNING"
}
}' http://IP/nifi- api/processors/processorsID
• Stop command (using Rest API's Put method):
curl -i -X ​​PUT -H 'Content-Type:application/json' -d '
{
"revision": {
"clientId": "586ec1d7-015d-1000-6459 -28251212434e",
"version":17},
"component": {
"id": "39e0dafc-015d-1000-918d-bee89ae2226e",
"state": "STOPPED"
}
}' http://IP/nifi- api/processors/processorsID

4 Summary and Postscript

This article first introduces Apache NiFi, and then takes the author's actual needs as an example to explain the actual combat of NiFi's core component, Processor. Since NiFi is still a top-level project that Apache has not launched for a long time, although the functions are very powerful, the available resources are still limited. This article is more of a process of throwing bricks. Its really powerful functions are still in data processing. Welcome Interested parties discuss with each other.

This article was originally published on Cobub's official website blog (www.cobub.com), author: Pan Hui
If you reprint, please indicate the author and source!
We recommend Cobub Razor (https://github.com/cobub/razor), an open-source privatized deployment of mobile application data statistical analysis system


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326111447&siteId=291194637
Recommended