[Big Data] Apache NiFi helps data processing and distribution

Apache NiFi helps data processing and distribution

1. What is NiFi?
2.The core concept of NiFi
3.NiFi architecture
4. Performance expectations and characteristics of NiFi
5. High-level overview of NiFi key features

Insert image description here

1. What is NiFi?

Simply put, NiFi was established to solve the problem of automatic data flow between different systems. Although dataflowthe term is used in a variety of contexts, we use it here to refer to the automated, manageable flow of information between disparate systems. Since enterprises have multiple systems, some systems will generate data, and some systems will consume data, and the problem of data circulation between different systems has arisen. The corresponding solutions to these problems have been widely studied and discussed, among which enterprise integration eip ( Enterprise Integration Patterns) is a comprehensive and easy-to-use solution.

Some of the challenges faced by dataflow include :

Systems fail : network failure, disk failure, software crash, human accident.
Data access exceeds capacity to consume : Sometimes, a given data source may exceed the processing capacity of certain parts of the processing chain or delivery chain, and it only takes one link to have a problem for the entire process to be affected.
Boundary conditions are mere suggestions : You will always get data that is too big, too small, too fast, too slow, corrupted, wrong, or malformed.
What is noise one day becomes signal the next : Real business or needs are changing faster, and new data processing processes or existing processes must be designed quickly.
Systems evolve at different rates : The protocols or data formats used by a given system may change at any time, often independently of other surrounding systems. dataflowIt exists to connect these massively distributed, loosely connected systems of components that are not even designed to work together.
Compliance and security : Changes in laws, regulations and policies. Changes to Business-to-Business Agreements. System-to-system and system-to-user interactions must be secure, trustworthy, and accountable.
Continuous improvement occurs in production : It is often impossible to fully simulate the production environment in a test environment.

Data flow has been one of the inevitable issues in architecture for many years. There are many active and rapidly developing technologies that make them dataflowmore important to specific enterprises that want to succeed, such as SOA, API, IoT, and BigData. Additionally, the level of rigor required for compliance, privacy, and security continues to increase. Despite the continuous emergence of these new concepts and technologies, dataflowthe difficulties and challenges faced remain. The main differences are the scope of complexity, the speed of changing needs that need to be adapted, and the generalization of large-scale edge cases. NiFi is designed to help solve these modern data flow challenges.

2.The core concept of NiFi

The basic design concept of NiFi is closely related to the main idea of flow-based programming ( FBP ). Flow-based programmingHere are some major NiFi concepts and how they map to FBP:

NiFi terminology	FBP terminology	describe
FlowFile	Information Packet	FlowFile represents each object moved in the system. For each FlowFile, NiFi will record an attribute key-value pair and 0 or more bytes of content (FlowFile has `attribute`and `content`)
FlowFile Processor	Black Box	It's actually the processor that plays the main role. In eip terminology, a processor is a combination of data routing, data conversion, or data mediation between different systems. The processor can access the properties of a given FlowFile and its contents. A processor can operate on zero or more flow files in a given unit of work and either commit the work or rollback the work
Connection	Bounded Buffer	Connections are used to connect processors. They act as queues and allow various processes to interact at different rates. These queues can be dynamically prioritized and capped on load, enabling backpressure
Flow Controller	Scheduler	The flow controller maintains how processes are connected and manages and allocates threads used by all processes. The flow controller acts as a proxy, facilitating the exchange of flow files between processors
Process Group	Subnet	A process group is a specific set of processes and connections that can receive data through the input port and send data through the output port. In this way, we can simply combine components in the process group to get a completely new functional component (Process Group)

This design model is also similar to SEDA and brings many benefits that help NiFi become a very effective platform for building powerful and scalable data flows. Some of these benefits include:

Facilitates the visual creation and management of directed graphs of processors.
Asynchronous in nature, allowing very high throughput and ample natural buffering.
Providing a highly concurrency model, developers do not have to worry about the complexity of concurrency.
Facilitates the development of cohesive and loosely coupled components that can then be reused in other environments and facilitate unit testing.
Resource-constrained connections (configurable connections in the process) make key functions such as backpressure and pressure relief very natural and intuitive.
Error handling becomes as natural as basic logic rather than coarse-grained catch-all.
The points at which data enters and exits the system, and how it flows, are easy to understand and track.

3.NiFi architecture

Insert image description here
NiFi executes within the JVM on the operating system. The main components of NiFi on the JVM are as follows:

Web Server : The purpose of the web server is to host NiFi's http-based command and control API.
Flow Controller : It is the core of the entire operation, providing threads for components to be run and managing scheduling.
Extensions : There are various types of NiFi extensions, which are described in other documentation. The key point here is that NiFi extensions operate and execute in the JVM.
FlowFile Repository : For a given FlowFile that is active in a flow, the FlowFile Repository is where NiFi keeps track of the state of the FlowFile. The implementation of FlowFile Repository is pluggable (multiple choices, configurable, and you can even implement it yourself). The default implementation usesWrite-Ahead Logtechnology (in simple popularization,WALthe core idea is: before data is written to the library, write it to the log first. , and then change the log record to the storage) and write it to the specified disk directory.
Content Repository : Content Repository is where the actual content bytes of a given FlowFile are stored. Content Repository implementations are pluggable. The default method is a fairly simple mechanism that stores blocks of data in the file system. Multiple file system storage locations can be specified to obtain different physical partitions to reduce contention on any single volume (so it is best practice in the environment to configure multiple directories, mount different disks, and improve IO).
Provenance Repository : Provenance Repository is where all event data is stored. The implementation of Provenance Repository is pluggable, and the default implementation uses one or more physical disk volumes. Event data within each location is indexed and searchable.

NiFi is also able to run within a cluster.

Insert image description here
Starting from NiFi version 1.0, NiFi cluster adopts Zero-Master Clusteringmode. Each node in a NiFi cluster performs the same task on the data, but each node operates on a different data set. Apache ZooKeeper selects a single node to serve as the cluster coordinator, and ZooKeeper automatically handles failover. All cluster nodes send heartbeat reports and status information to the cluster coordinator. The cluster coordinator is responsible for disconnecting and connecting nodes. In addition, each cluster has a master node, and the master node is also elected by ZooKeeper. We can interact with the NiFi cluster through the user interface of any node, and any changes we make will be replicated to all nodes in the cluster.

4. Performance expectations and characteristics of NiFi

NiFi is designed to take full advantage of the capabilities of the underlying host system it runs on. This resource maximization is particularly evident with regard to CPU and disk.

For IO : The throughput or latency that can be expected in different configurations of different systems can vary greatly, depending on how the system is configured. Given that most NiFi subsystems have pluggable implementations, performance depends on the implementation. However, for some specific and broadly applicable places, consider using the out-of-the-box default implementation. These implementations are durable, guaranteed to stream data, and are implemented using local disk. So, conservatively, assuming a read/write rate of about 50 MB per second on a regular disk or RAID volume in a typical server, NiFi should be able to effectively reach 100 MB per second or more for large data streams throughput. This is because each physical partition added to NiFi is expectedContent repositoryto grow linearly and the bottleneck will occur atsome point inFlowFile repositoryandProvenance repositoryWe plan to provide a benchmark and performance test template and then allow users to easily test their systems and determine where the bottlenecks are and why they may be bottlenecks. This template should also make it easy for system administrators to make changes and verify their impact. (Looking forward to the emergence of this test function)
For CPU : The Flow Controller acts as an engine, indicating when a specific processor can be assigned threads for execution. Write the processor to release the thread immediately after executing the task. You can provide a Flow Controller with a configuration value that indicates the available threads for the various thread pools it maintains. The ideal number of threads depends on the number of host system cores, whether other services are running on the system, and the nature of the streams being processed in the process. For typical large IO traffic, it is reasonable to make multiple threads available.
For RAM : NiFi runs in the JVM and is therefore limited to the memory provided by the JVM. JVM garbage collection becomes a very important factor in limiting the actual total heap size and optimizing the operation of the application. NiFi jobs can be I/O intensive when reading the same content periodically. Disks can be configured large enough to optimize performance.

5. High-level overview of NiFi key features

Flow Management
- Guaranteed Delivery : A core philosophy of NiFi is that delivery must be guaranteed, even at very high scale. This is achieved through the efficient use of purpose-built Write-Ahead Log and Content repositories. Together they are designed to allow very high transaction rates, efficient load distribution, copy-on-write, and take advantage of traditional disk read/writes.
- Data Buffering w/ Back Pressure and Pressure Release : NiFi supports buffering of all queued data, as well as the ability to provide backpressure when these queues reach specified limits, or the ability to age data when it reaches a specified age (its value has expired).
- Prioritized Queuing : NiFi allows setting one or more priority schemes for how data is retrieved from the queue. The default is FIFO, but sometimes the newest data should be fetched first (LIFO), largest data first out, or other customized schemes.
- Flow Specific QoS (latency v throughput, loss tolerance, etc.) : The data may be critical at some nodes of the data flow and cannot be lost, and at certain times the data needs to be processed and passed down in a few seconds. Only one node makes sense. NiFi can also make fine-grained configurations in these aspects.
Ease of Use
- Visual Command and Control : The processing logic and process of data flow can be very complex. Being able to visualize these processes and express them visually can greatly help users reduce the complexity of their data flows and identify where simplification is needed. NiFi can realize the visual establishment of data flow in real time. It's not "design and deployment", it's more like clay sculpture. If changes are made to the data flow, the changes take effect immediately and are fine-grained and component-isolated. Users do not need to stop an entire process or group of processes in order to make certain modifications.
- Flow Templates : FlowFiles tend to be highly patterned, and while there are often many different ways to solve a problem, being able to share those best practices goes a long way. Process templates allow designers to build and publish their process designs for others to benefit from and reuse.
- Data Provenance : NiFi automatically records, indexes, and makes available source data as objects flow through the system, even during fan-in, fan-out, transformation, and more. This information becomes critical in supporting compliance, troubleshooting, optimization, and other initiatives.
- Recovery / Recording a rolling buffer of fine-grained history : NiFi's Content repository is designed to act as a rolling buffer of historical data. Data is deleted only when the Content repository ages or space is needed. The Content repository, combined with the Data provenance capability, provides a very useful foundation for enabling functionality such as content viewing, content downloading and replaying at specific points in an object's lifecycle, even across generations.
Security
- System to System : The more secure the data flow, the better. For each node in the data flow NiFi securely exchanges data by using encryption protocols such as two-way SSL. In addition, NiFi's process can encrypt and decrypt content and use shared keys or other mechanisms on either side of the sender/receiver to keep the data secure.
- User to System : NiFi supports two-way SSL authentication and provides pluggable authorization methods to correctly control user access rights and specific levels (read-only, data flow management, admin). If the user enters a sensitive attribute (such as a password) in the process, it will be immediately encrypted on the server side to ensure that the sensitive information will not be exposed to the client (front-end UI) again (for example, user A enters the MySQL user password in the process, After filling in, no one, even user A, can see the clear text password).
- Multi-tenant Authorization : Permission levels for NiFi data flows apply to each component and allow admin users to have fine-grained control over access levels. This means that each NiFi cluster is capable of handling the needs of one or more organizations. In contrast to isolated topologies, multi-tenant entitlement enables self-service for data flow management, allowing each team or organization to manage the flow with complete visibility into the rest of the flow without being able to access the flow.
Extensible Architecture
- Extension : The core of NiFi is extensibility, so it is a data flow process platform that can execute and interact in a predictable and repeatable way. Extensible ones include:processors,Controller Services,Reporting Tasks,PrioritizersandCustomer User Interfaces.
- Classloader Isolation : As with any component-based system, issues involving dependencies often occur. NiFi solves this problem by providing a custom class loader, ensuring that each extension package is exposed to a very limited set of dependencies. Therefore, you can build extension packs without worrying about whether they might conflict with another extension pack. The concept of these expansion packs is called "NiFi Archives" and is discussed in more detail in the Developer's Guide.
- Site-to-Site Communication Protocol : The preferred communication protocol between NiFi instances is the NiFi Site-to-Site (S2S) protocol. S2S transfers data from one NiFi instance to another easily, efficiently and securely. The NiFi client library can be easily built and bundled into other applications or devices to communicate with NiFi via the S2S protocol. S2S supports Socket protocol and HTTP/HTTPS protocol as the underlying transmission protocol, making it possible to embed the proxy server into the communication of the S2S protocol.
Flexible Scaling Model
- Scale-out (Clustering) : NiFi is designed to be clusterable and horizontally scalable. If you provision a single node and configure it to process hundreds of MB of data per second, you can configure the cluster to process gigabytes of data per second. But this also brings the challenge of load balancing and failover between NiFi and the systems it obtains data from. Adopting asynchronous queuing-based protocols (such as messaging services, Kafka, etc.) can help solve these problems. Using NiFi's S2S functionality is also very effective because it is a protocol that allows NiFi and clients (including another NiFi cluster) to communicate with each other, share information about loading, and exchange specific authorized data ports.
- Scale-up & down : NiFi is also very flexible to scale up and down. From the perspective of the NiFi framework, in terms of increasing throughput, you can increase the number of concurrent tasks on the processor under the "Scheduling" tab during configuration. This allows more threads to execute simultaneously, providing higher throughput. On the other hand, you can perfectly scale down NiFi to run on edge devices where the required footprint is small due to limited hardware resources, which can be done using MiNiFi .