Building a Python processor using Apache NiFi 2.0.0

[Live broadcast preview] Will large models replace programmers? "

The built-in Python processor in the latest version of Apache NiFi simplifies data processing tasks, increases flexibility, and speeds development.

Translated from Apache NiFi 2.0.0: Building Python Processors , author Robert Kimani.

Apache NiFi is a powerful platform dedicated to data flow management , which provides many features designed to increase the efficiency and flexibility of data processing. Its web-based user interface provides a seamless experience for designing, controlling and monitoring data flows.

NiFi supports building custom processors and extensions, allowing users to tailor the platform to their specific needs.

With a multi-tenant user experience, NiFi ensures that multiple users can interact with the system simultaneously, each with their own set of access rights.

The Python handler provides a powerful way to extend the functionality of NiFi, allowing users to leverage a rich ecosystem of Python libraries and tools in their data streams. Here, we discuss the benefits of incorporating Python into NiFi workflows and explore real-world use cases where Python processors can simplify data processing tasks, increase flexibility, and speed development.

Whether you want to integrate machine learning algorithms, perform custom data transformations, or interact with external systems, building a Python processor in Apache NiFi can help you meet these data integration needs.

A standout feature of NiFi is its highly configurable nature, allowing users to tailor data routing, transformation and system mediation logic to their specific requirements. NiFi helps users achieve the data processing results they want, such as prioritizing fault tolerance over guaranteed delivery, or optimizing for low latency over high throughput.

Dynamic prioritization allows real-time adjustments to the priority of data in a stream, while the ability to modify streams at runtime adds a layer of flexibility to adapt to changing needs. NiFi also incorporates a backpressure mechanism to regulate data flow rates and prevent overload, ensuring smooth and efficient operation even under varying workloads.

NiFi is designed to support both vertical and horizontal scaling. Whether scaling to leverage the full power of a single machine or using a zero-leader cluster model, NiFi can adapt to data processing tasks of any size.

Data provenance is another key feature that allows users to track the journey of data from its origin to its final destination. This provides valuable insights for auditing, troubleshooting and ensuring data integrity throughout the process.

Security is paramount in NiFi, which supports SSL, SSH, HTTPS and encrypted content, among other security measures. Pluggable, fine-grained role-based authentication and authorization mechanisms ensure that access to data flows is carefully controlled, allowing multiple teams to securely manage and share specific parts of the flow.

NiFi's design philosophy, inspired by concepts such as flow-based programming and staged event-driven architecture , offers several compelling advantages:

Intuitive visual interface for designing and managing data flows, improving productivity and ease of use.
Asynchronous processing model that supports high throughput and natural buffering to accommodate fluctuating loads.
Built-in concurrency management abstracts the complexity of multi-threaded programming.
Emphasis on component reusability and testability, promoting modular and robust design methods.
Native support for backpressure and error handling ensures robustness and reliability in data processing pipelines.
Get a comprehensive understanding of data flow dynamics for effective monitoring and troubleshooting.

Why use Python builds in Apache NiFi?

Apache NiFi is a powerful tool for data ingestion, transformation and routing. The Python processor in NiFi provides a flexible way to extend its functionality, especially for processing unstructured data or integrating with external systems such as AI models or vector stores like the cloud-native vector database Milvus .

When dealing with the unstructured file types that tools like Cloudera Data Flow can extract, Python processors are critical for implementing custom logic to parse and manipulate the data. For example, you can use Python to extract specific information from text files, perform sentiment analysis on text data, or preprocess images before further analysis.

On the other hand, structured file types can often be processed using NiFi's built-in processor without the need for custom Python code. NiFi provides a wide range of processors for processing structured data formats such as CSV, JSON, Avro, and for interacting with databases , APIs , and other enterprise systems.

When you need to interact with AI models or other external systems such as Milvus, the Python processor provides a convenient way to integrate this functionality into your NiFi data flow. For tasks such as text-to-text, text-to-image, or text-to-speech processing, you can write Python code to interact with the relevant model or service and incorporate this processing into your NiFi pipeline.

Python: A new era in NiFi 2.0.0

Apache NiFi 2.0.0 brings some major improvements to the platform, especially in terms of Python integration and performance enhancements. The ability to seamlessly integrate Python scripts into NiFi data flows opens up a wide range of possibilities for working with a variety of data sources and leveraging the power of generative AI.

Prior to this release, while it was possible to use Python in NiFi, flexibility might be limited and executing Python scripts might not be as streamlined as users would like. However, with the latest version, Python integration has been greatly improved, allowing for more seamless execution of Python code in NiFi pipelines.

此外，对 JDK 21+ 的支持带来了性能改进，使 NiFi 更快、更高效，尤其是在处理多线程任务时。这可以显著提高 NiFi 数据流的可扩展性和响应能力，尤其是在处理大量数据或复杂处理任务时。

Introducing features such as process groups as stateless operations and a rules engine for development assistance further enhance the functionality and usability of NiFi, giving developers more flexibility and tools to build powerful data flow pipelines.

An example processor: Watson SDK to basic AI model

This Python code defines a NiFi processor called NiFi that interacts with the IBM WatsonX AI service to generate responses based on input prompts. Please note that for NiFi 2.0.0, Python3.10+ is the minimum requirement.

Let's break down the code and explain the various parts.

import

import json
import re
from nifiapi.flowfiletransform import FlowFileTransform, FlowFileTransformResult
from nifiapi.properties import PropertyDescriptor, StandardValidators, ExpressionLanguageScope

The following are the necessary imports for the script:

json and re are Python's built-in modules for processing JSON data and regular expressions respectively.
FlowFileTransform and FlowFileTransformResult are classes of the custom module (nifiapi.flowfiletransform) related to NiFi processing.
PropertyDescriptor, StandardValidators, and ExpressionLanguageScope are classes from another custom module (nifiapi.properties) used to define processor properties.

class definition

class CallWatsonXAI(FlowFileTransform):
    ...

This defines a class called CallWatsonXAI, which extends the FlowFileTransform class, which handles data transformation in NiFi.

Processor details

processor_details = {
    'name': 'Call WatsonX AI',
    'version': '2.0.0-M2',
    'description': 'Calls IBM WatsonX AI service to generate responses based on input prompts.',
    'tags': ['watsonx', 'ai', 'response', 'generation'],
}

Define processor details such as version, description, and tags. However, please note that 2.0.0-M2 is the current version.

property descriptor

PROMPT_TEXT = PropertyDescriptor(
    name="Prompt Text",
    description="Specifies whether or not the text (including full prompt with 
context) to send",
    required=True,
    validators=[StandardValidators.NON_EMPTY_VALIDATOR],
    
expression_language_scope=ExpressionLanguageScope.FLOWFILE_ATTRIBU
TES
)

Defines the features that can be set for this processor. In this case there are PROMPT_TEXT, WATSONXAI_API_KEY and WATSONXAI_PROJECT_ID.

Constructor

def __init__(self, **kwargs):
    super().__init__()
    self.property_descriptors.append(self.PROMPT_TEXT)
    self.property_descriptors.append(self.WATSONXAI_API_KEY)
    self.property_descriptors.append(self.WATSONXAI_PROJECT_ID)

Initialize the processor class and append the property descriptor to the property list.

getPropertyDescriptors 方法

def get_property_descriptors(self):
    return self.property_descriptors

This method is required by the NiFi processor to obtain a list of properties.

transform method

def transform(self, context, flowfile):
    ...

This method is responsible for processing the data. The method receives a context object containing information about the processor's execution environment and a stream file object containing the data to be processed.

IBM WatsonX integration

from ibm_watson_machine_learning.foundation_models.utils.enums import 
ModelTypes
from ibm_watson_machine_learning.foundation_models import Model

Import the IBM Watson machine learning module.

prompt_text = 
context.getProperty(self.PROMPT_TEXT).evaluateAttributeExpressions(flowfil
e).getValue()
watsonx_api_key = 
context.getProperty(self.WATSONXAI_API_KEY).evaluateAttributeExpressions(
flowfile).getValue()
project_id = 
context.getProperty(self.WATSONXAI_PROJECT_ID).evaluateAttributeExpres
sions(flowfile).getValue()

Get input values such as prompt text, WatsonX API key, and project ID through NiFi processor properties.

model_id = ModelTypes.LLAMA_2_70B_CHAT
gen_parms = None
project_id = project_id
space_id = None
verify = False

model = Model(model_id, my_credentials, gen_parms, project_id, space_id, verify)
gen_parms_override = None
generated_response = model.generate(prompt_text, gen_parms_override)

Configure and call the IBM WatsonX module to generate a response based on the prompt text.

Output processing

attributes = {"mime.type": "application/json"}
output_contents = json.dumps(generated_response)

Define output properties to convert the generated response into JSON format.

Logging and returns

self.logger.debug(f"Prompt: {prompt_text}")

Record the prompt text.

return FlowFileTransformResult(relationship="success", 
contents=output_contents, attributes=attributes)

Returns the conversion result, indicating whether the conversion was successful and providing output data and properties.

Prepackaged Python processor

NiFi 2.0.0 ships with a diverse set of Python processors that provide a wide range of functionality.

Pinecone's VectorDB interface : This processor facilitates interaction with Pinecone , a vector database service, allowing users to query and store data efficiently.
ChunkDocument : This processor breaks large documents into smaller chunks, making them suitable for processing and storage, especially in vector databases where size restrictions may apply.
ParseDocument : This processor seems to be very versatile, capable of parsing various document formats such as Markdown, PowerPoint, Google Docs, and Excel , extracting text content for further processing or storage.
ConvertCSVtoExcel : As the name suggests, this processor converts data from CSV format to Excel format, providing flexibility for data exchange and processing.
DetectObjectInImage : This processor appears to utilize deep learning techniques for object detection in images , enabling users to analyze image data and extract valuable insights.
PromptChatGPT : This processor sounds interesting - it integrates with ChatGPT or similar conversational AI models to enable users to generate responses or participate in conversations based on prompts.
PutChroma and QueryChroma : These processors are related to Chroma , an open source database for large language models (LLMs). They facilitate data storage (PutChroma) and retrieval/query (QueryChroma) in a Chroma database or similar system.

in conclusion

Prioritizing Python integration in Apache NiFi marks an important milestone in bridging the gap between data engineers and data scientists , while expanding the platform's versatility and applicability.

By enabling Python enthusiasts to seamlessly develop NiFi components in Python, the development cycle is simplified, accelerating the implementation of data pipelines and workflows.

It's an exciting time for Python processors in NiFi , and contributing to the ecosystem can be very valuable. Developing and sharing Python processors can extend NiFi's functionality and solve specific use cases.

To get started with NiFi, users can refer to the Quick Start Guide for development and the NiFi Developer Guide for more comprehensive information on how to contribute to the project.

This article was first published on Yunyunzhongsheng ( https://yylives.cc/ ), everyone is welcome to visit.