Integration of ODPS and Kettle

Scene introduction

Traditional data management vendors (not using cloud computing technology) often use Oracle as storage for data warehouses and Kettle as ETL and process scheduling tools. Relying on Oracle's stability, high efficiency, and Kettle's flexibility, the traditional architecture can handle various complex scenarios. The structure of their data governance is simply this:

Traditional architecture

With the continuous development and promotion of cloud computing technology, traditional architectures are slowly fading out of the market, but during the delivery process, it is inevitable to encounter scenarios that integrate cloud computing with traditional vendors. For example, we use Alibaba Cloud's product DataWorks, which integrates cloud data warehouse ODPS and offline synchronization tool DataX, the overall architecture will become like this:

Converged architecture

This article mainly introduces how to ensure the synchronization of DataX jobs and Kettle jobs in the above scenarios.

An Introduction

1. Create a virtual node ( vn_root) in DataWorks and set the node to the "paused" state (instances in the paused state will be converted to the "failed" state when the set execution time arrives), and then all The upstream of the DataX data integration task is configured as this node;

2. Use Java to encapsulate the DataWorks API, the encapsulation class name ResumeTask, and the output as a jar package ( active_vn_root.jar), the following process is implemented in the code to activate the DataX job:

Activation process

3. active_vn_root.jarRestart Kettle after putting it in the data-integration\lib directory

4. Add a java code node to the original business process. The source code is as follows:

/* 引用 jar 包中的方法 */
import dataworks.ResumeTask;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException {
    
    

    Object[] r = getRow();
    if (r == null) {
    
    
        setOutputDone();
        return false;
    }

    /* 只处理第一行 */
    if (first) {
    
    
        first = false;
    } else {
    
    
        setOutputDone();
        return false;
    }

    Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());

    /* 捕获上游信号 */
    String signal = get(Fields.In, "signal").getString(r);

    /* 处理信号 */
    if (signal.compareTo("1") == 0) {
    
    
        logBasic("signal is 1");
        /* 节点号 1234 是固定值,不会变*/
        int ret = ResumeTask.doResumeTask(1234);
        /** ret 返回数字的含义
         *         0 成功;
         *         -1 其他异常(网络异常);
         *         1 查询异常,未查询到相关实例;
         *         2 查询异常,查询到多个相关实例;
         *         3 恢复失败;
         *         4 恢复失败,任务为一次性任务;
         *         5 恢复失败,任务为空跑任务;
         *         6 恢复失败,接口调用返回值异常;
         *         7 实例恢复成功,但重跑失败;
         *         8 实例恢复成功,但重跑失败,任务并非失败任务,不需要重跑
         *         9 实例恢复成功,但重跑失败,接口调用返回值异常;
         */
        logBasic("return code is " + ret);
        if (ret == 0) {
    
    
            logBasic("active root node successfully: 调用成功");
            get(Fields.Out, "result").setValue(outputRow, "success");
        } else {
    
    
            String err_msg = "UNKOWN ERROR";
            switch(ret) {
    
    
                case -1: err_msg="其他异常(网络异常);";break;
                case  1: err_msg="查询异常,未查询到相关实例;";break;
                case  2: err_msg="查询异常,查询到多个相关实例;";break;
                case  3: err_msg="恢复失败;";break;
                case  4: err_msg="恢复失败,任务为一次性任务;";break;
                case  5: err_msg="恢复失败,任务为空跑任务;";break;
                case  6: err_msg="恢复失败,接口调用返回值异常;";break;
                case  7: err_msg="实例恢复成功,但重跑失败;";break;
                case  8: err_msg="实例恢复成功,但重跑失败,任务并非失败任务,不需要重跑;";break;
                case  9: err_msg="实例恢复成功,但重跑失败,接口调用返回值异常;";break;
            }
            logBasic("active root node failed: 调用失败," + err_msg);
            get(Fields.Out, "result").setValue(outputRow, "fail");
        }
    } else {
    
    
        logBasic("signal is not 1, will do nothing:  接收到的信号错误");
        get(Fields.Out, "result").setValue(outputRow, "fail");
    }
    putRow(data.outputRowMeta, outputRow);
    setOutputDone(); 
    return false;
}

5. Since the DataWorks API is an asynchronous operation (that is, it returns after the call, and does not wait for the task to be executed), there is no need to worry that the execution of the node will block the execution of the overall process.

Guess you like

Origin blog.csdn.net/ManWZD/article/details/110679589