[Real-time data warehouse] Dimension table association optimization of DWM order wide table -- asynchronous query

One DWM layer - order wide table

1-dimensional table association code implementation

(1) Optimization 2: Asynchronous query

In the process of Flink stream processing, it is often necessary to interact with external systems and use dimension tables to complete the fields in the fact table.

For example: in the e-commerce scenario, it is necessary to associate the sku id of a commodity with some attributes of the commodity, such as the industry to which the commodity belongs, the manufacturer of the commodity, and some situations of the manufacturer; in the logistics scenario, knowing the package id needs to associate the package Industry attributes, shipping information, receiving information, etc.

By default, in Flink's MapFunction, a single parallel can only interact in a synchronous manner: send a request to external storage, block IO, wait for the request to return, and then continue to send the next request. This method of synchronous interaction often spends a lot of time waiting on the network. In order to improve processing efficiency, the parallelism of MapFunction can be increased, but increasing the parallelism means more resources, which is not a very good solution.

Flink introduced Async I/O in 1.2. In the asynchronous mode, the IO operation is asynchronized. A single parallel can send multiple requests in succession. Whichever request returns first will be processed first, so that there is no need for blocking waiting between consecutive requests. , which greatly improves the stream processing efficiency.

Async I/O is a very popular feature contributed by Alibaba to the community. It solves the problem that network delay becomes a system bottleneck when interacting with external systems.

insert image description here

Asynchronous query actually entrusts the query operation of the dimension table to a separate thread pool to complete, so that it will not be blocked by a certain query, and a single parallel can send multiple requests continuously to improve concurrency efficiency.

This method is especially aimed at operations involving network IO, reducing the consumption caused by request waiting.

**prerequisites. **Properly implementing asynchronous I/O interactions with a database (or key/value store) requires a database client that supports asynchronous requests. Many mainstream databases provide such clients. If there is no such client, you can convert a synchronous client to a limited concurrency client by creating multiple clients and using a thread pool to handle synchronous calls. However, this approach is often less efficient than regular asynchronous clients.

a Encapsulate the thread pool tool class

package com.hzy.gmall.realtime.utils;
/**
 * 线程池工具类
 */
public class ThreadPoolUtil {
    
    

    public static ThreadPoolExecutor pool;

//    int corePoolSize,         初始线程数量
//    int maximumPoolSize,      最大线程数
//    long keepAliveTime,       当线程池中空闲线程的数量超过corePoolSize,在keepAliveTime时间后进行销毁
//    TimeUnit unit,            时间单位
//    BlockingQueue<Runnable> workQueue     要执行的任务队列
    public static ThreadPoolExecutor getInstance(){
    
    
        // 双重锁实现懒汉式单例创建
        if (pool == null){
    
    
            synchronized(ThreadPoolExecutor.class){
    
    
                if (pool == null){
    
    
                    System.out.println("开辟线程池...");
                    pool = new ThreadPoolExecutor(
                            4,20,300, TimeUnit.SECONDS,
                            new LinkedBlockingDeque<Runnable>(Integer.MAX_VALUE)
                    );
                }
            }

        }
        return pool;
    }
}

b Function class DimAsyncFunction that encapsulates dimension asynchronous query

This class inherits the asynchronous method class RichAsyncFunction and implements the custom dimension query interface.

Among them, RichAsyncFunction<IN,OUT> is an asynchronous method class provided by Flink. Here, because the query operation input class is consistent with the return class, it is <T,T>.

The RichAsyncFunction class implements two methods:

open is used to initialize the asynchronous connection pool.

The asyncInvoke method is the core method, and the operations inside must be asynchronous. If the query database has an asynchronous API, the asynchronous method of the thread can also be used. If there is no asynchronous method, it is necessary to use the thread pool to implement the asynchronous query.

package com.hzy.gmall.realtime.app.fun;
/**
 * 实现维度的异步关联
 * 模板方法设计模式:在父类中定义实现某一个功能的核心算法骨架,将具体的实现延迟到子类中去完成
 *                  子类不改变父类核心算法骨架的前提下,每一个子类都可以有自己的独立实现。
 */
public abstract class DimAsyncFunction<T> extends RichAsyncFunction<T,T>{
    
    

    private ExecutorService executorService;
    private String tableName;

    public DimAsyncFunction(String tableName) {
    
    
        this.tableName = tableName;
    }

    @Override
    public void open(Configuration parameters) throws Exception {
    
    
        // 创建线程池对象
        executorService = ThreadPoolUtil.getInstance();
    }

    // 发送异步请求,完成维度关联
    // 通过创建多线程的方式发送异步请求
    // 此方法每处理流中的一条数据,都会执行一次
    @Override
    public void asyncInvoke(T obj, ResultFuture<T> resultFuture) throws Exception {
    
    
        // 通过线程池获取线程
        executorService.submit(new Runnable() {
    
    
            // run中的代码实现的就是异步维度关联操作
            @Override
            public void run() {
    
    
                try {
    
    
                    long start = System.currentTimeMillis();
                    // 从对象中获取维度关联的key
                    String key = getKey(obj);
                    // 根据key到维度表中获取维度对象
                    JSONObject dimJsonObj = DimUtil.getDimInfo(tableName, key);
                    // 把维度对象的属性赋值给流中对象属性(维度关联)
                    if (dimJsonObj != null){
    
    
                        join(obj,dimJsonObj);
                    }
                    long end = System.currentTimeMillis();
                    System.out.println("维度异步查询耗时:" + (end - start) + "毫秒");
                    resultFuture.complete(Collections.singleton(obj));
                }catch (Exception e){
    
    
                    e.printStackTrace();
                    System.out.println("维度异步查询发生异常...");
                }
            }
        });
    }
}

c custom dimension query interface DimJoinFunction

This asynchronous dimension table query method is applicable to all kinds of dimension table queries. It is up to the user to define what conditions to use for the query and how to merge the found results into the data stream object.

This is to define an interface DimJoinFunction<T>that includes two methods.

package com.hzy.gmall.realtime.app.fun;

// 维度关联查询的接口
public interface DimJoinFunction<T> {
    
    

   /**
     * 需要实现如何把结果装配给数据流对象
     * @param obj  数据流对象
     * @param jsonObject   异步查询结果
     * @throws Exception
     */
    void join(T obj, JSONObject dimJsonObj) throws Exception;

     /**
     * 需要实现如何从流中对象获取主键
     * @param obj  数据流对象
     */
    String getKey(T obj);
}

d Use DimAsyncFunction

The core class is AsyncDataStream. This class has two methods, one is ordered wait (orderedWait) and the other is unordered wait (unorderedWait).

  • Unordered waiting (unorderedWait)

The later data, if the asynchronous query speed is faster than the first data, the performance will be better, but there will be out-of-order.

  • Ordered waiting (orderedWait)

The order of first come, first come is strictly preserved, so even if the later data is completed first, it must wait for the previous data. So the performance will be worse.

  • Notice
    • Here, the query of the user dimension table is implemented, so the join method of the assembly result and the getKey method of obtaining the query rowkey must be rewritten.
    • The last two parameters of the method are 60, TimeUnit.SECONDS, indicating that the asynchronous query can be executed for a maximum of 60 seconds, otherwise a timeout exception will be reported.
Associated user dimension source code

Associate data of different dimensions in OrderWideApp.

        // TODO 8 和用户维度进行关联
        // 以下方式(同步)实现效率低
//        orderWideDS.map(
//                new MapFunction<OrderWide, OrderWide>() {
    
    
//                    @Override
//                    public OrderWide map(OrderWide orderWide) throws Exception {
    
    
//                        Long user_id = orderWide.getUser_id();
//                        JSONObject userDimInfo = DimUtil.getDimInfo("dim_user_info", user_id.toString());
//                        String gender = userDimInfo.getString("GENDER");
//                        orderWide.setUser_gender(gender);
//                        return null;
//                    }
//                }
//        );
        // 异步操作,和用户表关联
        SingleOutputStreamOperator<OrderWide> orderWideWithUserDS = AsyncDataStream.unorderedWait(
                orderWideDS,
                // 动态绑定
                new DimAsyncFunction<OrderWide>("DIM_USER_INFO") {
    
    
                    @Override
                    public void join(OrderWide orderWide, JSONObject dimJsonObj) throws ParseException {
    
    
                        String gender = dimJsonObj.getString("GENDER");
                        orderWide.setUser_gender(gender);

                        // 2000-10-02
                        String birthday = dimJsonObj.getString("BIRTHDAY");
                        SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
                        Date birthdayDate = sdf.parse(birthday);
                        long diffMs = System.currentTimeMillis() - birthdayDate.getTime();
                        long ageLong = diffMs / 1000L / 60L / 60L / 24L / 365L;
                        orderWide.setUser_age((int) ageLong);
                    }

                    @Override
                    public String getKey(OrderWide orderWide) {
    
    
                        return orderWide.getUser_id().toString();
                    }
                },
                60,
                TimeUnit.SECONDS
        );
        
        orderWideWithUserDS.print(">>>");
test

The overall business process is as follows:

insert image description here

Start zookeeper, kafka, hdfs, wait for the safe mode to exit and start hbase, maxwell, phoenix.

configuration table

Add the configuration of the user table to the configuration table, import in batches, delete several original configurations, and execute the following statement

INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('activity_info', 'insert', 'hbase', 'dim_activity_info', 'id,activity_name,activity_type,activity_desc,start_time,end_time,create_time', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('activity_info', 'update', 'hbase', 'dim_activity_info', 'id,activity_name,activity_type,activity_desc,start_time,end_time,create_time', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('activity_rule', 'insert', 'hbase', 'dim_activity_rule', 'id,activity_id,activity_type,condition_amount,condition_num,benefit_amount,benefit_discount,benefit_level', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('activity_rule', 'update', 'hbase', 'dim_activity_rule', 'id,activity_id,activity_type,condition_amount,condition_num,benefit_amount,benefit_discount,benefit_level', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('activity_sku', 'insert', 'hbase', 'dim_activity_sku', 'id,activity_id,sku_id,create_time', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('activity_sku', 'update', 'hbase', 'dim_activity_sku', 'id,activity_id,sku_id,create_time', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('base_category1', 'insert', 'hbase', 'dim_base_category1', 'id,name', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('base_category1', 'update', 'hbase', 'dim_base_category1', 'id,name', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('base_category2', 'insert', 'hbase', 'dim_base_category2', 'id,name,category1_id', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('base_category2', 'update', 'hbase', 'dim_base_category2', 'id,name,category1_id', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('base_category3', 'insert', 'hbase', 'dim_base_category3', 'id,name,category2_id', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('base_category3', 'update', 'hbase', 'dim_base_category3', 'id,name,category2_id', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('base_dic', 'insert', 'hbase', 'dim_base_dic', 'id,dic_name,parent_code,create_time,operate_time', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('base_dic', 'update', 'hbase', 'dim_base_dic', 'id,dic_name,parent_code,create_time,operate_time', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('base_province', 'insert', 'hbase', 'dim_base_province', 'id,name,region_id,area_code,iso_code,iso_3166_2', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('base_province', 'update', 'hbase', 'dim_base_province', 'id,name,region_id,area_code,iso_code,iso_3166_2', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('base_region', 'insert', 'hbase', 'dim_base_region', 'id,region_name', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('base_region', 'update', 'hbase', 'dim_base_region', 'id,region_name', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('base_trademark', 'insert', 'hbase', 'dim_base_trademark', 'id,tm_name', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('base_trademark', 'update', 'hbase', 'dim_base_trademark', 'id,tm_name', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('cart_info', 'insert', 'kafka', 'dwd_cart_info', 'id,user_id,sku_id,cart_price,sku_num,img_url,sku_name,is_checked,create_time,operate_time,is_ordered,order_time,source_type,source_id', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('comment_info', 'insert', 'kafka', 'dwd_comment_info', 'id,user_id,nick_name,head_img,sku_id,spu_id,order_id,appraise,comment_txt,create_time,operate_time', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('coupon_info', 'insert', 'hbase', 'dim_coupon_info', 'id,coupon_name,coupon_type,condition_amount,condition_num,activity_id,benefit_amount,benefit_discount,create_time,range_type,limit_num,taken_count,start_time,end_time,operate_time,expire_time,range_desc', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('coupon_info', 'update', 'hbase', 'dim_coupon_info', 'id,coupon_name,coupon_type,condition_amount,condition_num,activity_id,benefit_amount,benefit_discount,create_time,range_type,limit_num,taken_count,start_time,end_time,operate_time,expire_time,range_desc', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('coupon_range', 'insert', 'hbase', 'dim_coupon_range', 'id,coupon_id,range_type,range_id', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('coupon_range', 'update', 'hbase', 'dim_coupon_range', 'id,coupon_id,range_type,range_id', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('coupon_use', 'insert', 'kafka', 'dwd_coupon_use', 'id,coupon_id,user_id,order_id,coupon_status,get_type,get_time,using_time,used_time,expire_time', 'id', ' SALT_BUCKETS = 3');
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('coupon_use', 'update', 'kafka', 'dwd_coupon_use', 'id,coupon_id,user_id,order_id,coupon_status,get_type,get_time,using_time,used_time,expire_time', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('favor_info', 'insert', 'kafka', 'dwd_favor_info', 'id,user_id,sku_id,spu_id,is_cancel,create_time,cancel_time', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('financial_sku_cost', 'insert', 'hbase', 'dim_financial_sku_cost', 'id,sku_id,sku_name,busi_date,is_lastest,sku_cost,create_time', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('financial_sku_cost', 'update', 'hbase', 'dim_financial_sku_cost', 'id,sku_id,sku_name,busi_date,is_lastest,sku_cost,create_time', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('order_detail', 'insert', 'kafka', 'dwd_order_detail', 'id,order_id,sku_id,sku_name,order_price,sku_num,create_time,source_type,source_id,split_activity_amount,split_coupon_amount,split_total_amount', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('order_detail_activity', 'insert', 'kafka', 'dwd_order_detail_activity', 'id,order_id,order_detail_id,activity_id,activity_rule_id,sku_id,create_time', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('order_detail_coupon', 'insert', 'kafka', 'dwd_order_detail_coupon', 'id,order_id,order_detail_id,coupon_id,coupon_use_id,sku_id,create_time', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('order_info', 'insert', 'kafka', 'dwd_order_info', 'id,consignee,consignee_tel,total_amount,order_status,user_id,payment_way,delivery_address,order_comment,out_trade_no,trade_body,create_time,operate_time,expire_time,process_status,tracking_no,parent_order_id,img_url,province_id,activity_reduce_amount,coupon_reduce_amount,original_total_amount,feight_fee,feight_fee_reduce,refundable_time', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('order_info', 'update', 'kafka', 'dwd_order_info_update', 'id,consignee,consignee_tel,total_amount,order_status,user_id,payment_way,delivery_address,order_comment,out_trade_no,trade_body,create_time,operate_time,expire_time,process_status,tracking_no,parent_order_id,img_url,province_id,activity_reduce_amount,coupon_reduce_amount,original_total_amount,feight_fee,feight_fee_reduce,refundable_time', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('order_refund_info', 'insert', 'kafka', 'dwd_order_refund_info', 'id,user_id,order_id,sku_id,refund_type,refund_num,refund_amount,refund_reason_type,refund_reason_txt,refund_status,create_time', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('payment_info', 'insert', 'kafka', 'dwd_payment_info', 'id,out_trade_no,order_id,user_id,payment_type,trade_no,total_amount,subject,payment_status,create_time,callback_time,callback_content', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('payment_info', 'update', 'kafka', 'dwd_payment_info', 'id,out_trade_no,order_id,user_id,payment_type,trade_no,total_amount,subject,payment_status,create_time,callback_time,callback_content', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('refund_payment', 'insert', 'kafka', 'dwd_refund_payment', 'id,out_trade_no,order_id,sku_id,payment_type,trade_no,total_amount,subject,refund_status,create_time,callback_time,callback_content', 'id', NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('refund_payment', 'update', 'kafka', 'dwd_refund_payment', 'id,out_trade_no,order_id,sku_id,payment_type,trade_no,total_amount,subject,refund_status,create_time,callback_time,callback_content', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('sku_info', 'insert', 'hbase', 'dim_sku_info', 'id,spu_id,price,sku_name,sku_desc,weight,tm_id,category3_id,sku_default_img,is_sale,create_time', 'id', ' SALT_BUCKETS = 4');
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('sku_info', 'update', 'hbase', 'dim_sku_info', 'id,spu_id,price,sku_name,sku_desc,weight,tm_id,category3_id,sku_default_img,is_sale,create_time', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('spu_info', 'insert', 'hbase', 'dim_spu_info', 'id,spu_name,description,category3_id,tm_id', 'id', ' SALT_BUCKETS = 3');
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('spu_info', 'update', 'hbase', 'dim_spu_info', 'id,spu_name,description,category3_id,tm_id', NULL, NULL);
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('user_info', 'insert', 'hbase', 'dim_user_info', 'id,login_name,name,user_level,birthday,gender,create_time,operate_time', 'id', ' SALT_BUCKETS = 3');
INSERT INTO `table_process`(`source_table`, `operate_type`, `sink_type`, `sink_table`, `sink_columns`, `sink_pk`, `sink_extend`) VALUES ('user_info', 'update', 'hbase', 'dim_user_info', 'id,login_name,name,user_level,birthday,gender,create_time,operate_time', NULL, NULL);

The result is as follows:

insert image description here

Start BaseDBApp, and add a lot of configuration information to the configuration table. After running, the program will create corresponding dimension tables in hbase.

The corresponding data can be viewed in phoenix.

insert image description here

Synchronization of historical data

There is data in the user_info table of the business system, and maxwell needs to synchronize the historical data of user information when collecting.

# maxwell-bootstrap的作用同步历史数据,实现将user_info表中的数据全部读取出来,不具备封装JSON并发送到hbase中的能力,--client_id 中的id maxwell_1 在/opt/module/maxwell-1.25.0/config.properties文件中进行过配置,由maxwell进程完成将数据封装成json并发送到hbase
bin/maxwell-bootstrap --user maxwell  --password 123456 --host hadoop101  --database gmall2022 --table user_info --client_id maxwell_1

After completion, the data in the dimension table is as follows:

insert image description here

Start OrderWideApp, start redis: redis-server /home/hzy/redis2022.conf, and generate business data.

The result is as follows:

insert image description here

Summarize
  • Processes that need to be started:
    • zookeeper、kafka、maxwell、hdfs、hbase、redis
    • BaseDBApp offload
    • OrderWideApp order wide table preparation
  • Program running process:
    • When running the simulation to generate the jar package of business data
    • Insert the generated business data into the business database MySQL
    • MySQL puts the changed data into binlog
    • maxwell obtains data from binlog, encapsulates the data into a JSON string and sends it to kafka's ods topic (ods_base_db_m)
    • BaseDBApp reads data from the ods_base_db_m topic for shunting
      • Fact data: write back to kafka's dwd topic
      • Dimension data: saved to the dimension table of phoenix
    • OrderWideApp gets the order and order details theme from the dwd theme
    • Use intervalJoin to perform dual-stream join on orders and order details
    • Associate the user dimension to the order wide table
      • Basic dimension association
      • Optimization 1: bypass cache
      • Optimization 2: Asynchronous IO - template method design pattern

Guess you like

Origin blog.csdn.net/weixin_43923463/article/details/128322249