X-Pack Spark docking Ali cloud logging service LogHub

Outline

X-Pack Spark analysis engine is based on a complex analysis capabilities provided by Spark, streaming, and machine learning. Spark Ali cloud analysis engine can butt of a variety of data sources, such as: cloud database HBase, MongoDB, Phoenix, etc., but also supports docking Ali cloud logging service LogHub . Ali cloud service log (Log Service, referred to as LOG) is a one-stop service for real-time log data, logging class provides data acquisition, consumption, delivery and analysis capabilities to enhance massive log processing and analysis capabilities.

Scene Description

APP certain a sales platform for the user to open the home page, search, product detail page in the APP and the final orders to buy goods and other operations, the event generated by the operation are recorded in a log Ali cloud systems. Now we need behavioral data to users of APP to do some statistical analysis, every day, every week a detailed operational data, and to provide users with online billing inquiries and so on.

How to achieve

By Ali cloud logging service + X-Pack Spark + cloud HBase complete these demands. Look at the data flow diagram in finishing:

_


Seen from the FIG data flow is: the user through the number of log LogHub butt of APP -> Spark Streming docking LogHub synchronization number to HBase (Phoenix) -> online data synchronized to Spark offline number of bins -> number of off-line cartridge batch calculating an output operational data .
Japan APP contains information of a user event generated by using the APP following a simple example to implement each next step.

LogHub butt log APP

APP is assumed that a log is generated in the file directory of the machine, the machine may interface file by LogHub, reading the log analysis. Log field information is assumed as follows:

event_time: long #事件产生的时间戳
user_id: string #用户ID,唯一值。
device_id: String #设备id,APP使用的设备。
event_name: String #事件名称,如:首页、搜索、明细页、购买
prod_id: String #商品ID。
stay_times: int #停留时间。

The above information using comma separators APP log in, so when the select Comma Separated LogHub specified configuration acquisition mode.

SparkStreaming docking APP

SparkStreaming docking APP can use the X-Pack Connectors in the docking LogHub plug (refer: the Spark docking LogHub Quick Start ). SparkStreaming docking LogHub each one minute may be provided a synchronized data to Phoenix.
We need to create a table in Phoenix before synchronizing data, as follows:

CREATE TABLE IF NOT EXISTS user_event (
   event_time BIGINT NOT NULL,
   user_id VARCHAR NOT NULL,
   device_id VARCHAR,
   event_name VARCHAR,
   prod_id VARCHAR
   CONSTRAINT my_pk PRIMARY KEY (event_time, user_id)
  );

Table user_event Phoenix user_event and use as a composite primary key user_id, primarily for use user_id detail query operations performed, time information in the time range to facilitate synchronization data to Spark.
SparkStreaming synchronized to the code data LogHub main logic Phoenix follows:

val loghubStream = LoghubUtils.createStream(
        ssc,
        loghubProject,
        logStore,
        loghubGroupName,
        endpoint,
        numReceiver,
        accessKeyId,
        accessKeySecret,
        StorageLevel.MEMORY_AND_DISK)

      loghubStream.foreachRDD { rdd =>
        rdd.foreachPartition { pt =>
          // 获取Phoenix的链接
          val phoenixConn = DriverManager.getConnection("jdbc:phoenix:" + zkAddress)
          val statment = phoenixConn.createStatement()
          var i = 0
          while (pt.hasNext) {
            val value = pt.next()
            //获取的LogHub的数据是json格式的,需要进行转换
            val valueFormatted = JSON.parseObject(new String(value))
            //构造phonenix 插入语句
            val insetSql = s"upsert into $phoenixTableName values(" +
              s"${valueFormatted.getLong("event_time")}," +
              s"'${valueFormatted.getString("user_id").trim}'," +
              s"'${valueFormatted.getString("device_id").trim}'," +
              s"'${valueFormatted.getString("event_name").trim}'," +
              s"'${valueFormatted.getString("prod_id").trim}')"
            statment.execute(insetSql)
            i = i + 1
            // 每隔batchSize行提交一次commit到Phoenix。
            if (i % batchSize == 0) {
              phoenixConn.commit()
              println(s"====finish upsert $i rows====")
            }
          }
          phoenixConn.commit()
          println(s"==last==finish upsert $i rows====")
          phoenixConn.close()
          }
      }

SparkStreaming synchronize data to Phoenix, the user can perform detailed queries the database for Phoenix. E.g:

# 查询用户user_id_1006所有浏览明细。
select * from user_event where user_id = 'user_id_1006';

Spark synchronized to the number of bins offline

Phoenix online database for detailed inquiry, if required statistics, the number of off-line calculation need to use Spark warehouse. Phoenix Spark synchronous data to the number of positions in essence creating a table on Spark, and then synchronize the data to Spark a table.
Refer to Spark archive data: Batch archive . Sql represented herein by synchronous logic, the data synchronization is assumed here to Spark once daily.
Spark built in tables, synchronization method is as follows:

#在Spark中创建Parquet格式表:user_event_parquet,使用dt作为分区字段。
create table user_event_parquet(
    event_time long,
    user_id string,
    device_id string,
    event_name string,
    prod_id string, 
    dt string
) using parquet
partitioned by(dt);

#  在Spark中创建表user_event_phoenix映射Phoenix数据库的表。
CREATE TABLE user_event_phoenix USING org.apache.phoenix.spark
OPTIONS (
'zkUrl' 'hb-xx-master3-001.hbase.rds.aliyuncs.com:2181,hb-xx-master1-001.hbase.rds.aliyuncs.com:2181,hb-xx-master2-001.hbase.rds.aliyuncs.com:2181',
'table' 'user_event'
);

# 向Parquet表:user_event_parquet插入一天:2019-01-01的数据
insert into user_event_parquet select EVENT_TIME,USER_ID,DEVICE_ID,EVENT_NAME,PROD_ID,'2019-01-01' from user_event_phoenix where EVENT_TIME >=1546272000 and EVENT_TIME < 1546358400

Calculating the number of bins offline batch

Spark can synchronize data to do statistical analysis of the budget Spark data, such as:

#统计每天的访问数
select dt, count(*) from user_event_parquet group by dt
#统计前十的访问
select dt, count(*) total from user_event_parquet group by dt order by total desc limit 10
#统计前100个用户的访问数
select dt,user_id, count(*) total  from user_event_parquet group by dt,user_id order by total desc limit 100

Calculation results can be written back to the database service for business inquiries, the statements.

summary

This article briefly describes how Spark docking LogHub and how to synchronize data and other common operations. Reference links below:

Guess you like

Origin yq.aliyun.com/articles/706379