1. Summary of data collection projects

1. Data Warehouse

Data warehouse is to store data and provide data support for enterprises

2. Classification of data

Business data: What is recorded is the order information!

Behavioral data: What is recorded is the information that occurred during the order placement process!

2.1 User business data

What is it:

The data generated when users use the platform (e-commerce) are closely related to the business of the e-commerce (purchase, order, payment, collection, search)!

produce:

When the user uses the APP, it is generated!

How to save:

Relational Database

why:

Transaction is the core element that distinguishes the scene that can use RDMS and NoSql!
RDMS: OLTP (online transaction process) design, focusing on transaction and online processing!
NoSQL: Born in the era of mobile Internet. Focus on performance! Strong performance!
Hive is designed based on OLAP (online analitic process), focusing on queries!

2.2 User behavior data

What is it:

Data used to record various operations and behaviors (startup, comments, favorites, etc.) of users when using the platform APP!

produce:

When the user uses the APP, it is generated!

How to save:

Stored in the form of log files, which are recorded in JSON format!

why:

Data information is dense, low value, long recording period, and complex structure.

3. Data example

3.1 Log data example

Start log:

｛ common：” xxx“

                       start：”xxxx“，
					err：“xxx”,
                      ts:"启动行为发生的时间戳"
｝

{
    
    
  "common": {
    
    
    "ar": "370000",
    "ba": "Honor",
    "ch": "wandoujia",
    "md": "Honor 20s",
    "mid": "eQF5boERMJFOujcp",
    "os": "Android 11.0",
    "uid": "76",
    "vc": "v2.1.134"
  },
  "start": {
    
       
    "entry": "icon",         --icon手机图标  notice 通知   install 安装后启动
    "loading_time": 18803,  --启动加载时间
    "open_ad_id": 7,        --广告页ID
    "open_ad_ms": 3449,    -- 广告总共播放时间
    "open_ad_skip_ms": 1989   --  用户跳过广告时点
  },
"err":{
    
                         --错误
"error_code": "1234",      --错误码
    "msg": "***********"       --错误信息
},
  "ts": 1585744304000
}

{
    
    "common":{
    
    "ar":"420000","ba":"iPhone","ch":"Appstore","md":"iPhone 8","mid":"mid_991","os":"iOS 13.3.1","uid":"418","vc":"v2.1.134"},
 "page":{
    
    "during_time":3336,"item":"3,7","item_type":"sku_ids","last_page_id":"trade","page_id":"payment"},"ts":1583769315209}

Event log:

{
"common":{}
"actions":[

]

page:{},
err:{},
ts:xxx
}

{
    
    
  "common": {
    
                      -- 公共信息
    "ar": "230000",              -- 地区编码
    "ba": "iPhone",              -- 手机品牌
    "ch": "Appstore",            -- 渠道
    "md": "iPhone 8",            -- 手机型号
    "mid": "YXfhjAYH6As2z9Iq", -- 设备id
    "os": "iOS 13.2.9",          -- 操作系统
    "uid": "485",                 -- 会员id
    "vc": "v2.1.134"             -- app版本号
  },
"actions": [                     --动作(事件)  
    {
    
    
      "action_id": "favor_add",   --动作id
      "item": "3",                   --目标id
      "item_type": "sku_id",       --目标类型
      "ts": 1585744376605           --动作时间戳
    }
  ]，
  "displays": [
    {
    
    
      "displayType": "query",        -- 曝光类型
      "item": "3",                     -- 曝光对象id
      "item_type": "sku_id",         -- 曝光对象类型
      "order": 1                        --出现顺序
    },
    {
    
    
      "displayType": "promotion",
      "item": "6",
      "item_type": "sku_id",
      "order": 2
    },
    {
    
    
      "displayType": "promotion",
      "item": "9",
      "item_type": "sku_id",
      "order": 3
    },
    {
    
    
      "displayType": "recommend",
      "item": "6",
      "item_type": "sku_id",
      "order": 4
    },
    {
    
    
      "displayType": "query ",
      "item": "6",
      "item_type": "sku_id",
      "order": 5
    }
  ],
  "page": {
    
                           --页面信息
    "during_time": 7648,        -- 持续时间毫秒
    "item": "3",                  -- 目标id
    "item_type": "sku_id",      -- 目标类型
    "last_page_id": "login",    -- 上页类型
    "page_id": "good_detail",   -- 页面ID
    "sourceType": "promotion"   -- 来源类型
  },
"err":{
    
                         --错误
"error_code": "1234",      --错误码
    "msg": "***********"       --错误信息
},
  "ts": 1585744374423  --跳入时间戳
}

{
    
    "common":{
    
    "ar":"420000","ba":"iPhone","ch":"Appstore","md":"iPhone 8","mid":"mid_991","os":"iOS 13.3.1","uid":"418","vc":"v2.1.134"},"displays":[{
    
    "displayType":"promotion","item":"10","item_type":"sku_id","order":1},{
    
    "displayType":"query","item":"10","item_type":"sku_id","order":2},{
    
    "displayType":"query","item":"10","item_type":"sku_id","order":3},{
    
    "displayType":"promotion","item":"5","item_type":"sku_id","order":4},{
    
    "displayType":"query","item":"3","item_type":"sku_id","order":5},{
    
    "displayType":"query","item":"7","item_type":"sku_id","order":6},{
    
    "displayType":"query","item":"5","item_type":"sku_id","order":7},{
    
    "displayType":"recommend","item":"1","item_type":"sku_id","order":8},{
    
    "displayType":"query","item":"10","item_type":"sku_id","order":9},{
    
    "displayType":"query","item":"6","item_type":"sku_id","order":10}],"page":{
    
    "during_time":12161,"item":"2","item_type":"sku_id","last_page_id":"good_detail","page_id":"good_spec","sourceType":"query"},"ts":1583769287899}

Type of log data:

Start, exposure, action, page, error!

Business data requirements:

①Which tables have 23 tables

②How is the data in the table generated?

③How is the table updated? Which fields will be updated?

④What strategy is used to import the table to HDFS? why?

4. Acquisition platform

4.1 Collection method

Business data: how to guide? Why is it so guided?

Use sqoop to import the data in mysql directly into HDFS.

why Sqoop?

The business scenario matches!
Batch processing scene!
Open source is free, there are many users, and the community is active!

Familiar with the import method of each table!

Daily full amount: import all the data of the table

select  xxx  from  表

Daily increment (only import new):

select  xxx  from  表  where (date_format(create_time,'%Y-%m-%d')='$do_date'

Daily additions and changes (only the data added and changed on the day are imported):


select  xxx  from  表
where
(date_format(create_time,'%Y-%m-%d')='$do_date'
or
(date_format(operate_time,'%Y-%m-%d')='$do_date'

What are the applicable scenarios for full and incremental guidance?

The full amount and increment are only related to the amount of data!
Small amount of
data : Full amount, incremental Large amount of data: Incremental

How to distinguish which tables have a small amount of data and which tables have a large amount of data?

The nature of the table points:

Dimension table: A dimension to describe facts!

Province table! Regional table!
user table!
Commodity list!
Commodity classification list!
The amount of data is limited!

Fact table: The data recorded in the table is a fact that happened! 3w(who when where) + quantity

Place an order, pay, and comment
Over time, the amount of data continues to grow!

Fact table: incremental
dimension table: full, incremental

Log data: how to collect?
Build two layers of flume acquisition channels! Which two floors?

1. Log server --> kafka
2. Kafka --> HDFS

4.2 The difference between flume and sqoop?

sqoop:   使用场景单一！  RDMS和HDFS互相导！
		 批处理！

flume:   场景丰富，支持多种数据源，可以传输到多种目的地！
		 流处理！

4.3 Why do you need double-layer Flume?

 对接SparkStreaming实时分析，因此必须将数据先采集到Kafka!

Advantage:

①安全性和公司目前集群的规划设计考虑
②削峰  
  a)  第一层flume将数据写入到kafka，最终由第二层flume将数据采集到hdfs
      实现了将 N个 第一层的flume agent进程直接请求HDFS到
   	  由第二层 flume agent，统一请求，减少了在并发情况下，NN的请求负载！
  b) 将数据先写入kafka,由第二层flume从kafka中读取数据，kafka启到缓冲作用
	 即使上游生产能力， 大于 第二层flume的消费能力，也不会丢失数据！
③分层解耦，方便维护

Summary of data collection for big data projects (3)