A data processing architecture
As shown, there are two data transfer lines, real-time and offline calculation flow calculation process
- Real-time computing: Event (hive table) ---- (using dw-event-to-collector.sh send events) ----> debt collection tool collector --------> flume distribution ---- ----> kafka cache --------> flink computing --------> hbase --------> elasticsearch
- Off-line calculation: Event hdfs -------- (hive tables) ---- (active hive meter reading) ---->> flink computing --------> hbase ----- ---> elasticsearch
II. Real-time calculation process tools
1.hive
- Number of positions into the hive: hive
- View current database: show databases;
- Switch to cdp library: use cdp;
- Create a table (Export Events configuration SMH front end, there are statements automatically generated):
the CREATE TABLE IF the NOT EXISTS TableName (
uid String,
event_time bigint,
touch_point_id String
) Partitioned by (process_date String)
ROW the FORMAT DELIMITED
the FIELDS TERMINATED BY '\ t'
LINES BY TERMINATED '\ n-'
STORED TEXTFILE the AS;
- View built table command: show create table c8_shopping;
- View the current list: show tables;
- View table column names: desc tablename;
- The hive into the corresponding event table: load data local inpath "/home/hadoop/shopping.txt" into table tablename partition (process_date = "2019-07-22");
- The data in Table query: select * from tablename where process_date = '2019-04-26' limit 10;
- Command before executing the query with the column name and data: set hive.cli.print.header = true;
- Delete the data in Table: truncate table tablename;
- Delete tables: drop table tablename;
2.kafka
Queries kafka consumption, path: /home/hadoop/kafka_2.11-0.10.2.0/bin
命令: sh kafka-console-consumer.sh --topic event_c8 --from-beginning --bootstrap-server 172.00.0.000:9092 > event_c8
3.flink
- Restart flink task, path: / home / hadoop / cdp-etl-jobs / bin / job / realtime
- Close flink task: yarn application -kill task id
- Start flink task: sh indexing-trait.sh sh calculate-trait.sh
4.hbase
- Enter hbase: hbase shell
- View existing tables: list
- Query a characteristic value: scan 'trait_c8', {COLUMNS => [ 'd: t1425', 'd: uid']}
- Uid query a delete state: scan 'trait_c8', {COLUMNS => 'd: delete_status', FILTER => "ValueFilter (=, 'substring: true')"}
- Discover a uid: get 'trait_c8', 'fff144eb653e7348f051307cde7db169'
- Delete table data: truncate "tablename"; flush "tablename";
- Delete the table: disable table; drop table;
- hbase synchronized to the total amount es: cdp / cdp-etl-jobs / bin / job / batch / trait-crowd-calc.sh -calcType sync increment: incr
5.elasticsearch
Query Tool can be used kibana or elasticsearch head plug, commonly used commands:
- 查询特性:
GET /trait_c39/trait_c39/_search?size=1000
{
"query": {
"match_all": {}
},
"_source": ["t596"]
} -
查询人群:
GET /trait_c39/trait_c39/_search?size=1000
{
"query": {
"match_all": {}
},
"post_filter": {"term": {
"crowds_code": "cr197"
}}
} -
Discover a uid:
GET / trait_c33 / trait_c33 / uid-1
III. Offline calculation process tools
1.hdfs
Front page inquiry address: http://172.23.x.xxx:50070/explorer.html#/cdp/warehouse
View catalog: hadoop fs -ls / cdp / warehouse / c8 / offline /
View File: hadoop fs -cat /cdp/warehouse/c8/offline/shopping.txt
Download Data: hadoop fs -get / cdp / warehouse / c8 / offline /
Delete files: hadoop fs -rm -r /cdp/warehouse/c8/offline/shopping.txt
2.azkaban
- cdp-batch-process off-line batch data
dw-etl-process cartridge number etl start
dw-event-to-hdfs active read events into HDFS
User-User Delete Delete
event-ub-to-hbase send events to hbase, with the user profile data show
common-jobs-config generate job configuration information, address: / home / hadoop / cdp- etl-jobs / jobs-tmp / codes /
characteristics trigger the arrival of ALL_EVENT_TRAIT all events List
ALL_ACC_TRAIT except timeline, all events accumulation characteristics trigger class list
ALL_REF_TRAIT all the features change trigger properties list
the full amount of the population list within ALL_CROWD channel
list is triggered when CALC_EVENT_TRAIT event arrives and the need to re-calculate the characteristic
trigger CALC_TRAIT change characteristics and the need to re-calculate the characteristics of the list
CALC_CROWD day needs people calculation list, including the re-calculation of the crowd, the crowd in line with the cycle of
the population list CLEAN_CROWD be deleted
CLEAN_TRAIT be deleted features list
properties list to be exported EXPORT_TRAIT idmapping when
CANCELED_TRAIT recall feature authorize the impact of list
event-trait-calc-full full amount of heavy run of data, traitupdate judgment of history to the latest data assigned to the characteristic
calculate the number of bins incremental daily data event-trait-calc-incr, traitupdate send only the day's data
event-trait-calc-init to recalculate the trigger characteristic event arrives, traitupdate day only send data
trait-crowd-calc computing crowd, characteristic for characteristic changes triggered when recalculated, timeline type characteristics, data update site administrator / operator Commissioner
id-mapping-clean delete obsolete mapping relationship
id-mapping-init idMapping initialization and establish mapping relationship
id-mapping-copy idMapping the characteristics copy
report-crowd-count update population number to mysql, cdp_crowd table crowd_scale column
report-metric timing calculation for all people long-term tracking Kanban indicators index and full-channel
cdp-batch-process - cdp-clean-jobs clear temporary files, file export expired crowd
- crowd-export groups exported
- init-channels initialization channel
- introducing characteristic trait-import