The new infrastructure projects to promote new clean historical data

1. Background:

      Data will slowly decay in the use of the process, because the data is frequently used will be frozen for some time and a number of other unknown factors that cause data point in time is not available, based on this phenomenon it is necessary to design clean historical data. Our current raw cleaning process is only responsible for obtaining basic information, because of the cleaning of historical data will not do the task after obtaining basic information, so I will clean the history of design data into primary and secondary wash clean in two stages. Initial cleaning is to obtain basic information, the second cleaning is then extracted from the data can do the task.

2, logic design:

  Historical data are backed up every month in three parts, for example: his_clean_t2018111, his_clean_t2018112, his_clean_t2018113

       Historical data is first cleaned every part of the historical data in the key fields to store duplicates removed to old_clean_t, engineering xx-ck-clean distribute data to engineering xx-rbt-clean first store to get a temporary table old_short_succ_t After cleaning results, a temporary table old_short_error_t, in order to ensure that data is not repeatedly distributing data storage has been distributed to a temporary table old_short_run_t, finally persisted to the basic information table old_used_t.

       Historical data cleaning is the secondary table will be inserted into the table old_used_task_t old_used_t data, if valid data that already exists is not required secondary cleaning, so there will need to be cleaned before the second table old_used_task_t already effective filtering of data deletion. Engineering xx-ck-task distribution data acquisition valid data to the second cleaning project xx-rbt-task uuid to keep to Redis, finally uuid linked to Redis data store table extracted secondary effective cleaning to do the task of final table cookie_used_t.

3, design projects:

Initial distribution data cleaning: xx-ck-clean;

First wash: xx-rbt-clean;

Distribute the second cleaning Data: xx-ck-task;

Secondary Cleaning: xx-rbt-task

4, data maintenance SQL:

-- 自动长度过长修改内容
UPDATE his_clean_t2018113 SET ua='HUAWEI-ARS-TL00__weibo__7.3.0__android__android8.1.0'
WHERE LENGTH(ua)>200;
-- 插入数据初次清洗
SET@uuid=2018111010000000000;
INSERT INTO old_clean_t(`uuid`,`cki`,`cks`,`ckua`,`ckaid`,`ckuid`,`ckfrom`,`ckgsid`)
SELECT @uuid:=@uuid+1 uuid,i,s,ua,aid,uid,`from`,`gsid` FROM his_clean_t2018111 GROUP BY gsid;
SET@uuid=2018112010000000000;
INSERT INTO old_clean_t(`uuid`,`cki`,`cks`,`ckua`,`ckaid`,`ckuid`,`ckfrom`,`ckgsid`)
SELECT @uuid:=@uuid+1 uuid,i,s,ua,aid,uid,`from`,`gsid` FROM his_clean_t2018112 GROUP BY gsid;

-- 查询
SELECT COUNT(1) FROM old_clean_t;
SELECT COUNT(1) FROM old_clean_t WHERE state=0;
SELECT COUNT(1) FROM old_clean_t WHERE state=1;
SELECT COUNT(1) FROM old_clean_t WHERE state=2;
SELECT COUNT(1) FROM old_short_run_t;
SELECT COUNT(1) FROM old_short_error_t;
SELECT COUNT(1) FROM old_short_succ_t;
SELECT COUNT(1) FROM old_used_t;
SELECT COUNT(1) FROM old_used_task_t;
-- 清理
DELETE FROM old_clean_t;
DELETE FROM old_used_task_t;

- inserting the secondary data washing
the INSERT the INTO old_used_task_t (UUID, CKI, CKS, ckua, ckaid, ckuid, ckfrom, ckgsid, CUID, cLevel, CVIP, CNV, cfollower, cfriend, carticle, cheader)
the SELECT UUID, CKI, CKS, ckua, ckaid, ckuid, ckfrom, ckgsid, CUID, cLevel, CVIP, CNV, cfollower, cfriend, carticle, cheader the FROM old_used_t;
- delete the second cleaning has been effective (data not already active secondary cleaning)
dELETE outt outt the INNER old_used_task_t the JOIN cookie_used_t the FROM Cut the ON outt.cuid = cut.cuid;
- valid data stored again after the second cleaning redis, inserting data into the program execution cookie_used_t

Guess you like

Origin www.cnblogs.com/xx0829/p/11647196.html