ETL data warehouse of combat

ETL, Extraction-Transformation-Loading Abbreviation, Chinese name for data extraction, transformation and loading.

With the growth and expansion of business in general, more and more production lines, more and more data is generated, very different aspects of the data collection methods, the original data format, the amount of data storage requirements, and other usage scenarios . As the data center, it is necessary to ensure the accuracy of data security, storage, and subsequent scalability, and timeliness of data analysis, this is a big challenge.

Glossary:

  • ODS-- operational data
  • DW-- data warehouse
  • DM-- data mart

A data extraction

Data extraction means to extract to the ODS source data DW, and then processed into display data to the relevant person to review

source data:

  • User access log
  • Custom event log, operation log
  • Business journal
  • Log each service generated
  • System Log: Operating System logs, CDN logs, etc.
  • Monitoring logs
  • Other logs

Extraction frequency:

  • If there are no special requirements can be once a day, but need to avoid the peak period got me log
  • For real-time log is required to be once an hour, or the like directly kafka collecting tools, the system can take into account the need to afford

Extraction strategy:

  • Since the large amount of data, usually incremental extraction, but for some specific scene data, such as the order data, since the state will change order, and the order of magnitude is predictable and relatively small, it is necessary adopt strategies to pull the full amount
  • For log increments pull, and if the file type, date can be added on the file name, for example server_log_2018082718.log, so to meet the demand by the hour to pull the
  • For reservation data source, considering the burst, the data source on the server to ensure that at least two or more days

Second, data conversion, cleaning

As the name suggests, it is an unnecessary and non-compliant data processing. Data cleansing is best not to put links were drawn, taking into account the sometimes check the raw data. Generally the companies will have their own specifications, the following points are listed for reference only.

Data cleaning includes the following aspects:

  1. Null handling; business needs, may be replaced with a null value to a specific value or directly filtered off;
  2. Verify the correctness of the data; non-compliant data mainly to do business meaning a process, for example, the character string represents a number of alternative field is 0, the date string a non date field filtered off and the like;
  3. 规范数据格式;比如,把所有的日期都格式化成yyyy-MM-dd HH:mm:ss的格式等;
  4. 数据转码;把一个源数据中用编码表示的字段,通过关联编码表,转换成代表其真实意义的值等等;
  5. 数据标准,统一;比如在源数据中表示男女的方式有很多种,在抽取的时候,直接根据模型中定义的值做转化,统一表示男女;
  6. 其他业务规则定义的数据清洗...

三、数据加载

数据拉取,清洗完之后,就需要展示了。一般是把清洗好的数据加载到mysql中,然后在各系统中使用,或者使用Tableau直接给相关人员展示

四、ETL相关工具

ELT相关的工具有很多,这里只列举一些常用的,而且各公司的技术原型也不一样,就需要根据实际情况来选择

数据抽取工具:

  • kafka
  • flume
  • sync

数据清洗

  • hive/tez
  • pig/tez
  • storm
  • spark

其它工具

  • 数据存储:hadoop、hbase,ES、redis
  • 任务管理:azkaban、oozie
  • 数据同步:datax、sqoop

五、ETL过程中的元数据

试想一下,你作为一个新人接手别人的工作,没有文档,程序没有注释,数据库中的表和字段也没有任何comment,你是不是会望着窗外,一声长叹...

所以元数据管理系统对于数据仓库来说是必须的,并且相关人员必须定时维护,如果元数据和数据仓库中的变动不同步,那么元数据系统就形同虚设。

Guess you like

Origin blog.csdn.net/oZuoLuo123/article/details/87913833