ETL

ETL is the process of extracting, cleaning and transforming the data of the business system and then loading it into the data warehouse.  ETL is an important part of a BI project. Usually, in a BI project, ETL will spend at least 1/3 of the time of the entire project, and the quality of ETL design is directly related to the success or failure of the BI project.       

  The design of ETL is divided into three parts: data extraction, data cleaning and transformation, and data loading. We also start from these three parts when designing ETL. Data extraction is extracted from various data sources into ODS (Operational Data Store, operational data storage)—this process can also do some data cleaning and conversion), and different extraction methods need to be selected during the extraction process. , to improve the operating efficiency of ETL as much as possible. Among the three parts of ETL, the part of "T" (Transform, cleaning, conversion) takes the longest time. Generally, the workload of this part is 2/3 of the entire ETL. Data loading is generally written directly into DW (Data Warehousing, data warehouse) after the data is cleaned.

  There are many ways to implement ETL, and there are three commonly used methods. One is realized by ETL tools (such as Oracle's OWB, DTS of SQL Server 2000, SSIS service of SQL Server 2005, Informatic, etc.), one is realized by SQL, and the other is the combination of ETL tools and SQL. The first two methods have their own advantages and disadvantages. With the help of tools, ETL projects can be quickly established, which shields complex coding tasks, improves speed and reduces difficulty, but lacks flexibility. The advantages of the SQL method are flexibility and improve the efficiency of ETL operation, but the coding is complex and the technical requirements are relatively high. The third is to combine the advantages of the previous two, which will greatly improve the development speed and efficiency of ETL.

  1. Data extraction (Extract)

  This part needs to do a lot of work in the research stage. First of all, it is necessary to find out where the data comes from several business systems, what DBMS is running on the database server of each business system, whether there is manual data, how large is the amount of manual data, and whether there is any unstructured data After collecting this information, the design of data extraction can be carried out.

  1. For the same data source processing method as the database system storing DW

  This type of data source is easier by design. Under normal circumstances, DBMS (SQLServer, Oracle) will provide database link function, and you can directly access by writing Select statement by establishing a direct link relationship between the DW database server and the original business system.

  2. Processing methods for data sources different from DW database systems

  For this type of data source, a database link can also be established through ODBC under normal circumstances - such as between SQL Server and Oracle. If the database link cannot be established, there are two ways to complete it. One is to export the source data into .txt or .xls files through tools, and then import these source system files into ODS. Another way is to do it through the program interface.

  3. For file type data sources (.txt, .xls) , business personnel can be trained to use database tools to import these data into the specified database, and then extract from the specified database. Or it can also be achieved with the help of tools.

  4. Incremental update problem

  For systems with large data volumes, incremental extraction must be considered. Under normal circumstances, the business system will record the time when the business occurs. We can use it as an incremental sign. Before each extraction, first determine the maximum time recorded in the ODS, and then go to the business system to retrieve all records greater than this time according to this time. . Using the time stamp of the business system, in general, the business system has no or part of the time stamp.

2. Data cleaning and transformation (Cleaning, Transform)

  In general, the data warehouse is divided into two parts: ODS and DW. The usual practice is to clean from the business system to ODS, filter out dirty data and incomplete data, convert from ODS to DW, and calculate and aggregate some business rules.

  1. Data cleaning

  The task of data cleaning is to filter those data that do not meet the requirements, and hand over the filtered results to the business competent department to confirm whether it is filtered out or corrected by the business unit before extracting.

The data that does not meet the requirements are mainly divided into three categories: incomplete data, wrong data, and duplicate data.

  (1)不完整的数据:这一类数据主要是一些应该有的信息缺失,如供应商的名称、分公司的名称、客户的区域信息缺失、业务系统中主表与明细表不能匹配等。对于这一类数据过滤出来,按缺失的内容分别写入不同Excel文件向客户提交,要求在规定的时间内补全。补全后才写入数据仓库。

  (2)错误的数据:这一类错误产生的原因是业务系统不够健全,在接收输入后没有进行判断直接写入后台数据库造成的,比如数值数据输成全角数字字符、字符串数据后面有一个回车操作、日期格式不正确、日期越界等。这一类数据也要分类,对于类似于全角字符、数据前后有不可见字符的问题,只能通过写SQL语句的方式找出来,然后要求客户在业务系统修正之后抽取。日期格式不正确的或者是日期越界的这一类错误会导致ETL运行失败这一类错误需要去业务系统数据库用SQL的方式挑出来,交给业务主管部门要求限期修正,修正之后再抽取。

  (3)重复的数据:对于这一类数据——特别是维表中会出现这种情况——将重复数据记录的所有字段导出来,让客户确认并整理。

  数据清洗是一个反复的过程,不可能在几天内完成,只有不断的发现问题,解决问题。对于是否过滤,是否修正一般要求客户确认,对于过滤掉的数据,写入Excel文件或者将过滤数据写入数据表,在ETL开发的初期可以每天向业务单位发送过滤数据的邮件,促使他们尽快地修正错误,同时也可以做为将来验证数据的依据。数据清洗需要注意的是不要将有用的数据过滤掉,对于每个过滤规则认真进行验证,并要用户确认。

  2、 数据转换

  数据转换的任务主要进行不一致的数据转换、数据粒度的转换,以及一些商务规则的计算。

  (1)不一致数据转换:这个过程是一个整合的过程,将不同业务系统的相同类型的数据统一,比如同一个供应商在结算系统的编码是XX0001,而在CRM中编码是YY0001,这样在抽取过来之后统一转换成一个编码。

  (2)数据粒度的转换:业务系统一般存储非常明细的数据,而数据仓库中数据是用来分析的,不需要非常明细的数据。一般情况下,会将业务系统数据按照数据仓库粒度进行聚合。

  (3)商务规则的计算不同的企业有不同的业务规则、不同的数据指标,这些指标有的时候不是简单的加加减减就能完成,这个时候需要在ETL中将这些数据指标计算好了之后存储在数据仓库中,以供分析使用。

三、ETL日志、警告发送

  1、 ETL日志

  ETL日志分为三类。

一类是执行过程日志,这一部分日志是在ETL执行过程中每执行一步的记录,记录每次运行每一步骤的起始时间,影响了多少行数据,流水账形式。

一类是错误日志,当某个模块出错的时候写错误日志,记录每次出错的时间、出错的模块以及出错的信息等。

第三类日志是总体日志,只记录ETL开始时间、结束时间是否成功信息。如果使用ETL工具,ETL工具会自动产生一些日志,这一类日志也可以作为ETL日志的一部分。

记录日志的目的是随时可以知道ETL运行情况,如果出错了,可以知道哪里出错。

  2、 警告发送

  如果ETL出错了,不仅要形成ETL出错日志,而且要向系统管理员发送警告。发送警告的方式多种,一般常用的就是给系统管理员发送邮件,并附上出错的信息,方便管理员排查错误。

  ETL是BI项目的关键部分,也是一个长期的过程只有不断的发现问题并解决问题,才能使ETL运行效率更高,为BI项目后期开发提供准确与高效的数据。

后记

     做数据仓库系统,ETL是关键的一环。说大了,ETL是数据整合解决方案,说小了,就是倒数据的工具回忆一下工作这么长时间以来,处理数据迁移、转换的工作倒还真的不少。但是那些工作基本上是一次性工作或者很小数据量。可是在数据仓库系统中,ETL上升到了一定的理论高度,和原来小打小闹的工具使用不同了。究竟什么不同,从名字上就可以看到,人家已经将倒数据的过程分成3个步骤,E、T、L分别代表抽取、转换和装载。

其实ETL过程就是数据流动的过程,从不同的数据源流向不同的目标数据。但在数据仓库中,

ETL有几个特点,

一是数据同步,它不是一次性倒完数据就拉到,它是经常性的活动,按照固定周期运行的,甚至现在还有人提出了实时ETL的概念。

二是数据量,一般都是巨大的,值得你将数据流动的过程拆分成E、T和L。

    现在有很多成熟的工具提供ETL功能,且不说他们的好坏。从应用角度来说,ETL的过程其实不是非常复杂,这些工具给数据仓库工程带来和很大的便利性,特别是开发的便利和维护的便利。但另一方面,开发人员容易迷失在这些工具中。举个例子,VB是一种非常简单的语言并且也是非常易用的编程工具,上手特别快,但是真正VB的高手有多少?微软设计的产品通常有个原则是“将使用者当作傻瓜”,在这个原则下,微软的东西确实非常好用,但是对于开发者,如果你自己也将自己当作傻瓜,那就真的傻了。ETL工具也是一样,这些工具为我们提供图形化界面,让我们将主要的精力放在规则上,以期提高开发效率。从使用效果来说,确实使用这些工具能够非常快速地构建一个job来处理某个数据,不过从整体来看,并不见得他的整体效率会高多少。问题主要不是出在工具上,而是在设计、开发人员上。他们迷失在工具中,没有去探求ETL的本质。可以说这些工具应用了这么长时间,在这么多项目、环境中应用,它必然有它成功之处,它必定体现了ETL的本质。如果我们不透过表面这些工具的简单使用去看它背后蕴涵的思想,最终我们作出来的东西也就是一个个独立的job,将他们整合起来仍然有巨大的工作量。大家都知道“理论与实践相结合”,如果在一个领域有所超越,必须要在理论水平上达到一定的高度.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326090197&siteId=291194637
ETL