ETL developers face questions set

Original link: https://my.oschina.net/u/1186503/blog/1633715

ETL explain (in great detail !!!)

 

ETL is the business system through data extraction, loading it into the data warehouse conversion after the cleaning, the aim will be dispersed in the enterprise, messy, standards are not unified data integration together, provide analytical basis for decisions.  ETL BI project is an important aspect. Typically, in the BI project, ETL will spend the entire project at least 1/3 of the time, ETL design quality is directly related to the success or failure of BI projects.       

  ETL is designed in three parts: data extracting, transferring data, loading data. We also set out in this three-part design of ETL time. Data extraction is extracted from various data sources to the ODS (Operational Data Store, operational data stores) - the process may be cleaned and to do some data conversion), during the withdrawal of the extraction method requires the selection of a different , ETL improve operating efficiency as much as possible. ETL in three parts, it takes the longest part "T" (Transform, cleaning, conversion), in general, this part of the overall workload ETL 2/3. Loading data is generally done after the data is written directly to the washing DW (Data Warehousing, Data Warehouse) go.

  There are several ways to achieve ETL, commonly used in three ways. One is by means of ETL tools (such as Oracle's OWB, SQL Server 2000's DTS, SQL Server2005 the SSIS service, Informatic, etc.) to achieve, one is SQL ways, the other is a combination of ETL tools and SQL. The first two methods each have their own advantages and disadvantages, with tools to quickly build ETL project, shielding complex coding task, increase the speed and reduce the difficulty, but the lack of flexibility. SQL advantage of the method is flexible, ETL improve operating efficiency, but coding complexity, technical requirements are relatively high. The third is a combination of two kinds of advantages in front, it will greatly improve development speed and efficiency of ETL.

  A data extraction (the Extract)

  This section needs to do a lot of work in the research phase, we must first find out data from several business systems, database servers running various business systems What DBMS, the existence of manual data, how much data by hand, the existence of non-structural of data, etc., when the information can be collected the data extraction design.

  1, the database system to store the same data source DW processing method

  This type of data source is relatively easy in design. Under normal circumstances, DBMS (SQLServer, Oracle) will provide database link functions, establish a direct link in the relationship between the database server and the original DW operational systems can be written directly access the Select statement.

  2, for the different data sources DW database system processing method

  For this type of data source, under normal circumstances can also create database links through ODBC way - such as between SQL Server and Oracle. If the database link can not be established, it can be accomplished in two ways, one is through the tool export data source or .xls .txt file, and then import the file to the ODS in the source system. Another method is accomplished by programming interface.

  3, a data source for the file type (.txt, * .xls) , service personnel can be trained using the database tools to import the data into a database, and then extracted from the specified database. Or it can also be achieved by means of tools.

  4, incremental update issues

  For large data systems, it must be considered incremental extraction. Under normal circumstances, the system will record the time traffic service occurs, we can be used for incremental mark, the maximum recording time ODS before each extraction is first determined, and then based on this system, the time taken to service all the records is greater than the time . Using a time stamp service system, under normal circumstances, no service system or part of a stamp.

Second, the data conversion cleaning (Cleaning, Transform)

  In general, the data warehouse into ODS, DW two parts. The usual practice is to make the service system from ODS cleaning, dirty data and incomplete data was filtered off, converted to the DW from the ODS process, the polymerization is calculated and some business rules.

  1, data cleaning

  Data cleansing task is to filter data that does not meet the requirements, the filtered results to the competent authorities, or to confirm whether filtering out again decimated by the business units after correction.

Does not meet the requirements of the main data is incomplete data, wrong data, duplication of data three categories.

  (1) incomplete data: this category should have data mainly some missing information, such as the vendor's name, name of the branch, region customer information is missing, the main business systems and material list can not match and so on. For this type of data is filtered out by the contents of the missing write Excel files submitted to different customers, respectively, requiring completion within the stipulated time. Completion before writing data warehouse.

  (2) error data: causes of this type of error is generated in the business system is not perfect, after receiving no input is determined directly written back the database result, numerical data such as full-width numeric character input, a data string back Enter operation, incorrect date format, date, and other cross-border. This type of data should be classified, similar to full-width characters, before and after the data in question are not visible characters, only to find out the way by writing SQL statements, and then ask the customer service system to extract after correction . Date format is incorrect or out of range of the date of this type of error can result in failure ETL run, this type of error need to pick a SQL database business systems out of the way, to the competent authorities a deadline for correction, and then extracted after correction.

  (3) duplicate data: for this type of data - especially the dimension table will appear in this case - will be repeated all fields in the records of lead out, to allow customers to identify and organize.

  Data cleansing is an iterative process that can not be completed within a few days, the only constant discover and solve problems. Whether filtration, if amendments generally require customers to confirm to filter out the data, write Excel files or filter the data written to the data table, you can send mail filtering data ETL development in the early stages of the business units a day, prompting them as soon as possible correct mistakes, but also can be used as the basis for future verification data. Data cleansing needs attention is not to filter out useful data, carry out a verification rules for each filter, and confirm to the user.

  2, the data conversion

  The main task of data conversion is calculated inconsistent data conversion, data granularity conversion, as well as some business rules.

  (1) inconsistent data conversion: This process is an integrated process, the same type of different business systems unified data, such as the same vendor encoding billing system is XX0001, and in CRM code is YY0001, so extraction converted into a unified coding after over.

  (2) Conversion of data granularity: business systems typically store very detailed data, and the data is used to analyze the data warehouse, data need not be very detailed. Under normal circumstances, it will be polymerized in accordance with the business system data granularity of the data warehouse.

  (3) calculate business rules : different companies have different business rules, different data indicators, which sometimes is not a simple simple calculations can be done, this time need to calculate these indicators in the data ETL after they have stored in the data warehouse for analysis.

Three, ETL log, send alerts

  1, ETL log

  ETL log divided into three categories.

One is the execution log , the log is a part of each step is performed during the execution of the ETL recording, the recording start time of each step of each run, the impact of the number of rows of data, in the form of running account.

One is the error log , when a module error write error log, recording each wrong time, wrong and wrong information and other modules.

The third category is the overall log log records only ETL start time, end time success whether information. If you are using ETL tools, ETL tools automatically generate some logs, logs of this kind can also be used as part of ETL log.

The purpose of logging is ready to know the ETL operation, if wrong, to know what went wrong.

  2, sent warning

  If ETL wrong, not only to form the ETL error log, and system administrators want to send a warning. A variety of ways to send a warning, is commonly used to send messages to the system administrator, along with the error information for administrators to troubleshoot errors.

  ETL BI project is a key part, and it is a long process , the only constant identify problems and solve problems, to make the ETL run more efficiently, providing accurate and efficient data for the BI project later development.

postscript

     For data warehouse system, ETL is a key part. Said the big, ETL data integration solutions, said small, is the tool down data . Recall work so long since the processing of data migration, conversion work comes in really a lot. But those one-time job or work is substantially a very small amount of data. But in data warehouse systems, ETL rises to a certain level of theory, and the original use different tools chipping away. What is different from the name can be seen, during the inverted data of people have been divided into three steps, E, T, L representing extract, transform and load.

In fact, the process is the process of ETL data flows, data from different data source to a different destination. But in the data warehouse,

ETL has several features,

First, data synchronization , it is not a one-time complete data is pulled down, it is a regular activity, run according to a fixed cycle, and even now there are people put forward the concept of real-time ETL.

Second, the amount of data , are generally huge, worth splitting process data flow into E, T and L.

    There are many sophisticated tools ETL functionality, not to mention their good or bad. From the application point of view, ETL process is actually not very complicated, the tools for data warehouse projects and bring great convenience, especially the development of convenience and ease of maintenance. On the other hand, developers easy to get lost in these tools. For example, VB is a very simple language and is very easy to use programming tools to use particularly fast, but the real master VB How many? Microsoft designed products usually have a principle that "the user as a fool," Under this principle, the things Microsoft is indeed very easy to use, but for developers, as if you yourself will fool myself, it is really Silly. ETL tools are the same, these tools provide a graphical interface for us, we will mainly focus on the rules, in order to improve development efficiency. From the use of effect, we do use these tools to build a job very quickly to deal with certain data, but on the whole, and not necessarily his overall efficiency will be much higher. The main problem is not with the tool, but in the design, development staff. They lost in the tool, not to explore the nature of the ETL. These tools can be said to apply for so long, so many items in the application, the environment, it is bound to have its success, it must reflect the nature of the ETL. If we do not see things behind it implies thinking through the simple use of the surface of these tools, we eventually come to that is a separate job, they will integrate still has a huge amount of work. We all know that "theory and practice", if in a somewhat beyond the field, we must reach a certain height at the theoretical level.

Reproduced in: https: //my.oschina.net/u/1186503/blog/1633715

Guess you like

Origin blog.csdn.net/choy9999/article/details/100591142