pentaho kit

According to statistics, in the entire data analysis process, the work of collecting and organizing data accounts for roughly 90% of the total workload, and the modeling process is less than 10%. It can be seen that ETL is a very important link in the data processing process. ETL engineers account for a large proportion of data warehouse jobs, and the salaries are not bad. It may be difficult for an IT person to transform into a data analyst at once. It is one of the reasonable choices to transform the data warehouse/ETL engineer first and then choose the opportunity to go higher. Among ETL software, the open source Kettle is the most used, which is completely free, and its functions and performance are not weaker than commercial ETL software such as datastage. Kettle is used together with other open source data platform software, such as Mysql cluster, Hadoop cluster, etc. A very cost-effective architectural choice. This course systematically explains Kettle and its secrets.

Course Introduction
ETL (Extract, Transformation, Load) tools are necessary tools for building data warehouses and data integration. There are a variety of commercial ETL tools on the market, such as Informatica, Datastage, etc. At present, there are few open source and practical ETL tools on the market, and Kettle is one of the few open source ETL tools. This course will mainly explain the basic use and secondary development method of the open source ETL tool Kettle, and combined with actual project cases, explain how Kettle is applied in practice and the problems that may arise in the application. In view of the current application of big data, this course will also combine big data to describe how Kettle supports big data technologies such as Hadoop, HBase, MongoDB, and MapReduce. In addition to the use of Kettle, in the later classes of this course, the secondary development of Kettle will be described: including the Kettle code reading guide, the description and usage of Kettle API, and the development method of Kettle plug-ins.

Course Content
The first week: the concept of ETL, the concept, function and operation of Kettle The
second week: Kettle resource library, log, operation mode
Week 3: Input Steps (Table Input, Text File Input, XML File Input...)
Week 4: Output Steps (Table Output, Update, Delete, Text File Output, XML File Output...)
Week 5: Transformation steps (filtering, string processing, splitting fields, calculators...)
Week 6: Transformation steps (field selection, sorting, adding check columns, removing duplicate records...)
Week 7: Application steps, Process steps (process file, execute program, send mail, no-op, blocking step, abort, etc...)
Week 8: Query steps, join steps (database query, stream query, merge records, recordset join, Cartesian. ..)
Week 9: Script Steps (Javascript, Java Class, Regular Expressions...)
Week 10: Job Items (Copy, Move, ftp, sftp...)
Week 11: Parameters and Variables for Kettle, Kettle Cluster
Week 12: Kettle Code Compilation, Code Structure, Application Integration, Various Configuration Files Week
13: Plugin Development - Steps, Job Items Week 15: Big data plugins (Hadoop file input/output, HBase input/output, MapReduce input/output, MongoDB input/output) Target group 1. ETL engineers, Java development engineers, 2. DBAs who often do data processing 3. Students with a certain database foundation and Java foundation. Course Expectations 1. Understand the basic functions of the Kettle software.










2. Be able to use Kettle to complete basic data processing work.
3. Understand some advanced functions of Kettle software
4. For students with Java development experience, have a certain understanding of Kettle code structure, and can develop some basic Java plug-ins.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326225449&siteId=291194637