A magic book on kettle

With the help of my friend , I find a book on kettle with the name <<Pentaho Kettle olutions>
Building+Open+Source+ETL+Solutions+with+Pentaho+Data+Integration>>,it is really a very cute book which help me to know more on kettle.

   in the year20121218, I begin to read this book, from page 1 to page 44, and I get to know the history of kettle and the relaiton of oltp system and data warehouse.because the English is so difficult, therefore I have to read vary carefully.

20130106

TOPIC 1 Agile BI
1）ETL Design

2) Data Acquisition

3) Beware of Spreadsheets

4) Design for failure

Kettle contains many features to do this. You can:
• Test a repository connection.
• Ping a host to check whether it’s available.
• Wait for a SQL command to return success/failure based on a row count condition.
• Check for empty folders.
• Check for the existence of a file, table, or column.
• Compare files or folders.
• Set a timeout on FTP and SSH connections.
• Create failure/success outputs on every available job step.

5) Change data capture

6) Data Quality

2013-1-16
   today , I try to study the kettle components, kettle is very powerful with the following building blocks.although it is a little difficult to develop the ETL jobs at the beginning,but it much easy to maintence the ETL jobs at the end. so it is a nice tools.
    The Building Blocks of Kettle Design
This section introduces and explains some of the Kettle specific terminology.

Transformations
A transformation is the workhorse of your ETL solution. It handles the manipulation of rows or data in the broadest possible meaning of the extraction, transformation, and loading acronym.

Steps
A step is a core building block in a transformation. It is graphically represented in the form of an icon;
Transformation Hops
A hop, represented by an arrow between two steps, defines the data path between the steps. The hop also represents a row buffer called a row set between two steps.

Parallelism
The simple rules enforced by the hops allow steps to be executed in a parallel nature in separate threads.

Rows of Data
The data that passes from step to step over a hop comes in the form of a row of data. A row is a collection of zero or more fields that can contain the data in any of the following data types:
• String: Any type of character data without any particular limit.
• Number: A double precision floating point number.
• Integer: A signed long integer (64-bit).
• BigNumber: A number with arbitrary (unlimited) precision.
• Date: A date-time value with millisecond precision.
• Boolean: A Boolean value can contain true or false.
Binary: Binary fields can contain images, sounds, videos, and other types of binary data.

Data Conversion
Jobs
A job consists of one or more job entries that are executed in a certain order. The order of execution is determined by the job hops between job entries as well as the result of the execution itself.

Job Entries
A job entry is a core building block of a job. Like a step, it is also graphically represented in the form of an icon. However, if you look a bit closer, you see that job entries differ in a number of ways:
Job Hops
Multiple Paths and Backtracking
Job Entry Results
.
   Tools and Utilities
Kettle contains a number of tools and utilities that help you in various ways and in various stages of your ETL project. The core tools of the Kettle software stack include:
• Spoon: A graphical user interface that will allow you to quickly design and manage complex ETL workloads.
• Kitchen: A command-line tool that allows you to run jobs
• Pan: A command-line tool that allows you to run transformations.
• Carte: A lightweight (around 1MB) web server that enables remote execution of transformations and jobs. A Carte instance also represents a slave server, a key part of Kettle clustering (MPP).
Chapter 3 provides more detailed information on these tools.
Repositories
    When you are faced with larger ETL projects with many ETL developers working together, it’s important to have facilities in place that enable cooperation. Kettle provides a way of defining repository types in a pluggable and flexible way.
• Database repository:
• Pentaho repository:
• File repository:
• Central storage:
• File locking:
• Revision management:
• Referential integrity checking:
• Security:
• Referencing: tact.

Virtual File Systems
   Flexible and uniform file handling is very important to any ETL tool. That is why Kettle supports the specification of files in the broadest sense as URLs. The Apache Commons VFS back end that was put in place will then take care of the complexity for you. For example, with Apache VFS, it is possible to process a selection of files inside a .zip archive in exactly the same way as you would process a list of files in a local folder. For more information on how to specify VFS files, visit the Apache VFS website at http://commons.apache.org/vfs/.
Table 2-5 shows a few typical examples.

猜你喜欢