Kettle tuning (large amount of data while using blocking components)

I. Overview

        In the previous document,   https://blog.csdn.net/qq_35995514/article/details/106856885   talked about a Kettle synchronization program. The program is designed, and the online test fails. The program runs to the text file and the input is stuck.

 

Two, kettle step communication

There is an important parameter in kettle tuning: the number of records in the record set.

 

 

The number refers to the size of the [cache queue] for communication between components. Kettle uses List to implement the cache queue internally. Each statement will be encapsulated into a RowSet object, and there will be a List<RowSet between each component. >Queue, the source step will write one piece of data to the queue each time, and the target step will read one piece of data from the queue each time.

The conversion in the kettle is parallel, the data flows through each component one by one, and the queue size is 10,000 by default.

But when using the relevant components of [blocking data until the step is completed] and [blocking data], you need to expand the size according to the amount of business data.

Otherwise, the data will always be stored in the cache queue between certain two components, and when the number reaches a certain size, it will be stuck.

The size of the cache queue has no effect on the job. Looking at the source code, it is found that the cache queue is composed of inputRowSet and outputRowset. The bottom implementation is ArrayList. When it is initially created, the default capacity is 10, and an array of set length will not be created directly. And use the characteristics of ArrayList: When the amount of data to be added exceeds the capacity of the array, the ArrayList will be dynamically expanded, and the size will be expanded by 1.5 times

 

Question : Considering that the amount of data will be relatively large or the data backlog will be larger, the Kettle program data processing flow will be optimized, or the code will be used to implement the ETL of the data

Guess you like

Origin blog.csdn.net/qq_35995514/article/details/111320268