KafkaStreams study notes-01

Reference book: KafkaStreams in Action Because the project involves distributed stream data processing, I bought a copy at the bookstore.

Chapter One

concept

Kafka is an open source stream processing platform developed by the Apache Software Foundation, written in Scala and Java. Kafka is a high-throughput distributed publish-subscribe messaging system.
Reference: https://baijiahao.baidu.com/s?id=1651919282506404758&wfr=spider&for=pc

KafkaStreams is a Java library (lib) that depends on Kafka.
Common processing frameworks that are not processing frameworks are: Storm, Spark, etc.

Frameworks are given rules, and development based on a framework is based on the structure and usage rules of the framework.
The library is a collection of APIs that can be flexibly called or created by yourself.
The library is lighter and more flexible

Batch processing Large-capacity static data sets can be batched based on time or batches based on a certain condition. For example, processing every 5s [time]; or processing when the amount of data reaches 1M [condition] framework Hadoop
stream processing processes each event in real time. Although each new data is processed separately, many stream processing systems also support "window" operations, which allow processing to also reference data that arrives within a specified time interval before and / or after the current data arrives [according to the window The most central data within data processing]. Framework Storm, Samza
micro-batch processing The batch interval is shorter, but instead of event-based like stream processing, the batch is divided into smaller units, the framework spark

The new feature of lambda expression JDK8 allows functions to be used as method parameters.
Format:

(parameters) -> expression
或
(parameters) ->{ statements; }

E.g:

() -> 5		//无接收值,但返回5

Hash code https://blog.csdn.net/qq_21430549/article/details/52225801
Understanding the data grouping under P6
When partitioning the data, if I want to divide into 3 (1,2,3) areas, I have ( a, b, c, d, e, f) 6 data, then all you can do is add a tag to each data, such as a-3, b-1, c-2, d-2, e-1 , f-3. This structure is much like a key-value pair structure, but note that 1,2,3 are keys, and a, b, and c are values. Because the same key value will cover, so the key is different, but the key is different, how to achieve partition? Use a certain algorithm to hash the value of the key, and then divide the same hash code into a partition. The formula on the next page is to take a key hash value, and then take the remainder of the partition number to get the partition number. Note that the hashCode function may be rewritten based on your key. There is also a situation where my data exists in different Map structures. Although the keys in each map are unique, their arrangement is probably not in order. If I follow the keys when partitioning Points, the search efficiency will be very low. The appearance of the hash code is to improve the speed. It will get the key through the hash operation to obtain the hash code. The same hash code goes to a partition, which is very fast. Note that the hashing rules are generally related to the number of partitions.

Depth first From the initial point, keep walking forward. If you encounter a dead end, go back one step and try another way until you find the target position. This method of not hitting the south wall and not looking back, even if successful, does not necessarily find a good way, but the advantage is that there are fewer positions to remember.
Breadth first starts from the initial point and walks through all possible paths. If there is no target position in it, try to walk through all the positions that can be reached in two steps to see if there is a target position; Step to the location. This method can certainly find a shortest path, but there are a lot of contents that need to be remembered.

Back pressure A flow control mechanism. Flow Control
Backpressure means that the producer produces as much as the consumer needs. This is somewhat similar to the flow control in TCP. The receiver controls the receiving rate according to its own receiving window, and controls the sending rate of the sender through the reverse ACK packet. This scheme is only valid for cold Observable. Cold Observable is a source that allows to reduce the rate. For example, two machines can transfer a file. The rate can be large or small. Even if it is reduced to a few bytes per second, as long as the time is long enough, it can still be completed. The opposite example is audio and video live broadcast, the entire function is not available if the rate is below a certain value (this is similar to hot Observable). [Thinking] For the data stream generated by the time camera, it is obviously not cold Observable, that is, the data rate generated by the source is not controlled, then the backpressure mechanism is not applicable in this scenario.

to sum up

The first chapter outlines the working mode of KafkaStream with a purchase example. The memory points that remain in the mind after reading it are: the source, the processor node, and the mapper will return the corresponding KStream instance when the stream object purchase is processed. This instance is the processor.

When the complete purchase object passes through each node from the source, it feels like I have gone through information filtering and filtering. This node only receives the data related to processing, and the data filtered by the mapper will not be processed when generating the node downward. In other words, the number of nodes coming from the source is implemented according to the function direction. There is no other function after processing this function, because the flow direction of each node is single and more "narrow".

Thinking
But there is some doubt here that this KStream instance was put back by the method, but which object called this method? The API shows that KStream called the mapValues ​​method and then returned a new processor object. Then this object should process the purchase object, I think this KStream should call a method, the parameter of this method is purchase object, and then return a processed purchase copy object. I do n’t know if I understand this, or I should look at the specific code implementation.
There is also the principle of the mapper, the working method is not clear.

Published 9 original articles · liked 0 · visits 856

Guess you like

Origin blog.csdn.net/weixin_43138930/article/details/105271248