Table of contents
1. It is recommended to use an online json editor when writing json:
3. Take kafka—>mysql as an example to explain the mapping relationship between fields
4. kafka—>mysql, when kafka enters dirty data, mysql will write blank lines
5. Kafkareader, group-offsets mode cannot read the submitted offset data
This article aims to record the tips in the process of using chunjun, and record the places that do not match the description on the official website, so as to reduce the learning cost
The easiest and quickest way to get started is to familiarize yourself with the connector parameters in the official website documentation ( pure Jun (dtstack.github.io) )
1. It is recommended to use an online json editor when writing json:
Editor | JSON Crack https://jsoncrack.com/editor This tool can check whether the json format is correct and automatically standardize the json format, and can automatically standardize the json format, and can also generate a tree diagram to visually view the structure, avoiding tasks caused by json format problems Can not operate
The error caused by the json format problem is similar: Caused by: com.google.gson.stream.MalformedJsonException: Unterminated array at line 24 column 16 path $.job.content[0].reader.parameter.[1]
It can be avoided after checking the json format
2. Similar to MySQL<—>MySQL, which needs to write tasks with jdbcUrl, pay attention to the inconsistency of jdbcUrl types in reader and writer
jdbcUrl in reader is Array
And the jdbcUrl in the writer is String
This is inconsistent with the description in the official website document
If it is written according to the official website document, a format matching error will occur:
Caused by: java.lang.IllegalStateException: Expected STRING but was BEGIN_ARRAY at path $.jdbcUrl
3. Take kafka—>mysql as an example to explain the mapping relationship between fields
There are two kinds of data in kafka topic:
{"id":"1","name":"a1","A1":"0.001","A2":"0.005","A3":"100","A4":"abadc","A5":"eqerd"}
{"id":"2","name":"a2","A1":"0.001","A2":"0.005","A3":"5","A4":"abadc","A5":"eqerd"}
{"id":"3","name":"a3","A1":"0.1","A2":"0.3","A3":"20","A4":"","A5":"qerda"}
{"id":"4","name":"a4","A1":"0.00070","A2":"12.2","A3":"10","A4":null,"A5":"weaef"}
{"id":"5","name":"a5","A1":"0.1","A2":"0.3","A3":"20","A4":"adfsa","A5":"qerda"}
{"id":"6","name":"a1","A1":null,"A2":null,"A3":"100","A4":"abadc","A5":"eqerd"}
{"id":"1","name":"a1","B1":"0.1","B2":"5","B3":"GKLGU"}
{"id":"2","name":"a2","B1":"1.425","B2":"10","B3":"HJFV"}
{"id":"3","name":"a3","B1":"54.12","B2":"4325","B3":"FDGAD"}
{"id":"4","name":"a4","B1":"10.0","B2":"1","B3":null}
{"id":"5","name":"a5","B1":null,"B2":"11","B3":"SDF"}
{"id":"6","name":"a7","B1":null,"B2":null,"B3":null}
The first type contains id, name, A1, A2, A3, A4, A5 fields
The second type contains id, name, B1, B2, B3 fields
Write target table fields id, name, A1, A2, A3, A4, A5, B1, B2, B3
① Experiment 1:
kafkareader:name、A1、A2、A3、A4、A5、B1、B2、B3
mysqlwriter:name、A1、A2、A3、A4、A5、B1、B2、B3
The result is written normally
② Experiment 2
kafkareader:name、A1、A2、A3、A4、A5、B1、B2、B3
mysqlwriter:name、A2、A1、A3、A5、A4、B1、B2、B3
In the result table, columns A1 and A2 are swapped, and columns A4 and A5 are swapped
③ Experiment 3
kafkareader:name、A2、A1、A3、A4、A5、B1、B2、B3
mysqlwriter:name、A1、A2、A3、A4、A5、B1、B2、B3
Columns A1 and A2 in the result table are swapped
④ Experiment 4 (forgot to take a screenshot)
kafkareader:name、a1、a2、a3、A4、A5、b1、B2、B3
mysqlwriter:name、A1、A2、A3、A4、A5、B1、B2、B3
The columns A1, A2, A3, and B1 in the result target table are null
⑤ Experiment 5 (forgot to take a screenshot)
kafkareader:name、A1、A2、A3、A4、A5、B1、B2、B3
mysqlwriter:name、a1、a2、A3、A4、A5、B1、B2、B3
As a result, the operation failed, and the a1 and a2 columns could not be found in mysql
in conclusion:
① When the task runs to the reader, chunjun will match the data in the topic according to the fields defined in the kafkareader, and the order of the fields is not limited by the order of kafka. If the field in the reader does not appear in the kafka topic, it will be assigned null.
② Assign values in sequence when the fields correspond. When the reader is M, L, N, and the writer is l, m, n, M->l, L->m, N->n, the official website document description is not accurate, Therefore, as long as you understand the correspondence between the data in Kafka and the table fields in MySQL, you can quickly write json
③ When the task runs to the writer, chunjun will write the mysql table according to the fields defined in mysqlwriter, and the order of the fields is not limited by the order of mysql. If the fields in the writer do not appear in mysql, an error will be reported
④ The field correspondence can be one-to-many. For example, A1, A1, A1 in the reader, A1, A2, A3 in the writer, then A1, A2, and A3 columns in the mysql table will have A1 value attached
4. kafka—>mysql, when kafka enters dirty data, mysql will write blank lines
If a column in the target table is set to not null, the task will fail directly regardless of whether chunjun sets the errorLimit
Pay attention to the compliance of writing data to Kafka
5. Kafkareader, group-offsets mode cannot read the submitted offset data
There are five modes in chunjun kafkareader:
- group-offsets: Start consumption from the offset that has been submitted by the consumption group specified in ZK/Kafka brokers
- earliest-offset: start at the earliest offset (if possible)
- latest-offset: start at the latest offset (if possible)
- timestamp: start at the specified timestamp for each partition
- specific-offsets: start at specified specific offsets for each partition
In the group_offsets mode, the data in the group has not been consumed, and the default offset is -915623761773L.
At this time, the actual measurement using this mode will skip the original data and go directly to the latest offset, which is equivalent to the latest-offset mode. The query principle finds that:
This part of the chunjun code inherits the flink kafka api. When it is set to the group_offsets mode, if the offset of the group does not exist or is invalid, the initial offset will be determined according to the attribute "auto.offset.reset". auto.offset.reset defaults to largest.
If you want to consume the original data, you can manually set "auto.offset.reset" to earliest according to the following method
The earliest parameter of auto.offset.reset
When there is a committed offset under each partition: start consumption from the offset
When there is no committed offset under each partition: consume from scratch
In this way, most scenarios can be realized