Technology Sharing | OMS Introduction

Author: Gao Peng

DBA, responsible for daily project troubleshooting and long-term rental of advertising space.

Source of this article: original contribution

*Produced by the Aikesheng open source community, the original content is not allowed to be used without authorization, please contact the editor and indicate the source for reprinting.


The main contributor to this article: @打生春 (北分之光) for OMS source code analysis

1. Walk into OMS

This article takes OMS Community Edition 3.3.1 as an example

We can get its architecture diagram from the official address, this is what it looks like:

You can see that an OMS data migration tool contains many components, including DBCat, Store, Connector, JDBCWriter, Checker and Supervisor, etc. The functions of the components are not copied here. After all, it is enough to have hands. Next, let’s talk about what’s not on the official website.

Before the leader asked me to print the flame graph to take a look at the OMS performance test to analyze where the time was spent during the migration process, but when I logged in to the OMS container and saw many related java processes, I couldn’t tell which process was which. What to do, then we will explain these processes one by one

1.Ghana-endpoint

[ActionTech ~]# ps uax | grep Ghana-endpoint
root        61  3.1  0.5 20918816 1582384 pts/0 Sl  Feb07 1756:47 java -Dspring.config.location=/home/ds/ghana/config/application-oms.properties -server -Xms1g -Xmx2g -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/home/admin/logs/ghana/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/admin/logs/ghana -jar /home/ds/ghana/boot/Ghana-endpoint-1.0.0-executable.jar

Ghana-endpoint is responsible for providing OMS background management interface, scheduling TaskJob and StructTaskJob programs

tips:

StructTaskJob: Structural migration task scheduler

TaskJob:

  • TaskJob::scheduledTask(), responsible for the scheduling of the subtask execution of the forward switching step
  • TaskJob::scheduleMigrationProject(), responsible for the initialization of all steps of the structure migration project & the scheduling of task execution progress monitoring

2.commons-daemon(CM)

[ActionTech ~]# ps uax | grep commons-daemon
root        50  297  1.7 25997476 4711620 pts/0 Sl  Feb07 163685:09 java -cp /home/ds/cm/package/deployapp/lib/commons-daemon.jar:/home/ds/cm/package/jetty/start.jar -server -Xmx4g -Xms4g -Xmn4g -Dorg.eclipse.jetty.util.URI.charset=utf-8 -Dorg.eclipse.jetty.server.Request.maxFormContentSize=0 -Dorg.eclipse.jetty.server.Request.maxFormKeys=20000 -DSTOP.PORT=8089 -DSTOP.KEY=cm -Djetty.base=/home/ds/cm/package/deployapp org.eclipse.jetty.start.Main

The CM cluster management process provides an interface for the OMS management background process to create tasks such as pulling incremental logs, full migration, incremental synchronization, and full verification, and to obtain the execution progress of these tasks

3.oms-supervisor

[ActionTech ~]# ps uax | grep oms-supervisor
ds          63  1.0  0.3 11780820 985860 pts/0 Sl   Feb07 566:35 java -server -Xms1g -Xmx1g -Xmn512m -verbose:gc -Xloggc:./log/gc.log -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Dserver.port=9000 -DconfigDir=/u01/ds/supervisor/config/ -Dspring.main.allow-circular-references=true -jar ./bin/oms-supervisor.jar

The oms-supervisor process is used to start processes (components) that perform tasks such as pulling incremental logs, full migration, incremental synchronization, and full verification, and monitor the status of these processes

4.store

The store incremental log pulling process is a multi-process collaboration. If you directly grab the store process, you may see the following. Its root process is ./bin/store, which has child processes and multiple descendant processes.

  • ./bin/store: Simulate the slave library of the source node and receive incremental logs from the source node
  • /u01/ds/store/store7100/bin/metadata_builder: filter, convert, write files, and process DDL

These processes will continuously pull the incremental logs that need to be migrated to the OMS server and store them for the incremental synchronization task.

5. Full migration and full verification

[ActionTech ~]# ps -ef | grep VEngine
UID          PID    PPID  C STIME TTY          TIME       CMD
ds         32635       1 99 11:21 pts/0    00:00:02       /opt/alibaba/java/bin/java -server -Xms8g -Xmx8g -Xmn4g -Xss512k -XX:ErrorFile=/u01/ds/bin/..//run//10.186.17.106-9000:90230:0000000016/logs/hs_err_pid%p.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/u01/ds/bin/..//run//10.186.17.106-9000:90230:0000000016/logs -verbose:gc -Xloggc:/u01/ds/bin/..//run//10.186.17.106-9000:90230:0000000016/logs/gc_%p.log -XX:+PrintGCDateStamps -XX:+PrintGCDetails -classpath lib/verification-core-1.2.54.jar:lib/verification-OB05-1.2.54.jar:lib/verification-TIDB-1.2.54.jar:lib/verification-MySQL-1.2.54.jar:lib/verification-OB-Oracle-Mode-1.2.54.jar:lib/verification-OB10-1.2.54.jar:lib/verification-Oracle-1.2.54.jar:lib/verification-DB2-1.2.54.jar:lib/verification-Sybase-1.2.54.jar:lib/verification-common-1.2.54.jar:lib/apache-log4j-extras-1.2.17.jar:lib/slf4j-log4j12-1.7.21.jar:lib/oms-conditions-3.2.3.12-SNAPSHOT.jar:lib/oms-common-3.2.3.12-SNAPSHOT.jar:lib/log4j-1.2.17.jar:lib/dws-rule-1.1.6.jar:lib/dws-schema-1.1.6.jar:lib/oms-record-3.2.3.12-SNAPSHOT.jar:lib/commons-io-2.6.jar:lib/metrics-core-4.0.2.jar:lib/connect-api-2.1.0.jar:lib/kafka-clients-2.1.0.jar:lib/slf4j-api-1.7.25.jar:lib/dss-transformer-1.0.10.jar:lib/calcite-core-1.19.0.jar:lib/dss-record-1.0.0.jar:lib/avatica-core-1.13.0.jar:lib/jackson-datatype-jsr310-2.11.1.jar:lib/jackson-databind-2.11.1.jar:lib/esri-geometry-api-2.2.0.jar:lib/jackson-dataformat-yaml-2.9.8.jar:lib/jackson-core-2.11.1.jar:lib/jackson-annotations-2.11.1.jar:lib/mysql-connector-java-5.1.47.jar:lib/oceanbase-1.2.1.jar:lib/druid-1.1.11.jar:lib/etransfer-0.0.65-SNAPSHOT.jar:lib/commons-lang3-3.9.jar:lib/aggdesigner-algorithm-6.0.jar:lib/commons-lang-2.6.jar:lib/fastjson-1.2.72_noneautotype.jar:lib/commons-beanutils-1.7.0.jar:lib/log4j-1.2.15.jar:lib/mapdb-3.0.8.jar:lib/kotlin-stdlib-1.2.71.jar:lib/annotations-16.0.3.jar:lib/calcite-linq4j-1.19.0.jar:lib/guava-29.0-jre.jar:lib/maven-project-2.2.1.jar:lib/maven-artifact-manager-2.2.1.jar:lib/maven-reporting-api-3.0.jar:lib/doxia-sink-api-1.1.2.jar:lib/doxia-logging-api-1.1.2.jar:lib/maven-settings-2.2.1.jar:lib/maven-profile-2.2.1.jar:lib/maven-plugin-registry-2.2.1.jar:lib/plexus-container-default-1.0-alpha-30.jar:lib/groovy-test-2.5.5.jar:lib/plexus-classworlds-1.2-alpha-9.jar:lib/junit-4.12.jar:lib/commons-dbcp2-2.5.0.jar:lib/httpclient-4.5.6.jar:lib/commons-logging-1.2.jar:lib/commons-collections4-4.1.jar:lib/oms-operator-3.2.3.12-SNAPSHOT.jar:lib/retrofit-2.9.0.jar:lib/jsr305-3.0.2.jar:lib/servlet-api-2.5.jar:lib/org.osgi.core-4.3.1.jar:lib/protobuf-java-3.11.0.jar:lib/maven-plugin-api-2.2.1.jar:lib/okhttp-3.14.9.jar:lib/maven-artifact-2.2.1.jar:lib/wagon-provider-api-1.0-beta-6.jar:lib/maven-repository-metadata-2.2.1.jar:lib/maven-model-2.2.1.jar:lib/plexus-utils-3.0.16.jar:lib/javax.annotation-api-1.3.2.jar:lib/javassist-3.20.0-GA.jar:lib/xml-apis-1.3.03.jar:lib/error_prone_annotations-2.3.4.jar:lib/easy-random-core-4.2.0.jar:lib/objenesis-3.1.jar:lib/commons-collections-3.2.2.jar:lib/lombok-1.18.16.jar:lib/antlr4-runtime-4.9.1.jar:lib/ojdbc8-19.7.0.0.jar:lib/orai18n-19.3.0.0.jar:lib/oceanbase-client-1.1.10.jar:lib/db2jcc-db2jcc4.jar:lib/jtds-1.3.1.jar:lib/javax.ws.rs-api-2.1.1.jar:lib/groovy-ant-2.5.5.jar:lib/groovy-cli-commons-2.5.5.jar:lib/groovy-groovysh-2.5.5.jar:lib/groovy-console-2.5.5.jar:lib/groovy-groovydoc-2.5.5.jar:lib/groovy-docgenerator-2.5.5.jar:lib/groovy-cli-picocli-2.5.5.jar:lib/groovy-datetime-2.5.5.jar:lib/groovy-jmx-2.5.5.jar:lib/groovy-json-2.5.5.jar:lib/groovy-jsr223-2.5.5.jar:lib/groovy-macro-2.5.5.jar:lib/groovy-nio-2.5.5.jar:lib/groovy-servlet-2.5.5.jar:lib/groovy-sql-2.5.5.jar:lib/groovy-swing-2.5.5.jar:lib/groovy-templates-2.5.5.jar:lib/groovy-test-junit5-2.5.5.jar:lib/groovy-testng-2.5.5.jar:lib/groovy-xml-2.5.5.jar:lib/groovy-2.5.5.jar:lib/sketches-core-0.9.0.jar:lib/json-path-2.4.0.jar:lib/janino-3.0.11.jar:lib/commons-compiler-3.0.11.jar:lib/hamcrest-core-1.3.jar:lib/gson-2.8.5.jar:lib/httpcore-4.4.13.jar:lib/commons-codec-1.15.jar:lib/ini4j-0.5.2.jar:lib/backport-util-concurrent-3.1.jar:lib/plexus-interpolation-1.11.jar:lib/failureaccess-1.0.1.jar:lib/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:lib/checker-qual-2.11.1.jar:lib/j2objc-annotations-1.3.jar:lib/classgraph-4.8.65.jar:lib/zstd-jni-1.3.5-4.jar:lib/lz4-java-1.5.0.jar:lib/snappy-java-1.1.7.2.jar:lib/ant-junit-1.9.13.jar:lib/ant-1.9.13.jar:lib/ant-launcher-1.9.13.jar:lib/ant-antlr-1.9.13.jar:lib/commons-cli-1.4.jar:lib/picocli-3.7.0.jar:lib/qdox-1.12.1.jar:lib/jline-2.14.6.jar:lib/junit-platform-launcher-1.3.2.jar:lib/junit-jupiter-engine-5.3.2.jar:lib/testng-6.13.1.jar:lib/avatica-metrics-1.13.0.jar:lib/commons-pool2-2.6.0.jar:lib/snakeyaml-1.23.jar:lib/memory-0.9.0.jar:lib/eclipse-collections-forkjoin-11.0.0.jar:lib/eclipse-collections-11.0.0.jar:lib/eclipse-collections-api-11.0.0.jar:lib/lz4-1.3.0.jar:lib/elsa-3.0.0-M5.jar:lib/okio-1.17.2.jar:lib/junit-platform-engine-1.3.2.jar:lib/junit-jupiter-api-5.3.2.jar:lib/junit-platform-commons-1.3.2.jar:lib/apiguardian-api-1.0.0.jar:lib/jcommander-1.72.jar:lib/kotlin-stdlib-common-1.2.71.jar:lib/opentest4j-1.1.1.jar:conf com.alipay.light.VEngine -t 10.186.17.106-9000:90230:0000000016 -c /home/ds//run/10.186.17.106-9000:90230:0000000016/conf/checker.conf start

The start commands of the two tasks of full migration and full verification are the same, but they are actually two processes, so we cannot tell whether the process is full migration or full verification from the ps command

To distinguish, you need to look at the /home/ds//run/10.186.17.106-9000:90230:0000000016/conf/checker.conf file, which is the configuration file specified by the -c parameter in the above startup command.

condition.whiteCondition=[{"name":"sakila","all":false,"attr":{"dml":"d,i,u"},"sub":[{"name":"city"},{"name":"country"}]}]
condition.blackCondition=[{"name":"sakila","all":false,"sub":[{"name":"DRC_TXN*","func":"fn"},{"name":"drc_txn*","func":"fn"}]}]
datasource.master.type=MYSQL
......
task.split.mode=false
task.type=verify
task.checker_jvm_param=-server -Xms8g -Xmx8g -Xmn4g -Xss512k
task.id=90230
task.subId=3
task.resume=false

The value of the task.type configuration item in the configuration file represents the process type

  • task.type=migrate: full migration process
  • task.type=verify: full verification process

6. Incremental synchronization process

[ActionTech ~]# ps aux | grep "coordinator\.Bootstrap"
ds       58500 34.7  0.9 44600844 2575092 pts/0 Sl  Feb08 18483:51 java -server -XX:+DisableExplicitGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/u01/ds/bin/../run/10.186.17.104-9000:p_4gmn723gtt28_dest-000-0:0002000001 -Djava.library.path=/u01/ds/bin/../plugins/jdbc_connector/ -Duser.home=/home/ds -Dlogging.path=/u01/ds/bin/../run/10.186.17.104-9000:p_4gmn723gtt28_dest-000-0:0002000001/logs -cp jdbc_connector.jar:jdbc-source-store.jar:jdbc-sink-ob-mysql.jar com.oceanbase.oms.connector.jdbc.coordinator.Bootstrap -c conf -s com.oceanbase.oms.connector.jdbc.source.store.StoreSource -d com.oceanbase.oms.connector.jdbc.sink.obmysql.OBMySQLJdbcSink -t 10.186.17.104-9000:p_4gmn723gtt28_dest-000-0:0002000001 start

This process is used to read incremental logs and construct SQL statements for playback on the target node

7. Relationship between processes

2. Migration process

The above are the various component processes that make up the OMS and the relationship between them. Next, we use an example to briefly describe the internal workflow of OMS.

Take the incremental migration task as an example:

Through the analysis of the source code by colleagues, it can be concluded that the migration process is roughly as follows:

Then we know the migration process, and then how to analyze the speed of OMS migration must be the most concerned point of every operation and maintenance colleague.

We all know that the operation interface of OMS is very simple. Although influxDB is used, the displayed information is still very little, and it is hardly helpful.

Log in to influxDB and you can see the following related monitoring items

["checker.dest.rps"]
["checker.dest.rt"]
["checker.dest.write.iops"]
["checker.source.read.iops"]
["checker.source.rps"]
["checker.source.rt"]
["checker.verify.dest.read.iops"]
["checker.verify.dest.rps"]
["checker.verify.dest.rt"]
["checker.verify.source.read.iops"]
["checker.verify.source.rps"]
["checker.verify.source.rt"]
["jdbcwriter.delay"]
["jdbcwriter.iops"]
["jdbcwriter.rps"]
["store.conn"]
["store.delay"]
["store.iops"]
["store.rps"]

In addition, OMS also uses io.dropwizard.metrics5 for partial monitoring. The coordinator process prints metrics every 10 seconds. The log path of metrics: /u01/ds/run/$test_id/logs/msg/metrics.log, The log content is in json format, and the records are more detailed.

[2023-03-18 22:49:18.101] [{"jvm":{"JVM":"jvm:[heapMemory[max:30360MB, init:2048MB, used:677MB, committed:1980MB], noHeapMemory[max:0MB, init:2MB, used:58MB, committed:63MB], gc[gcName:ParNew, count:2643, time:64034ms;gcName:ConcurrentMarkSweep, count:8, time:624ms;], thread[count:88]]"},"sink":{"sink_worker_num":0,"sink_total_transaction":4.5654123E7,"rps":0.0,"tps":0.0,"iops":0.0,"sink_total_record":4.5654123E7,"sink_commit_time":0.0,"sink_worker_num_all":64,"sink_execute_time":0.0,"sink_total_bytes":2.8912799892E10},"source":{"StoreSource[0]source_delay":5,"p_4gmn723gtt28_source-000-0source_record_num":1.36962569E8,"p_4gmn723gtt28_source-000-0source_iops":198.98,"StoreSource[0]source_status":"running","p_4gmn723gtt28_source-000-0source_dml_num":4.5654123E7,"p_4gmn723gtt28_source-000-0source_dml_rps":0.0,"p_4gmn723gtt28_source-000-0source_rps":0.0},"dispatcher":{"wait_dispatch_record_size":0,"ready_execute_batch_size":0},"frame":{"SourceTaskManager.createdSourceSize":1,"queue_slot1.batchAccumulate":0,"forward_slot0.batchAccumulate":0,"forward_slot0.batchCount":4.7873212E7,"queue_slot1.batchCount":4.7873212E7,"queue_slot1.rps":0.6999300122261047,"SourceTaskManager.sourceTaskNum":0,"forward_slot0.recordAccumulate":0,"forward_slot0.rps":0.6999300122261047,"queue_slot1.tps":0.6999300122261047,"forward_slot0.tps":0.6999300122261047,"queue_slot1.recordAccumulate":0,"queue_slot1.recordCount":4.7873212E7,"forward_slot0.recordCount":4.7873212E7}}]

metrics.log is an important tool for analyzing migration bottlenecks, focusing on the monitoring values ​​of source, dispatch, and sink for analysis.

1. source stage

Function: It is mainly used to connect to the source database, pull incremental logs, run in a single thread, and read SQL to output downstream in the form of a transaction. Since it is running in a single thread, the entire action can only be completed serially.

TransactionAssembler::TransactionAssemblerInner::generateTransaction() stitches the read SQL into a transaction;

TransactionAssembler::TransactionAssemblerInner::putTransaction() is sent to the QueuedSlot queue.

The following monitoring values ​​can be obtained through metrics.log:

  • source_record_num: Source stage, the total number of row records processed (statistics when processing row records (stitching row records into transactions), TransactionAssembler.notify)
  • source_rps: Source stage, the number of rows processed per second (TransactionAssembler.notify)
  • source_iops: Source stage, the size of row records processed per second (TransactionAssembler.notify)
  • source_dml_num: Source stage, the total number of DML line records processed (TransactionAssembler.notify)
  • source_dml_rps: Source stage, the number of DML line records processed per second (TransactionAssembler.notify)
  • StoreSource[0]source_status: The status of the Source stage, running/stopped
  • StoreSource[0]source_delay: The delay of the Source stage, the time to process data

InfluxDB can be used to make it into a monitoring display diagram, as shown below (part):

2. Dispatcher stage

Role: Coordinate the speed between production and consumption, responsible for reading transactions from the QueuedSlot queue, and then writing them into the transaction scheduling queue (DefaultTransactionScheduler::readyQueue), the default queue size is 16384, single-threaded.

  • dispatcher_ready_transaction_size: DefaultTransactionScheduler::readyQueue The number of backlogged transactions in the queue, waiting for the SinkTaskManager thread to process.
  • coordinator_dispatch_record_size: DefaultTransactionScheduler::readyQueue The sum of the records of transactions backlogged in the queue, waiting for the SinkTaskManager thread to process.

3. sink stage

Function: Obtain data from the upstream transaction scheduling queue, and then play it back in the target database. Multithreading, the number of threads is specified by the workerNum configuration item, and the default is 16.

  • sink_tps: How many transactions are played back per second (statistics when the transaction playback is completed)
  • sink_total_transaction: The total number of replay transactions
  • sink_worker_num: number of busy threads in worker threads
  • sink_worker_num_all: total number of worker threads
  • sink_rps: How many lines of records are played back per second
  • sink_total_record: The total number of line records played back
  • sink_execute_time: The average execution time of row records = SQL execution time of the transaction / number of records of the transaction
  • sink_commit_time: the average time of the commit phase of the transaction
  • sink_iops: row record size of transactions played back per second, in kb/s
  • sink_total_bytes: total row record size of replayed transactions

3. OMS optimization

OMS does not leave much room for optimization in O&M. In this issue, we only optimize the migration link. The two most direct optimization methods are listed below.

1.sink_worker_number

Use a set of monitoring graphs to check whether OMS has performance problems

During the pressure test, the curves presented by the sink_worker_number and wait_dispatch_record monitoring views are almost the same. It can be clearly seen that all transactions are backlogged in the dispatch queue due to insufficient downstream consumption capacity, and the dispatch queue is full.

After finding the cause of the problem, the optimization becomes simple. You only need to increase the number of threads in sink_worker_number, and modify it as shown in the figure below.

O&M Monitoring - Components - Update

Find the value of "JDBCWriter.worker_num", it is recommended to change it to the number of logical cores of the host CPU, the maximum number of CPU logical cores * 4 (the specific value can be dynamically adjusted according to the environment).

2. Migration Options

When OMS creates a migration task, if the migration task includes full migration, you can select "Full Migration Concurrent Speed".

The resources required for different gears are as follows. This is the official template that comes with it, and it can also be adjusted according to the actual situation.

Note: If only incremental migration is selected, this option is not available.

3. Other optimization methods can only be analyzed in detail for specific reasons.

Guess you like

Origin blog.csdn.net/ActionTech/article/details/130221855