elasticsearch upgrade and index rebuild

1. Background description

es carries three parts of business in the company, site query, order data statistics, elk log analysis.

In 2020, the team decided to upgrade elasticsearch. The current version of es (abbreviation of elasticsearch, the same below) is 1.x, and it will be upgraded to version 5.x.

5.x supports the following new features:

  • Support lucene 6.x, reduce disk space by half, index time by half, query performance improved by 25%
  • Java rest client (high level api)
  • Painless script is safer, more concise and better performance than groovy script

For in-site query and order data statistics, the current business structure is

mysql -> canal -> kafka -> (es Index server) -> es

(Consider using kafka connector instead of canal)

1.1 How to configure mysql -> canal -> kafka

1.1.1. Configure mysql

open binlog

[mysqld]
log-bin=mysql-bin # 开启 binlog
binlog-format=ROW # 选择 ROW 模式
server_id=1 # 配置 MySQL replaction 需要定义,不要和 canal 的 slaveId 重复

Authorize the canal user to have copy permission

CREATE USER canal IDENTIFIED BY 'canal';  
GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'canal'@'%';
-- GRANT ALL PRIVILEGES ON *.* TO 'canal'@'%' ;
FLUSH PRIVILEGES;

1.1.2 configure canal

Download https://github.com/alibaba/canal/releases/download/canal-1.1.6/canal.deployer-1.1.6.tar.gz

Modify conf/canal.properties

# tcp, kafka, rocketMQ, rabbitMQ, pulsarMQ
canal.serverMode = kafka # 由kafka消费

kafka.bootstrap.servers = 127.0.0.1:9092
kafka.acks = all
kafka.compression.type = none
kafka.batch.size = 16384
kafka.linger.ms = 1
kafka.max.request.size = 1048576
kafka.buffer.memory = 33554432
kafka.max.in.flight.requests.per.connection = 1
kafka.retries = 0

Modify conf/example/instance.properties

# username/password
canal.instance.dbUsername=canal
canal.instance.dbPassword=canal
canal.instance.defaultDatabaseName=mysql_test # 同步的数据库

# mq config
canal.mq.topic=canal_topic # 在kafka的topic

start canal

./bin/start.sh

1.1.2 Start zookeeper and kafka

brew services start zookeeper
brew services start kafka

1.1.3 Testing

Add data in db, you can use kafka script to see synchronous data

INSERT INTO `mysql_test`.`user` (`id`, `name`) VALUES ('6', 'Bob');

➜  bin ./kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9092 --from-beginning --topic canal_topic
{"data":[{"id":"6","name":"Bob","age":null}],"database":"mysql_test","es":1684221427000,"id":5,"isDdl":false,"mysqlType":{"id":"int","name":"varchar(32)","age":"int"},"old":null,"pkNames":["id"],"sql":"","sqlType":{"id":4,"name":12,"age":4},"table":"user","ts":1684221427082,"type":"INSERT"}

2. Difficulties

1. How not to affect the current business during the upgrade.

2. If the upgrade fails, it can be quickly rolled back.

3. Specific steps

The main scheme adopts double writing mechanism

3.1. Deploy a new es cluster

Download the 5.x version of es, and deploy a new cluster on a new machine.

Configure the machine:

  • disable swapping:swapoff -a
  • Memory locking: mlockall: true
  • Modify the number of file handles: ulimit -a
  • Allocate half of the memory to es, leave half of the memory to the file system, ES_JAVA_OPTS="-Xms16g -Xmx16g"

3.2.pull code, upgrade the code to the new version of es

Due to the relatively large span from 1.x to 5.x, many java APIs have changed and need to be fixed.  

  • Common field type modification: text/keyword instead of string
  • The type type is no longer supported
  • java api alias semantic change

3.3. Rebuild index

  We use the index rebuild program to create new indexes. The specific steps to rebuild the index are as follows. We call the online index an online index, and the newly created index a new index.

  3.3.1.init

    Refresh the index name mapping relationship and check that the current alias has only one physical index.

    According to the predefined mapping, create the index new index.

    Set the online index record data change log, that is, record the kafka data consumed by the online index and store it as a change log file.

  3.3.2. Full index data on the database to new index

    The data detected from mysql is synchronized to es. If there are multiple sub-tables, they will be synchronized according to the order of the tables. Multi-threaded batch insert can be enabled.

  3.3.3. Index optimization for new index

    refresh, flush index. Call the force-merge api to merge segments.

  3.3.4. Replay the change log to the new index

    According to the change log, it is converted into es query and written to the new index.    

  3.3.5. Suspend online index writing

    Because the online index and new index use the same kafka consumer group, the consumption function of the online index must be stopped.

  3.3.6. Close the change log

    Stop recording online index record data change logs.

 3.3.7. The second stage replays the change log

    According to the change log, it is converted into es query and written to the new index. 

  3.3.8. Delete change log 

    Strikethrough indexes record data change logs.

  3.3.9. Set the number of copies 

    When new index creates an index, the default number of copies is 0, and now the number of copies is dynamically adjusted to the value required by the business. For example, two copies are set for the actual search business, and no copies are required for the order statistics index.

PUT /new_index/_settings
{
    "number_of_replicas": 2
}

    This stage can be time-consuming, requiring a few minutes before proceeding to the next step. A better approach is to call the health api to check the shard status.

GET _cluster/health

{
  "cluster_name" : "testcluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 1,
  "active_shards" : 1,
  "relocating_shards" : 0, // 重新定位的分片
  "initializing_shards" : 0, // 初始化中的分片
  "unassigned_shards" : 1, // 未分配的分片
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 50.0
}

  3.3.10. Alias ​​switching 

POST /_aliases
{
    "actions": [
        { "remove": { "index": "online_index", "alias": "my_index" }},
        { "add":    { "index": "new_index", "alias": "my_index" }}
    ]
}

  3.3.11. Run online index (read data from kafka)

    new_index starts to consume the latest data from Kafka. Due to possible delays in previous operations, it takes a few minutes to synchronize to the latest data.

  3.3.12. Delete the old index

    delete old_index

The detailed code steps are as follows

        // 1.init
        logger.info("初始化");
        ESHighLevelFactory esHighLevelFactory = ESHighLevelFactory.getInstance(indexContext.getIndex().getIndexName());
        logger.info("刷新索引名映射关系");
        if (!indexContext.refreshIndexName()) {
            throw new IndexException("刷新索引映射关系失败");
        }

        rebuildIndexName = indexContext.getPhysicalRebuildIndexName();

        logger.info("初始化重建索引环境,当前重建索引名:" + rebuildIndexName);
        logger.info("创建索引,索引名:" + rebuildIndexName);
        boolean isCreate = false;
        try {
            isCreate = indexContext.getIndex().createIndex(rebuildIndexName);
        } catch (Throwable t) {
            logger.info("创建索引失败,本次失败可以不处理,将会自动重试 ...");
        }

        logger.info("设置在线索引记录数据变更日志");
        indexContext.startChangeLog();

        // 2. 重建索引
        logger.info("全量索引数据库上的数据 ...");
        long startRebulidTime = System.currentTimeMillis();
        rebuild();
        logger.info(" ------  完成全量索引数据库上的数据,对应索引" + rebuildIndexName + ",耗时" + ((System.currentTimeMillis() - startRebulidTime) / 1000)
            + " 秒    ------  ");

        // 3. 索引优化 -- 是否调到变更重放完毕后做优化
        logger.info("优化索引 ...");
        long startOptimizeTime = System.currentTimeMillis();
        ESHighLevelFactory.getInstance(rebuildIndexName).optimize(rebuildIndexName, 1);
        logger.info(" ------  完成" + rebuildIndexName + "索引优化,耗时 " + ((System.currentTimeMillis() - startOptimizeTime) / 1000)
            + " 秒    ------  ");

        // TODO 字符集设置
        BufferedReader logReader = new BufferedReader(new FileReader(indexContext.getChangeLogFilePath()));

        // 4. 重放变更日志
        logger.info("重放本地数据变更日志[第一阶段] ...");
        long startReplay1Time = System.currentTimeMillis();
        int replayChangeLogCount = replayChangeLogFirst(logReader);
        logger.info(" ------  完成[第一阶段]的变更日志重放,行数" + replayChangeLogCount + " 耗时 "
            + ((System.currentTimeMillis() - startReplay1Time) / 1000) + " 秒    ------  ");

        // 5. 暂停在线索引
        logger.info("暂停在线索引");
        indexContext.pauseOnlineIndex();
        isPauseOnline.set(true);

        // 6. 设置 在线索引只做索引更新 以及 关闭 change log
        logger.info("停止变更日志");
        indexContext.stopChangeLog();

        // 7. 继续重放 change log
        logger.info("重放本地数据变更日志[第二阶段] ...");
        long startReplay2Time = System.currentTimeMillis();
        replayChangeLogCount = replayChangeLogCount + replayChangeLogSecond(logReader);
        if ((indexContext.getWriteChangeLogCount() - replayChangeLogCount) != 0) {
            logger.error("变更日志,处于错误的状态,统计的日志行数:" + indexContext.getWriteChangeLogCount() + ", 但实际只有:" + replayChangeLogCount);
        }
        logger.info(" ------  完成[第二阶段]的变更日志重放,行数" + replayChangeLogCount + " 耗时 "
            + ((System.currentTimeMillis() - startReplay2Time) / 1000) + " 秒    ------  ");

        // 8. 删除变更日志, OnlineIndex.startChangeLog 有做环境清理,这里不执行
        logger.info("简单优化索引 ...");
        long startSimpleOptimizeTime = System.currentTimeMillis();
        ESHighLevelFactory.getInstance(rebuildIndexName).optimize(rebuildIndexName, null);

        logger.info(" ------  完成" + rebuildIndexName + "索引简单优化,耗时 " + ((System.currentTimeMillis() - startSimpleOptimizeTime) / 1000)
            + " 秒    ------  ");

        // 9. 设置副本数 (怀疑比较耗时~~~待确认)
        logger.info("设置副本数 ...");
        int replicas = 3;
        if (rebuildIndexName.startsWith(IndexNameConst.ORDER_INDEX_PREFIX)) {
            replicas = 1;
        } else if (rebuildIndexName.startsWith(IndexNameConst.IndexName.activityTicket.getIndexName())) {
            replicas = 2;
        } else {
            String replicasStr = Configuration.getInstance().loadDiamondProperty(Configuration.ES_INDEX_REPLICAS);
            if (NumberUtils.isNumber(replicasStr)) {
                replicas = NumberUtils.toInt(replicasStr);
            }
        }
        ESHighLevelFactory.getInstance(rebuildIndexName).setReplicas(rebuildIndexName, replicas);

        // 执行索引切换流程
        // 预发、线上环境阻塞等待2分钟同步数据后,再执行索引切换和删除旧索引逻辑
        try {
            if(IDCUtil.isBuildOrProduction()){
                Thread.sleep(120 * 1000);
            }
        } catch (InterruptedException e) {
        }
        // 10. 别名切换
        logger.info("索引切换:将" + rebuildIndexName + "设置为线上索引");
        if (!indexContext.switchIndex(rebuildIndexName)) {
            throw new IndexException("索引切换失败:将" + rebuildIndexName + "设置为线上索引失败");
        }

        // 11. 运行在线索引
        logger.info("运行在线索引");
        indexContext.keepRuningOnlineIndex();
        isPauseOnline.set(false);

        // 12. 删除原有在线索引
        String oldOnlineIndexName = indexContext.getPhysicalRebuildIndexName();
        logger.info("删除原有在线索引,索引名:" + oldOnlineIndexName);
        if (!ESHighLevelFactory.getInstance(indexContext.getIndex().getIndexName()).deleteIndex(oldOnlineIndexName)) {
            throw new IndexException("删除索引失败,索引名:" + oldOnlineIndexName);
        }

think

If you simply create a new index, you can do this (using a different consumer group) 

  1. Record timestamp 

  2. Data of full index data

  3. Find the subscript in Kafka according to the previous timestamp, the timestamp of the subscript must be < the recorded timestamp    

  sh kafka_2.11-2.3.0/bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list broker1:9092,broker2:9092 -topic topicName -time 1585186237000

  4. Start indexing data according to the subscript in the previous step

3.4. Use the new cluster for business testing

Deploy a new client service to call a new es cluster to check whether the business is normal. Check whether the search results are consistent for in-site queries, and check whether the statistical results are consistent for statistical queries.

Do different tests for different business scenarios

1. Comparing the old and new clusters, whether the index data volume is consistent

2. Search the business and view the search results of popular keywords

3. Statistical business, compare the amount of index data, common aggregation statistical query results are consistent

4. For elk business, it can be upgraded separately

3.5. Publish the online client search code, modify the es address to the new cluster address

  Go online and observe whether the business is stable.

3.6. Offline the old es cluster

  Release the resources of the old es cluster.

4. Summary

  The job of upgrading es was done two years ago. Now let’s summarize, some details may be omitted. But to sum up, there are still a lot of gains, and there is room for improvement in terms of architecture and code details. The es reconstruction code can be made more general, and then open source.

Guess you like

Origin blog.csdn.net/Z__7Gk/article/details/131662725