The Road to Graph Database System Reconstruction: Migrating from OrientDB to NebulaGraph Real Case Sharing

1. Write in front

Students who have read the articles on my official account know that I have done many refactorings, which can be said to be "refactoring nail households", but this time, the refactoring graph database OrientDB is Nebula Graph (https://www.nebula- graph.io/), can be said to be the most difficult refactoring I have ever done, so this article will talk about the road to graph database refactoring.

2. Where are the difficulties

1. The historical burden is heavy. The original OrientDB system was developed in 2016. The logic is very complicated, and the historical background is completely unclear.

2. I don’t understand the business. We are temporarily receiving big data needs. We have not participated in this business before, so we don’t understand it at all.

3. I don’t know the technology stack. This is the first time I have come into contact with graph databases (no one in the team knows about it). I have never been in touch with OrientDB and Nebula before. It turns out that most of the code in the old system is written in Scala language. The Hbase used in the system, Spark and Kafka are also relatively unfamiliar to us.

4. Time is tight

To sum up: I don’t understand the business, I am not familiar with the technology stack!

Tips: Everyone thinks about a question, how to do refactoring when the business and technology stack are not familiar?

3. Technical solution

The following is an introduction to the reconstruction technology plan

1. Background

Orion's graph database OrientDB has performance bottlenecks and single-point problems, and needs to be upgraded to Nebula.

The technology stack used by the old system cannot support elastic scaling, and the monitoring and alarm facilities are not perfect.

2. Research items

Note: Since we are not familiar with the business, what have we researched?

1) External interface sorting: sort out all external interfaces of the system, including interface name, interface purpose, request volume (QPS), average time consumption, caller (service and IP)

2), sort out the core process of the old system: output the structure diagram of the old system, and output the flow chart of important interfaces (about 10)

3) Environment sorting: What are the projects involved that need to be transformed, application deployment, Mysql, Redis, Hbase cluster IP, and current online deployment branch sorting

4) Triggering scenarios: how the interfaces are triggered, starting from the business usage scenarios, each interface is covered by at least one scenario, which is convenient for later function verification

5) Transformation plan: Feasibility analysis, how to transform (OrientDB statement to Nebula query statement) for each interface, how to transform the image (writing process)

6), new system design scheme: output sorting structure diagram, core flow chart

3. Project objectives

Completed transformation of graph database data source OrientDB to Nebula, reconstructed the unified technology stack of the old system to Java, and supported service level expansion.

4. Overall plan

We adopted a more radical solution:

1. Starting from the entrance of the call interface, directly rewrite the underlying old system, and the impact area is controllable

2. Once and for all, convenient for later maintenance

3. Unify the Java technology stack and access the company's unified service framework, which is more conducive to monitoring and maintenance

4. The application boundary of the basic graph database is clear, and it is easier for subsequent upper-layer applications to access the graph database

Note: Here are the pictures drawn during the research stage. The pictures involve business, so I won’t list them here.

5. Grayscale scheme

0a7810ad5f3d810445623f6fea2c07f2.jpeg

** 1) Grayscale scheme**

Write request: use synchronous double write

Read requests: Migrate from small to large according to traffic, smooth transition

** 2) Grayscale plan**

stage one stage two stage three stage four stage five stage six stage seven
0% 1‰ 1% 10% 20% 50% 100%
Synchronous double writing, traffic playback sampling comparison, 100% pass, estimated grayscale for 2 days Grayscale 2 days Grayscale 2 days Grayscale for 5 days, pressure testing is required at this stage Grayscale 2 days Grayscale 2 days -

Note:

  1. 1. Configure the central switch control, switch at any time if there is a problem, and recover in seconds.

  2. 2. The omission of the read interface has no effect, only the changed one will be affected.

  3. 3. Use the parameter hash value as the key to ensure that the results of multiple requests for the same parameter are consistent and satisfy abs(key) % 1000 < X ​​( 0< X < 1000, X is a dynamic configuration) is the hit grayscale.

Digression: In fact, the most important thing about refactoring is the grayscale solution. I mentioned this in the previous article. The real traffic is compared asynchronously. After the comparison is completely passed, the volume is increased. This data comparison stage is longer than expected (actually, it took 2 weeks and many hidden problems were found).

6. Data comparison scheme

1) The grayscale process of the miss is as follows:

First call the old system, and then according to whether the sampling is hit (sampling ratio configuration 0% ~ 100%), the hit sampling will send MQ, and then consume MQ in the new system, request the new system interface, and compare the data returned by the old system interface with json. Inconsistency Send enterprise WeChat notifications, perceive data inconsistencies in real time, discover and solve problems.

73fbb09b79992bae38292612992a8f1f.png
img

vice versa! !

7. Data migration plan

1), Full volume (historical data): The script is fully migrated, and there are inconsistencies during the online period, and the data consumed from MQ for nearly 3 days

2) Increment: synchronous double writing (few writing interfaces, write request QPS is not high)

8. Transformation case - take subgraph query as an example

1) Before transformation

@Override
    public MSubGraphReceive getSubGraph(MSubGraphSend subGraphSend) {
        logger.info("-----start getSubGraph------(" + subGraphSend.toString() + ")");
        MSubGraphReceive r = (MSubGraphReceive) akkaClient.sendMessage(subGraphSend, 30);
        logger.info("-----end getSubGraph:");
        return r;
    }

2) After transformation

Define the grayscale module interface

public interface IGrayService {
    /**
     * 是否命中灰度 配置值 0 ~ 1000  true: 命中  false:未命中
     *
     * @param hashCode
     * @return
     */
    public boolean hit(Integer hashCode);

    /**
     * 是否取样 配置值 0 ~ 100
     *
     * @return
     */
    public boolean hitSample();

    /**
     * 发送请求-响应数据
     * @param requestDTO
     */
    public void sendReqMsg(MessageRequestDTO requestDTO);

    /**
     * 根据
     * @param methodKeyEnum
     * @return
     */
    public boolean hitSample(MethodKeyEnum methodKeyEnum);
}

The interface modification is as follows, kgpCoreService requests new kgp-core services, the interface business logic is consistent with orion-x, and the underlying graph database is changed to query Nebula

@Override
    public MSubGraphReceive getSubGraph(MSubGraphSend subGraphSend) {
        logger.info("-----start getSubGraph------(" + subGraphSend.toString() + ")");
        long start = System.currentTimeMillis();
        //1. 请求灰度
        boolean hit = grayService.hit(HashUtils.getHashCode(subGraphSend));
        MSubGraphReceive r;
        if (hit) {
            //2、命中灰度 走新流程
            r = kgpCoreService.getSubGraph(subGraphSend); // 使用Dubbo调用新服务
        } else {
            //这里是原来的流程 使用的akka通信
            r = (MSubGraphReceive) akkaClient.sendMessage(subGraphSend, 30);
        }
        long requestTime = System.currentTimeMillis() - start;

        //3.采样命中了发送数据对比MQ 
        if (grayService.hitSample(MethodKeyEnum.getSubGraph_subGraphSend)) {
            MessageRequestDTO requestDTO = new MessageRequestDTO.Builder()
                    .req(JSON.toJSONString(subGraphSend))
                    .res(JSON.toJSONString(r))
                    .requestTime(requestTime)
                    .methodKey(MethodKeyEnum.getSubGraph_subGraphSend)
                    .isGray(hit).build();
            grayService.sendReqMsg(requestDTO);
        }
        logger.info("-----end getSubGraph: {} ms", requestTime);
        return r;
    }

9. Project scheduling plan

Manpower input: 4 people for development, 1 person for testing

The main items and time-consuming are as follows:

Scheme design stage development stage testing phase Grayscale
1. Process sorting
2. Drawing flowcharts and organizing structure diagrams
3. Scheme design
1. New service project construction, Nebula operation class ORM framework package
2. Interface transformation (more than 10 interface transformations)
3. MQ consumption transformation
4. Data comparison tool development (including enterprise micro notification)
5. Data migration script development
6. Interface Joint debugging
7, CR within the code group
1. Functional test
2. Data comparison
3. 100% traffic old system regression test
4. 100% traffic new system regression test
5. Production data migration

1. Divided into 7 stages of grayscale, smooth transition
2. Real-time comparison of production data
3. Perfect monitoring & alarm facilities (this is completed before the stress test, and the indicators are observed during the plan stress test)
4. Stress test (10% flow pressure test )
5. Data backup and recovery drill (using nebula snapshot backup), capacity expansion drill
1 week 3 weeks took 2 weeks

10. Required resources

3 Nebula machines, configuration: 8-core 64G, 2T SSD hard disk

6 docker services, configuration: 2 cores 4G

4. Restructuring benefits

After 2 months of hard work by the team, the grayscale stage has been completed, and the benefits are as follows

1. Nebula itself supports distributed expansion, new system services support elastic scaling, and overall support performance level expansion

2. Judging from the pressure test results, the performance of the interface has improved significantly, and the support requests far exceed expectations

3. Connect to the company for unified monitoring and alarm, which is more conducive to later maintenance

V. Summary

This refactoring was successfully completed. Thanks to the small partners who refactored together this time, as well as the support of big data and risk control students, and also thanks to the Nebula community (https://discuss.nebula-graph.com.cn/), we met Asked some questions, and quickly answered them.

Welcome to pay attention to the WeChat public account "Talking about Architecture", and share original technical articles from time to time.

51a4f31d2ca31bfc4c94fc5c20d881a2.png

Guess you like

Origin blog.csdn.net/weixin_38130500/article/details/131929301