Cloud Service Migration Summary

Common service products

Function

Points to note

Requirements & Risks

cloud server esc

Whether the relationship between the bandwidth of the ECS device and the bandwidth of the VPC network is directly inherited, or should it be calculated separately according to the device specification

Relational Database

Backup cost calculation

DTS

Whether it supports the full + incremental synchronization mode, and the task configuration can be updated in real time. Under normal circumstances, the data delay is at the second level

Can the task support resuming and retrying

cdn

In the transitional stage, whether it can cooperate to support back-to-origin to different origin sites according to url rules

The url should contain dates or other string features that can be split

oss object storage

Whether to support 2 back-to-origin, back-to-origin to the original origin site, and pull to the current storage bucket

Application Firewall WAF

If the source site is not the cloud service of the same vendor, what is the delay, and is there any difference in billing and bandwidth?

Whether to support SQL-like log query function

cloud firewall

Basic components

block storage

Disk snapshot fee

load balancing

Whether to support SQL-like log query function

middleware-es

Whether it supports flexible downgrading, and whether it is supported by migration tools

middleware-redis

Whether it supports flexible downgrading, and whether it is supported by migration tools

middleware-kafka

Whether to support flexible downgrading

What is the default number of topic fragments, and whether it can be adjusted by yourself

vpc network

vpn

Whether to support flexible downgrading

log service

data transmission

traffic cost of data transmission

ECS

The foundation of the foundation, but generally similar service providers have little difference in product capabilities, so there will be no major problems

Points to note:

Do a single-point + cluster performance test and it will be ok. Special attention should be paid to the network throughput performance of the device. Sometimes the performance bottleneck lies in the traffic.

Relational Database

The same is the foundation of the foundation, similar service providers have little difference in product capabilities, and there will be no major problems

Points to note:

When calculating the cost, the cost of backup should be included. Generally speaking, at least 2 times a week, most service providers will require at least 7 days to keep

DTS

Top priority! ! Directly determine the efficiency and success of migration (unless the volume is too small)

Points to note:

The stability and data integrity of the task must be tested in advance. After the task is created, it must support subsequent modification. Don’t just rebuild the task.

It is best to have a function that supports checking whether the data is consistent, otherwise you can only write python by yourself

CDN and object storage

Points to note:

Try to avoid one size fits all, migrate in stages, and check whether the files are complete after each migration (can be supported by automated scripts or tools)

cdn

In the transitional stage, whether it can cooperate to support back-to-origin to different origin sites according to url rules

oss object storage

Whether to support 2 back-to-origin, back-to-origin to the original origin site, and pull to the current storage bucket

These two pieces are to achieve gradual switching and avoid implementation costs and risks caused by one-size-fits-all

In the early stage of our migration, we relied on the service provider's technology to configure the date feature in the background, and switched the urlCDN access of certain years and months to the new source site, so that the historical files could be migrated there first;

Then the object storage product supports that when the local storage bucket does not have the file to be accessed, you can configure the source site to go back to the source twice, and pull it from our old storage bucket. Since then, the seamless connection of object storage has been realized.

PS: If a service provider with similar support capabilities cannot be provided for a relatively large platform, it is not recommended to use it, because the CDN has a close relationship with the object, and there are usually joint problems. If it cannot support flexible back-to-source configuration, the migration process The risk will be greatly increased

Application Firewall WAF

Platform security is the top priority! ! Generally speaking, you will choose products from the same vendor as your own cloud server, but the degree of completeness of functions is the final determining factor

Points to note:

Alibaba Cloud's products are the first to be promoted. They are mature and reliable, easy to use, and log queries are very convenient. There is only one disadvantage: expensive!

It is actually measured that Aliyun waf returns to the source site of other cloud service providers, and the delay is +30ms, which is acceptable

At the same time, the Alibaba Cloud Enterprise Edition has 100M bandwidth. If you return to the source across service providers, it is only 30M, and you need to purchase additional bandwidth.

SQL-like log query capabilities are really crucial for subsequent operation and maintenance, and can greatly shorten the cycle of locating and solving problems

cloud firewall

The basic security components are similar to each other, so there is not much to say

block storage

Basically, they are cloud disks hanging on ECS

Points to note:

Pay attention to the cost of the snapshot, the disk backup strategy directly determines the size of the snapshot

load balancing

The foundation of the foundation, there is not much to say

Points to note:

Pay attention to the completeness of the log function, but it is not necessary, elk can solve it

log service

There is nothing much to say about the function, just see if the price is suitable

middleware

ES

There is nothing much to say about the function, let’s see if the price is suitable and whether it can support flexible allocation

In addition, pay attention to the number of shards used by default for a single topic

Kafka

There is nothing much to say about the function, let’s see if the price is suitable and whether it can support flexible allocation

redis

There is nothing much to say about the function, let’s see if the price is suitable and whether it can support flexible allocation

pressure test

In addition to routinely using jmeter for single-point pressure measurement and cluster pressure measurement, you can use goreplay for traffic replication pressure measurement

There are many tutorials on the Internet, so I won’t repeat them. In actual use, I found that the actual concurrency pressure of 100% traffic return visits is much lower than the real situation, about 1/10 (excluding performance problems of pressure equipment), and then increased to 1000%.

Migration preparation

Sort out the existing situation and implement the documents

Resource Matrix

In order to facilitate the search and replacement of resource addresses involved, users and passwords (database, redis, ES, kafka, object storage)

Project dependent resource details

Organize a list of project-dependent middleware involved in the migration, which is used to evaluate and check whether the configuration transformation of the corresponding system is complete, and whether the resources that the system depends on are ready and can be migrated

Timing task details

Scheduled tasks are an important part of the business, and it is also a place where migration is prone to problems. It is necessary to sort out and clarify:

Tasks depend on the system, execution time, and execution of business

configuration transformation

One set of code can support multi-computer room deployment

Put the local configuration files in the configuration center (apollo, nacos) to ensure that the same set of code can be directly deployed in different cloud services, and pull configurations from the configuration centers in each cloud service to adapt middleware and resources

Minimize calls across computer rooms

Unless the budget is very sufficient, generally speaking, no dedicated line will be used. Therefore, in the migration plan, try to avoid cross-computer room calls and ensure that the interaction is in the local computer room.

Necessary double-write transformation

Data synchronization can be achieved by relying on DTS

Redis generally does not need real-time synchronization, just do the full amount

Kafka does not need to synchronize data normally

ES normally needs to do incremental synchronization (because the search result data is constantly changing, if you can only do full synchronization, it will be very troublesome to ensure that the data on the two ends is consistent when switching), so far no very useful tool has been found, and it is being implemented In the process, we developed a double-write scheme to synchronize index updates in an asynchronous manner

to switch

It is not recommended to change the DNS cname binding when switching, because there is a delay in each operation, which will bring uncertain risks

If WAF is used, first consider switching by changing the back-to-source address in WAF, which can be switched in real time (Alibaba Cloud WAF supports pan-domain name configuration, and the switching operation is more convenient)

If there is no WAF, you can also configure server groups + weights on the clb of the cloud service to control the distribution of traffic (this solution will also have additional traffic costs, because the distribution still uses public network traffic)

Launch plan

Prepare a launch plan, detailing each operation step, operator, and time

This is our plan after desensitization

step

Operation content

Person in charge/team

Completion

configuration transformation

Deploy new environment

ES cluster double write online

Full database copy

New environment function test

pressure test

traffic replication test

Upgrade Alibaba Cloud waf traffic configuration

Stop new environment scheduling

Start database incremental synchronization

redis replication

do it once a day

Copy ES

Double write + do one drop every day

Check configuration center content

switching phase

stop old environment scheduling

Change WAF back to source

stop incremental data copy

Start a new environment scheduler

Check the status of scheduled tasks

Need to add log assistance

Check nginx logs (parallel)

Check frontend syslog (parallel)

Check server syslog (parallel)

Start regression testing (parallel)

switch complete

Problems encountered during the switch day

Sit down according to the above steps, and there are basically no major problems. There will always be fish that slip through the net. Don’t panic if you encounter problems. Don’t rush to code. There is a high probability that it is a configuration-related problem.

The main issues we found during the handover phase:

The access password of individual system redis configuration is incorrect

因为还有一部分系统遗留在老环境上,再配置nginx跳转的时候有一些问题导致未能正常访问到老服务上

当天没有发现改代码需要解决的问题,整个切换从23:50 - 次日凌晨3:00

Guess you like

Origin blog.csdn.net/windywolf301/article/details/129031090