Common service products
Function |
Points to note |
Requirements & Risks |
cloud server esc |
Whether the relationship between the bandwidth of the ECS device and the bandwidth of the VPC network is directly inherited, or should it be calculated separately according to the device specification |
|
Relational Database |
Backup cost calculation |
|
DTS |
Whether it supports the full + incremental synchronization mode, and the task configuration can be updated in real time. Under normal circumstances, the data delay is at the second level |
Can the task support resuming and retrying |
cdn |
In the transitional stage, whether it can cooperate to support back-to-origin to different origin sites according to url rules |
The url should contain dates or other string features that can be split |
oss object storage |
Whether to support 2 back-to-origin, back-to-origin to the original origin site, and pull to the current storage bucket |
|
Application Firewall WAF |
If the source site is not the cloud service of the same vendor, what is the delay, and is there any difference in billing and bandwidth? Whether to support SQL-like log query function |
|
cloud firewall |
Basic components |
|
block storage |
Disk snapshot fee |
|
load balancing |
Whether to support SQL-like log query function |
|
middleware-es |
Whether it supports flexible downgrading, and whether it is supported by migration tools |
|
middleware-redis |
Whether it supports flexible downgrading, and whether it is supported by migration tools |
|
middleware-kafka |
Whether to support flexible downgrading What is the default number of topic fragments, and whether it can be adjusted by yourself |
|
vpc network |
||
vpn |
Whether to support flexible downgrading |
|
log service |
||
data transmission |
traffic cost of data transmission |
ECS
The foundation of the foundation, but generally similar service providers have little difference in product capabilities, so there will be no major problems
Points to note:
Do a single-point + cluster performance test and it will be ok. Special attention should be paid to the network throughput performance of the device. Sometimes the performance bottleneck lies in the traffic.
Relational Database
The same is the foundation of the foundation, similar service providers have little difference in product capabilities, and there will be no major problems
Points to note:
When calculating the cost, the cost of backup should be included. Generally speaking, at least 2 times a week, most service providers will require at least 7 days to keep
DTS
Top priority! ! Directly determine the efficiency and success of migration (unless the volume is too small)
Points to note:
The stability and data integrity of the task must be tested in advance. After the task is created, it must support subsequent modification. Don’t just rebuild the task.
It is best to have a function that supports checking whether the data is consistent, otherwise you can only write python by yourself
CDN and object storage
Points to note:
Try to avoid one size fits all, migrate in stages, and check whether the files are complete after each migration (can be supported by automated scripts or tools)
cdn |
In the transitional stage, whether it can cooperate to support back-to-origin to different origin sites according to url rules |
oss object storage |
Whether to support 2 back-to-origin, back-to-origin to the original origin site, and pull to the current storage bucket |
These two pieces are to achieve gradual switching and avoid implementation costs and risks caused by one-size-fits-all
In the early stage of our migration, we relied on the service provider's technology to configure the date feature in the background, and switched the urlCDN access of certain years and months to the new source site, so that the historical files could be migrated there first;
Then the object storage product supports that when the local storage bucket does not have the file to be accessed, you can configure the source site to go back to the source twice, and pull it from our old storage bucket. Since then, the seamless connection of object storage has been realized.
PS: If a service provider with similar support capabilities cannot be provided for a relatively large platform, it is not recommended to use it, because the CDN has a close relationship with the object, and there are usually joint problems. If it cannot support flexible back-to-source configuration, the migration process The risk will be greatly increased
Application Firewall WAF
Platform security is the top priority! ! Generally speaking, you will choose products from the same vendor as your own cloud server, but the degree of completeness of functions is the final determining factor
Points to note:
Alibaba Cloud's products are the first to be promoted. They are mature and reliable, easy to use, and log queries are very convenient. There is only one disadvantage: expensive!
It is actually measured that Aliyun waf returns to the source site of other cloud service providers, and the delay is +30ms, which is acceptable
At the same time, the Alibaba Cloud Enterprise Edition has 100M bandwidth. If you return to the source across service providers, it is only 30M, and you need to purchase additional bandwidth.
SQL-like log query capabilities are really crucial for subsequent operation and maintenance, and can greatly shorten the cycle of locating and solving problems
cloud firewall
The basic security components are similar to each other, so there is not much to say
block storage
Basically, they are cloud disks hanging on ECS
Points to note:
Pay attention to the cost of the snapshot, the disk backup strategy directly determines the size of the snapshot
load balancing
The foundation of the foundation, there is not much to say
Points to note:
Pay attention to the completeness of the log function, but it is not necessary, elk can solve it
log service
There is nothing much to say about the function, just see if the price is suitable
middleware
ES
There is nothing much to say about the function, let’s see if the price is suitable and whether it can support flexible allocation
In addition, pay attention to the number of shards used by default for a single topic
Kafka
There is nothing much to say about the function, let’s see if the price is suitable and whether it can support flexible allocation
redis
There is nothing much to say about the function, let’s see if the price is suitable and whether it can support flexible allocation
pressure test
In addition to routinely using jmeter for single-point pressure measurement and cluster pressure measurement, you can use goreplay for traffic replication pressure measurement
There are many tutorials on the Internet, so I won’t repeat them. In actual use, I found that the actual concurrency pressure of 100% traffic return visits is much lower than the real situation, about 1/10 (excluding performance problems of pressure equipment), and then increased to 1000%.
Migration preparation
Sort out the existing situation and implement the documents
Resource Matrix
In order to facilitate the search and replacement of resource addresses involved, users and passwords (database, redis, ES, kafka, object storage)
Project dependent resource details
Organize a list of project-dependent middleware involved in the migration, which is used to evaluate and check whether the configuration transformation of the corresponding system is complete, and whether the resources that the system depends on are ready and can be migrated
Timing task details
Scheduled tasks are an important part of the business, and it is also a place where migration is prone to problems. It is necessary to sort out and clarify:
Tasks depend on the system, execution time, and execution of business
configuration transformation
One set of code can support multi-computer room deployment
Put the local configuration files in the configuration center (apollo, nacos) to ensure that the same set of code can be directly deployed in different cloud services, and pull configurations from the configuration centers in each cloud service to adapt middleware and resources
Minimize calls across computer rooms
Unless the budget is very sufficient, generally speaking, no dedicated line will be used. Therefore, in the migration plan, try to avoid cross-computer room calls and ensure that the interaction is in the local computer room.
Necessary double-write transformation
Data synchronization can be achieved by relying on DTS
Redis generally does not need real-time synchronization, just do the full amount
Kafka does not need to synchronize data normally
ES normally needs to do incremental synchronization (because the search result data is constantly changing, if you can only do full synchronization, it will be very troublesome to ensure that the data on the two ends is consistent when switching), so far no very useful tool has been found, and it is being implemented In the process, we developed a double-write scheme to synchronize index updates in an asynchronous manner
to switch
It is not recommended to change the DNS cname binding when switching, because there is a delay in each operation, which will bring uncertain risks
If WAF is used, first consider switching by changing the back-to-source address in WAF, which can be switched in real time (Alibaba Cloud WAF supports pan-domain name configuration, and the switching operation is more convenient)
If there is no WAF, you can also configure server groups + weights on the clb of the cloud service to control the distribution of traffic (this solution will also have additional traffic costs, because the distribution still uses public network traffic)
Launch plan
Prepare a launch plan, detailing each operation step, operator, and time
This is our plan after desensitization
step |
Operation content |
Person in charge/team |
Completion |
configuration transformation |
|||
Deploy new environment |
|||
ES cluster double write online |
|||
Full database copy |
|||
New environment function test |
|||
pressure test |
|||
traffic replication test |
|||
Upgrade Alibaba Cloud waf traffic configuration |
|||
Stop new environment scheduling |
|||
Start database incremental synchronization |
|||
redis replication |
do it once a day |
||
Copy ES |
Double write + do one drop every day |
||
Check configuration center content |
|||
switching phase |
|||
stop old environment scheduling |
|||
Change WAF back to source |
|||
stop incremental data copy |
|||
Start a new environment scheduler |
|||
Check the status of scheduled tasks |
Need to add log assistance |
||
Check nginx logs (parallel) |
|||
Check frontend syslog (parallel) |
|||
Check server syslog (parallel) |
|||
Start regression testing (parallel) |
|||
switch complete |
Problems encountered during the switch day
Sit down according to the above steps, and there are basically no major problems. There will always be fish that slip through the net. Don’t panic if you encounter problems. Don’t rush to code. There is a high probability that it is a configuration-related problem.
The main issues we found during the handover phase:
The access password of individual system redis configuration is incorrect
因为还有一部分系统遗留在老环境上,再配置nginx跳转的时候有一些问题导致未能正常访问到老服务上
当天没有发现改代码需要解决的问题,整个切换从23:50 - 次日凌晨3:00