ECS Active O&M 2.0, experience upgrade, more results with less effort

Abstract:  Alibaba Cloud is committed to providing a better operation and maintenance experience, making the process of using ECS ​​more transparent and efficient, and realizing a more standardized and automated operation and maintenance method. Based on Active O&M 2.0, your experience using ECS ​​cloud servers is smoother, and you no longer rely on work orders to contact customer service by using system events. You can respond to the restart of active O&M instances through self-service processing, reducing the impact on system reliability and business. Continuity effects.

      ECS (Elastic Compute Service) is an elastic and scalable computing service that helps you reduce IT costs, improve operation and maintenance efficiency, and enable you to focus more on core business innovation. When you build a business system based on the ECS cloud server, you can achieve agile response to business needs and strong guarantee of business continuity with the help of many advantages and features of cloud computing. On this basis, Alibaba Cloud is committed to providing a better operation and maintenance experience, making the process of using ECS ​​more transparent and efficient, and realizing a more standardized and automated operation and maintenance method.

 

Active operation and maintenance

      Alibaba Cloud uses strict IDC standards, server access standards, and operation and maintenance standards to ensure the high availability of the entire cloud computing infrastructure, data reliability, and high availability of cloud servers. For a single ECS instance, Alibaba Cloud promises that the service availability of a single ECS instance in a service cycle is not less than 99.95%; for a single-region multi-AZ, Alibaba Cloud promises that the service availability of the single-region multi-AZ during a service cycle will not be lower than 99.95%. Below 99.99%.

      We know that at the infrastructure level, there are always some potential factors such as software bugs or hardware failures that will affect the operation of ECS instances. Therefore, in order to ensure the above-mentioned high-level service availability, in addition to the high-availability design of the cloud computing infrastructure, ECS actively Operation and maintenance are indispensable. Active O&M, as the invisible guardian of ECS, will proactively perform routine maintenance and fault detection on the physical servers that run the ECS instances, and repair potential faults through online or rotating upgrades as much as possible, so as to continuously improve system reliability and performance. and security protection capabilities to ensure the stable operation of cloud servers.

      However, in some cases, the physical server needs to be restarted or shut down for maintenance. In this case, the active operation and maintenance system will send a message notification to the ECS users on the server, indicating that your ECS instance needs to be restarted and migrated to a healthy physical server. Previously, after receiving such a notification, users needed to submit a work order to contact customer service personnel for authorization. With the evolution of Active Operation and Maintenance 2.0, the experience in this area has been improved in many ways.

 

Experience upgrade

 

1. Active operation and maintenance of live migration, instance operation is not interrupted

      When the active operation and maintenance detects that the physical server is at risk of failure, the system will preferentially attempt to live-migrate the ECS instance on the server to another physical server online. The successful live-migration instance will not be interrupted, and its services can remain online; only a small amount of Only instances that are at risk during live migration will enter the active O&M restart migration process. After this strategy is upgraded, the impact on user business continuity is effectively reduced. With the rapid growth of Alibaba Cloud user scale, the number of active O&M related work orders has dropped by 125 times year-on-year!

 

2. The risk warning is clearer, and the migration impact is known in advance

      For instances that need to be restarted and migrated, Alibaba Cloud will send message notifications and targeted prompts to users in advance. Since the local storage (local disk) comes from a single physical server and is not based on the multi-copy distributed technology, the data stored in the local disk will be erased during migration. Therefore, for local disk instances, the notification clearly indicates this risk and reminds you to Backup data in time before migration. For cloud disk instances, the notification provides operation guidance. You no longer need to submit a work order to contact customer service personnel. You can process instance restart and migration directly on the console or through the API.

 

3. There is no need to find customer service through work orders, system events to help

      The self-service processing function for restarting and migrating cloud disk instances is available on the console and API. When you receive a system plan event for restarting and migrating, you can know the execution plan of the event. As shown in the figure below, according to your business needs, you can choose to restart immediately, schedule a restart during off-peak business hours, or wait for the system to execute as planned, and do prepared operation and maintenance operations. Such a process no longer needs to rely on work order processing, which improves efficiency and reduces the impact of instance restarts on your ongoing business.

036785e95a5685c8d18c6f7a23dc9b7a8de702ee

08b9688d01c10cc3e8fe2b8d388fc77e4762d8bc

 

Do more with less

      As the experience improvement mentioned above, "things" are multiplied with half the effort. In addition to the active operation and maintenance process evolution, it also comes from the release of system events. System events help improve users' perception of ECS operating status changes, and take targeted actions to respond to or avoid the impact of events on existing businesses. Through the closed loop of system events, more operation and maintenance scenarios are standardized and automated, allowing users to obtain a better operation and maintenance experience on the cloud.

Original link

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325091777&siteId=291194637