Thousands of servers, the amount of millions of users: actually home to two native cloud transformation journey

 

REVIEW : traditional enterprises are usually top-down decision-link form, so the transformation of the Internet, not just the R & D level, the entire company executives need to do to upgrade and update the concept of knowledge, which is designed to lie flat home on a cloud of the road in the past few years experienced. This article will focus on home actually use Ali cloud container services (ACK) Cloud native practice course, look forward to helping readers understand the practice of traditional companies from the traditional single path to cloud architectures native evolution.

 

In 2009, actually designer (Homestyler) R & D team was established and began to explore the first version; now, ten years have passed, but still officially changed its name to lie flat design home design home, nearly ten million subscribers. For more than two cloud-native practice of the transformation process, the team has gone through all the clouds and then delivered to the operation and maintenance of thousands of servers, from the cloud to explore and exploit Serverless Service Mesh cloud native completed the transformation, and ultimately achieve the overall availability of three 9 months or more, while IT costs were reduced by almost half. This Share cloud native practice lie flat course designed home.

 

Since 2013 the MattStine Pivotal first proposed so far, a native cloud (Cloud Native) This concept is gradually reshaping the entire software life cycle. Good architecture cloud native system is largely self-healing, cost-effective, and can be CI / CD (continuous integration / continuous delivery) to easily update and maintain. Fortunately, constitute the cloud server, disk and network infrastructure with the same tradition, which means that almost all of the outstanding architectural design principles still apply to native cloud architecture.

 

But in the cloud, some basic assumptions about how to perform this structure will change. For example, replace the server configuration in a traditional environment may take several weeks, and in a cloud environment requires only a few seconds, the contents of which are application architecture needs to be considered.

 

This article is an exclusive interview with Xie Kang, director of design and development lie flat Easyhome home made, it helps us understand the path of practice this enterprise from the traditional single architecture to the cloud native evolution.

 

Practical background

 

Actually lie flat designer formerly known designer (Homestyler), it is actually home's designed for home improvement design to create a one-stop service brands, including related tools and community, primarily serving the home improvement designer, currently has more than 400,000 designers active on the platform, the international designer, it has been more than nine million.

 

Prior to the transformation of native cloud, actually home technology stack is relatively traditional, early core algorithms and systems are based on C ++ language and Scala to build, but now has been difficult to recruit experienced engineers Scala, the short-term reconstruction in Java The entire platform and the price is too high. At the same time, the overall iteration is very slow, a longer period of response to demand, the ability to innovate there is obviously insufficient, server operation and maintenance, network and other costs and expenses gradually unbearable.

 

Early establishment, there is still the problem of the conventional technology stack is not so obvious, the overall business takes less than 10 servers is enough support, infrastructure costs can still afford. As the user's body mass is increasing, especially the rapid growth of overseas users, even if the server is expanding rapidly to thousands, it is still difficult to ride out the daily traffic peaks.

 

Since the end of the service rendered belong compute-intensive tasks, is very high demand for CPU resources, fluctuations in the number of tasks per day during peak periods and relatively large, often during peak hours to render a map task requires tens of minutes or even hours of waiting, so long the waiting time for designers is unacceptable. More frightening is the computing resources than the design value of the cluster to perform all tasks lead to an avalanche of all crashes. When the flow trough, a large number of servers in the idle state, the rational use of resources and did not get.

 

In addition, we lie flat designers throughout the R & D team excels in vertical areas, such as the field of 3D graphics and image processing research and development, but had non-core direction, such as put a lot of effort and money on infrastructure maintenance, and with the continued growth in size, this part of the increasingly high cost, resulting in software development costs have begun to increasingly uncontrolled, diluted the investment of resources ought to be used in the core product.

 

At that time the entire R & D team into a very painful condition, Xie Kang quipped in an interview:. "At that time the technology generation gap with the leading Internet companies probably have 5-10 years."

 

Finally, in about to lose first-mover advantage in the field of vertical pressure, facing a variety of programming languages ​​put together a huge, bloated technology platform, the team finally decided to bite "to reform his own life" and decided to open up the underlying cloud platform technology stack, native to cloud migration architecture, bear the burden has been put down, attention again return to its core business.

 

Native cloud evolution

 

After the decision to move the entire team at the time of the cloud platform requirements are relatively clear.

 

Xie Kang said:

 

  • The first is stability, self-built room difficult to achieve high stability period, experiencing frequent problems and various types of network hardware and software facilities due;
  • Second overall system flexibility, the need for rapid expansion during peak traffic, recycle, low flow rate in order to reduce costs;
  • Finally, high performance, as previously stated, rendering the design requirements for higher computing power, are CPU-intensive calculations, and these are the traditional self-built room difficult to achieve. By contrast, automated operation and maintenance convenience and system flexibility to expand the priority it does not seem so urgent.

 

In the end, after careful consideration and evaluation, the team began in 2016, based on AWS Cloud native transformation (after Ali overall migration to the cloud).

 

The first stage: organizational restructuring and the transformation of micro-services

 

In 1967, Marvin Conway Conway proposed law is summed up in one sentence:. "The system architecture design is subject to produce communication structure of the organization of these designs," according to Conway's Law to say, the organizational structure will be reflected in the system architecturally, while traditional companies driven mostly accustomed to top-down manner. Therefore, prior to the micro-service reform, designer lie flat organizational structure and personnel have been adjusted.

 

Xie Kang said, most of the staff from the design house was first established when lying flat AutoDesk, and then incorporated into the Group actually affected, which is reflected in the final decision-making model, the structure of the domestic Internet companies are vastly different organization. If you want to transform, it is necessary to first adjust the organizational structure. Specifically, lie flat designers entire technical team has unique advantages in image processing and 3D graphics and other core algorithms, so the focus of R & D engineers retained, will have to do without good and middleware and infrastructure operation and maintenance work pay to Ali cloud.

 

Network and computer room the whole team basically in the architecture disappear, operation and maintenance team significantly reduced, and the product team, algorithms team, large data team, including R & D personnel to deal with large-scale real-time calculation are increasing, and began recruiting ServiceStack and Docker and other cloud-native related R & D personnel.

 

After the organizational restructuring completed, the team started the micro-service reform. In this case, a new problem has emerged.

 

Unlike most traditional enterprise similar to single architecture model is designed to lie flat house system was originally used, but the scale is very large, hundreds of services blend together, the logical dependencies between services is very complex, if the cloud before the first locally micro service reform, the whole process of dismantling presumably time-consuming and labor-intensive, and even can be said with no possibility in the short term.

 

Xie Kang said at the time, the whole team has taken a very aggressive approach by Service Mesh will sink to the infrastructure service management capabilities, so you can be on top of the depth of the transformation will be applied in the case of running on a cloud platform right system.

 

Chose Service Mesh, but also because the initial technology stack based on Scala and written in C ++, it is difficult without a well-functioning deep transformation in the cloud, and then micro-services support is relatively lacking.

 

Subsequently, the team can divide and reconstruction services for micro applications, the overall operation and maintenance become very easy and scalable deployment without affecting the normal operation of the business. Xie Kang added, lie flat designers may also use Service Mesh earliest group of companies. After not too long a period of migration, lie flat designers of all core business has been running all over Ali cloud platform, and partially completed the micro-services transformation.

 

But, strictly speaking, this phase of the "hard onto the cloud," although through the cloud to enjoy the advantages of flexibility, usability and powerful computing power provided by cloud platform itself, but also enhance the overall value is not very obvious.

 

However, the whole team is still on schedule the second phase of the transformation of the following reasons:

 

  • First, because of persistent cloud has become the company's primary biochemical mandatory, whether it is public or private clouds, without the aid of native cloud-enabled, then Homestyler (later renamed to lie flat designer) between mainstream Internet companies technological generation gap will only grow;

  • Second, since access to the Internet 2.0 era, large-scale enterprises increasingly important to the body at that time and the amount of product iteration speed, if you do not migrate to the cloud platform, do not burst through the rapid transformation of grinding out product to grow up fast, will soon lose first-mover advantage and subsequent survival pressure will be very large.

 

The second stage: from the cloud to the cloud native

 

In the first phase of the renovation is nearing completion on the occasion, lie flat designers immediately began the second phase of the transformation, is the depth of the system of micro-scale automation and service automation cluster management and on-demand scalability and transformation of the cluster. Xie Kang said that the second phase has already begun the transformation from cloud to cloud the pure virgin direction, value is also more obvious, the overall speed of delivery and operation and maintenance automation capabilities have increased dramatically after the transformation was complete.

 

For example, before a native cloud transformation, lie flat design home delivery cycle may by the quarter, and now the lead time has been reduced to a weekly basis, every two weeks will be a big feature upgrades. And can be in the peak traffic, so in minutes cluster expansion units, the peak time period can also be done in time out (design) chart, which is even more than some of the localization of software, no longer calculate a wide range of stagnation had happened malfunction.

 

Like other Internet companies, lie flat designer also has a large number of Web applications and micro-services, which are run in the cloud Ali Container Service (ACK) in.

 

But Xie Kang said the difference is that more lie flat designers server running compute-intensive tasks, the original IDC room most of the servers are 48-core, 96-core and even custom models, similar high server with up to hundreds, and long press server count limit power mode of operation also result in product failure rate is very high.

 

In the transformation process, if you want such a system on a large scale transformation of the cloud is not an easy thing. Ali cloud and the entire team of architects and technical experts co-transformation, repeatedly overcome both help a lot of inconvenience, to discuss the development plan, the ultimate perfection test problem-solving on the cloud computing-intensive tasks on Ali cloud.

 

Currently, lie flat designers have all been rendered under decommissioning line server, and so on and moved Ali Ali cloud host and container services ACK, the elastically stretchable and achieve a perfect running application-level services, no longer have to worry about suddenly computing service failures caused by hardware facilities avalanche force, and can dynamically adjust cluster size when the user submits the task.

 

In addition to cloud computing resources to solve the problem on the above scale on demand by Xie Kang also said, Serverless and Service Mesh lie flat in the use of home design has become increasingly popular, and the ongoing transformation of native cloud, more and more need after demand scalable computing nodes are decoupled, it relies heavily on frequent problems during the write state and data before reconstruction aid off, this part of the logic to migrate all Serverless node.

 

Through a series of Trigger trigger, not only so that the whole system becomes more flexible ways to pay only during operation it is also very economical. Ali cloud Istio on ACK can create, control and observation micro-service functions, and easily achieve load balancing, authentication, and inter-service monitoring capability based.

 

Service Mesh cluster of carrying almost all the basic services, the second phase of reconstruction lie flat designer is actually a micro-depth service of process, gradually stripping the original monster by strangling one of those ancestral pattern, and finally to each logical groups of services can freely expand and can degrade gracefully in case of failure.

 

Initially, the team also considered the use of self OpenStack private cloud.

 

But whether a private cloud or public cloud, cloud cluster management is composed of several series of separate but interrelated project components, and each item in turn consists of several components. These components cooperate with each other, the entire building became a cloud ecosystem. This cluster management which is responsible for network management, storage management is responsible, and responsible for image management.

 

Then lie flat on the designer, the overall complexity is too high, and the cloud R & D talent is very scarce, even willing to invest want to recruit the right talent in this area is not an easy thing, so the final decision to the public or cloud vendors, with infrastructure and services provided by public cloud, more focused on their core business and algorithm development.

 

As we all know, Kubernetes has become the de facto standard, then lie flat designers use Ali cloud container services (ACK), the mature container and container arrangement allows ignoring the complexity of research and development IaaS, while providing powerful configuration and management functions, which greatly simplifies the development configuration management.

 

The third stage: DevOps practices and succession planning

 

At present, although the second phase of the transformation is still ongoing but lie flat design house technical team has already done the next technical planning.

 

Next, DevOps forward, edge-based computing and large-scale computing power Service Mesh / Serverless enhance integration, it will be an important direction of the entire R & D team of native cloud practice. Xie Kang said that at present there is still further enhance the DevOps possible, while originally based on a common program also requires a combination of public cloud itself needs some customization transformation.

 

In simple terms, in terms of DevOps:

 

  • First, to achieve timely perception can become aware of the operational status of each node on the system;

  • Next up is palpable, resolve issues arise at each stage through technical means, rather than simply reboot trouble, every link in the link core must have healing powers, to ensure the system can be when faced with a major fault fuse or downgrade settlement and avoiding global paralysis;

  • Finally, intelligent operation and maintenance, through machine learning approach to historical data on the overall health of the system to control the use of AI algorithms to automate the operation and maintenance, and ultimately to an unattended operation and maintenance.

 

In addition, with native cloud more and more standardized, the CI / CD and highly automated edge computing also beginning to break, which is also concerned about the direction lie flat design home. Xie Kang explained that the company is the flagship cloud capabilities, but in fact lie flat throughout the user interaction designers very often, traffic generated in the process is relatively large.

 

How can the behavior of some users into the edge of the network nodes, but not all concentrated in the cloud node processing can greatly improve the user experience? If you can get on the edge computing breakthrough, there will be a very big improvement to solve this problem, so the whole team pay close attention to the developments in the field.

 

Practical effect

 

In addition to the above-mentioned interaction cycles, performance enhancements, the whole process also allows users lie flat design house has been greatly improved. To conclude, Xie Kang believes can be summarized as the following four points:

 

  • Infrastructure costs in half, on the basis of scale, investment in infrastructure is also reduced by nearly 50%;
  • R & D costs, the entire team of more than 50% of the R & D staff and product, after the delivery of infrastructure to the cloud Ali, the whole team can concentrate on core business development, greatly enhance the speed of delivery;
  • System availability increased to 99.96%, in the case of personnel and cost reductions, the overall availability there is a greatly improved;
  • Increased security, originally limited to the overall structure, after reaching 99% cost increase each time after the decimal point are nonlinear, and by the transformation of native cloud, you can use a relatively economical way to achieve a higher degree of safety.

 

Conclusion

 

Review the entire process, this transformation will not be easy. Xie Kang said at least lie flat designers are lucky, you can quickly determine the transformation plan in a short time and practice success.

 

For want of native cloud transformation company, Xie Kang proposals must be well-related technical reserves and psychological preparation scraped the bone healing, do not do any fancy work can achieve a smooth transition, this is not possible, technological innovation is essential .

 

Secondly, although the cloud of native components and related specifications mature, but for traditional businesses, the threshold remains high. If you are eager to enhance the technical capacity within the organization to reach a consensus on the structure adjustment, the entire R & D system also need to make adequate preparations must not be entrusted to feel after cloud vendors can reap the profits, you will be able to step on the cloud. The entire process on the cloud, policymakers must do to meet the changing time to prepare.

 

Finally, as noted above, traditional enterprises are usually top-down decision-link form.

 

Hence the need for a transformation of the Internet, not just the R & D level, the entire company executives need to do to upgrade and update the concept of knowledge, which is designed to lie flat on the home cloud the road the past few years experienced.


Guest Introduction:

Xie Kang, senior practitioners ten years of Internet technology veterans, had worked in a grand, Ctrip and eLong same way, focusing on large-scale landing explore high availability architecture design Internet and cloud native, platform architecture has many years of experience, has design and implementation of micro-bearer service thousands of services and DevOps platforms. Now working lie flat designer, is responsible for the overall technical architecture and cloud native move forward.

 

Learn ACK container services, please see: https://www.aliyun.com/product/kubernetes

Ali cloud services best Chinese vessel to enter to Forrest ER Quadrant report strong performance

Guess you like

Origin www.cnblogs.com/alisystemsoftware/p/11387232.html