Are you hungry? Technical past (part 1)

Introduction: As an Internet startup company, Ele.me has grown from inception to growth. In the era of mobile Internet, the volume of business and technical team has experienced a 10-fold increase. This experience is one of the technical teams of many startups in the Internet field. miniature. Record our experience and lessons in the process of growing up here. ——Huang Xiaolu

image.png

As an Internet startup company, Ele.me has grown from inception to growth. In the era of mobile Internet, the business volume and technical team have experienced a 10-fold increase. This experience is a microcosm of the technical teams of many startups in the Internet field. Record our experience and lessons in the process of growing up here.

Ele.me’s technical system has gone through the following four stages:

1. The early architecture of the core system All in one;
2. A comprehensive service-oriented architecture based on the separation of system domains, separation of business systems and middleware and other infrastructure;

3. As the automation platform and container scheduling system mature, the infrastructure system that governs the transition from traditional operation and maintenance to DevOps;
4. The Cloud Ready architecture based on the multi-data center system takes shape.

During this period, rapid business growth, large and small accidents, changes and integration of organizational structure, team engineering culture and technology stack background, conflicts of different technical concepts, are intertwined, impacting each other and affecting structural changes.

The first stage: All in one

This is the early look of the Ele.me technology system. The technology co-founder took a group of Geeks, from 0 to 1, to build the earliest technology system, supporting a million-level order.

At this stage, the business is rushing all the way, and the technology is desperately catching up with the business, but it is not broken quickly.

The technology stack is mainly based on Python, with part of the PHP system, and a single-machine multi-application mixed mode. Application release and system operation and maintenance are basically completed by development engineers hitting the command line. The core business system, merchants, users, and transactions all share a codebase, built under a system called zeus. Business has grown rapidly in a short period of time, online databases have been unable to support the needs of ETL, and big data systems and data warehouses have begun to be established.

The Shanghai data center where the big data is located, and the Beijing data center where the online business system is located began to build. These two data centers have witnessed the changes in our architecture until later on the cloud as a whole. At the earliest, there were more than 40 technical teams, and sometimes it was necessary The founder ran to the computer room to move the server.

When the system cannot keep up with the speed of business development, the core system has gone through some embarrassing stages of intermittent downtime. Some business systems that have just started to develop have also experienced continuous downtime as soon as the system went online, and had to temporarily slow down the business. But this process is also rewarding. Many development engineers have very strong online troubleshooting capabilities and scripts are very slippery.

Engineers in this period often have multiple roles in one person, including front-end, back-end, development, testing, operation and maintenance deployment, and have a deep understanding of the business, even taking the role of technology and products.

Gains and lessons-the importance of culture

Early on Ele.me has a distinctive feature, that is, everyone is very responsible, open and tolerant. It is rare to shirk and avoid responsibility. Although many accidents at the time looked more serious in retrospect, the organization was relatively tolerant of errors in the growth of technical personnel. The entire team is relatively flat, and technical disputes between superiors and subordinates are common, but they can all discuss technology based on technology.

Ele.me’s engineer culture is still quite strong: the engineer is thinking about whether the resource utilization rate of the server can be squeezed; before deciding to use Redis on a large scale, he will read the source code of Redis; many solutions are to find a bar and The whiteboard quickly discussed, reached a consensus quickly, and went online. It may also be this atmosphere, attracting many people with the same taste, and forming the culture of the technical team. The original culture formed by the founders of the technical team can continue. It is the foundation for the team to grow from the initial dozens of people to thousands of people, and to maintain cohesion and execution.

The second stage: demolition and infrastructure

The technical system architecture affects the efficiency of business delivery, so it is necessary to restructure or even rebuild the system; if the unreasonable organizational structure hinders the iteration of the system architecture and becomes the bottleneck of business development, then the organizational structure needs to be adjusted.

For example, if the business system is a racing car, then the infrastructure is the racetrack. Infrastructure is also the focus of our construction at this stage, laying the foundation for the rapid growth of future business.

At this stage we face several problems:

With a Python-based technology stack, the existing engineers have strong individual combat capabilities, but at that time there was a serious shortage of troops on the market.

In the All in one system, there are no divisions in various business areas, and the codes between business modules are interleaved, which affects the delivery efficiency and requires rapid business development.

Infrastructure and business system development are not separated. Development engineers who hold multiple positions have their own strengths and weaknesses in infrastructure operation and maintenance, middleware development, and front-end and back-end business system development.

The traditional manual online, deployment, operation and maintenance, and monitoring mode-SSH to the server to manually execute scripts is inefficient, takes a long time to recover when an accident occurs, and is difficult to roll back after release.

Formation

With the rapid growth of business volume and the increasing complexity of business systems, the technical team has also expanded. At that time, there were still more sources of Java engineers than Python engineers in the talent market, and there were more choices. Therefore, there were many fields dominated by Java. It began to grow gradually, forming two major technology stack systems, Python and Java.

There are multiple roles in front-end, mobile, back-end business applications in multiple fields, operation and maintenance, middleware, risk control and security, big data, project management, etc. Engineers in different roles do what they are good at. At this stage, the system and organization formed several architectures for business development, operation and maintenance, shared components and middleware.

business system

1. Division of business areas

The monolithic system began to be split according to domains. All in one's system, by dividing the business areas, the technical backbones in each field claim the areas they are responsible for, and the organizational structure is adjusted accordingly, it is very difficult to complete the division. Shopping guide, search recommendation, marketing, transaction, finance, public services, merchant products, merchant performance, customer service, newly-built logistics waybill, distribution, scheduling and other systems, big data warehouse, and gradually identify their own fields, sub-fields and corresponding modules. In this process, some backbones did not pay attention to human resources at the first time after taking over the areas they were responsible for, resulting in insufficient delivery capabilities and becoming a bottleneck for business development. In the process of transition from technical backbone to technical team leader, it is easy to overlook the team’s personnel structure.

2. System split

With the rapid iteration of requirements in their respective fields, the system has also expanded rapidly, and mutual dependence and field boundaries have also become complicated. What used to be "closed loop" now needs to be interactive, and even the code in other fields can be directly moved to raise PR and access other people's databases, but now it doesn't work. Splitting services from single unit to domain, changing the original way of thinking, leading to a lot of discomfort, and also taking some detours: For performance, the shopping guide domain has created a cache of merchant product data for product query, so it needs to understand the merchant side Domain business, subscribe to this kind of master data changes, and this part of the data on the merchant side cannot be closed. The freshness of the cache, the change of the underlying data structure, and the system reconstruction are all troublesome; the transaction domain is also a lot of trouble, some areas In order not to rely on transactions, I have redundant most of the order data; in the field of logistics performance, there are multiple data redundancy in the downstream, which leads to the lack of closure of field responsibilities, which brings many consistency problems and increases the complexity of the system. , System interaction and communication costs have also risen.

At the other extreme, the system is too fragmented, and frequent calls to each other depend on it, and the system that should have been high cohesive is broken into low cohesion. Orders and logistics have experienced the pain of excessive asynchronousization, fault recovery time is too long, complexity and troubleshooting costs have increased. During this period of time, the divergence of domain boundaries is also a headache.

Experience and lessons-Conway's law, technological culture

In the order fulfillment system, there is a team that belongs to the merchant domain and is responsible for the order push system. The main responsibility of the system is to undertake the delivery of merchant calls and push orders to the logistics waybill center. In order to reduce the dependence on the order system, in case the order system fails and the contract cannot be fulfilled, the development team has redundant data on a lot of orders. Therefore, it is necessary to consider both the forward and reverse of the order and the waybill, its own availability, and the system. It is designed to be more complicated.

In this process, once a project involves the interaction of logistics and order push systems, the two teams often diverge in the field boundary, involving some scenes of fetching from orders. The logistics team believes, "I should not understand this part of the logic. , You should get it from the upstream system and push it to us", and this team belonging to the merchant domain believes, "This part is not the data in the field of the order, and the order system itself is not used, and the logistics needs should be done by yourself Go and fetch".

Similar problems recurring and repeated discussions consume a lot of everyone's time, and they will happen in the foreseeable future. Long links also bring hidden dangers to stability. In the end, the team responsible for the order domain will be responsible for the order push system. The logic in the order domain can be closed-loop, and this problem will be solved.

Therefore, when it comes to the boundaries of two fields, once similar problems recur, we may want to consider Conway's law.

Regarding controversy: The fierce controversy in the system design stage is very reasonable. Full discussion will greatly reduce the probability of the scheme's failure, and the development stage will also be less suffering from rework. Technical discussions center on the rationality of technology, and the development of the matter in terms of the facts. Do not shirk or make the final decision for reasons other than technology. After the discussion, everyone can accept the decision frankly. The most important thing is that the participating engineers can fully understand the pros and cons of the final plan. Choose, there will be no deviation in landing, and do not shirk if there is a problem. This is also a place where the technical atmosphere of many teams is attractive. Culture is not a slogan, but these daily details and practices.

Operating system

1. The team
is responsible for the business operation and maintenance of software delivery, the system operation and maintenance of the underlying operating system and hardware delivery, the DBA responsible for the database, and the stability assurance team have been established one after another.

2.
The number of monitoring alarm cluster instances has expanded dramatically. The operation and maintenance mode of logging on to the server to view logs is no longer realistic and has been replaced by a monitoring system based on telemetry. With the establishment of the 7*24-hour NOC team (fault emergency response), the ELK-based monitoring system was also set up, and core indicators were cast on the monitoring wall.

Experience and lessons: the significance of monitoring and warning mechanisms

After the Internet has been applied to a certain degree of complexity, huge clusters, especially after containerization, IP dynamic allocation, huge number of logs, complex and discrete log data, monitoring and troubleshooting need to rely on the same idea as satellite troubleshooting-telemetry. The log system needs to support aggregation and query. Monitoring needs to collect and sample various indicators in real time. The monitoring panel can check the current health status of the system at any time, and the alarm mechanism can detect smoke from the system as soon as possible.

Once there was a problem. From the monitoring panel, everything was normal. Later, it was discovered that the root cause was an int32 overflow bug, which caused the order to fail. However, why can't the monitoring show it? Because the code swallowed the exception, the call returned success, and our core indicators used the successful call volume indicator of the order interface and some abnormal indicators.

After the governance of key indicators, our monitoring panel focuses on three types of core indicators:

Business indicators-pay attention to the overall health of the business in real time. From this indicator, you can intuitively see the degree of damage, duration and impact of the business, such as when the order was dropped, what percentage was dropped, and how long was it dropped. What is the success rate of access to key pages? For the case of the above order, it needs to be processed after the order is successfully placed (completed by the implementation logic responsible for the order dropout) to ensure that the business status is truly reflected. It should be noted that such indicators usually involve sensitive business information, so some processing is required.

Application indicators-pay attention to the real-time health of the application, the amount of calls, delays, success rates, exceptions, etc. of the application itself and directly upstream and downstream. For the sake of security, care should be taken not to expose sensitive system information. Especially related to business systems in the financial and security fields.

System indicators-pay attention to the real-time health status of middleware and operating system, Network Input/Output, CPU load & Utilization, Memory Usage, etc. When a fault occurs, these indicators usually have abnormalities one after the other. It is necessary to pay attention to which is the cause and which is the effect to avoid being misled.

Of course, the above is not enough. Monitoring still has some areas to pay attention to. With the development of business, it will bring more challenges to our monitoring system. In the next stage, the monitoring system will undergo radical changes.

Another lesson learned from this accident is the isolation of the critical path and the non-critical path: the int32 overflow bug was triggered on the non-critical path, and the critical path had not been sorted out at that time, so the non-critical path failure affected the critical path. Accidents happen from time to time. With the sorting out of the critical path, the ability to downgrade subsequent services has begun to be gradually built.

3. Business operation and maintenance The
business operation and maintenance team is responsible for the initialization of many business system runtime environments, such as the initialization of virtual machines, HAProxy, Nginx, Redis, RabbitMQ, MySQL, and capacity evaluation. Stability assurance is also one of the main responsibilities. These divisions of labor allow development engineers to focus more on the delivery of business systems, but it is also a double-edged sword. This is something later.

4. System operation and maintenance
With the establishment of professional operation and maintenance teams, the system has migrated from physical machines to virtual machines. Service Instance per VM is the main deployment mode at this stage. The operation and maintenance of hardware equipment and network planning have gradually become more professional and standardized, and CMDB has also begun to be built.

5.
The operation of DBA/DA database is unified to this team, responsible for database capacity planning, reliability guarantee, index optimization, etc., and delivered many database operation tool products including database monitoring system. At the same time, they are also involved To the data architecture design and evaluation selection of the business system.

6. The
definition of system failure levels, the structure review mechanism, and the overall project mechanism have also been released. The establishment, implementation and people-oriented of the system are the three things that have never been unified. If they are not recognized by people, the implementation will be discounted and deviate from the original intention of the establishment of the system. Therefore, the system needs to be iterated. The system is the bottom line. The areas that the system does not cover are maintained by the team culture. However, if the culture is implemented as a system, the gains outweigh the losses.

Experiences and lessons-what exactly should an architect do?

The architect is a role, not a rank. The responsibilities of this role include but are not limited to the following:

1. Technical scheme design and iterative planning of business system

2. Definition and scheme design of non-functional requirements, technology selection
3. Governance of existing architecture, division of domain boundaries, trade-off and balance between design principles and reality (technical debt)
4. Future evolution of architecture

However, if it is not deep enough, it will be difficult to achieve the above. Pre-design, follow-up to the delivery stage, post-event online system operation and feedback, and continuous optimization and iteration all require the full leadership or participation of the architect.

Agile development and design are an easily overlooked link. In terms of system, we organize review meetings composed of architects from various fields of operation and maintenance, middleware, business development, database, and security risk control. Before the facility resource application is approved, review and evaluate the design plan.

In fact, this "pre-prevention" before going live also has limitations, because the design plan has already come out at this time. Although the design document has been submitted in advance before the review, it has not been able to participate in the entire life cycle, and the depth is limited. Do it all. A better mechanism is that each team has the role of an architect and fully participates in the design process. This is the real "priority".

Therefore, for a long time, architects will lead or intervene relatively deeply in key cross-domain projects or projects in the entire technology center. In daily iterations, when there are major disagreements on the design plan and domain boundaries, the architect Often they are passively involved, and become the role of the aunt of the neighborhood committee. The coverage area is too wide, and architects have become a bottleneck at this time. The architecture of the business system is precisely built by daily iterations. Later, when the global architecture group was established, the architects really began to delve into the daily delivery of more services in different areas:

The person in charge of the design plan, as the Owner, is responsible for the plan, and follows up the delivery, has the power and obligation;

Build influence. Except for daily delivery with development engineers, there is no shortcut. This requires a relatively long time and project accumulation, which is difficult to accomplish overnight. If the frontline engineers and development managers are recognized in this process, the influence will gradually form and become stronger. On the contrary, it will gradually lose its influence. If the architect has no influence, nothing can be said about it. Because of the organization, the architect is not necessarily the reporting line of the front-line development engineer or development manager. A neutral and objective attitude and an open mind are also the key to building influence. Influence is the embodiment of architect Leadership.

The guarantee of stability is always one of the core concerns of the architect, and the stability index is what the architect must bear. These include delivery quality, availability, flexibility, system capacity, etc. It is difficult to provide guarantees without going into specific areas. For this reason, architects need to have the power to mobilize resources to ensure stability.

The influence of technology and culture, the implementation of architecture planning, in addition to regular presentations and sharing, more importantly, is the subtlety in the process of daily project delivery and technical discussions. Design thinking, design principles, technical and cultural identity, these are not formed by preaching, but a consensus is reached in project iterations, online troubleshooting, and accident review meetings.

Only on the basis of the above points can we assume the responsibility of the architecture as a whole and promote the evolution of the architecture. Stay tuned for next week's content: Ele.me's full-service architecture implementation and challenges; containerization practices and DevOps transformation, etc.

Author: Huang Road (nickname: Pulse-kun), added in October 2015 it was hungry, responsible for global architecture.

Original link: https://developer.aliyun.com/article/776433?

Copyright statement: The content of this article is contributed spontaneously by Alibaba Cloud real-name registered users. The copyright belongs to the original author. The Alibaba Cloud developer community does not own its copyright and does not assume corresponding legal responsibilities. Please refer to the "Alibaba Cloud Developer Community User Service Agreement" and "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines" for specific rules. If you find that there is suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.

Guess you like

Origin blog.csdn.net/alitech2017/article/details/109292002