The technical support story behind the global live broadcast of Luo Pang's New Year's Eve speech - an interview with Fang Yuan, Chief Architect of Luo Ji Thinking

Introduction: In recent years, knowledge-paying products have appeared on the stage one after another. You may have heard about the recent New Year's Eve activities of Luo Ji Thinking, or have used the app to learn new knowledge. For obtaining such a product, the technical challenges and experience behind it are not well understood by the outside world. It happened that Fang Yuan, the chief architect of Luoji thinking, participated in the GIAC conference in December 2017 as the producer of the middleware forum, and the high availability architecture was carried out. interview.
Fang Yuan, Chief Architect of Luo Ji Siwei, has worked in infrastructure R&D at Cisco and Sina Weibo. He has been focusing on the research and development of back-end technology for more than ten years, and has rich experience in message communication, distributed storage and other directions. He has a wide range of personal technical interests, mainly focusing on the development of programming languages ​​such as Go/Java/Python, especially the application in frontier fields such as cloud computing.
Many friends around you are using the app, but you may not know much about the technical team behind it. Can you briefly introduce your technical team? What are your main day-to-day responsibilities as a lead architect?
Fangyuan: The technical team obtained mainly includes front-end (Web, iOS, Android), back-end, mall team, basic service team, big data team and operation and maintenance team. My job is mainly to lead the back-end team responsible for the back-end development of the app. This includes the daily development of business system functions, as well as the development of service frameworks and tools for internal use.
You acted as the producer of the middleware topic at the GIAC conference in December 2017. Can you introduce to the readers what middleware is? For an architect, how do you view the value of middleware? Fangyuan: Middleware is a non-business technical component. Personally, I think that in a broad sense, except for the operating system, everything else is middleware. Of course, in a general sense, middleware is mainly message middleware, framework, configuration service, cache and so on.
The most important value of middleware for application systems is to reduce the complexity of application system control logic so that engineers can focus on application business logic as much as possible. For example, service frameworks can reduce the complexity of our split services, similar to A framework like DRDS allows programmers to care less about sub-databases and sub-tables, and can decouple some application systems through message middleware.
For engineers, learning middleware can quickly improve their abstraction and coding capabilities. It is best to try to implement a small demo to deepen their understanding after they are familiar with it. Most teams use open source middleware in their projects, and some teams prefer self-developed middleware. What are your suggestions for introducing and self-developed middleware? Fangyuan: In the initial trial and error stage of the business, it is recommended to use mature open source middleware, so as to avoid stepping on pits and speed up the development progress, but the selection of middleware should be determined according to the familiarity of your team, and you cannot blindly follow the trend , reduce the overall learning cost of the team.
In the stable stage of business development, you can conduct research based on your own business characteristics. At the same time, you must also consider the company's resource investment and compatibility with mainstream data formats or communication protocols. The cycle of self-research is generally long, and it needs to be disassembled into phased goals, which is conducive to the implementation, and at the same time, the old system must be taken into account, and the migration of online data in the future should be fully considered, and the business transition will be smooth.
Ali re-established the Dubbo development team in 2017, and this GIAC middleware topic also has related sharing. From the understanding of the conference site, what are the new changes in Dubbo? Fangyuan: The participants' concerns for Dubbo are mainly in the following two points: First, how is Ali's support for Dubbo? Second, how do third-party teams contribute code to Dubbo?
Ali has regained Dubbo this time, and the official is very confident to make Dubbo an Apache project. Some time ago, we also saw news of Dubbo 3.0. The new Dubbo kernel is completely different from Dubbo 2.0, but it is compatible with 2.0. Dubbo 3.0 will use Streaming as the core, instead of RPC in the 2.0 era, but RPC will become an optional form of remote Streaming docking in 3.0. The specific changes still need to wait for the new version of Dubbo to be released. Kafka has been widely used in the field of big data in recent years. Can you introduce some developments in the field of Kafka? Compared to previous years, have the usage scenarios and concerns of Kafka changed? Fangyuan: Key version upgrades of Kafka in recent years:
0.8 Support copy mode and enhance disaster tolerance
0.9 Added groupcoordinator to completely solve the dynamic adjustment of partition and consume
0.10 Support stream processing function, and move the offset of consume to the default topic
1.0.0 Stream capability enhancement
August 5, 2017 Kafka released version 0.11 to support exactly-once and enhance transaction processing capabilities
. August 2017 LinkedIn open source Kafka Cruise Control, providing automated operation and maintenance capabilities
. August 28, 2017 Confluent announced open source KSQL, a streaming data SQL engine for Kafka.
2017 On November 3, Kafka announced the release of 1.0.0
. The biggest change in Kafka usage scenarios: At first, everyone mainly used Kafka to do some log processing systems, and later it was mainly used in message queue systems. In the past two years, with the enhancement of Kafka stream processing capabilities, it has gradually changed. into a lightweight stream processing platform.
In addition, the LinkedIn team has also done a lot of work on the automatic operation and maintenance of Kafka recently. In this conference, Qin Jiangjie from the LinkedIn team introduced the automatic operation and maintenance tool Cruise Control that they implemented. I have also seen an article on the high-availability architecture before introducing this tool. Mr. Yu Zhaohui from where to go shared the topic of message queue (MQ) at the GIAC conference. Can you briefly introduce his sharing topic? What are the inspirations for scenarios using MQ? Fangyuan: The content shared by Qunar’s MQ is very helpful for implementing distributed transactions. The lecturer's sharing is mainly divided into two parts, one is a simple model of distributed transactions, the other is because of the need to support distributed transactions, what optimizations have been made in Qunar's self-developed middleware, and it has also done with popular message middleware Contrast (eg Kafka vs rocketmq). For our company, it happens that we also need to implement distributed transactions, and we exchanged a lot of details with the lecturer to avoid us from stepping on the pit. Let's turn the topic back to your team. The recent New Year's Eve speech has received a lot of attention. What preparations have you made in response to this New Year's Eve? Fangyuan: We officially started preparations in October 2017. In fact, the earliest preparations started in September. The work mainly focuses on the following aspects:
We have sorted out many potential problems in the business architecture. For example, there are many two-way calls in the early system, and there are many free resources, and there are many codes that cause read amplification or write amplification. There are many unreasonable call relationship stacks, and there is a relatively clear structure for the business system. At the same time, some problems were found on the call link.
Service/resource splitting In the early days, the main business system was a monolithic architecture. Only one database was used in the core business call chain, and the cache usage was also concentrated in several major cache clusters. Therefore, we did a lot of resource and service splits. Decentralized pressure Refactoring of
important service code The corresponding main business modules are divided into separate services, and the abstraction of resources is done well. In order to cope with greater pressure, we have implemented a simple multi-level cache framework, which is used in all code refactoring projects. This multi-level cache framework ensures the processing power of the business system
Stress test The stress test is divided into two parts. One part is the corresponding stress test conducted by the development engineer before the function is launched. If there is a problem, it can be analyzed through the corresponding tools of the Go language, and it can be released online after it reaches a certain standard. The other part is the full-link stress test provided by the Alibaba Cloud PTS team. We conducted 18 rounds of full-link stress tests within 3 months, covering the main interfaces (close to 200 interfaces, with a coverage rate of close to 50%). . Through the stress test of a single service, we solved the performance problem of a single service, and through the full-link stress test, we solved the problem caused by the calling link. After 18 rounds of stress tests, the system load capacity has increased by more than 25 times, making it ready for the New Year.
API Gateway Even if we do the business splitting, service splitting and refactoring described above, we cannot guarantee that the system will be 100% free of problems, especially those systems that have not been refactored. After all the load capacity of the system depends on the shortest board. Our solution to this problem is to introduce an API Gateway. In September, we introduced the API Gateway, and made corresponding current limiting strategies for some interfaces that may have problems before New Years Eve.
Question: It is understood that Dao has done an important reconstruction on the eve of New Year's Eve. Can you briefly introduce the background and results of this reconstruction? Fangyuan: Reconstructing the background is actually relatively simple. On August 31 last year, the second product launch was broadcast on Shenzhen Satellite TV and multiple video websites, and the traffic it brought was about 4 times that of the usual morning and evening peaks, resulting in a large Fault. Therefore, starting from September, we have concentrated part of our development efforts to reconstruct more than 10 important business modules. In the process of refactoring, the following points are mainly considered to optimize performance:
strictly control resources (database, cache), use early services, and use resources at will, and there are many codes that cause read/write amplification, so the new system strictly controls the use of resources. Use of resources To
ensure that non-core business data can be automatically degraded, and non-core business data can be automatically called and degraded, but because we use the Go language, we have not yet implemented a circuit breaker mechanism, which is also one of our landing goals determined in 2018.
Guaranteeing the stability of core links For core links across the year, such as listening, purchasing, pulling new ones, and exchanging, we ensure that core links are not affected by non-core links.
Peak shaving and asynchronization of writes For the purchase process, we partially asynchronousize, and for non-purchase-related business processes, we perform peak shaving and processing through MQ. The effect is still obvious. After peak clipping and merging writes, the related database IOPS is reduced by an order of magnitude.
Question: How was the response during the New Year's Eve activities? Were there any unexpected problems and how did you deal with them at the time? Fangyuan: During the New Year’s Eve event, the situation was basically as expected, and the traffic of the core system during the peak period was only about 1/8 of what we had prepared, so the pressure on the core system was not great. To exaggerate a bit, the core system can handle the New Year's Eve with 8 Luo fats without too much pressure. But some legacy systems are still under a lot of pressure. For example, we have a legacy system. During the peak traffic period, the database is under great pressure. At that time, we found that the system was under great pressure through the monitoring system, and quickly limited the current of the system's API through the gateway to ensure that the system would not go down. Question: Through this New Year's Eve event, the team must have gained a lot. Can you introduce some of the experience in this response?
Fangyuan: Personally, the following three points are more important:
the full-link stress test has just started to make the back-end service, and there is no tracing system. Although we can guarantee the processing power of a single system, when multiple systems are combined, the overall performance of the system is difficult to grasp. Alibaba PTS helped us to find many problems on the call link. Every time we tested the pressure, we found some new problems, which were quickly solved after the pressure test. Therefore, we can see that the system load capacity of each pressure test has been greatly improved.
API Gateway As mentioned above, the API gateway is used to limit the current flow of the API to ensure that even if there is a problem, the back-end system is protected so that some users can access it normally.
Core business link reconstruction must not be lenient for old systems with pitfalls. Code refactoring can improve the processing capability of the system on the one hand, and ensure that subsequent functional development can be carried out lightly on the other hand.
As a chief architect, do you still write code every day? In what area, if any, is your code primarily in and what contributions have you made to the team? In what ways do you think the chief architect's value to the team is mainly brought out? Fangyuan: I don’t write code every day, but I still insist on regular code output. My code is mainly focused on some public libraries/frameworks/tools, because these codes are critical to the development efficiency and code quality of the whole team.
For example, we conducted a stress test before going online, and found that the QPS of the main interface of a single machine was only more than 300. After finding the problem through the flame graph, a small amount of code modification was made to the public library, and it was carried out again a day later. The QPS of a single machine in the stress test can reach about 12,000, and due to the changes in the public code base, other business modules using this library have also achieved several times the performance improvement. I personally think that the chief architect should have the following work:
management of the company's architecture team;
determination of technical direction, architecture analysis, design and partial implementation;
mutual promotion between the company's technology platform and business lines;
business architecture and implementation control.
Many team architects do not lead people, so they cannot directly ask engineers to execute according to their own ideas. In this case, how can architects better realize their value? And avoid being just a executor under the CTO or technical director? Fangyuan: I personally think that if the architect cannot lead people, it is difficult to influence the execution plan of the engineer. When an engineer executes, there may be various factors that affect his choice of execution plan, and the architect is only one of them. In addition, everyone in different positions, different times, and different backgrounds may have different views on a technical issue and the selection of a technical solution. It is also difficult to have a fixed pattern for the way to land.
For example, when I first arrived at the company, I was emphasizing the problem of reading and writing amplification, but no one took it seriously. There are various reasons. Most people think that the function development is too tight and can't care about it until I take charge of it. Only by getting the specific work of the back-end can we force everyone to solve it on the ground. In the end, everyone found that the processing capacity of the system can be improved by an order of magnitude without adding much work. In the follow-up work, there is no need to emphasize again, and everyone will do the same.
I personally think that the roles of architects and CTOs and technical directors are still different. CTOs and technical directors are more responsible for management, while architects are more of a technical identity. For the CTO and technical director, they need to let the architect do the design and implementation. For the architect, in addition to doing the architectural work, they also need to assist the CTO technical director and do some management work.
Xiaobian: Thank you Fangyuan for the interview. If you have any questions about the architecture and middleware technology, you are welcome to join the group 561614305 and also provide you with some videos for free
. If you have any questions about the architecture and middleware technology, please leave a message and contact Fangyuan for further information. Discuss

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326125344&siteId=291194637