Volcano engine, the confidence to take an unusual path

Reprint: https://baijiahao.baidu.com/s?id=1764113509942259731&wfr=spider&for=pc

Introduction:

In the current era of big model boom, Volcano Engine has chosen a unique angle - to provide the best machine learning platform for companies that need to train large models.

This includes supporting 10,000-ka-level large model training and microsecond-level delay networks, making large model training more stable and faster.

As for the question of "whether Volcano Engine can make large models?" Tan Dai, the president of Volcano Engine, answered very cleverly: "There are dozens of companies in China that make large models, and most of them are already on the Volcano Engine Cloud. We will connect to many A large model will provide enterprises and consumers with richer AI applications."

Everything shows that Volcano Engine is an unusual cloud computing company.

img

1. Super cloud generated by super business

On April 18, Volcano Engine launched many attractive new product and service lists at its "Motive Power Conference".

This includes but is not limited to Volcano Engine, which showcases the latest exploration, applications and practices in cloud technology, cloud services and cloud scenarios, and releases new products at the levels of agile iteration, data-driven and experience innovation.

On June 22, 2020, the Volcano Engine was officially launched. Moreover, what the author has never figured out is why it is not simply called "Byte Cloud", so that it needs to be explained frequently in front of people who don't understand it - Volcano Engine is Byte Cloud.

However, is it too fantastic to come up with such a rich product line in less than 3 years?

At least, Tan Dai saw hope when he joined Volcano Engine in 2020.

Before joining, he received satisfactory answers from Yang Zhenyuan, Vice President of ByteDance:

First: concentrate on toB’s business;

Second: Byte's accumulated capabilities in the field of cloud computing can be opened to the outside world in a timely manner;

"If you look purely at Byte Cloud's current market share, it is indeed outside the top five. But I use another calculation method to look at this problem," Tan Dai told the author: "If Byte's own business is Taking into account the cloud computing infrastructure, cloud native capabilities, and understanding of the cloud, we may already be among the top three invisible players in the domestic cloud market, and the iron law of this industry is that super scale produces super clouds."

"The cloud thing is extremely dependent on scale, so you see, not all giants have cloud business, the essence is that the scale is not big enough," Tan Dai said: "Scale can exercise scheduling capabilities, dilute average costs, and more importantly, limit Under certain conditions, the boundaries of the cloud can be polished. Therefore, the domestic cloud companies that can really become big can be counted on one hand, and I think Byte has unique competitiveness and opportunities."

Starting a new project in Byte is equivalent to an internal entrepreneurship. During the severe months of the epidemic in 2020, Tan Dai and others wrote the first version of BP of the Volcano Engine using Feishu Documents, and faced difficulties in meeting The first round of interviews started under heavy circumstances, and Volcano Engine was born, and it moved towards opening up step by step.

In fact, before the establishment of Volcano Engine, a well-known mobile phone manufacturer approached ByteDance and hoped to make some optimizations to the browser and app store algorithms and use ByteDance's personalized recommendation algorithm. ByteDance took over this "extraordinary task" with some hesitation, but unexpectedly achieved good results.

This cooperation deeply inspired ByteDance - Over the years, ByteDance has achieved healthy and sustained growth by relying on its user-facing philosophy and data-driven work model. If the technical capabilities accumulated behind this are exported to the outside world, it can Produce huge value to the industry.

The formal establishment of the Volcano Engine project proves that it is on the fast track to commercialization.

2. It’s not just as simple as “the same byte”

In the official introduction, the "Volcano Engine" is explained this way - it is a cloud computing service platform owned by ByteDance. Relying on Bytedance's technical capabilities, growth philosophy and operating methodology, it provides technical services to corporate customers.

Some people believe that the biggest charm here lies in the phrase "relying on Bytedance's technical capabilities, growth philosophy and operating methodology." Some admirers directly call Volcano Engine's products "the same model as ByteDance," while detractors say They are called "rich second generation" and "fighting father".

Indeed, ByteDance, which relies on data-driven development, has created a world-class growth miracle, so much so that the outside world wants to know the secret of its success.

There are also a number of Byte fans in the industry. For example, Li Xiang, the founder of Li Auto, after introducing Byte’s collaboration tool Feishu, he did not forget to use it on various occasions and expressed his praises, such as - "We research We visited various advanced organizations and finally found one very close to us - Byte Company." He also pointed out that Feishu, as a collaboration tool that carries advanced management concepts, has great advantages in information circulation efficiency and organizational culture. Construction has brought many changes to the ideal...

Therefore, the author raised two more soul questions about Tan Dai.

The first question is, will this not reveal Byte’s core secrets?

Tan Dai's view on this is basically not.

"Byte's development has many factors, including leadership, content, operations, technology, and even timing and luck. In my opinion, this kind of success cannot be reproduced through the output of a few technologies," Tan Dai said.

In Tan Dai's view, good technology will flow naturally. For example, Baidu does the best search, but now there are many companies that can apply search technology; Byte does a good job in recommendation, but now most Internet companies use recommendation technology more or less... Moreover, the technology is still It will naturally flow with the blossoming and spreading of technical personnel, and it will also flow with the open source culture. All good companies will have a situation where you are in me, and I am in you, just in terms of technology.

Therefore, we can now see some companies that have a pan-competitive relationship with Byte Cloud in Byte Cloud's customer list.

This is just like BYD is currently the best seller of cars in China, but this does not prevent BYD from also selling its own batteries - blade batteries are indeed one of BYD's core technologies, but they are not the only factor in BYD's success. Therefore, BYD can sell its own batteries with confidence and boldness - of course, the author believes that this still requires great confidence from companies that dare to export technology.

The second question is, what standards will Byte use to release technology?

Tan Dai’s answer is, good technology, excellent technology, especially technology that can win in Byte’s internal PK.

In fact, in toB companies, there have always been two routes.

The first route is to use one technology stack for your own use and sell the other technology stack externally.

Enterprises in the first situation generally have a long history, such as BAT. They came from the non-cloud era, and then moved to the cloud bit by bit. There was a long period of specific historical stage of "walking on two legs". . Therefore, in the end, two technology-R&D-management paths were naturally formed for self-use and external sales.

The advantage of this is that it is easier to manage and the inside and outside are clear. The disadvantage is that in the later stage, there will often be bifurcations on many technical routes, which require strong integration when necessary; sometimes, external customers are interested in a certain internal capability, and after discussion, this capability can be released externally. , but it requires testing and productization, which often makes it easy to lose the opportunity.

**Another route is that from the beginning, there will be only one technology stack, and the internal and external technologies are not only the same, but also from the same origin. **The initial cost of this is that the management cost is relatively high, because the entire architecture is naturally decentralized and distributed, rather than bureaucratic. This requires a complete set of coordinated coordination mechanisms, and ByteDance is a model of this model. (One more thing to note from this is that the development of Feishu was actually created to meet the needs of this kind of management)

That's why there's a saying - No executive knows where all their employees sit.

But this is not the end. Tan Dai pointed out that for products that can truly be commercialized, “it first wins in Byte’s internal PK. In terms of technology and R&D management, Byte can achieve the goal of ‘small front desk, large scale’. "Middle platform', the premise is that the capabilities of the middle platform are strong enough. This comes from the setting of the mechanism - Byte has implemented an internal settlement mechanism relatively early."

"All our technologies and products can actually be 'purchased' on the internal platform, and they are actually settled internally. This is fairer and more severe than any assessment. Only those who are truly widely accepted internally Only popular and excellent technology can be sold, so before it appears on the market, it has actually been tested by our internal market," Tan Dai said.

3. Easter eggs

In fact, when a customer buys a certain technology or function from Volcano Engine, they often get some bonus "Easter eggs", making the transaction extremely cost-effective.

For example, "burying the point" is an important know-how for making good use of recommendation technology.

The so-called "buried point analysis" is to capture, process and analyze specific events on the "operating nodes" that need to collect data, and then analyze the full amount of behavior, which satisfies the enterprise's need to remove the rough and the fine from the massive data and achieve rapid product and service Optimize iteration needs.

For example, when playing a video, is it more valuable for the user to click "Collect" or for the length of time the user stays there? It's hard to say, so in a sense, burying is an important technique, and it's difficult to standardize.

So, if Byte engineers are willing to provide some tips at this time, then for small and medium-sized enterprises, it may be some specific and important knowledge that they have not been able to capture for a long time.

To give another example, BI (Business Intelligence) is a complete set of solutions used to effectively integrate existing data in an enterprise, quickly and accurately provide reports and provide decision-making basis, and help enterprises make decisions. Make informed business decisions.

As a data-driven company, ByteDance has a large number of useful data tools. More importantly, it has formed a culture that relies on data for business advancement. Here, 60,000 employees use BI tools to drive their work every day. .

"Many users are eager for such a capability, but this capability cannot be achieved by just buying some tools. For example, can your organizational strength enable one-third of your employees to develop the habit and culture of using BI? Or, Does your data base support you to extract data in real time and analyze it effectively?" Tan Dai said: "So what we sell to customers is not just tools. Tools are the solidification of practical capabilities, but it also requires a fit of concepts to drive this capability. , so customers will also get inspiration from our ideas.”

Therefore, the reason why Byte can continue to attract new users is not only because the tool is good enough, but also because the concept inheritance required to make good use of this tool can also help users gain new inspiration invisibly, and This is a priceless and invisible "Easter egg".

Of course, these intangible wealth can only be passed on through specific products, so in the product list of Volcano Engine, there is the product "Data Flywheel".

img

In short, this is what the outside world wants most, a workbench for practicing data-driven based on ByteDance's 10 years of practice. What it actually solves is: "Promote data production through data consumption, and use data consumption to assist Business development” is a core proposition.

However, there are two key elements in building a data flywheel. One is the "data-driven" concept of the data flywheel itself, and the other is the appropriate products and services to implement the model. Therefore, Volcano Engine also focuses on data products and consulting services to help enterprises implement the data flywheel and really turn it around.

Since the release of the digital intelligence platform VeDI last year, Volcano Engine has continued to open up ByteDance’s internal data technology and tool capabilities, and has continuously launched products, such as helping enterprises build Serverless smart lake warehouses, through ultimate performance and Serverless full hosting capabilities. It brings cost reduction and efficiency improvement of data infrastructure construction to enterprises; it launches Management Cockpit Plus to quickly respond to managers’ needs to see real-time data and fact-based decision-making.

There are many products like this at the 4.18 press conference.

4. From the same source, the same model to the same pool

Another thing that shocked the author was that on April 18, Volcano Engine announced a large-scale merger with Byte’s domestic business to achieve large-scale real-time reuse of internal and external resources.

This means that the Volcano Engine can quickly deliver a large amount of large-scale resources to enterprise customers within a specific period of time, and can schedule up to 100,000 core CPUs at the minute level, ensuring agile flexibility and more extreme cost reduction and efficiency improvement.

Tan Dai said that Byte currently has hundreds of millions of core CPU clusters, dozens of exabytes of enterprise storage, and huge cloud computing capabilities. The merger with Volcano Engine will enhance the cloud computing scale and capabilities of Volcano Engine and bring better cloud service capabilities to partners.

It should be said that this is more shocking to the author than another focus of the meeting - providing a computing platform for large models.

The key lies in ByteDance’s conceptual breakthrough, as well as its social significance and future indicators.

Many years ago, when I visited Didi Chuxing, I encountered a problem - every morning and evening, countless people scolded Didi for not being able to get a ride and increasing fares, but in the morning and afternoon, there were also a large number of of vehicles driving empty.

This problem of being stuck in an endless loop led Didi’s executives to find economist Zhou Qiren.

Zhou Qiren pointed out the problem. This is a classic problem in the transportation industry and even the service industry - if a company purchases fixed transportation capacity according to demand peaks, it will inevitably lose money; if it purchases transportation capacity according to troughs, it will inevitably be unable to meet demand.

What Zhou Qiren gave was an answer already known in the economics community - flexible transport capacity. This answer was the basis for a series of subsequent carpooling, ride-hailing and other businesses.

The same problem exists in the digital world we live in. The computing power purchased by enterprises at high prices is not enough during the peak period and is wasted during the trough period. Therefore, to a certain extent, the agile and elastic capabilities provided by cloud computing are a cost-effective solution for users from a micro perspective, and a good reuse of increasingly scarce resources from a macro perspective.

But in the actual environment, especially in China, few companies dare to actually put all their data on the public cloud, even though the deployment cost of hybrid cloud is higher and even goes against the original intention of cloud computing.

From a technical perspective, Byte's approach proves its cloud computing capabilities, including self-developed servers, self-developed OS, etc. Its self-developed virtual network can reduce transmission delays by 50%; its self-developed mGPU increases deployment density by more than 500%, bringing higher resource utilization to upper-layer applications.

It also explains that as a company born in the cloud-native era, Byte's infrastructure is developed based on the concept of cloud-native, and more than 95% of its internal computing system is containerized. Only in this way can it achieve large-scale pooling and sharing of internal and external resources. Circulation and scheduling.

But these are too specific compared to their social significance.

Novelist Dan Brown created an image of a biological geek in his work "Hell". This person believed that the earth's population would quickly lead to the destruction of the earth after it exceeded 8 billion, and he did not hesitate to develop a virus to sterilize humans...

After all, the novel is just a novel, but the problem it points out cannot be ignored - as the population continues to grow, we will increasingly face the problem of resource scarcity. Whether it is brain-computer interface, metaverse or biological editing technology, they are actually all or More or less in a sense to solve this problem.

According to the International Data Corporation (IDC), the global data volume will reach 175ZB by 2025, and nearly 90% of the data is unstructured. This data requires a lot of computing power to be analyzed and processed, and therefore consumes a lot of energy. At the same time, as AI algorithms continue to be upgraded and developed, their complexity and computational load are also increasing.

It is estimated that the current energy consumption of AI accounts for about 3% of global energy consumption. According to a report, AI will consume 15% of the global electricity supply by 2025. This means that the rapid development of AI will have a huge impact on energy consumption and the environment.

AI is only part of the huge world of data. To a certain extent, if we do not adopt a more radical mechanism to save resources, the social benefits we create through digitalization and intelligence may be offset by environmental pressure, or even Worse.

The social significance of Volcano Engine's "pooling" approach is that through the ultimate pursuit of technology, resources are maximized and reusable, thereby exploring a future with a more sharing spirit and intensive benefits, and creating a world with A milestone of spiritual symbolism, its motivation may be commercial, but its benefits are social.


Byte-Douyin-Volcano, Technology Chapter

  • ByteDance has been practicing the technical culture of technology in the middle . Let the technology center directly realize the commercialization of its own products. For example, for recommendation, we use the same recommendation platform, tools and methodology of Toutiao and Douyin. In this way, we can use our best internal capabilities to serve the outside world.
  • The overall product technology system of Volcano Engine is divided into four layers, namely: unified basic services, technology middle platform, intelligent applications and industry solutions . From bottom to top, these four layers meet the needs of enterprises in different industries and business scenarios from operation and maintenance, research and development, products, operations to marketing.
  • **Build a data-driven flywheel. **Data-driven will become a habit of daily internal collaboration, and eventually become the source of business growth. Building a flywheel is divided into four key steps: business process digitization, digital collaboration, data-driven business optimization, and objective analysis and evaluation.
  • For a summary and sharing of technical practices in data-driven and agile development , see the chart below for details.

If recommendation algorithms and big data technology are the technical capabilities that support ByteDance’s business development, then what is the core technical concept of its iterative innovation?

On October 27, at the "Rare Earth Developer Conference", Tan Dai, general manager of Volcano Engine, used the theme of "Data-driven x agile development, dual engines for rapid business growth" to deeply decrypt the two major technologies for the rapid development of ByteDance's business. Concept - data-driven, agile development, sharing how to build a data-driven flywheel and how to support large-scale applications to achieve agile development through full-stack cloud native architecture.

The following is the transcript of Tan Dai’s speech:

Hello everyone, I am Tan Dai, the person in charge of ByteDance’s Volcano Engine business. I am very happy to receive the invitation to the Rare Earth Developer Conference and be able to share and discuss ByteDance’s technical concepts and practices with you today.

Volcano Engine is the digital growth engine for enterprises

Before I start sharing, let me first introduce you to the Volcano Engine.

Volcano Engine is an enterprise-level technical service platform owned by Bytedance. It is a unified window for Bytedance’s technical team to provide technical services to the outside world. We hope that through Volcano Engine, Bytedance’s technology, products and services will be open to the outside world, including cloud, AI, big data, recommendations, etc., to help companies in different industries achieve their own growth and digital transformation.

As we all know, ByteDance has been practicing the technical culture of technology in the middle. Therefore, in the process of doing technology ToB, we also adopted this mechanism to allow the technology center to directly commercialize its own products. Therefore, the technologies and tools open to the outside world by Volcano Engine are completely homologous to ByteDance’s technology platform. For example, for recommendation, we use the same recommendation platform, tools and methodology of Toutiao and Douyin. In this way, we can use our best internal capabilities to serve the outside world.

img

This is the overall product technology system of Volcano Engine, which is divided into four layers, namely: unified basic services, technology middle platform, intelligent applications and industry solutions. From bottom to top, these four layers meet the needs of enterprises in different industries and business scenarios from operation and maintenance, research and development, products, operations to marketing.

This is the result of our continuous commercialization of ByteDance’s internal technology over the past year. During this process, we have been thinking about how ByteDance has developed step by step. This supports the rapid development of the business. What is the technical concept? Today I want to share my understanding with you. I think there are two major concepts that are very important in this process: data-driven and agile development.

Data-driven: Building a data-driven flywheel

First, let’s talk about data drive. Amazon has a famous flywheel theory: a company's various business modules should be organically combined and promote each other, just like meshing gears. Each flywheel takes effort to move from rest to rotation, but because they are combined together, each rotation is not in vain. Once one gear turns, the entire system will turn, faster and faster.

Building a data-driven flywheel

Returning to the topic of being data-driven, we believe the same is true. Data-driven is not achieved overnight. It is not done by using a tool or building a few reports. Instead, it is constantly solving problems one by one throughout the process, and ultimately forms multiple systems that allow it to automatically transform. rise to form a flywheel effect of data. Once the flywheel effect is formed, it will spin faster towards the back. Data drive will become a habit of daily internal collaboration and eventually become the source of business growth.

Insert image description here

Focusing on this goal, we can divide the construction of the flywheel into four key steps: business process digitization, digital collaboration, data-driven business optimization, and objective analysis and evaluation.

Between these steps is an organic process:

The digitization of business processes is the first and very critical step. The more fully the business process is digitized, the more accurate the description of the business will be, which will facilitate the development of subsequent steps. Therefore, we need to continuously bring offline activities online, refine online activities, and express them all through digital means.

After realizing the digitization of business processes, the second step is digital collaboration . First, the underlying data must be expressed in a standardized and unified manner through data governance and other means. The second is to involve more people, so tools such as data visualization need to be used by different roles (developers, operators, users, managers, etc.) to join the digital collaboration process.

The most direct impact of digital collaboration capabilities is the improvement of efficiency. The better the collaboration, the more timely and comprehensive the understanding of the business can be obtained, and the data can more objectively support the optimization of the upper-level business .

The effect of optimization must not be patted on the head or based on feelings, but objective analysis and evaluation . On the one hand, we can use A/B testing and other methods to accurately evaluate the actual benefits brought by the business through data. On the other hand, we also need to further establish multi-dimensional correlation reasons.

Finally, after completing these four steps, we can accumulate more data during the business optimization and evaluation process, which forms a closed loop and realizes the rotation of the flywheel.

Bytes of data drive the flywheel

What I just described is a bit abstract. Let’s take a look at the specific situation of ByteDance:

  • The digitization of business processes mainly focuses on burying data at different touch points, such as APPs, mini-programs, operation pages, etc.;
  • Digital collaboration is the collaborative processing of data applications by multiple roles. For example, how to do a good job in data development and data governance in R&D, and how to make better and faster use of data in operations;
  • Digital-driven business optimization mainly involves optimizing products and algorithms based on data and insights generated by data, such as optimizing recommendation system strategies, optimizing operations for different user groups, etc.;
  • Objective analysis and evaluation, on the one hand, objectively evaluate different and new iterations through A/B testing, and on the other hand, further data insights through ABI, which can accumulate relevant insights and promote the rotation of the entire process.

This is the process of ByteDance building the entire data-driven flywheel. In this process, we have precipitated the three concepts of "digital business process", "digital collaboration" and "objective analysis and evaluation" and solidified them into a unified data center. capabilities to support data optimization for different applications. At the same time, mid-end capabilities can further optimize different dimensions of the business, including growth, experience, monetization, etc.

Next, we will expand on the data center and application optimization.

Application-oriented data center

Insert image description here

I actually mentioned the data center just now. One of its biggest functions is to help various applications and businesses optimize based on data drive. Therefore, there is a very important concept in building a data center, that is, it must be built for applications. Start with data and use data to verify. So when it comes to data verification, the most important thing is actually A/B testing. We have emphasized Byte's emphasis on A/B testing on different occasions before, including the naming of Douyin and Toutiao through A/B testing.

For evaluation, testing is only the first step. We also need to further analyze the results, so we have built corresponding data operation platforms, intelligent data insights, customer data platforms and other tools to help products and operations analyze data in depth.

At the bottom level, for the large-scale, batch, and real-time data generated every day, we have also built a complete suite of data collection, research and development, and management to improve the efficiency of data development.

So it can be said that at the bottom level, we pay more attention to the efficiency and scale of data development, while at the upper level, we focus on the ease of use and interactivity of the entire product and operation in the data analysis process. To achieve a connection between ease of use, interactivity, and underlying scale and efficiency, we need a very powerful data analysis engine, which is our ByteHouse.

Insert image description here

ByteHouse originated from the open source clickhouse project, so it has the suffix of House. But it is actually a cloud-native large-scale data analysis platform that has been transformed based on ByteDance's large-scale data scenarios.

As mentioned just now, data-driven is an important technical concept of Bytedance. Every day we have dozens of petabytes of new data, and tens of thousands of people have to analyze this data from various dimensions and details. There are many performance issues and real-time issues that need to be solved, and ByteHouse is behind them.

So far, ByteHouse serves almost all business lines within Byte, and is also the core engine of analysis systems such as ABI systems, UBA systems, profiling systems, and A/B testing. The overall scale has reached 30,000 servers, with tens of millions of queries per day.

Insert image description here

Faced with the large-scale challenges just mentioned, we have mainly made five levels of in-depth transformation on ByteHouse:

The first is support for streaming data. For analysis, we have very high requirements for real-time performance, so we support the processing of real-time data through Kafka. In this way, ByteHouse can provide a unified analysis platform for real-time and offline data, supporting batch and stream integration.

The second is the separation of computing and storage. Because our scale is so large, how to support tens of thousands of people and perform tens of millions of real-time queries efficiently and quickly based on dozens of petabytes of new data is a big challenge. By separating computing and storage, we can better solve performance problems. After separation, the computing layer can be flexibly expanded and reduced independently. In terms of storage, it can be connected to distributed storage systems, including HDFS, S3, etc. This can solve the storage stability problem on the one hand, and the capacity expansion problem on the other hand.

In addition to the separation of computing and storage, we have done a lot of work in terms of operation and maintenance and security to further make up for the lack of functionality in the community version.

The last and most important thing is that we have implemented multi-level resource isolation. Because different departments and roles are doing various analyzes every day, the requirements for authority and timeliness are different. Then through the isolation of tenants, the separation of reading and writing, and heterogeneous computing resources, we can well meet the problem of large-scale centralized use of resource allocation by different departments and different roles.

Through the above five major levels of optimization, we can support the core steps of the entire ByteDance data driver based on ByteHouse.

Application optimization

Insert image description here

I just talked about some practices of data center, and then I will talk about how to optimize applications and business through data drive. Here is an example of growing customers.

Of course, whether it is a growth scenario or other scenarios, if you want to do a good job in data-driven optimization, the first and most critical thing is to design a good indicator system. Because the indicators are wrong, no matter how much you do, it will be wrong.

So for growth, we believe that there are two most important indicators - "positive input-output" and "healthy user scale".

Positive input-output, simply put, is ROI>1. It seems very simple, but how to calculate ROI correctly and accurately and follow up long-term ROI at the granularity of each user is actually the difficulty and key.

Of course, we can't just look at short-term ROI, but also look at long-term user health, including retention, LT, etc.

After setting these key indicators, you can actually use the indicators to find the corresponding optimized growth strategy. This growth strategy must not only meet the positive indicators, but also have a sustainable, scalable, and replicable model. This transforms the business growth model into a measurable and trackable data-driven model.

Finally, a picture is used to completely explain the case of data-driven, middle-end and application-based optimization to build an overall flywheel.

  • First, do user orientation based on data, define goals, and find the people who are most critical to the product;
  • After finding it, create corresponding creativity and content, and then let these highest-quality and most attractive content reach customers through different channels, form conversions and generate new data. Moreover, we have a digitally recorded process that enables accurate attribution and detailed tracking of effects;
  • There will be a lot of creativity during the optimization process. We iterate quickly through A/B testing to see which idea is more suitable. During the evaluation process, more data will appear, which will supplement the entire strategic plan, ultimately forming a data-driven growth flywheel.

In such a process, the speed of experiment is very critical. If others can only do 10 experiments a day, and you can do 100, the results are self-evident. From small creative experiments to large iterative development of APP functions, speed plays a very important role in it. And this echoes the second concept I want to talk about, agile development.

Agile development: full-stack cloud-native architecture supports large-scale applications

When it comes to agile development, we can see a variety of solutions at different levels on the market, such as low-code, aPaaS, etc. However, the main thing I want to talk to you about today is cloud native, because whether it is a SaaS layer or a PaaS layer solution, it is inseparable from the support of a complete set of cloud native architecture at the bottom layer.

ByteDance full-stack cloud-native architecture

Insert image description here

Here is a brief review of the development history of cloud basic technology. I believe many people are familiar with this trajectory. It can be seen that 2013 is an important turning point. Thirteen years later, with the rise and popularization of technologies such as Docker and K8s, the cloud has shifted from infrastructure-centered to application-centered; from resource service to platform service, and ByteDance happened to be born in 2012, so I am very lucky to have no historical baggage and directly embrace the latest cloud native technology.

Insert image description here

Let me share with you a set of numbers (statistics in February 2021): In ByteDance’s internal business, the number of server nodes is nearly one million; the number of microservices online at the same time is 80,000+, and it is growing by 2,000 per month; the number of containers 750w+; daily new increment of more than 60 PB.

From these figures, you can also see that we are facing a very large-scale challenge in terms of service volume that is still growing rapidly. So from an infrastructure perspective, we believe there are three issues that need to be considered:

The first is how to support massive services . With the application of microservices, the governance object has changed from a single application to a larger number of microservices, which makes global governance more difficult, including building a global configuration center and a more flexible global network, runtime selection, and complete equipment. security mechanisms, and how to connect them end-to-end with the entire DevOps process.

The second is the challenge under large-scale scheduling and operation and maintenance, how to make the infrastructure more stable . At present, the average internal single cluster size is more than 5,000 nodes, and large clusters have tens of thousands. In such a large scale, various issues need to be considered, such as how to do image preheating and multi-cluster federation management in large-scale image distribution scenarios; cloud-edge collaboration in weak network environments. problem; in a heterogeneous environment, GPU scheduling problems in machine learning scenarios.

Third, it is a hybrid online/offline deployment . Because of such a large scale, the cost is naturally high, so we must improve the utilization rate. Online/offline mixing is a very important means. Especially the ByteDance business itself has obvious peaks and troughs. For example, the peak of Douyin is at night, and the QPS at other times is not so high. Therefore, we designed a set of online/offline mixed deployment mechanism, which can reduce costs on the one hand, and better cope with the problem of business scale growth under extreme circumstances.

At the same time, at the bottom level, we have also built an overall container + multi-cloud solution.

In terms of multi-cloud, not only can our computing be multi-cloud, but stateful storage can also be multi-cloud, so that we can be very flexible to respond to various emergencies, such as grabbing red envelopes during the Spring Festival Gala at the beginning of the year, and the 818 Trendy Shopping Festival etc.

Insert image description here

This diagram further explains the full-stack cloud-native architecture from the perspective of the architecture system.

First, at the lowest level, there is a complete set of cloud native infrastructure. Providing a new generation of high-performance computing storage and network solutions through a unified bottom layer is actually the cornerstone of ensuring business stability and agility.

On the basis of cloud native is the service platform layer, which solves the abstraction of some common platforms and service capabilities in business development. This includes high-performance microservice framework, service grid-based microservice governance capabilities, and serverless and edge computing platform capabilities. The service platform is built to allow developers to develop business logic more agilely and focused, and to worry less about resources, platforms, inter-service communication and governance.

Above the platform layer is the construction of the entire R&D system. At this level, we hope that through various tools, process mechanisms and organizations, we can help ByteDance flexibly support the rapid development and development management of all business lines.

On both sides of the middle three-tier facilities are important cloud-native security systems and SRE service support systems.

The first is the cloud native security system. So compared with the traditional security system, it needs to be extended to different levels. One is left extension, which not only focuses on runtime security, but also needs to be integrated with the DevOps process to focus on the security of the entire life cycle of the application. The second one is decentralization, which not only focuses on the security of the container, but also the security of the host.

The second is the SRE system, which supports the stability of the entire business during its rapid development.

Due to limited time, I picked two interesting topics to share further. One is microservices and the other is mobile development. On the one hand, they are relatively representative, and on the other hand, they cover most business research and development scenarios.

Server side - microservices, service governance and DevOps

Insert image description here

Let’s look at microservices first. We can use four points to describe the current situation of ByteDance microservices:

It's huge and growing rapidly . I just introduced that Bytedance’s current number of microservices is 80,000, but in 2018, the total number of microservices was only about 7,000 to 8,000, so it has actually increased nearly 10 times in three years, and it is still growing. In this process, we naturally encountered many challenges.

More than 90% of online microservices run in containers . For business lines, you cannot see resources, only PaaS and containers. This brings a lot of convenience and is conducive to the promotion of the core functions of new technologies, but it also brings many challenges, especially in terms of scheduling complexity.

The technical system is mainly based on Golang language . According to the latest survey statistics, Golang is the main language within ByteDance, and more than 55% of services use Golang. The second-ranked language is NodeJS, followed by other languages.

The comprehensive implementation and application of Service Mesh. ByteDance is one of the first companies in China to use Service Mesh on a large scale in the production process.

You can find that ByteDance is very fast in using microservices, and can even be said to be quite radical. The reason behind this is that at ByteDance, speed and efficiency are the top issues we want to solve in our research and development. New applications and new users are growing very fast every day, and R&D must solve the production capacity problem. This is also the reason why we radically adopt a microservice architecture. But on such a large scale, doing such fast iterations will naturally have a huge impact on stability and trust.

Insert image description here

In order to deal with these difficulties and contradictions, we made various optimizations when implementing the end-to-end microservice architecture:

The first is the language level. Golang is the main language used, so a lot of framework-level optimizations have been made at the Golang level, such as RPC framework and HTTP framework. We have given back these frameworks to the community through open source - in early September, ByteDance open sourced CloudWeGo to help more developers build cloud-native microservice architecture.

The second is for the governance of massive services. We have built our own service grid system based on the concept of ServiceMesh , and solidified the service governance capabilities into Byte's internal platform. On the one hand, it helps us with the compatibility of multiple services, and on the other hand On the other hand, through Golang's stable framework and the concept based on Mesh governance, we have achieved the overall construction of global traffic management, unitization and systemization .

Finally, the efficiency of R&D is improved through the implementation and practice of DevOps tools and methods, and the observability of operation and maintenance is further improved.

Insert image description here

Let’s expand one by one below.

The first is the Golang framework, and the other is Kitex, which is the RPC framework. The other is Hertz, which is an HTTP framework. Behind these frameworks are integrated our self-developed high-performance network libraries to solve some performance and interaction problems on the network. At the same time, we support multiple message protocols (Thrift/Protobuf) and multiple interaction methods (Ping-Pong/Oneway/Streaming), which can provide a more flexible and autonomous code generator.

Insert image description here

This is a comparison of the performance of Kitex and gRPC. We selected two groups, one based on the Thrift and Protobuf protocols. It can be seen that Kitex has better performance in both methods. Especially in terms of TP99 latency, as the number of concurrent connections increases, Kitex's advantages are getting bigger and bigger.

Insert image description here

This is a comparison between Hertz and some frameworks in the industry, including average latency, QPS, and comparison results under different size packages. We now provide these two frameworks to the outside world through open source, so developers are welcome to download and use them, communicate with us, and provide opinions.

Insert image description here

Next we look at the governance of the service grid. As mentioned just now, due to our own business type and business volume, we face many challenges in practicing microservice architecture, such as language fragmentation, service heterogeneity, protocol heterogeneity, as well as security and observability. , problem tracing calls, etc. Therefore, we adopted a service grid-based model to conduct overall microservice governance.

The green box in the above picture is the control surface, and the dotted box is the data surface. We separate the control plane and data plane through the service mesh, eliminating the possibility of single points of failure. For example, when the data plane traffic is too heavy and performance problems occur, it will not affect the routing policy of the control plane; conversely, when the control plane policy is overloaded, it will not affect the forwarding of the data plane.

Each dotted box in the figure is a pod. Compared with traditional services, our service grid uses sidecar to manage traffic, such as circuit breaker, current limit, timeout retry, reduction, etc. These functions are separated from each service. It is separated from the network to form an agent, and governance between services is implemented through these agents. The advantage of this is that each service can only focus on its own business logic without having to worry about global scheduling and communication issues, making development simpler and more efficient.

Insert image description here

Of course, this non-intrusive mode of ServiceMesh brings a lot of convenience, but in fact it also brings a lot of challenges. The biggest challenge is the additional performance overhead, so we have done a lot of work to solve the ultimate performance optimization of the service grid. Such an optimization has multiple levels:

At the network and kernel levels, we use shared memory or system calls to implement true zero copy.

We will also optimize the basic library and component architecture levels to remove some unnecessary interactions. Even in the compilation stage, we can obtain a performance improvement of about 2% through better fully static compilation without any code modification.

In the end, through this overall, multi-level combination optimization, we not only enjoyed the convenience brought by the service grid, but also ensured performance.

Mobile terminal - mobile APP development for ultimate experience

What I just talked about is the microservice framework and service governance. Next, let’s talk about mobile development.

For Byte, it can be regarded as a mobile-native enterprise, and most of our business is carried through APP. As of now, we have operated more than 100 APPs and have a mobile application R&D team of thousands of people within the company.

To support the development of such a large R&D team and corresponding businesses, we must establish an industry-leading mobile application development platform and continuously optimize it through a lot of practice and polishing in various extreme scenarios. Therefore, we established a company-level mobile R&D platform very early, codenamed: MARS, through which we uniformly support the development of various upper-level business applications. Apps such as Douyin and Toutiao that everyone is using today are developed and iterated based on MARS.

From a hierarchical perspective, MARS as a whole can be divided into 5 sections:

  1. The first is project management. Through the internal research and development characteristics of Abstract Byte, we have established a unified project management platform to support daily business iteration management, especially the optimization of special processes such as release.
  2. Secondly, in the application development process, the efficiency of this step is very critical. We use low-code methods to further improve efficiency. For example, it provides designers with a way to directly generate code through design. For operations personnel and R&D personnel, we have adopted this visible and available approach to help business personnel build business applications more easily and conveniently through drag and drop.
  3. Then facing the traditional coding and research and development stages, we output a complete end-to-end development platform for different ends such as APP, front-end and small programs.
  4. In addition, in terms of quality control, we also provide a one-stop full-link testing platform that simulates actual online scenarios based on a large number of real machines to detect potential anomalies to the maximum extent.
  5. Finally, there is the full-link monitoring platform, which can cover the complete application link monitoring of "terminal-network-backend application-basic environment" to help R&D personnel accurately locate and solve problems.

Through the above introduction to microservices and mobile development platform Mars, I think everyone should have a more vivid understanding of ByteDance agile development.

Back to the topic shared today, behind the development of the entire byte technology, data-driven and agile development are two important concepts, but these two concepts are not separate, they are integrated. Because for data-driven purposes, we need more experiments to find good solutions to promote and find bad points to improve. Agile development can ensure that a large number of experiments can be carried out every day. In turn, through data drive, we can find valuable things inside, and at the same time, we can also accumulate more data, thus building a closed loop for the rapid development of the entire business.

Insert image description here

Here is some data to share. Within ByteDance, we launch 1,500 new experiments every day, with a total number of more than 800,000 experiments. There are more than 10,000 experiments running at the same time, covering more than 500 internal business lines and Various scenarios. Including personalized scenarios, push scenarios, website building scenarios, server-side scenarios, advertising and marketing scenarios, etc.

As for our underlying technology, platform technology, and business-layer technology, it is precisely because these two concepts are constantly accumulated and iterated that they ultimately promote the rapid development of the business.

In fact, the truth is very simple. Just like people say that the only martial arts in the world is fast and unbreakable, the truth is very simple. But to do these things well, we need to continuously accumulate tool platforms and methods, and form these methods into daily routines. Habits eventually form the driving force behind business promotion.

The above is my summary and sharing of ByteDance’s technical practices in data-driven and agile development. Hope it inspires everyone. Many of the technologies mentioned in it have basically been commercialized on the Volcano Engine. I also look forward to everyone using these products, giving feedback, and creating greater value.

Guess you like

Origin blog.csdn.net/qq_43842093/article/details/135163201