The evolution of code management architecture under the collaborative scale of Baidu's 10,000 people

Internet research and development, only fast is not broken. In order to improve the company's overall R&D efficiency, Baidu has introduced the industry's best engineering practices and designed and developed a complete R&D tool chain. It mainly includes a project management platform, a code development collaboration platform and a continuous delivery platform, which provide tools, processes and data support for requirements, development and delivery scenarios respectively, as shown in Figure 1.

 

 

Figure 1 Baidu R&D Tool Chain

 

The target scenario of code management is the development scenario, which is the core link of R&D activities, carrying the role of opening up requirements and delivering upstream and downstream. Baidu code management construction starts from three aspects: cultural communication, engineering practice and product construction to promote the continuous improvement of the company's code management level. To this end, we have launched a five-level pyramid model for code management construction, as shown in Figure 2, which represent different levels of ability in code management construction.

 

 

Figure 2 Baidu code management overview

 

The bottom layer is code hosting, which is the most basic capability of code management. 

The second layer is collaborative development, which supports fast and orderly collaborative development of various business lines under different R&D models. Baidu has many products and business lines. Different team sizes, different development languages, and different R&D models have put forward different requirements for development collaboration. 

The third layer is DevOps support, which realizes the full link and automation of tools for the entire product life cycle. 
The fourth layer provides the company with R&D data reference and promotes the improvement of the R&D process through the construction of the R&D data measurement system. 
The fifth layer of engineer culture construction, implementing code review, internal open source, social programming and other engineer cultures within the company.

 

Baidu code management challenges

 

Baidu has a development team of 10,000 people, nearly 100,000 projects, more than 200,000 problems are automatically detected in the code every week, and more than 10,000 reviews are initiated every day. To ensure code quality, we require automated checks before and after code commits. To speed up compilation and integration, we have large-scale distributed compilation systems and continuous integration systems. Baidu C/C++ language is a source code dependency, and the compilation system needs to check out all the dependent codes, so the access pressure of the code base increases exponentially. These are the challenges faced by Baidu code management, which can be summed up in three points: code quality, scale coordination, and safe and stable services, as shown in Figure 3.

 

 

Figure 3 Challenges encountered by Baidu code management

 

Faced with these three challenges, the code development collaboration platform focuses on five aspects of code management: code hosting, collaborative development, code quality, code security and openness, and R&D improvement. 


1. Code hosting 
Code hosting is the infrastructure for R&D. Code hosting needs to ensure the security, stability, and reliability of services, while ensuring high performance in large-scale collaborative scenarios. 
2. Code quality is 
based on the code warehousing process, providing easy-to-use code reviews, and supports automated checks such as code scanning, coding standards, security scanning, etc. in the review process, and supports automated testing through continuous integration, so as to ensure that the code is stored before warehousing. be fully inspected for quality. 
3. Code security and open 
code security require strict restrictions on access control rights, and support for security scanning and security auditing, etc.; code openness encourages code sharing and open source, so as to achieve code reuse. 
4. Collaborative development 
supports mainstream Workflows to meet the needs of different R&D models of various business lines, such as traditional branch development, trunk development, feature branch, git flow and other workflows. 
5. R&D improvement 
R&D management needs data support, use data to measure everything, continuously optimize the R&D process, promote efficient collaboration, and improve R&D efficiency.

 

The evolution of Baidu code management architecture

 

The code development collaboration platform has undergone these four stages of evolution. Faced with different problems at different stages, we adopt different solutions, as shown in Figure 4.

 

 

Figure 4 The evolution of Baidu's code management architecture

 

product start-up period

 

In order to quickly verify the product, we adopted the lean thinking and quickly realized the MVP. In the design of the code base service, the problem of capacity and performance is temporarily ignored, and the single-instance structure of Master-Slave is adopted, as shown in Figure 5.

 

 

Figure 5 Product start-up period - architecture

 

Under this architecture, we mainly ensure the security and reliability of services from two aspects.

 

  1. RAID is done on the storage, and DRBD is used to back up data in real time to ensure data reliability. We use the DRBD synchronous replication protocol, which means that the master's data will be synchronized to the slave in real time.

  2. In order to ensure the high availability of the service, KeepAlived is used to ensure that the slave can be quickly switched to the slave when the master fails to achieve automatic failover.

 

product development period

 

With the rapid development of the platform, the concurrency and capacity of the code base has grown dramatically. We first use large memory and SSD hard drives to improve hardware performance. Then optimize I/O, network, cache, etc. After repeated performance tests, the optimal configuration of a single machine has been obtained.

 

In the expansion plan, we mainly consider two solutions: distributed storage and data sharding.

 

  1. Distributed storage 
    1-1. The advantage is that the architecture is simple, the data is backed up, and the capacity can be scaled horizontally. 
    1-2. The disadvantage is that the I/O performance decreases.

  2. Data fragmentation 
    2-1. The advantages are reliable performance, flexible control, and easy expansion (different fragmentation strategies and load balancing schemes are implemented according to business requirements). 
    2-2. The disadvantage is that the existing architecture changes greatly; the implementation cost of cross-sharding operations is high.

 

After performance testing and MVP verification, we finally chose the data sharding scheme. The main reason is that the code service is a service with high I/O, and the I/O performance of distributed storage is significantly different from that of local storage, especially the performance of writing has dropped by an order of magnitude.

 

 

Figure 6 Product Development Period - Architecture

 

The main change in this version of the architecture shown in Figure 6 is the deployment of git services in sub-instances. Sharding based on Repositories assigns different Repositories to different instances. The database service is independently deployed in the master-standby mode, and supports the read-write separation of database access.

 

The user request first goes through the proxy, and the unified routing service is called to forward the request to the corresponding instance. The authentication service is independently deployed, and the proxy integrates the authentication module to strengthen user authentication.

 

The routing service is the core service. In order to reduce the transformation cost of the business system, a unified routing service and routing module are designed, and all access requests to the code base are intercepted in a faceted manner, thereby achieving low intrusion to business code and transparency to the caller. In routing design, because the decentralized micro-service architecture is used first, client-side routing is adopted. At the same time, a local cache is added, so that even if the routing service is down, the routing can still run normally, as shown in Figure 7.

 

 

Figure 7 Product development period - routing design

 

product maturity

 

Due to the explosive growth of requirements such as compilation, automated testing, and continuous integration, the code base has more than 300,000 daily read requests and 20,000 daily write requests. During peak hours, the TPS is nearly 1000, and the gigabit network cards are all full. After evaluating the demand for throughput, we expect TPS to exceed 10,000. In order to ensure performance, the rate of downloading code during peak hours, the automation system should be above 30MB/s, and the developer must be above 5MB/s. Therefore, the problem of insufficient throughput has become the core problem. Our improvement plan is as follows:

 

  1. To increase the bandwidth, replace the Gigabit network card with a 10 Gigabit network card.

  2. Add machines. Amortize bandwidth pressure by splitting smaller instances. Increase the read-only nodes of each group of instances, because our scenario is that reads are much larger than writes, and most of the throughput pressure comes from read requests. At the same time, the idle cold standby nodes are upgraded to read-only nodes.

 

 

Figure 8 Product Maturity Period - Architecture

 

Figure 8 is an architecture diagram of read-write separation. The proxy judges the read-write request, sends the write request to the master node, and sends the read request to all nodes of the instance through the load balancing module. In the process of upgrading this version of the architecture, we still use DRBD+KeepAlived to implement the disaster recovery backup solution. The read-write separation greatly improves the system throughput, but the DRBD cold standby machine is idle, which is a serious waste of resources. So, we made further improvements.

 

 

Figure 9 Product Maturity Stage - Architecture Optimization

 

Figure 9 is an improved architecture diagram. We have abandoned DRBD backup and implemented our own high-availability solution. Our program is mainly divided into two phases:

 

  1. The failure judgment of the master node. After the heartbeat detection of a proxy node captures the abnormality of the master node, it initiates a vote. If more than half of the proxy nodes judge that the master node is abnormal, it is determined that the master is invalid.

  2. The slave node is promoted. One more round of voting is conducted to promote a slave node in this group of git instances to a master node. After the voting is completed, the new instance information is written into the routing service, and the routing service notifies all callers of routing changes and updates the local routing cache in time.

 

The overall architecture of the code development collaboration platform

 

 

Figure 10 Baidu code development collaboration platform architecture

 

The Baidu code development collaboration platform adopts the micro-service architecture as a whole, and builds various business service units based on the self-developed micro-service framework, and independently develops, publishes, deploys and runs. The overall architecture is shown in Figure 10.

 

1. 接入服务 
Httpd Proxy主要用于Web访问,Sshd Proxy用于Git命令行操作,API Gateway用于统一提供开放API,便于API的安全授权、管理。 
在接入服务之上采用百度统一前端接入服务构建高可用负载均衡器,一方面提高系统的并发访问能力,另一方面提高系统的防攻击能力,保证平台的安全性。 
2. 访问控制 
平台构建统一的安全策略和用户认证体系,确保系统安全。 
3. 服务中心 
服务中心是服务治理的核心,提供服务注册/发现、服务路由、服务配置、服务降级、服务熔断、流量控制等功能,保证平台整体的服务稳定性。通过服务路由、配置管理中心、服务注册/发现等机制来统一管理服务,另外提供统一的管理控制台管理应用服务集群、Git集群和基础服务集群等。 
4. 开放服务 
平台同时支持Webhook和Plugin两种开放能力的方案,支持第三方系统方便集成。Webhook主要的应用场景是当开发人员提交代码变更后,自动触发持续集成构建。Plugin主要应用在代码评审环节的自动代码检查。 
5. 业务服务 
业务服务通过微服务架构组织服务单元,每一个业务服务都会注册到服务中心,在调用其他业务服务时也是通过服务中心的服务发现机制去获取某一特定服务的具体提供实例列表,通过客户端路由方式来决定具体调用哪个服务提供者,从而既保证服务可靠性,又能提高系统吞吐量。平台提供代码管理、代码浏览、代码评审、代码搜索、代码扫描等业务组件。 
6. Git集群 
Git集群是平台的最核心、最基础的部分。为了保证Git集群的安全、高可用和高性能,平台提供了如下能力: 
6-1. Support both soft real-time and hard real-time backup capabilities of data. 
6-2. Provide triple backup, with at least three copies of each code. 
6-3. Provide the ability to shard data according to different sharding strategies, and support the dynamic expansion of Git cluster. 
6-4. Provide the ability to separate read and write, and support one master and multiple backups. 
6-5. Provide HA solution and support automatic failover. 
7. The basic service 
platform relies on multiple basic services such as database, index, cache, user management, notification, and storage. These basic services improve the overall reliability of the platform while ensuring their own availability.

Code management architecture practice under enterprise-level SaaS service

After a series of architectural improvements, the capacity, performance and reliability of the code development collaboration platform have been improved and verified. With the external service of Baidu's efficiency cloud products, code management has encountered greater challenges in other aspects.

  1. Security, code is the core asset of an enterprise. Only by ensuring that the enterprise code is not leaked or lost can the trust of the enterprise be won.

  2. The requirements for capacity and performance are higher, and there will be more users in the future of external services. More users means more Repositories and higher concurrency that require platform support.

  3. The demand for elastic scaling, because different development stages of the enterprise have different requirements for the capacity and performance of the code base, it is necessary to implement elastic scaling to meet the changing requirements of the enterprise for resources.

  4. Automatic operation and maintenance mainly considers two aspects, which can support rapid access of enterprises and facilitate large-scale cluster management.

 

Combining the above-mentioned enterprise-level SaaS requirements for code management, we propose an enterprise-specific cloud solution, as shown in Figure 11.

 

 

Figure 11 Dedicated cloud

 

We mainly realize enterprise exclusive cloud from three aspects:

 

  1. The access layer, through a unified proxy service, integrates the security authentication module, and supports enterprise account access to achieve enterprise request isolation.

  2. The application layer adopts a shared service method to achieve enterprise isolation through a unified access control layer.

  3. For the core assets of the enterprise (such as: code base, product library, etc.), we isolate them at the resource layer, and different enterprise services run on different resources to achieve real physical isolation.

 

Focusing on the enterprise-specific cloud solution, we have strengthened multi-tenant management and resource management in the architecture design, as shown in Figure 12.

 

 

Figure 12 Code management architecture under enterprise-level SaaS services

 

在多租户管理方面,在统一的访问控制层增加了租户认证,所有请求都需要带上租户信息才可以通过认证。

 

在资源管理方面,同时支持Docker资源和虚拟机资源。企业接入时,Admin系统将自动从统一的资源池申请资源,通过Docker的方式完成自动化部署。我们同时支持混部和独立部署的方式,混部就是在同一个资源上部署多个企业的代码库实例。对于对代码安全有更高隔离要求的客户,我们将他们的服务独立部署在一台虚拟机上。

 

总结

 

百度代码开发协作平台使用微服务架构构建业务服务,一方面整合了现有的业务系统,另一方面提高了系统的稳定性和性能。使用数据分片和读写分离相结合的方式解决了代码库服务容量和性能的问题。使用专属云方案处理多租户的问题,帮助企业客户快速接入,实现资源隔离。但是,我们还有很多不足的地方有待提高和完善。比如,目前我们考虑到性能和开发成本的问题,选择了数据分片来扩容。但是,随着代码库容量的不断提升,数据分片带来的架构复杂、运维成本、性能瓶颈等问题也开始显现出来。读写分离和主备切换的方案,在高并发读的场景下工作尚可,但是面对高并发写的场景性能和可靠性就难以满足。

 

针对说到的一些东西我特意整理了一下,有很多技术不是靠几句话能讲清楚,所以干脆找朋友录制了一些视频,很多问题其实答案很简单,但是背后的思考和逻辑不简单,要做到知其然还要知其所以然。如果想学习Java工程化、高性能及分布式、深入浅出。微服务、Spring,MyBatis,Netty源码分析的朋友可以加我的Java进阶群:318261748 群里有阿里大牛直播讲解技术,以及Java大型互联网技术的视频免费分享给大家。

架构设计是和业务需求紧密相关的,只有合适的架构才是好的架构,因此,产品发展的不同阶段需要选择不同的技术架构方案。同时,一种可演进的架构是应对业务需求发展和变化的较优选择。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326053799&siteId=291194637