Why does Google put billions of lines of code in one repository

Compared to the average company, Google uses a single code repository, and many people don't understand why. The author of this article is an engineer in Google's infrastructure group, and he explained this problem in detail.

 

Early Googlers decided to use a centralized source control system to manage the codebase. This approach has worked at Google for over 16 years, and today the vast majority of Google software is still stored in a shared codebase.

 

As the amount of software developed by Google has steadily increased, so has the size of the Google codebase. As a result, the technology used to manage the codebase has also changed significantly.

 

This article provides an overview of the size of the codebase and details Google's custom, centralized codebase and why the model was chosen. Google manages the company's codebase using a self-developed version control system, a centralized system that underlies the workflow of many Google developers. Here, we provide a background on the systems and workflows that can effectively manage and efficiently use such a large codebase. We'll explain Google's "trunk-based development" strategy and support system, along with build workflows, and tools for keeping Google's codebase healthy, including software for static analysis, code cleanup, and ease of code review.

 

 

Google scale 

 

The codebase used by 95% of software developers at Google meets the definition of a hyperscale system [4], and this repository is evidence that a centralized codebase can be successfully scaled.

 

The Google codebase contains about a billion files and has a history of about 35 million commits (including all Google commits for 18 years). The codebase contains 86TB of data, including 9 million source files and approximately 2 billion lines of code.

 

The total number of files also includes source files copied to the release branch, files removed by the latest version, configuration files, documentation, and data files. See the table here for a summary of Google repository statistics from January 2015.

 

In 2014, 15 million lines of code were modified in the Google codebase every week. By contrast, the Linux kernel is an example of a large open-source software codebase containing about 15 million lines of code in 40,000 files. [14]

 

Google's codebase is shared by more than 25,000 Google software developers in dozens of offices around the world. On a typical work day, they typically make 16,000 changes to the codebase, and another 24,000 changes are committed by automated systems.

 

Every day, the codebase serves billions of file read requests, with a peak of about 800,000 queries per second, and a weekday average of about 500,000 queries per second, with most of the traffic coming from Google's distributed build and test system.

 

 

Total submitted code includes interactive use cases or user data as well as automated submitted code, and holidays (such as Christmas and New Year's Day, US Thanksgiving, and US Independence Day) will see a large drop in the number of submitted lines.

 

In October 2012, the Google codebase added support for Windows and Mac users (previously only supported Linux), the existing Windows and Mac codebase was merged with the main codebase, and Google's codebase merge tool attributes all historical changes to its original author.

 

 

According to the weekly commit chart, commit rates were dominated by users until 2012, when Google moved the codebase to a private implementation. After this, code commits to the repository started to increase, and the growth in code commits was largely due to automation, as described below.

 

Managing a codebase and development of this size is an ongoing challenge for Google, and despite several years of experimentation, Google has yet to find a commercially available or open source version control system to support in a single codebase At this scale, Google's proprietary system for solving this problem is Piper.

 

background

 

Before you look at the pros and cons of using a single codebase, you need some background on Google's tools and workflows.

 

Piper and CitC

 

Piper is a large codebase implemented on standard Google infrastructure, originally based on BigTable and now based on Spanner. [3] Piper is distributed in 10 Google data centers around the world and relies on the Paxos [6] algorithm to ensure replica consistency.

 

This architecture provides high redundancy and helps optimize latency for Google software developers. Also, caching and asynchronous operations can hide a lot of network latency. This is important because getting the full benefits of Google's cloud toolchain requires developers to be online.

 

Before launching Piper, Google relied primarily on a single instance of Perforce (plus a custom caching infrastructure[1], serving it for over 10 years). Continuing to expand the Google codebase was the main driving force behind the development of Piper.

 

Because Google's source code is one of the company's most important assets, security features were a key consideration in Piper's design. Piper supports file-level access control lists, most of the codebase is visible to all Piper users, and files for important configuration files or critical algorithms can also be more tightly controlled.

 

You can read and write access to files in Piper. If a sensitive data file is accidentally committed to Piper, the file can be purged, and reading the log allows administrators to determine if anyone accessed the file in question before deleting it.

 

 

In the Piper workflow, developers create local copies of files, which are stored in a developer-owned workspace, before making changes to the codebase.

 

A Piper workspace is comparable to a working copy of a client in Apache Subversion (a local clone in Git) or Perforce, and updates in the Piper codebase can be pulled into the workspace and merged with work-in-progress as needed.

 

Workspace snapshots can be shared with other developers for review, and files in the workspace are only committed to the central repository after going through the Google code review process.

 

Most developers access Piper through a system called Clients in Cloud, or CitC, which consists of a cloud-based storage backend and a Linux FUSE [13] filesystem, where developers see their workspace as a directory, overlaying changes on top of the full Piper library.

 

CitC supports code browsing and Unix tools without the need for local cloning or syncing state. Developers can browse and edit files anywhere in the Piper repository, and only modified files are stored in their workspace.

 

This structure means that CitC workspaces typically consume only a small amount of storage (average workspace is less than 10 files) while presenting the entire Piper codebase to developers.

 

All writes to files are stored as snapshots in CitC, making it possible to resume previous stages of work as needed. Snapshots can be explicitly named, resumed, or marked for review.

 

CitC workspaces can be used on any cloud-connected machine, making it easy to switch machines and work uninterrupted, which also allows developers to view each other's work in the CitC workspace, storing all work-in-progress in the cloud is an important part of Google's workflow.

 

Work statuses are available for other tools, including cloud-based build systems, automated testing infrastructure, and code browsing, editing, and viewing tools.

 

Several workflows take advantage of the uncommitted code feature in CitC, enabling software developers to work more efficiently with large codebases.

 

For example, when sending changes for a code review, developers can enable the auto-commit option, which is especially useful when code authors and reviewers are in different time zones. The test will run when the review is marked as complete.

 

If it passes the tests, the code will be merged into the codebase without further human intervention, and CodeSearch, Google's code browsing tool, supports simple editing using the CitC workspace.

 

When browsing the library, developers can click a button to enter edit mode and make simple changes (such as modifying typing or improving comments). Then, without leaving the code browser, they can send their changes to the appropriate reviewers, with auto-commit enabled.

 

Piper can also be used without CitC, developers can store Piper workspaces on their local computer, and Piper is also interoperable with Git. Currently, over 80% of Piper users use CitC, and usage continues to grow due to CitC's many advantages.

 

Piper and CitC guarantee efficient work with a single codebase at the scale of Google's codebase. The design and architecture of these systems are influenced by the trunk-based development model adopted by Google, as described below.

 

 

trunk-based development 

 

Google implements trunk-based development on top of the Piper source code repository. Piper users work overwhelmingly in "head" or the latest version of the "trunk" or "mainline" copy of the code, and changes to the codebase are serial.

 

The combination of trunk-based development and a central codebase defines a single codebase model, and after any commit, changes are visible to all other developers. Piper users' consistent view of the Google codebase is key to delivering the benefits described later in this article.

 

 

Trunk-based development is beneficial because it avoids the pain of merging long branched branches. Although code branches are usually used for releases, they are not well supported in Google code branches.

 

Bug fixes and enhancements that must be added to a release are usually developed on trunks and then pulled into the release branch.

 

Due to the need to maintain stability and limit churn on release branches, releases are usually snapshots of "head", with an optional small amount of code being pulled from "head" as needed, long-lived branches developed in parallel on branch and trunk are very rare.

 

Piper and CitC can work efficiently with a single source code base at the scale of the Google code base.

When developing new features, both old and new code paths often coexist, controlled by using conditional flags. This technique avoids the need for development branches, and features are turned on or off through configuration updates.

 

While some additional complexity is required for developers, the development branch merge problem is avoided, and the flag flip makes it easier and faster for users to switch new implementations with problems.

 

This method is typically used for project-specific code, not generic library code, which ends up removing flags and legacy code. Google uses a similar method to test different codes, such A/B tests can get parameters related to product changes from code performance.

 

Google Workflow 

 

Several best practices and support systems are required to avoid the problems encountered in the trunk-based development model. For example, Google has an automated testing infrastructure that launches tests for all affected dependencies on almost every commit.

 

If a code change breaks the build, the system automatically undoes the change. To reduce the incidence of bad code that occurs, Google's highly customizable "pre-commit" infrastructure automates testing and analysis of changes before they are added to the codebase.

 

Running a global set of pre-commit analyses for all changes, code owners can create custom analyses that run only on directories within their specified codebase, with only a small subset of very low-level core libraries using the mechanism of branch to ensure Additional tests are performed before the new version is exposed to client code.

 

An important aspect of encouraging code quality is the expectation that all code will be reviewed before committing to the codebase. Most developers can view and suggest changes anywhere in the codebase (except for a more carefully controlled set of highly classified code).

 

The risk of unfamiliar developers changing related code is mitigated by the code review process and the concept of code ownership, Google codebases are laid out in a tree structure, and each directory has a set of owners that controls whether or not changes to files in the directory are accepted.

 

The owner is usually the developer working on the project in the relevant directory, and changes usually begin with a developer receiving a detailed code review assessing the quality of the change, as well as the owner's endorsement approval assessing the suitability of the change pair.

 

Code reviewers comment on aspects of code quality, including design, functionality, complexity, testing, naming, review quality, and code style.

 

Google has written a code review tool called Critique that allows reviewers to see the evolution of the code and comment on changes to any line. It encourages further revisions and reviews to meet the owner's requirements.

 

Google's static analysis system (Tricorder [10]) and pre-commit infrastructure can also automatically provide data on code quality, test coverage, and test results in the Google code review tool. These computationally intensive checks are triggered periodically, sending code changes for review.

 

Tricorder also suggests fixes for many bugs, and these systems provide important data to improve the effectiveness of code reviews and keep Google's codebase healthy.

 

From time to time, the Google developer team conducts code cleanups to further maintain the health of the codebase. Developers making these changes typically divide the process into two phases.

 

Large backward compatible changes are made first, and once done, a second smaller change can be made to remove code that is no longer referenced, the Rosie tooling supports the first phase of this massive cleanup and code change.

 

With Rosie, developers can create one big patch. Rosie is responsible for breaking large patches into smaller patches, testing them independently, sending them out for code review, and committing them automatically after passing the tests and code review.

 

Rosie splits patches by project directory line, relying on the code ownership hierarchy described earlier to send patches to the appropriate reviewers.

 

As Rosie's popularity and usage grew, it became apparent that some controls had to be established to limit the use of Rosie to high-value variations.

 

In 2013, Google passed a formal mass change review process, which resulted in a reduction in the number of Rosie from 2013 to 2014. When evaluating Rosie changes, the review board balances the benefits of the change with the costs of reviewer time and repository churn, and we examine similar tradeoffs more closely later.

 

All in all, Google has developed a number of tools to support its vast code base, including trunk-based development, a distributed source code repository Piper, a workspace client CitC, and workflow support tools Critique, CodeSearch, Tricorder, and Rosie, which we have in The pros and cons of this model are discussed here.

 

analyze

 

This section outlines and expands on the advantages of a single codebase and the costs associated with maintaining the scale of such models.

 

advantage

 

Supporting a very large Google codebase while serving thousands of users while maintaining good performance is a challenge, but Google has embraced a single codebase due to its compelling advantages.

 

Most importantly it supports:

  • Unified version.
  • Extensive code sharing and reuse.
  • Simplify dependency management.
  • Atomic changes.
  • Massive refactoring.
  • teamwork.
  • Flexible team boundaries and code ownership.
  • Code visibility and a clear tree structure with implicit team namespaces.

 

A single code base provides unified version control and a single source of code. There is no confusion as to which repository hosts the authoritative version of the file.

 

If one team wants to depend on another team's code, it can do so directly, the Google codebase contains a large number of useful libraries, and a single codebase can lead to extensive code sharing and reuse.

 

The Google Build System [5] makes it easy to include code between directories, simplifying dependency management. Changes to a project's dependencies trigger rebuilds of dependent code, and since all code is versioned in the same repository, there is only one version, and don't care about independent versions of dependencies.

 

 

Most notably, this model allows Google to avoid the "diamond dependency" problem that occurs when A depends on B and C, which both depend on D, but B requires version D.1 and C requires version D 0.2.

 

In most cases, it can be difficult to release a new version without causing breakage because all callers must update at the same time, which is difficult when library callers are hosted in different repositories.

 

In the open source world, where dependencies are often broken by library updates, finding the versions of dependent libraries that all work together is a challenge. Updating versions of dependencies can be painful for developers, and delaying updates can turn into very expensive technical debt.

 

With a single codebase, it's easier for someone updating the library to update all affected dependencies at the same time. The technical debt incurred by dependencies is repaid as soon as a change is made, and changes to the underlying library are immediately propagated through the dependency chain to the final product that depends on the library, without the need for a separate sync or migration step.

 

Note that there may be diamond dependency issues at the source/API level as well as between binaries, as described below. [12] At Google, binary problems are avoided by using static linking.

 

The ability to make atomic changes is also a very powerful feature of the overall model, where developers can make breaking changes to hundreds or thousands of files in a codebase in a consistent operation. For example, a developer can rename a class or function in a single commit without breaking any builds or tests.

 

The availability of all source code in a single codebase, or at least on a centralized server, makes it easier for maintainers of the core library to perform tests and performance benchmarks before committing high-impact changes.

 

This approach is useful for exploring and measuring the value of highly disruptive changes, a specific example being an experiment evaluating the feasibility of converting Google data centers to support non-x86 machine architectures.

 

Because of the structure of Google's codebase, developers don't need to decide on codebase boundaries, and engineers don't need to "branch" development of shared libraries, or merge across repositories to update code.

 

Team boundaries are fluid, and when project ownership changes or plans to merge systems, all code is already in the same library. This environment makes cyclical refactoring and reorganization of the codebase easy, moving projects and updating dependencies can be applied atomically to the codebase, and the development history of the affected code remains unchanged and available.

 

Another property of a monolithic codebase is the easy-to-understand layout of the codebase, as it is organized in a single tree, with each team having a directory structure in the main tree, effectively acting as the project's own namespace.

 

Each source file can be uniquely identified by a single string, the file path optionally contains a revision number, and browsing the codebase makes it easy to see how any source file applies to the codebase.

 

The Google codebase is constantly evolving, and more complex codebase modernization efforts (such as updating it to C++11 or rolling out performance optimizations [9]) are typically managed centrally by dedicated codebase maintainers. 

 

Such an effort can touch half a million variable declarations or function call sites (spread across hundreds of thousands of source code files), and since all projects are stored centrally, a team of experts can do the work for the entire company, rather than requiring Many people develop their own tools.

 

for example

 

Consider Google's compiler team, who make sure Google's developers use the latest toolchain and benefit from the latest improvements in generated code and "debuggability".

 

A single codebase gives compilation teams a complete view of how Google uses various languages ​​and allows them to do codebase-wide cleanups to prevent changes from breaking builds.

 

This greatly simplifies compiler verification, which reduces compiler release cycles and makes it possible for Google to safely perform compiler version (typically more than 20 per year for C++ compilers) upgrades.

 

By analyzing data from nightly runs of performance and regression tests, the compiler team can adjust the default compiler settings to be optimal.

 

For example, Java developers at Google have all seen their garbage collection (GC) CPU consumption drop by over 50%, and GC dwell time has dropped by 10%-40% from 2014 to 2015. Also, when the software finds a bug, the compiler team may add new warnings to prevent the bug from recurring.

 

Combined with this change, they scan the entire repository to find and fix other instances where the problem is occurring before moving on to a new compiler bug, which has been proven in the past to reject problematic code by the compiler greatly improving Google's code performance situation.

 

Storing all source code in a common version control repository allows repository maintainers to efficiently analyze and change Google's source code. Tools like Refaster [11] and ClangMR [15] (often used with Rosie) utilize a single view of Google's source code to perform advanced transformation of source code.

 

A single codebase captures all dependency information, allowing older APIs to be removed with confidence. Because the new API can be made available to all callers at any given time, a single codebase greatly simplifies the development of these tools by ensuring atomicity of changes and a single global view of the entire repository.

 

An important aspect of Google's culture that encourages code quality is the expectation that all code is reviewed before committing to the codebase.

 

Costs and Tradeoffs

 

Note that a single codebase in no way implies a monolithic software design, and using this model involves some drawbacks and tradeoffs that must be considered.

 

These costs and tradeoffs fall into three categories:

  • Invest in tools for development and execution.
  • Codebase complexity, including unnecessary dependencies and difficulties with code discovery.
  • Efforts to achieve code robustness.

In many ways, a single codebase leads to simpler tools. However, there is also a need to scale the tooling to the size of the codebase.

 

For example, Google has written a custom plug-in for the Eclipse integrated development environment (IDE) to enable the IDE to work with large codebases.

 

Google's code indexing system supports static analysis, cross-referencing in code browsing tools, and rich IDE capabilities for Emacs, Vim, and other development environments that require ongoing investment to manage the growing size of Google's codebase.

 

In addition to the investment in building and maintaining scalable tools, Google must also bear the cost of running these systems, some of which are very computationally intensive.

 

Much of Google's internal suite of developer tools, including automated testing infrastructure and a highly scalable build infrastructure, is critical to supporting the scale of a single codebase. Therefore, how to run these tools must be weighed to balance the cost of execution with the benefits of the data provided to the developer.

 

A single codebase makes it easier to understand the structure of the codebase because there are no cross-repository boundaries between dependencies. However, as the scale increases, code lookups become more difficult because standard tools like grep are largely unavailable.

 

Developers must be able to explore codebases, find related libraries, and understand how to use them and who wrote them, and library authors often need to understand how their APIs are used.

 

This requires a significant investment in code search and browsing tools, which Google has found to be very beneficial, increasing the productivity of all developers. [9]

 

Access to the entire codebase encourages extensive code sharing and reuse, some would argue that this pattern relies on the extensibility of Google's build system, makes it too easy to add dependencies, and reduces software developers to design stable and well-designed API motivation.

 

Because of the ease of creating dependencies, often teams don't think about their dependency graph, making code cleanup more error-prone. Unnecessary dependencies can increase a project's risk of breaking downstream builds, cause binary bloat, and create extra work in builds and tests, plus maintaining legacy projects can lead to lost productivity.

 

Google's attempt to control unnecessary dependencies already has tools to help identify and remove unwanted dependencies. Tools also exist for identifying underutilized dependencies or identifying unneeded libraries.

 

The tool Clipper relies on a custom Java compiler to generate an accurate cross-reference index, which it then uses to build a reachability graph and determine what classes to never use.

 

Clipper can guide the work of dependent refactorings by helping developers find targets that are relatively easy to remove or decompose. ,

 

Developers can make breaking changes through hundreds or thousands of files in a repository in one consistent operation.

 

Dependency refactoring and cleanup tools are helpful, but ideally code owners should be able to prevent unnecessary dependencies from being created.

 

In 2011, Google began promoting the concept of API visibility, setting the default visibility of new APIs to "private," which forced developers to explicitly mark APIs for use by other teams. Lessons learned from Google's experience with large codebases should be implemented as soon as possible to encourage better dependency structures.

 

Most Google code is available to all Google developers, which has led to a culture where some teams want other developers to read their code, rather than providing them with separate API documentation.

 

There are pros and cons to this approach, developers sometimes read API code and end up relying on low-level implementation details, and this behavior can create some maintenance burden for teams that are reluctant to expose details to users.

 

The model also requires teams to collaborate with each other when using open source code, with one area of ​​code being reserved for open source code (developed at Google or externally).

 

To prevent dependency conflicts, it is necessary to ensure that only one open source version is available at any given time, and teams using open source code spend time working on new versions of open source libraries when they make dependency upgrades.

 

Google puts a lot of effort into maintaining code health to address some of the issues related to codebase complexity and dependency management.

 

For example, specialized tools automatically detect and remove dead code, split a lot of refactoring, and automatically assign code evaluations (such as through Rosie), and mark APIs as deprecated.

 

The need for humans to run these tools and manage the corresponding large-scale code changes, review codebase-wide cleanups and other ongoing simple refactorings also incur costs.

 

Alternative

 

With the increasing popularity and use of distributed version control systems (DVCS) like Git, Google considered whether to move Piper to Git as its primary version control system.

 

A team at Google is focused on supporting Git, which is used by Google's Android and Chrome teams outside of Google's main codebase, for which using Git is important due to external partners and open source collaboration.

 

The Git community strongly recommends that developers have more and more codebases, and Git-clone operations require copying everything to the local machine, a process incompatible with large repositories. To move to Git-based source code hosting, it was necessary to split Google's repositories into thousands of separate repositories to achieve reasonable performance.

 

Such a reorganization would require cultural and workflow changes for Google developers. For comparison, Google's Git-hosted Android code is divided into over 800 separate code repositories.

 

Given the value gained by the existing tools that Google has built and the many advantages of the overall codebase structure, switching to more and more codebases doesn't make sense for Google's main codebase, moving to Git or requiring a codebase The split was unremarkable for Google.

 

The Google source code team's current investments are focused on continued reliability, scalability, and security of the internal source code system, and the team is also doing experimental work with Mercurial, an open-source Git-like DVCS.

 

The goal is to add scalability features to the Mercurial client to efficiently support Google's scale. This will give Google developers an alternative to the popular DVCS-style workflow with a single codebase repository.

 

This effort is in partnership with the open source Mercurial community, which includes contributors from other companies.

 

in conclusion

 

Google opted for a single source control strategy when migrating its existing Google codebase from CVS to Perforce in 1999, and early Google engineers believed that a single codebase was much more restrictive than multiple codebases, although at the time they did not Anticipate the future size of the codebase and all supported tools.

 

Over the years, as the investment required to continue expanding the centralized repository has grown, Google leadership has occasionally considered whether it makes sense to move from a single-mode model. Despite the effort, Google has chosen to stick with a centralized single codebase due to its advantages.

 

A single model of source control is not for everyone, it is best for organizations like Google with a culture of openness and collaboration. This is less applicable to organizations where much of the codebase is private or hidden between groups.

 

On Google's side, we've found that with some investment, the overall model of source control can be successfully scaled to a codebase with over a billion files, 35 million commits, and thousands of developers around the world.

 

As Google and Google's internal projects continue to grow in size and complexity, we hope that the analysis and workflow described in this article will allow them to make trade-off decisions about the long-term structure of their codebase.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326395686&siteId=291194637