How to manage a very large Python code repository with nearly 30,000 files?

△Click " Python cat " above to follow, and reply " 1 " to receive the e-book

daa204727a03e66f142d75b297d9441c.jpeg

Flowering cat language: In the 11th issue of Python Trends Weekly , I shared the English version of this article, here is the translation~~

英文:How we organise our very large Python monolith

By David Seddon

Translator: RyomaHan | Xiaobai

Source: https://wiki.blanc.site/archives/d644d904.html

TLDR (Summary by AI Claude)

This article comes from a Python developer's summary of the code structure of a huge Python project.

The project contains nearly 30,000 Python files and is maintained by more than 400 developers around the world. In order to cope with the growing complexity of the code, the project adopts a layered architecture design. That is to divide the code base into multiple levels and limit the dependencies between different levels. Dependencies can only flow from the upper level to the lower level.

The article details the project's hierarchical structure and how the Import Linter tool can be used to enforce the hierarchical rules. The progress of the hierarchy implementation can be measured by tracking the number of illegal import statements that are ignored.

The layered architecture can indeed effectively reduce the complexity of large-scale projects and facilitate independent development. But there are also some disadvantages, such as it is easy to generate too much code at the high level, and it takes time to fully implement layering. In general, introducing a layered architecture as early as possible can reduce the later refactoring workload and is an effective way to manage large-scale Python projects.

Through a real large-scale Python project case, this article vividly introduces the implementation process, advantages and disadvantages of the layered architecture, which is very useful for managing large-scale projects.

foreword

Hi, I'm David, a Python developer from Kraken Technologies. My job at Kraken is to maintain a Python application that has 27637 modules at last count. Yes, you read that right, this project has almost 28K individual Python files (not including test code).

I maintain this behemoth along with 400 other developers around the world, constantly merging new code into it. Anyone needing only approval from a colleague on Github can modify the code files and initiate the deployment of the software, which is running at 17 different energy and utility companies and has a customer base in the millions.

Seeing the above description, you will most likely subconsciously think that the code of this project must be extremely messy. Frankly, I think so too. But the truth is, at least in the field I work in, a large number of developers can work productively on one large Python project.

There are many elements to achieve this goal, many of which come from culture and rules rather than technology. In this blog post, I want to focus on how we achieve this goal by optimizing the code organization structure .

layered architecture

If you have been responsible for maintaining the code repository of an application for a while, you will definitely feel that the code complexity is getting higher and higher as time goes by. In the process of continuous development and maintenance, the logic codes of various parts of the application are mixed together, and it becomes more and more difficult to independently analyze a certain module in the application.

This is also the problem we encountered when maintaining the code warehouse in the early days. After research, we decided to adopt a layered architecture ( that is, divide the code base into multiple components (that is, levels, which will not be commented later), and limit the references between components relationship ) to deal with this problem.

Layering (Layering) is a relatively common software architecture pattern, in which different components (that is, layers, not repeated later) will be organized in the form of (conceptual) stacks. In this stack, lower-level components cannot depend on (introduce) their upper-level components.

f78ed15f6a304c0a58c26d55dfb397f0.png
Layered Architecture with Dependency Downward Flow of Relationships

For example, in the diagram above, C can depend on B and A, but not D.

The application of layered architecture is very broad, and you can freely define components. For example: you can treat multiple independently deployable services as multiple components, or you can directly treat source code files in different parts of the project as different components.

The definition of dependency is also broad. In general, we consider a dependency between two components whenever there is a direct intersection between them (even if only at the conceptual level). Indirect intersections (such as passing through configuration) are generally not considered dependencies.

How to apply layered architecture in Python projects

The best practice for a layered architecture in a Python project is to use Python modules as a basis for layering and import statements as a basis for dependencies .

Take the following project warehouse directory as an example:

myproject
 __init__.py
 payments/
  __init__.py
  api.py
  vendor.py
 products.py
 shopping_cart.py

The nesting relationship between modules in a directory is the best basis for hierarchy. Suppose, we decide to layer in the following order:

# 依赖关系向下流动(即上层可以依赖下层)
shopping_cart
payments
products

In order to meet the requirements of the above architecture, we need to prohibit paymentsmodules in from shopping_cartimporting content from the module, but can productsimport content from the module (refer to Figure 1).

Layers can also be nested, so we can continue layering in the payments module, for example:

api
vendor

There is no single correct answer to how many layers to set and in what order to arrange them, and we need to constantly summarize them in practice. But the reasonable use of layered architecture can effectively reduce the complexity of the project structure, making it easier to understand and modify.

How we practice layered architecture in Kraken projects

As of this writing, 17 different energy and utilities related businesses have purchased Kraken licenses. We refer to these enterprises as clients internally, and run a separate instance for each enterprise. Because of this, different instances of Kraken have formed a feature of "different branches of the same root".

In layman's terms, many behaviors between different instances are actually shared, but each client also has its own custom code to meet their specific needs. The same is true at the geographical level. There are certain commonalities between all clients operating in the UK (they belong to the same energy industry), while Octopus Energy in Japan does not share these commonalities.

As the Kraken platform grows, we are constantly optimizing our sharing structure to help us better meet the needs of different customers. The current layered top-level structure is roughly as follows:

# 依赖关系向下流动(即上层可以依赖下层)
kraken/
    __init__.py
    
    client/
        __init__.py
        oede/
        oegb/
        oejp/
        ...
    
    territories/
        __init__.py
        deu/
        gbr/
        jpn/
        ...
    
    core/

The client component is at the top of the structure. Each client has a dedicated subpackage in this layer (for example, oede corresponds to Octopus Energy Germany). Below this is the territories component, which is used to meet specific behaviors required by different countries, and also sets different subpackages for different regions. The bottom layer is the core component, which contains common codes used by all clients.

We have also formulated a special rule: the subpackages under the client component must be independent (that is, cannot be referenced by other clients), and the subpackages under the territories component are the same.

After building Kraken with this hierarchical structure, we can easily update and maintain the code in a limited area (such as a subpackage of a component). Since the client component is at the top of the structure, no other components will directly depend on it, making it easier to change something specific to a client without affecting the behavior of other clients.

Likewise, changing only one subpackage within the territories component will not affect other subpackages. This allows us to develop quickly and independently across teams, especially when we make changes that affect only a small number of Kraken instances.

Ensure layered implementation in your project with Import Linter

Although a layered structure was introduced, we quickly found that it was not enough to describe the layering in theory. Developers often accidentally introduce violations between layers. We need to somehow ensure that the theory of hierarchical structure can be followed in the code structure, in order to achieve this we introduced a third-party library Import Linter into the project.

The Import Linter is an open source tool for checking that the citation logic in a project follows a specified structure. First, we need to define a configuration describing the target's requirements in an INI file, something like this:

[importlinter:contract:top-level]

name = Top level layers
type = layers
layers =
    kraken.clients
    kraken.territories
    Kraken.core

We can also use two other configuration files to force different clients and territories to be independent from each other. Something like this:

# 文件 1
[importlinter:contract:client-independence]
name = Client independence
type = independence
layers =
    kraken.clients.oede
    kraken.clients.oegb
    kraken.clients.oejp
    ...

# 文件 2
[importlinter:contract:territory-independence]
name = Territory independence
type = independence
layers =
    kraken.territories.deu
    kraken.territories.gbr
    kraken.territories.jpn
    ...

Then, you can run it from the command line lint-import, and it will tell you if any imports in the project violate the requirements in our configuration. We run this every time we pull code, so if someone uses an out-of-compliance import, the check will fail and the code won't be merged.

The configuration files shown above are not all configuration files for our project. Team members can add their own layers deeper in the application, for example: kranken.ritories.jpn is itself a layer. We currently have over 40 configuration files that define our hierarchy.

Eliminate technical debt

We have no way to make the entire project meet the architectural requirements as soon as it is determined that it is a layered architecture. Therefore, we used a feature in the Import Linter that allows you to ignore the check for some imports before checking for illegal imports.

[importlinter:contract:my-layers-contract]
name = My contract
type = layers
layers =
    kraken.clients
    kraken.territories
    kraken.core
ignore_imports =
    kraken.core.customers ->
    kraken.territories.gbr.customers.views
    kraken.territories.jpn.payments -> kraken.utils.urls
    (and so on...)

Thereafter, we used the number of import statements ignored by the Import Linter when the project was built as a metric to track technical debt completion. This way, we can observe whether and how quickly the technical debt profile improves over time.

f332b410f0fc82b675c4218c572ae00f.png
Ignored imports since 1 May 2022

The graph above is the change in the number of problematic import statements that we have ignored over the past year or so. I will share this image regularly to show you our latest work progress and encourage our developers to strive to fully adhere to the hierarchical structure agreement. We also used this burndown chart method to display several other technical debts.

There is no silver bullet, talk about the shortcomings of layered architecture

complex reality

The real world is extremely complex, and dependencies are everywhere in the project. After adopting a layered architecture, you will often encounter situations where you want to break the existing hierarchical relationship, and you will often inadvertently call high-level components from low-level components.

Luckily, there is always a way around this kind of problem, it's called Inversion of Control (Ioc), and you can do it very easily in Python, it just requires a different mindset. However, using this method will increase the "local complexity", but in order to make the project as a whole simpler, this price is worth it.

Too much high-level code in the structure

In a hierarchy, components at higher levels are naturally easier to change. For this reason, we have deliberately simplified the code flow for modifying specific clients or territories. On the other hand, core is the basis of all other codes, and modifying it becomes a high-cost and high-risk thing.

The high-cost, high-risk low-level code modification behavior discourages us, prompting us to write more high-level code for specific customers or regions. The end result is that the high-level code is much, much more than we thought. We're still learning how to fix this.

We're not quite done yet

Remember the previously mentioned imports that were set to be ignored in the Import Linter special configuration file? Years later, it's still not all resolved, at least 15 by count. These last few imports are also the most stubborn and difficult to optimize.

We need to pay a lot of time to refactor an existing project, so the earlier the layering, the less trouble we need to face.

Summarize

Kraken's hierarchical structure allows us to maintain a healthy development and maintenance with such a large code volume, and it is relatively easy to operate, especially considering its size. If we don't limit the dependencies between tens of thousands of modules, our project repository is likely to be as complicated as a mess of threads.

But the code architecture we chose smoothly helped us do a lot of work in a single Python codebase. Seems impossible, but it's the truth.

If you are developing a large Python project, or even a relatively small project, don't try the layered structure, or the same sentence: the earlier the layering, the less trouble you need to face .

a476de7d6281ee82f74e046d4efdc49e.gif

The Python cat technical exchange group is open! In the group, there are current employees of domestic first-tier and second-tier factories, as well as students studying in domestic and foreign colleges and universities. There are programming veterans with more than ten years of coding experience, and newcomers who have just started primary and secondary schools. The learning atmosphere is good! Students who want to join the group, please reply to the " communication group " in the official account to get Brother Mao's WeChat (decline the advertising party, if you are the one!) ~

Not enough? try them

Big news about the Python global interpreter lock GIL!

Lightweight message queue Django-Q light experience

How to implement caching in Python programs?

Why doesn't Python support the switch statement?

Why don't Python, Go and Rust support the ternary operator?

len(x) defeated x.len(), looking at Python's design ideas from built-in functions

If you found this article helpful

Please generously share and like , thank you !

Guess you like

Origin blog.csdn.net/chinesehuazhou2/article/details/132157770