Notes on software platform architecture design and technical management

Notes on software platform architecture design and technical management

cognition

Leading all aspects of software platform work requires extremely high technical knowledge, thinking patterns, decision-making capabilities, work style, and cultural casting, which can be called "domain wisdom." The cost of cognitive blind spots is huge. "I don't know" has more serious consequences than "doesn't know" and may lead to directional errors.

The huge scientific and technological team can be described as thousands of people with different faces. When indecision and confusion arise in management work, the only way to use solid cognition as a psychological anchor is to avoid being like water without roots.

In the tedious and boring IT technical work, you must not forget to look up at the road while working hard, take the initiative to discover more "artistic" cells, add more abstract understanding, and eventually the clouds will clear up.

When you reach a deadlock at work, return to your cognition and examine yourself to help you break out of the cocoon.

Generally speaking, the technical director's responsibilities focus on team building and management, and he is responsible for the completion of tasks. The architect mainly plays a role in the engineering and technical fields. The CTO has more leadership value, in addition to more management affairs and external relations. In addition to work, you should also establish industry influence and win the situation.

Hard power is the necessary foundation, while soft skills determine the final portrait of the technical leader. Controlling the platform requires the blessing of "soft power".

If you can become an active member of open source communities or forums, or be a participant in the formulation of industry standards, so as to increase your influence in related fields, it will be a bonus.

The market competitiveness of the platform depends to a large extent on the front-end capabilities, user experience and external evaluation, which directly reflects the strength of asynchronous communication, lazy loading, compression, rendering and other capabilities in the platform's large front-end; the speed of product iteration release, It directly depends on the architectural strength and packaging and deployment capabilities of front-end componentization, module encapsulation and reuse; the follow-up of new industry technologies and new ecology relies on the front-end team's technology stack and language learning and application capabilities.

The front-end field is still the hardest hit area by security issues and compatibility issues.

The distributed development framework solves the problem of expansion and evolution of single applications, but the disadvantage is that there are more and more systems from a logical perspective, and there are more and more interactions between cross-process system services.

The widespread application of cloud technology, service meshing, and integration of development and operation require mastering and using cloud native capabilities.

The technological development of the industry accelerates the refinement of division of labor, which in turn further promotes the further development of technology. In such a cycle, technical personnel must overcome the dilemma of being "kidnapped" by it.

It is no longer possible for the person in charge of platform technology to be a professional generalist in various subdivided technical fields, and unable to focus mainly on technical depth. Faced with the enlargement of the technical team and the differentiation of skills, the technical leader must focus on the overall blueprint of the "platform technical capability perspective" to reflect the work value and competitiveness. As the chief architect and evangelist of each product line development team, Carry out global domain planning, granular design of application systems and services, capability support and public abstraction, and platform-level component construction, and organize service governance, inter-system relationships, and technology reuse among each product line, and clearly understand the current progress At which stage should we reach, what capabilities should we have in planning, what is the value, and the rational technical decision-making on the problem.

You can't have your cake and eat it too. Work decisions within the technical department, especially technical propositions and architectural design, often have scenes where there is no right answer and each team insists on their own opinions. As far as it is concerned, it is indeed objective. For example, , citing the risks and opportunities of new technologies, the granularity of microservice splitting, the choice of several development frameworks, the relationship design between systems and the choice of communication protocols, the risk estimation of business data migration, etc., or the short development cycle and testing There are contradictions everywhere between strict control requirements, between quality and delivery speed, between quickly parsing data and using high-strength encryption algorithms to ensure security. There will be no simple answers when it comes to technical decision-making. Various designs and propositions have their own reasons, and the technical person in charge needs to bear the decision-making responsibility.

Insert image description here

no perfect answer

Decision-making is not about giving perfect answers. The core of decision-making is to communicate as fully as possible within a limited time and make reasonable "trade-offs". It is fundamentally the art of "compromise" to achieve a balanced goal. In most cases The result is not one choice, but a balance between multiple viewpoints.

The arbitrary decision-making method is the extreme left, and the decision-making method that is too democratic and consultative is the extreme right. Both have flaws.

The best decision should be made in the selection of ideas based on the credibility of the opinions. The people involved in the decision-making should be delineated and everyone should put forward their own opinions. For those with strong abilities and responsibilities, efforts should also be made to solve each other's problems in the process. Disagreements, the weights of opinions of people with different abilities are different, and credibility weighting is performed to produce decision-making reference results.

Decision makers do not have to judge everything, but they must be able to end debates when appropriate and know how to transcend differences and promote consensus on next steps. Sometimes delaying a decision (for example, until the next meeting) is a good way to slow things down.

Carefully record the reasons for your decision

For architectural design work, carefully documented reasons for decision-making are themselves documents of architectural evolution, and are well worth the money anyway.

Mastering and using industry examples as a reference can, to a large extent, help us select the appropriate "methodology, expression mode and tool support" as soon as possible.

The word "architectural view" is indeed a classic, but it gives people the feeling of "what is seen from the section and window". It focuses on the expression of results and is slightly lacking in meaning. (Such as a typical 4+1 architecture view: logical view, process view, physical view, development view, scene view)

"Architecture theme" means the identification and analysis of characteristics, as well as the exploration of themes. To visualize it, a view is a cross-section of the entire platform. The front-end technology architecture, big data architecture, unified identity authentication architecture, etc. are vertical components of the platform. They are areas, not a cross-section of the entire platform from a certain perspective. Strictly speaking, the object of architectural design is not a few classic views, but a series of domain objects. There are both cross-sectional views and vertical field views.

You can use some methodological tools that are helpful for architectural decision-making, such as architecture trade-off analysis, cost-benefit analysis, ADMEMS method, etc.

The development history of architecture

The first generation is the Monoliths architecture:

  • Features: Horizontal layering

  • Pros: Easy to get started, deploy and test

  • Disadvantages: high coupling, single technology selection, low development efficiency

The second generation is service-oriented architecture (SOA):

  • Features: Vertical layering, systems interact through Service API and centrally managed Enterprise Service Bus (ESB)
  • Advantages: service-based integration and governance
  • Disadvantages: Each service is essentially a single service and relies heavily on ESB

The third generation is the MicroService architecture:

  • Features: Combining horizontal layering and vertical layering, treating a single application as a set of small services, distributed technology

  • advantage:

    It has the capabilities of service registration, circuit breaker, fault tolerance, self-detection, automatic release, etc.; it can quickly iterate and deliver continuously. Microservice architecture and distributed development framework complement each other, embodying small team autonomy and rapid iterative product development.

    Microservices are actually a better alternative to service-oriented architecture, that is, decentralized distributed service architecture (DSA), which completely separates service addressing and service invocation and no longer relies on ESB.

The fourth generation is the Cloud Native architecture:

  • Features: Based on the idea of ​​microservices and using containers as the carrier, it provides a new model of product development and operation.

  • advantage:

    In terms of development and operation and maintenance, it includes: cloud-based capabilities, dynamic resource management; Docker container technology; Service Mesh service grid; Serverless serverless architecture; API servitization; lightweight service statelessness; Restful style; self-service resource management ; Continuous Integration (CI) Continuous Delivery (CD); DevOps development, operation and maintenance integration and automation.

If you want to talk about the structure, you must talk about the model.

In terms of content expression, whether it is more technical or more conceptual, the essence is a reusable solution that can be used for specific problems. This is a pattern. Patterns are everywhere, and good design methods are actually patterns, with no actual boundaries.

The genes of software architecture have inherent "fuzziness":

What exactly is software architecture? Is it the core of patterns and tools, or concepts and ideas? Is a pattern a style? Is microservices an architectural pattern or architectural style? Do architectural patterns and design patterns really have boundaries in actual work tasks?

Different experts will have different answers to the above questions.

The development process of the fourth generation architecture reflects the practice of core concepts, styles and models over decades, and is not a silver bullet technology. Different from the concrete description methods of the first three generations, the definition of the fourth-generation architecture has been blurred. It is a broad capability package and more corresponds to the widespread application of cloud technology.

However, the way the current four-generation architecture generations are defined will not cover all mainstreams in the future. The career plans of algorithm developers probably have not benefited from cloud native architecture, and Hadoop big data developers don't care much about microservice architecture.

In the future, there may not be any single word that can stand out on its own and assume the abbreviated definition of the fifth-generation architecture. Instead, it will be branched development in various fields.

With or without fifth-generation architecture, its practical significance has been lost. Moreover, we should not think that future generations are better than previous generations. The definition of good or bad is too absolute. We can only say that future generations are new products for industry development. The software itself is not the purpose, only whether it is suitable for the goal. If you want to make a very simple small For Demo, you can still use the monolithic concept and find the most original technology stack to implement.

The emerging theory of architecture generation is to express the mainstream trends in development trends, as well as the models and methods for the implementation of new concepts and new technologies.

Architecture should not be considered a "silver bullet" and too much energy should not be placed on the topic of considering architectural patterns. It is important to understand that a large number of design methods exist more widely in programming languages, and writing clean code and using automated tests are more important to the success or failure of the system, which is still the core of modern software development work.

Govern by doing nothing

The four levels of governance described in Laozi’s Tao Te Ching are:

The lowest level relies on "manager intelligence and resourcefulness".

The second is to use "rules, regulations, and provisions" for formalized management and restraint.

The higher level is the virtue level of "tolerance and benevolence".

The highest level of the four levels is "governing by doing nothing".

Governing by doing nothing requires more standing aside and observing, and not being eager to exert one's authority, nor does it require "active efforts" everywhere. What the technical leader may lack is listening.

Learning to let go is the true understanding of the essence of management. The technical leader should pay attention to whether the team is implementing the system according to the design and work plan, but there is no need to stand behind and monitor everyone.

Delegating work authority and giving team members sufficient autonomy and autonomy to allow them to unleash their creativity and abilities may be the greatest value of a technical leader.

Architectural design thinking principles and patterns

Insert image description here

Architectural design thinking principles and patterns
  • people oriented:

    The essence of design work is the interaction between people. Understand the requirements of relevant parties and drive the understanding of subordinate teams. Respect all stakeholders and think from their perspective.

  • Delayed decision making

    Ambiguous projects are dangerous. Don't rush to make final design decisions until conditions are ripe.

    Some can even be placed outside the scope of the work and left to subsequent designers to decide. Remember, this is in no way an avoidance of responsibility.

  • Learn from and reuse

    You can start your own design based on others, or use the framework that others have built to solve the problem. When designing architecture, you must spend more time studying existing designs, reduce your own creation, and avoid inefficient production methods. Therefore, people who have read widely and carefully studied a large number of design patterns will appear to be more capable as long as they do not have pattern diseases.

  • turn imaginary into reality

    Let the audience understand and digest it through perceptual knowledge. If others cannot accept the idea, no matter how good the design concept and creativity are, they will not be able to generate value.

  • understand

    Research the business goals that stakeholders care about, understand important business requirements, and more importantly, understand non-functional requirements, including the resources, style, and even office politics of the development team, all of which must be mastered.

  • explore

    Scientific and rigorous definition of concepts, architectural design exploration, refers to forming a series of design concepts and determining engineering methods to solve problems, including studying a large number of patterns, technologies and development methods.

  • exhibit

    Demonstration not only embodies the emphasis in the Four Principles of Turning the Imaginary into Reality to allow others to understand and accept your design ideas, it is also used for architecture evaluation and verification. Presenting ideas is actually the output of the architecture work, advancing negotiations and formulating plans. Therefore, it is necessary to focus on how to improve the expressiveness and expression of architectural results.

  • Evaluate

    The only way to share architecture is to make it concrete. It is easy to understand the role of assessment in validating the suitability of an architectural design to determine whether it meets the needs of various stakeholders.

Being good at technology and being good at upward management are two different types of abilities. Technical leaders need to take into account both. In terms of value and resources, ten thousand lines of code cannot compare to one recognition from the leader.

Architecture

demarcation thinking

When designing technical functions based on the business needs of large-scale software platforms, you should refer to or follow the DDD method as a best practice guide. Use the bounded context design pattern to identify all contexts (that is, business-based problem spaces), classify contexts according to several types of core domains, support domains, and general domains, and form corresponding domain models for each classification. Through context mapping (Context Mapping) technology to design the relationship between domains.

The subtlety of bounded context is that microservices are essentially equivalent to the bounded context in DDD, and can be used to set the granularity of microservices. Boundary thinking is not only suitable for the construction of new platforms and new systems, but is especially suitable for the overall reconstruction of the "Big Loach" legacy system.

Architectural Patterns and Design Patterns

Patterns provide the most valuable tools for software architects, which can be understood as: Patterns are layered according to levels and can be divided into architectural patterns and design patterns. The former is relatively high-level and aims to provide the overall skeleton of the system architecture, while the latter is granular. More detailed, used to solve common problems, such as organization and communication at the component level of a certain system (or key technology in a certain business).

Patterns are templates and tools that are abstractly refined and summarized by technical seniors. They are easy for future generations to reuse, thus helping us to carry out technical design work concisely and effectively. This is the correct usage, and we must maintain a clear understanding of all technical tasks. insights to provide practical and effective solutions that serve the business, carry out appropriate design, and achieve a balanced approach.

In platform construction, we first carry out minimal implementation, deploy it early, and establish the architecture. Realize the "established architecture": open up the paths of development, testing, deployment, and maintenance immediately; open up the front-end engineering scaffolding, front-end and back-end communication, and back-end access cache, database and usage pipeline immediately... Verify the service registration and discovery of each service node, the effectiveness of the communication mechanism, and even the availability of load balancing.

Keeping the architecture in an available state and performing incremental deployments helps build confidence that the whole thing is moving in the right direction, and a series of serial work becomes more flexible. Not all products have to wait until they are perfect before they can be released. Incremental deployment means incremental and selective cultivation and iterative verification based on a workable architecture.

Open up the data model

What is the data model? Essentially, it includes the definition of entities, relationships, and attributes, as well as the paradigm used. Different paradigms give different principles and constraints on data redundancy and data integrity.

The current development process of databases can be divided into three stages:

  • The first stage: more than 10 years after 2000, the world of large commercial databases

    Such as Oracle, IBM DB2, Teradata

  • The second stage: after about 2006, open source databases such as MySQL

    The open source attribute has given rise to clustered deployment, as well as the widespread application of sub-database, sub-table, and read-write separation technology, achieving a true balance between investment cost and performance.

  • The third stage: from 2012 to the present, the Hadoop ecosystem driven by the big data industry and the development of distributed computing

    Including NoSQL, HDFS storage, unstructured databases (such as HBase columnar database) and data warehouse (Hive), etc.

In the first two stages, the core progress lies in the leap in database software technology brought about by Internet thinking and open source technology. However, the core elements of "data existence form and main application mode" have not changed, and the mainstream is SQL standard. Structured storage and output form of two-dimensional tables under the paradigm.

Technology in the database field has been changing, but the attribute and concept of "content is king" has not changed. The system construction idea of ​​building a solid database fortress has not changed in the past 20 years.

The user interface and application logic will change, the business will develop, and the personnel will change, but the data will be retained forever. Whether it is two-dimensional structured or columnar unstructured data, the immutable characteristics of the data cannot be changed. Given these characteristics of data, creating a solid data model starts from day one.

Another meaning of content being king is that the database is given the highest mission from the perspective of enterprise security. Data has the highest security level in the system and is the most valuable asset in the enterprise.

Applications can be rebuilt, but without data, there is nothing. Platform operation and maintenance work should focus more on databases and core data. Encryption and desensitization, cold/hot backup are the basic and necessary capabilities of the platform. At the same time, regular attack, defense and switching drills, disaster recovery center construction, various qualifications and ratings, such important operation and maintenance work, should be based on data. theme.

Understand operating water levels and conduct regular drills

Carry out platform water level management and proactively grasp the operating pressure line of the system. If you know yourself and the enemy, you will never be in danger. There are generally three types of methods for operating water level assessment:

One is the TPS estimation method,

The second is the operating system resource usage estimation method.

The third is the historical value estimation method.

Regularly rehearsed working mechanisms can be used to reveal “the iceberg beneath the water’s surface.” Switching drills at the active and backup center levels, physical machine downtime drills, core database outage drills, network DNS resolution service provider replacement, etc. are all good drill topics. Remember, "finding problems in actual drills" is better than "finding problems in actual drills 10,000 times" The meeting emphasized that self-examination is more effective. This benign, self-examination method is the best way to deal with the inertia of technical personnel to "self-examination and self-improvement".

Operation and maintenance manuals and technical white papers

The operation and maintenance manual should be positioned so that all elements are familiar and can be found immediately when needed.

Troubleshooting may be more suitable to be placed in the operation and maintenance knowledge base, and it is better represented by a list of problems and processing methods. The operation and maintenance manual is the related material that troubleshooting work relies on, and it is impossible to foresee all failure scenarios in the operation and maintenance manual.

Regarding the understanding of technical white papers, industry technical workers believe that they are more suitable for underlying technology applications. For example, a white paper focusing on a communication protocol or a storage technology pays more attention to technical principles and specifications. The content is in-depth and informative, and is completely professional technical documents. As for the technical white paper on a big-name topic like blockchain, it is more reflected in an industry standard and even in the form of an international standard.

The white paper of the application system platform is more commonly understood as the standard services, access and integration methods released by the platform for any developer to carry out secondary development or customer development under authorization mode with strict boundaries but open implementation form. End access, this type of white paper is actually a high-standard technical manual for developers.

It is recommended that the technical leader organize and unite various technical teams to define an annual major version every year and compile and publish a company-oriented technical white paper. Main content can include:

  • Various internal standard specification documents: interface specifications, development specifications, operation specifications, database specifications, etc.
  • Highly abstract expression of platform-level capabilities
  • The most important part of the technical white paper is to solemnly state the platform’s capacity, TPS/QPS (concurrency performance), SLA (platform availability) indicators, RT (service response time) indicators, etc.
  • Statistics on delivery and quality, number of annual releases (and corresponding number of requirements), average delivery cycle, etc.

Integration of models and code

The domain model is the core of DDD. If the ideas contained in the design model by the architect cannot be reflected in the code and the model is decoupled from the code, then the thinking and deduction of various non-functional designs such as performance, usability, expansion, etc. will also be lost. Useless. Therefore, it is important to pay attention to integrating the domain model into the code and reducing the deviation between the code and the model.

The organization method of the code package, whether hierarchical organization or functional module organization, matches the designed module structure and can highlight the architecture. This should be regarded as a standard development specification.

To implement clear and accurate usage relationships between architectural elements, you need to pay attention to access restrictions in the code module structure, or distribute the modules as libraries. If these are not possible, then you can consider using tools to monitor these usage relationships.

Do a good job in contract design, set usage conditions in the code, and check during operation. If the contract is violated, an error should be thrown and the operation should be terminated, including objects, services, threads, and processes. Various granularities can be used.

Upgrading code modules to components, using microservice architecture for component-level call management, and using containers for component-level process encapsulation and lightweight distribution is a good way to control the implementation of the model.

In addition, you can also rely on traditional code review meetings to check the integration of models and code.

Top-level design

Based on the analysis of business needs, it must be three-dimensional and label the platform as much as possible. The definition of these labels and the answers to questions are the driving force for platform architecture design. The output of technical architecture work is generally understood as: "a rich expression of the software platform and implemented systems from multiple aspects, perspectives, and dimensions." Once each aspect can be expressed, the architectural design of the entire platform will be revealed.

Layered overall architecture

Insert image description here

The layered architecture is the most common and widely used architecture model in the industry. Although the microservice idea and distributed architecture have become popular, compared with the hexagonal architecture paradigm used more abroad, domestic Still the most enthusiastic about using layered architecture,its characteristics are: simple and easy to understand, easy to draw, and can express the user layer, business functions, technical components, underlying data, and operating system at the same time to the greatest extent and basic platform environment and many other contents.

Facing various communication targets such as leaders, business parties, partners, etc., considering various work scenarios such as project establishment, reporting, basic presentations, training, etc.Only this structure can express the paradigm The most general knowledge.

Among all architecture topics, layered architecture is the most granular and should be done first. The main content is layered horizontally (each layer is a horizontal rectangle), and the upper layer relies on the lower layer to provide support, forming a typical division method from top to bottom.

It can be divided:

  • User layer:

Covers all client types: C-side mobile apps, mini programs, PC-side browsers, etc.

  • Gateway layer (can also be defined as access layer):

It mainly expresses the form of product output, describes various access methods such as gateway system, API, file transfer, etc.; describes the network links accessed by the user layer, including the Internet, dedicated lines, etc.; describes the positioning and functional responsibilities of the gateway layer, such as service release. , accessor identity authentication, key distribution, protocol conversion, request routing distribution, flow control, etc.

  • Application layer:

It is the application system layer, also called the business logic layer.

Describing several systems (or modules) to be implemented by the platform, they can be divided into three categories: core domain, support domain, and general domain according to the DDD classification method.

The application layer is the most important layer in the overall framework. It is the product of the layers below it and provides business services to the user layer above it. It expresses what products and business services the entire platform will ultimately produce.

  • Technical capability layer

The technical capability layer is also called the shared (or public) technical component layer. This is the core of the technical architecture from a logical perspective. This layer needs to reflect the technical components of the platform, which can include the encapsulation and encapsulation of common basic technologies such as message channels, log services, search services, desensitization encryption, rule engines, BI reports, indicator calculations, etc. Output, as well as specific component services required by platform services such as text parsing, image recognition, speech processing, real-time video, etc.

If the application layer is the most important role in the layered architecture, then the component capability layer is the most important area of ​​the architect's work. As the technology sharing center for the entire platform, it is the manufacturing workshop for all technical components and business components.

  • Outreach service layer

Includes two types of outreach services, technical and business.

The first is to describe the technical third-party services accessed by the platform, including email, SMS services, etc.

The second is to describe the services provided by the business partners of the platform for external access, such as payment services.

For example, the platform payment service is connected to UnionPay, Netlink or XX Bank, so that all the platform's business partners will be visible at a glance.

  • middleware layer

Describe the various technical middleware on which the platform runs, including the application containers and environments used, main communication protocols, caching, load balancing, application configuration and service registration, big data processing software, etc.

Because of the core position of data, many layered frameworks will divide a separate layer for the database, that is, the data persistence layer, to focus on expressing data and file storage.

  • Platform capability layer

If a cloud platform using XX technology is used, it should be stated here and provide native capabilities here, such as internal network division, resource management and containers, service orchestration, grayscale release, and service degradation capabilities.

  • Other sectors

Try to add as many vertical rectangles as possible on the left and right, and load the information of the operation and maintenance, testing and management sections together to make the overall structure more comprehensive. Including the used code repository, build and release tools, code inspection tools, documentation knowledge base, project management tools, and testing tools.

interaction design

Interaction relationship design can be divided into two categories: "interaction process design" and "system logical relationship design" according to different granularity and purposes. Interactive relationship design focuses on reflecting the business function connections between each role and end of the relevant relationship, and technical relationships are not its main point.

Insert image description here

Interactive flow chart design

The interactive flow chart essentially designs the business processing process, including each link point (process processing point) and the temporal interaction relationship between the processing points.

The best practice for the interaction process must start with user access and end with service completion.

The number of interactive links included in an interaction flow chart should be moderate, and 20 to 30 is a pleasing number.

In addition to using swim lane diagrams, you can also use a three-dimensional diagram to express the interaction process.

Insert image description here

Swim lane diagram example

Insert image description here

Stereo view

Insert image description here

3D detail

System logical relationship design

As another way to express interactive relationships, system logical relationships describe the (business interface) service calling relationships between various application systems in the platform, and the calls to technical (such as encryption machines) systems and shared technical capabilities (such as SMS channels). It doesn't have to be included.

There are many service interfaces between systems, and there is no standard for reference. They are generally free-form three-dimensional diagrams, and the threshold for drawing them is relatively high. It is recommended to design and maintain one if capabilities and resources allow. Another way is to increase the granularity from application systems to sections and maintain the interface calling relationships between sections. The division of sections can basically correspond to business (product) lines, that is, application systems under the same business line grouped into one section.

Data architecture design

The top-level data architecture of the platform first focuses on dividing areas, using data areas to outline the overall structure. The divide-and-conquer idea embodied in divided areas is the first principle of most architecture theme designs. Data architecture is a huge design system, including business, technical and management aspects. Each aspect is highly three-dimensional and divided into logical perspective and physical perspective. It is not difficult to understand. The logical perspective is oriented towards business themes, and the physical perspective is Facing technology and implementation, the physical perspective is the technical implementation of logical design.

Business perspective design

1. Divide various data areas

The data area of ​​the complete platform can include: online business processing and transaction data area (such as customers, merchants, orders, payments), business support and management data area (such as billing, risk control, customer service), and data service data area (such as data recommendation), data management data area (such as data governance), and other resource data areas involved (such as a certain industry data, a certain public service data), etc.

2. Describe the logical relationship between data topics

Based on the business perspective, describe the logical relationship between each data topic in various system areas of the platform, as well as the transfer processing method.

3. Design ETL functions

Carry out partition management planning for transaction data and data used for analysis and statistics, use different types of databases, and design detailed and effective data extraction, transmission, and conversion processes in two intervals.

4. Hierarchical planning of data warehouse

Hierarchical planning data warehouse mainly refers to the hierarchical design of data warehouse, which generally includes the operational data storage (Operational Data Store, ODS) layer, the data warehouse (Data Warehouse, DW) layer, and the data mart (Data Mart, DM) layer. Data warehouse and ETL (Extract Transform Load) need to be connected and integrated.

5. Reflect data management

Data management includes data visualization operations, data sharing, and data permission management, reflecting the content of data hierarchical and classified management.

6. Reflect data governance

Low-level data governance includes metadata management, data standards, data quality, and data asset management. High-level data governance should include data tags, data maps, blood relationships, and even data sandboxes.

Engineering Technology Architecture

The core of engineering technology architecture is to set the development framework and various technology stacks used in various major fields, as well as the combination of the technology used and engineering (or application system).

Traffic distribution design

Traffic distribution is the platform architecture theme that best embodies aesthetics. It is more important on dual-center (generally refers to the main and backup computer rooms) or even multi-center platforms. Traffic is to the platform what the context is to the human body. Clearly grasp the traffic distribution of the platform. It is equivalent to a doctor listening to the heart and pulse, and is an important foundation for the operation and maintenance of various platforms.

This traffic is the "technical traffic" generated by business requests, that is, the amount of calls between each system node. Can be called link traffic.

The link amount is calculated as follows:

  • horizontal direction

    It can be divided into two sections. The first section is between the platform entrance and the application gateway, and the second section is from the application gateway to the application system area.

  • Vertical direction

    Choose the appropriate granularity, split the business sections and functions of the entire platform vertically, and determine the number of visits to a certain (or certain) page or certain (or certain) API per unit time. For each vertical unit, starting with the number of business visits, all application nodes accessed to complete this business request are expanded step by step, and each cross-node call in the downward process is calculated.

The unit of traffic on a link trace is not bandwidth, but the number of calls.

Application deployment design

The deployment architecture focuses on describing the implementation mode of systems at all levels from a logical perspective, that is, the combination of the application system and the platform operating environment, for use by application operation and maintenance personnel and operation and maintenance related work scenarios. The application deployment architecture is not a real physical architecture. Detailed physical information such as servers, virtual machines, and network devices do not need to be directly reflected in the application deployment design output.

System communication design

Communication design specifically refers to the communication network, communication methods and protocols used between various application system processes on the platform.

The so-called "network" here refers to the network channels, addresses and network services used for application system communication, and does not include network engineering (such as physical network, switch router network device configuration, firewall policy, etc.).

application security architecture

security governance

Safety governance mainly refers to safety organization and assessment and evaluation systems, as well as safety training, safety insurance, and compliance safety management.

Security management

Security management mainly refers to the security system, employee security awareness and capacity building, system full-cycle security management (all processes from demand analysis to online, operation and maintenance), and data usage (application, storage, transmission media, destruction, etc.) security management. and security incident management.

safety technology

Security technology mainly includes application security, network security, host security, data security, as well as office security and operational security, etc.

security audit

Security audit mainly includes security supervision, compliance audit, risk assessment, operation and maintenance audit, etc.

The operating mechanisms and work development strategies of various application security technical measures can be classified into the following three types:

Real-time protection, relatively solid rules

Such as program reinforcement, access identity authentication, data transmission and storage encryption

Monitoring and identification are more important than model and strategic capabilities

The first is situational awareness, making a traffic mirror and monitoring its security; the second is risk control and early warning, inputting business requests into the risk control model for analysis and processing, and providing early warning when the security threshold setting is reached, triggering a disposal strategy.

Active Attacks, Detection Technology and Assessment

That is to arrange vulnerability scanning, regular inspections, attack and defense drills, etc., and establish corresponding (such as hacker attacks) security technology professional capabilities and tools.

Log system design

The log is an abstraction of the changes during system operation. Its content is the orderly combination of the operation results of the specified objects according to time. Logs are to the system what underground pipes are to the city. On the surface, underground pipes do not represent the city's construction level, but heavy rain will explain everything.

The log carries all textual data and has extremely high reliability. The risk of log writing failure is almost lower than that of all other types of system operations. As the cornerstone of platform operation, establishing a log center from a global perspective is an inevitable trend for medium and large distributed application platforms.

Log hierarchical classification:

1. Business log

2. Monitoring logs

3. System log

4. Console log

Also consider log aggregation and usage:

The biggest purpose of establishing a log center is to aggregate logs and view them in one window, which can solve the security risks and operational complexity of frequent logins to the console windows of each system node.

Operation guarantee

High availability system design

(1) Full chain redundancy mechanism.

There are multiple channels available. When there is a fault on one channel, timely automatic switching or fault isolation will be performed without affecting the overall service.

(2) Defense and degradation capabilities.

Proper protection of the platform can control many scenario problems within a limited range and prevent them from becoming failures.

(3) Application release guarantee.

Use the grayscale publishing mechanism to avoid platform failures caused by problems with the application itself.

(4) Apply high-performance design.

It refers to the ability to cope with large pressure and maintain normal response time under high pressure.

Design of redundancy mechanism

Redundancy means there are two options, A and B. If A is not easy to use, switch to B, and vice versa (without considering the switch between A and B) (over a period of time) the business and customer-side experience carried by the platform have not been affected in any way, or the impact can be ignored;

AndDefense has only one A. If A is not easy to use, let it become A- and support it in the form of A- For business, during this period of using A-, some traffic could not be processed or was rejected, and the services provided by the platform were discounted.

Several aspects of redundancy mechanism design:

Multi-node load exploration

The most well-known high-availability technology deployment solution is "load balancing + multi-node horizontal deployment of application systems". Load balancing has the ability to detect node liveness, and will receive business requests according to load rules (generally including three rules, namely polling in sequence). (average distribution, random distribution or fixed proportion distribution) are sent to the "live" application system nodes.

fault isolation mechanism

The fault isolation mechanism is an encapsulation and upgrade mode of a simple, horizontally redundant multi-node detection mechanism. A group of application nodes with a front-to-back calling relationship is formed into a cell, and the detection of an application becomes a detection of cells. For unit exploration, unavailable cells will be closed and traffic will no longer be distributed. For example, if three application systems A, B, and C are called in cascade, A deploys 6 nodes, B deploys 12 nodes, and C deploys 6 nodes, 2 A+4 B+2 C can be used as a unit. Define 3 cells, the load is highly available, and the nodes are upgraded to cells.

Real-time switching between active and standby nodes

If there are no conditions for multi-node load exploration, you can use the web service high availability solution implemented by Keepalived to avoid single points of failure. A web service will have at least two servers running Keepalived, one as the master server (Master) and one as the backup server (Backup), but they appear as a virtual IP to the outside world. This is an older redundant deployment mechanism, but it still works. The VRRP protocol used by Keepalived is designed to solve the single point of failure problem of static routing.

Application service dual channel

Redundancy from the application service perspective needs to be realized through self-development. The core idea is very simple. Take the SMS channel as an example, select two service providers, and the application system realizes dual access. You cannot just choose one channel and hang it on a tree.

Application dual-channel is a widely used design method. The more common one is resource connection. For example, under the active-active center, the application system connects the configuration string of Redis and the JDBC configuration string of connecting to MySQL. The main address and standby (StandBy) are configured. The application system is responsible for detecting the address to achieve high availability of resource access.

Microservice registration and discovery mechanism

Use the service registration and service discovery mechanisms provided by the microservices distributed development framework (or use frameworks such as ZooKeeper) to implement microservice call management between multiple nodes, kick unavailable nodes out of the list, and ensure that the selection is available when calling between services. node.

The registration and discovery mechanism of microservices provides availability management of multiple node services. It is not only redundant, but also a mechanism to support high performance. The difference from multi-node load exploration is that microservices focus on the development framework itself. Provided load and scheduling capabilities.

Build a disaster recovery center

The disaster recovery center is the largest granular redundancy solution. It can also be called the operation guarantee mechanism in the most extreme situations. It is used to deal with disaster-level situations such as earthquakes and urban power failures that cannot be covered by other solutions. Disaster recovery can be divided into Multiple modes such as active-active and active-standby.

Defense degradation design

The defense system is used to protect itself when platform performance or a certain link is stuck and cannot meet all external traffic requests. The methods include strategies such as current limiting, circuit breaker, hanging maintenance, and timeout shutdown. Defense allows the platform to avoid avalanches, heal itself instantly, and minimize losses.

speed limit mechanism

Control the speed at which clients request access at the traffic entrance, and deny service if the speed exceeds the speed.

circuit breaker

At the entrance of each service traffic, extract a certain proportion of requests as samples, and calculate the time from the start of receiving the result to the return of the result. If the response time of a certain percentage (such as 50%) is greater than the set platform service response time limit (Internet platforms generally is 8 to 10s), it means that the service is unavailable and the service will be refused immediately (generally the duration is 30s). It will be automatically opened after the period is over, and this will be repeated in units of periods.

Service maintenance

Under high pressure on the platform, low-priority (non-core, can be downgraded) services are shut down. The implementation mechanism is generally a suspended maintenance method. The entrance link that is suspended for maintenance is equivalent to being temporarily blocked and offline, or a pop-up window prompts when the customer clicks. The system is busy or under maintenance (there should be a description of the maintenance time). This approach is a vertical service maintenance mechanism.

It can also provide a horizontal service maintenance mechanism. The background application of a service has 10 horizontally deployed nodes. Under high pressure on the platform, two nodes can be set not to respond to the business, and the system is busy or under maintenance will be returned to the front-end page or API gateway. Response code.

To summarize briefly, vertical maintenance is to turn off selected services, and horizontal maintenance is to turn off a selected proportion of traffic. Of course, there can be the best service plan, which is a vertical + horizontal service maintenance model, so that detailed maintenance capabilities such as which services and how much traffic can be turned off can be achieved.

timeout shutdown

The complete processing of a request generally includes cascade calls between multiple systems, which is the concept of links in log system design. Links include calls between internal systems, systems to databases or external third-party systems. When calling, each system should set a timeout for the call (usually 3 seconds). When the timeout expires, the connection will be disconnected, so as to protect the system port from being crushed by a large number of dead TCP/IP links and allow the system to heal itself.

Function downgrade

Many low-level functions and auxiliary functions are implemented by independently deployed systems or by calling third-party system interfaces. The failure rate is no lower than that of core functions. In this regard, you can consider using preset "baffles" to downgrade for temporary emergency relief. Do not block the overall function and cause customer complaints because such functions are unavailable. For example, if the advertising service fails to load, it can be replaced by a fixed static page.

Elastic scaling

Elastic scaling includes elastic expansion and elastic contraction. Elastic expansion refers to automatic horizontal expansion when traffic pressure is high and service response time becomes long, while elastic contraction means that when traffic decreases, it can automatically reduce the number of nodes, recycle excess resources, and save the overall cost of the servers occupied by the platform.

Generally speaking, the essence of automatic defense is "slow release": that is, blocking at the entrance and recovering internally by releasing connections or expanding capacity.

Release Guarantee

1. Use tools for automated deployment

2. Step by step, batch and grayscale release:

Application deployment is not the same as release. Deployment refers to the technical rollout, and release is the real business rollout, which means freeing up traffic. Deployment and release may be operationally continuous, but there is a huge difference in meaning. This is also the last barrier before the actual release.

The core idea is to proceed step by step and in batches. First deploy a part of the nodes (for example, select 2 out of 10 nodes) and release them. That is to say, after the traffic is received, and the production is verified to be correct, other nodes can be deployed and released.

Grayscale publishing must first delineate grayscale cells. The core of grayscale publishing is to deploy and publish the nodes managed by the cells, which is called the grayscale production environment. The traffic sources are logically distinguished at the traffic entrance (such as using requests An identification field in the header), direct the specified traffic to the grayscale environment for production verification. After passing the verification, all nodes will be deployed and released.

3. Let the program warm up to achieve smooth launch

Looking at the program online using the JVM running method, we can find the problem: when the traffic is released to the online node, if it is at the peak of the business, a large number of access requests enter the newly running program, which will cause the JIT compilation of a large number of Class files to be triggered at the same time. .

To sum up, when publishing an application, it is necessary to give the node a warm-up process so that it can complete the compilation work smoothly. The method can be to select a low-traffic time period (usually at night) for the online work of key applications, or by Load balancing controls the traffic distribution to newly online nodes, gradually increasing in a ladder-like manner so that the node has a sufficient warm-up process.

Apply high-performance design

For high-performance design, the focus is from an application development perspective.

Asynchronous, caching, and parallelism are the three core concept mechanisms for achieving high-performance systems:

The utility of asynchrony can be seen from two aspects. One is to prevent the main function from blocking and complete it as soon as possible without being hindered. The other is to convert serial into parallel processing;

Caching directly increases access speed;

Parallelism is to make full use of computing resources and do more things in the same time, focusing on the application of technology such as resource pools, multi-threading, connection pools, (distributed) multi-node simultaneous computing, etc.

Hot data cache:

Introducing a performance optimization method for mid-end applications - hot data caching. Take the sales ranking of a certain web page as an example. The sales of all items are changing all the time. There is no way to prepare data in advance. Such high-frequency "hot" data , although its heat power is high, its business attributes are relatively low (that is, the necessity of persistent storage is low). You should consider using a lightweight key-value cache library, that is, "the customer query requests of this type of business cannot be directly passed through Through the database to query", thereby achieving extremely high query performance. When there is a change in sales volume, the cache needs to be actively updated. Depending on the situation, non-real-time updates may be considered. In any case, the database must be protected from the impact of such business.

Monitoring and alarm system

Building platform monitoring capabilities is a task that "seems simple and has a low threshold, but if you want to do it well, the upper limit is extremely high." The fundamental reason is that in addition to technology, there is a long-term exploration process, indicator selection and debugging process. It is necessary to have a good understanding of the system in all aspects, otherwise false alarms and missed alarms will occur from time to time.

1. Monitoring in active polling mode

Monitoring in active polling mode is actually a plug-in live detection. The plug-in can be a packet sender made by the test team to simulate customer requests in the real customer's network environment.

If you have the budget, you can purchase third-party dial testing services.

2. Automatic monitoring of business monitoring

The general mode is to process the business monitoring logs collected in real time and issue alarms through the early warning indicator system.

3. Automatic monitoring link monitoring

Generally, there are two modes: log collection and process probe: The log collection mode collects link monitoring logs. Based on the interface time consumption of the log ID, it performs segmented monitoring of the call links between systems from a technical perspective. According to the set link Alarm based on the road time indicator threshold;

Process probes refer to burying interceptors in the system and obtaining important monitoring data such as the duration of the observed point through the interceptor program.

4. Application system health monitoring

The health monitoring mode includes two categories: one is to implement a self-check interface for each application system, and the other is to monitor the heartbeat of the application system through the VRRP protocol in some scenarios (such as Keepalived).

5. Front-end page monitoring

Load the monitoring script in the front-end page to record the response speed of page loading and operation requests, as well as the timeliness and error rate of AJAX requests and JavaScript script loading, thereby monitoring important parameters such as "page loading time, white screen rate and white screen time" User experience metrics.

6. Resource usage monitoring

This is the most basic and core monitoring method. For servers, virtual machines, and containers, monitoring items include CPU and memory usage, I/O throughput, bandwidth usage, number of TCP connections, etc.

Availability and capacity measurement

Platform capability measurement indicators include availability, concurrent response capability values, and the maximum capacity measurement of the platform in terms of data, lines, etc.

The most important high-availability indicator in platform operation is service availability, which many platforms refer to as SLA for short.

If the SLA of an application system is 99.9%, then the SLA of a unit composed of two cascaded systems is 99.8% (99.9% × 99.9%), and the SLA of three cascaded systems is 99.7% (99.9% × 99.9% ×99.9%), which confirms the principle of "shallow depth and fewer cascades" mentioned in high-performance deployment.

Based on the availability rate A of the server (and operating system) and the availability rate B of the application, the SLA (A × B) of the application system can be calculated.

After using cloud technology, first, the availability rate of cloud resource servers is higher than the availability rate of self-managed servers, which means A has improved; second, cloud native capabilities and microservice software capabilities have brought about improvements in high availability at the application level. , that is to say, B has been improved.

In short, on the basis of ensuring A, try to improve B so that the actual SLA result value is not lower than the platform SLA indicator requirements.

Insert image description here

Calculation of platform SLA

Capacity measurements include: user capacity, data capacity, and line capacity.

The industry’s understanding of the quantitative relationship between the number of registered users, the number of online users, and the number of concurrent users is:

Number of online users = number of registered users × (5% ~ 20%)

Number of concurrent users = number of online users × (5% ~ 20%)

If the data volume of a tenant is 100GB, and the total available resources of the data storage table space that can be allocated are 10TB, then the maximum number of tenants is 100 (10TB/100GB).

Considering the physical space utilization issue during database storage, the tenant's data volume should be amplified by a certain ratio. Personally, the recommended value is 1.3 to 1.5 times.

Line capacity, as the name implies, is how many service requests the line bandwidth can transport at the same time. Line capacity is divided into external and internal. External refers to the Internet bandwidth, and internal refers to the internal bandwidth of the LAN.

Concurrency performance measurement

QPS (Queries Per Second), which means query rate per second, refers to the number of queries that an application service can respond to per second. It is used to measure the amount of traffic processed by a specific query server within a specified period of time. It is mainly used for queries. Server performance metrics.

TPS (Transactions Per Second) means the number of transactions per second. A transaction refers to the process in which the server responds after the client sends a request to the server. A specific transaction can be one request or multiple requests, and there is no strict fixed definition.

TPS and QPS are common indicators for both client and server, but as can be seen from the definitions of the two, TPS is more suitable for describing indicators from the client's perspective, and QPS is more suitable for describing indicators from the server's perspective.

Distributed statelessness

As the distributed troika, stateless applications, distributed transactions and distributed locks are necessary options for developing application services.

Insert image description here

Stateless sessions are one of the typical features of contemporary distributed application platforms. Stateless design needs to be examined from the entire chain, mainly including the following entry points.

Looking at stateless sessions from the perspective of server node deployment:

There is a peer-to-peer relationship between application system nodes and they are completely decoupled from the user's SessionID and session context information.

Specifically, a user's business process may include multiple requests. The previous and subsequent steps are in a serial relationship. The input of the next step depends on the result of the previous step. At this time, how to implement the load can distribute the request of any step to any part of the application system. What about nodes? The only recommended solution for this is to remove the context information content from the node and place it in a shared middleware that can be read at high speed (such as Redis).

Stateless sessions from the perspective of client-server interaction

The server does not save the client state. Most implementations are that after the server performs user identity authentication, it generates some kind of session credential (i.e. SessionID) for the user, for example, it can be a Token token, etc.

Stateless sessions from a communications protocol perspective

Each request from the client to the server must contain all the information required to understand the request, which is a "stateless API". The first solution is undoubtedly the Restful API using the Http protocol.

Stateless Sessions from a User Convenience Perspective

Using unified identity authentication to achieve user single sign-on is a typical case.

Architecture design drawing

Insert image description here

Schematic diagram of layered architecture that favors middle office and technology stack

Insert image description here

Schematic diagram of hierarchical architecture focusing on business domains

Insert image description here

Schematic diagram of mobile application security architecture

Insert image description here

Swim lane style interaction process diagram

Insert image description here

Schematic diagram of reverse swim lane style interaction process

Insert image description here

Schematic diagram of three-dimensional style interaction process

Insert image description here

Three-dimensional style logical relationship diagram

Insert image description here

Schematic diagram of logical relationships of hierarchical style system

Insert image description here

Application system deployment design diagram

Insert image description here

Data subject processing relationship diagram

Insert image description here

Schematic diagram of data partition classification design

Insert image description here

Schematic diagram of listed functional framework

Insert image description here

Schematic diagram of functional and interactive hybrid functional framework

Front-end technology review form

Insert image description here

Insert image description here

Insert image description here

Backend technical review form

Insert image description here

Insert image description here

Insert image description here

Version release ledger

Insert image description here

Insert image description here

Insert image description here

Version running ledger

Insert image description here

Insert image description here

Guess you like

Origin blog.csdn.net/yinweimumu/article/details/134794274
Recommended