Parallel and Distributed Computing Chapter 1 Basic Concepts

Parallel and Distributed Computing Chapter 1 Basic Concepts

1.1 The inevitability of parallel and distributed computing

1.1.1 Three walls from the perspective of data flow

  • Limited increase in GPU memory capacity (memory wall)
  • The improvement of link bandwidth lags far behind the improvement of AI computing power (communication wall)
  • Data is growing far faster than Moore's Law, but storage
    device speed (bandwidth or IOPS) lags far behind processors (storage walls)

1.1.2 System structure evolution

von Neumann structurePlease add image description

Von Neumann structural characteristics

  • Structurally centered on operators
  • Instructions consist of opcodes and address codes
  • The program instructions are stored in the computer memory in advance. The computer executes the instructions in the logical sequence specified by the program. The PC and branch instructions control the execution sequence.
  • Harvard structure: In the memory, instructions and data are treated equally and are expressed in binary, that is, the program composed of instructions can be modified.
  • Each memory unit has a fixed size, linear addressing in sequence, and access by address.

The evolution of instruction set architecture

  • Function of the instruction set
    · Complex instruction set computer (CISC)
    · Reduced instruction set computer (RISC) (the basis of contemporary computer design)
  • Instruction set format
    ·Variable-length instruction set computer
    ·Fixed-length instruction set computer
  • Instruction address space and addressing mode
    ·Multiple flexible addressing modes

Computing Architecture Evolution

  • Single-level computing ->Multi-level computing->Network computing->Cloud computing
  • Instruction word width increases 8 16 32 64 vector instructions
  • Parallelism: single core, multi-core, many cores
  • Moore's Law begins to expire
  • Increased operating frequency
  • Increased control complexity

Storage structure evolution

  • Single-level storage->Multi-level storage->Network storage->Cloud storage
  • Storage architecture remains strong in defense of Moore's Law
  • Capacity rises
  • speed rise
  • Reduced energy consumption
  • Storage architecture remains strong in defense of Moore's Law

IO structure evolution
Please add image description

1.2 Parallel computing

1.2.1 Classification of computer system structures

Flynn’s taxonomy proposed by Michael Flynn:

  • Single Instruction Stream, Single Data Stream, SISD
  • Single Instruction Stream, Multiple Data Stream, SIMD
  • Multiple Instruction Stream, Single Data Stream, MISD
  • Multiple Instruction Stream, Multiple Data Stream,
    MIMD

**Single instruction stream single data stream, SISD (von Neumann)Please add image description

  • Each instruction operates on a single piece of data
  • Instruction set parallelism
  • Each PU exclusive CU (CU control unit PU arithmetic logic unit)
  • Pipelining, dynamic scheduling, lookahead execution, superscalar, multi-issue.

**Single Instruction Stream Multiple Data Stream, SIMD (Array Processor, Vector)**
Please add image description

  • Each instruction manipulates multiple data streams,

  • Data set parallelism

  • Multiple PUs can store privately, but share CU

  • Vector architecture GPU
    Multiple instruction streams, multiple data streams, MIMD
    Please add image description

  • Each instruction operates on a single data stream

  • task-level parallelism

  • Tight coupling manifests as thread-level parallelism: multi-core, shared memory model

  • Loose coupling manifests as process-level parallelism: multi-processor, message passing mode
    Multiple instruction streams and single data stream, MISD (high performance, multi-machine)
    Please add image description

  • Multiple instructions operate on a single data stream

  • Streaming computing

  • Tight coupling manifests itself as on-chip streaming computing:

  • Loose coupling manifests itself in big data workflows:

1.2.2 Parallelism

Parallelism: A computer system performs multiple calculations or operations at the same time or within the same time interval.

  • Simultaneity: Two or more events occurring at the same time.
  • Concurrency: Two or more events occurring within the same time interval.

From the perspective of processing data, the parallelism level is from low to high

  • String bit string: Only one bit of a word is processed at a time. (The most basic serial processing method, no parallelism)
  • Word string bit union: all bits of a word are processed at the same time, and different words are serialized (parallelism begins to appear)
  • Word-parallel bit string: Processing the same bits (called bit slices) of many words at the same time (with high parallelism)
  • Fully parallel: Processing all or part of the bits of many words simultaneously. (highest level of parallelism)

From the perspective of the executor, the parallelism levels are from low to high:

  • Intra-instruction parallelism: parallelism between micro-operations in a single instruction
  • Instruction-level parallelism: executing two or more instructions in parallel
  • Thread-level parallelism: executing two or more threads in parallel. (Usually multiple threads derived from one process are used as the scheduling unit)
  • Task-level or process-level parallelism: executing two or more processes or tasks (program segments) in parallel
  • Job or program-level parallelism: Execution of two or more jobs or programs in parallel.

1.2.3 Parallel computing classified by cost

  • Computing-Intensive: such as large-scale scientific engineering calculations and numerical simulations;
  • Data-Intensive: such as data warehouse, data mining, computing visualization;
  • Communication-Intensive: such as collaborative computing and streaming media services

1.2.4 Application fields of parallel computing system structure

  • Highly structured numerical computation
  • unstructured numerical computation
  • Real-time multi-factor problems
  • Problems with large storage capacity and intensive input and output
  • Graphics and Design Issues
  • AI

1.2.5 Three basic ideas for improving parallelism

  • Time overlap (multi-component pipeline) introduces the time factor, allowing multiple processing processes to be staggered in time, and use various parts of the same set of hardware equipment in turn to speed up hardware turnover and gain speed.
  • Duplication of resources (multi-core) introduces space factors and wins with quantity. By repeatedly setting up hardware resources, the performance of the computer system is greatly improved. (large model)
  • Resource sharing (software concurrency) This is a software method that enables multiple tasks to take turns using the same set of hardware devices in a certain time sequence. Such as virtual machines in cloud computing.

1.2.6 The development of parallelism

The development of parallelism in stand-alone systems
In the process of developing high-performance single processors, the principle of time overlap plays a leading role.

The basis for achieving time overlap: specialization of component functions

  • Divide a piece of work into several interconnected parts according to function;
  • Assign each part to a dedicated component for completion;
  • According to the principle of time overlap, the execution processes of each part are overlapped in time, so that all components can complete a common set of tasks in sequence.

In a single processor, the application of resource duplication principle

  • multi-body memory
  • Multi-operation components:
    · General components are decomposed into several special-purpose components, such as addition components, multiplication components, division components, logical operation components, etc., and the same component can also be Repeat multiple settings.
    · As long as the operating parts required by the instruction are free, the execution of this instruction can begin (provided that the operands are ready)
    · This implements the instruction level parallel.
  • An array processor (parallel processor) sets up many identical processing units, allowing them to perform the same operation on each element of a vector or array at the same time under the command of the same controller and the same instruction, forming an array processor.

The application of resource sharing in a single machine
In a single processor, the concept of resource sharing is essentially to use a single processor to simulate the functions of a multiprocessor, forming the concept of a so-called virtual machine . Each user of a virtual machine feels like he or she has a dedicated processor.

The development of parallelism in multi-machine systems

  • Coupling degree: reflects the closeness of physical connections and the strength of interaction capabilities between machines in a multi-machine system. "Low coupling, high cohesion" thinking
  • Tightly coupled system (directly coupled system): In this system, the physical connection between computers is of a higher frequency band, usually through a bus or high-speed switch interconnection, and can share main memory. The interaction between machines takes place at the variable level.
  • Loosely coupled system (indirectly coupled system): Generally, computers are interconnected through channels or communication lines, and external memory devices (disks, tapes, etc.) can be shared. The interaction between machines occurs at the file or data set level.

multi-machine system

  • High availability: structural robustness (multi-active instances), data reliability (hot standby and replicas), reconfigurable system (multi-point collaboration)
  • High performance: load balancing (reverse proxy), elastic computing (webservice), low latency (message queue, distributed cache)
  • High reliability: data encryption, identity authentication, permission control

1.2.7 Paradigms of parallel computing

Phase ParallelPlease add image description
Phase Parallel A parallel program consists of a series of super steps, and each super step is also called a phase. Within each super step, each process performs its own parallel calculations, followed by interactions (including communication synchronization, etc.). Phase parallelism is also called loose synchronization parallelism. The advantage is that it facilitates error checking and performance analysis. The disadvantage is that the calculation phase and the interaction phase cannot overlap, making it difficult to maintain load balance

divide and conquer parallelism
Please add image description

Divide and Conquer Parallel: A leading process divides its workload into multiple small loads and assigns them to some sub-processes. These sub-processes complete their respective calculations in parallel, and then the subsequent processes merge the results of these calculations. . This process of slicing and merging naturally leads to recursion. The disadvantage is that it is difficult to maintain load balance. The predecessor process and the subsequent process can be the same process or different processes, such as MapReduce.

Please add image description

Pipeline parallelism
The overlapping execution operations of multiple processes in different sections of the pipeline have achieved an overall parallel effect. The disadvantage is that the load of each segment is usually different, making it difficult to maintain load balance. Load imbalance leads to pipeline bubbles. The solution is to repeatedly set the bottleneck segment or subdivide the bottleneck segment.

Master-slave parallelism
Please add image description

Master-Slave Parallel (Master-Slave Parallel) A master process executes the serial part of a parallel program and generates some sub-processes that can perform calculations in parallel at the same time; when a sub-process completes the calculation, it reports to the main process, and the main process reassigns Give it new tasks. This kind of parallelism is also called herding parallelism. The main process is responsible for all coordination work. The disadvantage is that the main process can easily become a system bottleneck. The solution is to use multiple masters and multiple slaves, but it will also introduce the data consistency problem of multi-master collaboration.

Work Pool Parallel
![Please add image description](https://img-blog.csdnimg.cn/c00ee6a97ec14d8abf2741a53b62ff06.png Work Pool Parallel (Work Pool) when the pool starts There is only one job in the pool, and any idle process can take it out of the pool and execute it. During the execution, one or more new jobs may be generated and put back into the pool for other idle processes to take. When When the pool becomes empty, the parallel program ends. The work pool is a logical global data structure, which can be an unordered queue, an ordered queue, a priority multi-queue, etc. This is the message queue solution commonly used in enterprise-level applications.

Main issues in parallel computing

  • The memory wall problem (Memory Wall) is an imbalance between the memory access capability and the computing power of the processing unit, causing the processor to be in a "hungry" state.
  • Power wall problem (Power Wall)
  • Ecological wall problem

1.3 Distributed computing

1.3.1 Why is a distributed computing system needed?

Two major technologies that promote the development of distributed computing systems:

  • The development of computer hardware technology and software technology;
  • The development of computer network technology.

** Two major technologies have changed the way people use computers: **

  • In the 1950s, making an appointment to board the machine took up all the resources;
  • In the 1960s, batch processing technology;
  • In the 1970s, time-sharing systems allowed multiple users to use one computer at the same time;
  • In the 1980s, with personal computers, each user had a dedicated computer;
  • From the 1990s to the present, multiple computers are used through computer networks.
    The physical basis determines the application mode

New problems brought by multi-computer system environment:
New problems brought by multi-computer system environment:

  • In use, users must know the difference between local objects and remote objects
  • In terms of management, managers cannot run around performing operations such as file backups.

The way to solve the above new problems is to implement a distributed operating system
In a distributed computing system, multiple computers form a complete system that behaves like a stand-alone system. Distributed operating system Distributed operating system is the core of realizing distributed computing system.

1.3.2 Related concepts of distributed computing systems

What is a distributed computing system
A distributed computing system is a computing system composed of multiple independent computer systems connected to each other. From the user's perspective, it seems to be a centralized Stand-alone system

A distributed computing system is a computing system composed of multiple interconnected processing resources that can cooperate to perform a common task under the control of the entire system, with minimal reliance on centralized programs, data, and hardware. These processing resources can be physically adjacent or geographically dispersed.

Description of distributed computing system definition:

  • A system is composed of multiple processors or computer systems.
  • Two types of structures: These computing resources can be physically adjacent processors connected by machine internal buses or switches and communicate through shared main memory; these computing resources can also be geographically separated and connected by computer communication networks ( Computer systems connected to a remote network or local area network use messages to communicate.
  • These resources form a whole and are transparent to users, that is, users do not need to know where these resources are when using any resources.
  • A program can be distributed to run on various computing resources;
  • Each computer system has an equal status. In addition to being controlled by the operating system of the entire system, there is no master-slave control or centralized control link.

Tightly coupled and loosely coupled distributed computing systems

  • Tightly coupled distributed computing system*Please add image description
  • Connection method: internal bus or intra-machine interconnection network
  • Processing distance between resources: physically dispersed and very close to each other;
  • Processing resources: multiprocessor;
  • Communication method: shared memory.

Loosely coupled distributed computing system architecture
Please add image description

  • Connection method: communication network
  • Dealing with distance between resources: geographically dispersed and far apart;
  • Processing resources: multiple computer systems;
  • Communication method: message exchange.

Homogeneous and heterogeneous distribution computing system

  • For a homogeneous distributed system, the hardware and software of the computers that make up the system are the same or very similar, and the hardware and software of the computer network that make up the system are also the same or very similar.
  • For a heterogeneous distributed system, the hardware or software of the computers that make up the system are different, or the hardware or software of the computer network that makes up the system is also different.
    Please add image description

1.3.3 Transparency of distributed computing systems

  • Transparency of distributed computing systems: users or programmers cannot see the existence of the network. In this way, from the perspective of the user or programmer, all the machines in the network appear as one, and the user or programmer cannot see the boundaries of the machines and the network itself. Users do not need to know where data is placed and where processes are executed
    Transparent performance of distributed computing systems:
  • Name transparency: Name transparency means that the name of an object is globally unique, and the name used to access the object is the same no matter where it is accessed. This way, moving a program within the system does not affect its correctness.
  • Location transparency: Location transparency means that the name of a resource does not contain the location information of the resource. In this way, when the resource is moved in the system, the original program can run normally while the resource name remains unchanged.
  • Access transparency: Access transparency. Users do not need to distinguish between local resources and remote resources. The methods for accessing local resources and remote resources are the same.
  • Migration is transparent. Migration transparency means that the user does not know whether a resource or his job has been migrated to another location. Migration transparency requires the support of name transparency.
  • Copy transparent. Replication transparency allows multiple copies of a file or other object to exist in the system at the same time, but this situation is transparent to the user, and modifications to the object should be applied to all copies of the object at the same time.
  • Concurrency and parallelism are transparent. Multiple processes may access the same resource concurrently or in parallel, or one process may use multiple resources at the same time. In this case, there will be no mutual interference or destruction.
  • Failure transparent. When a certain part of the system fails, the entire system will not fail and can still operate normally.

Distributed computing systems have the following advantages when they are transparent

  • Makes software development easier because there is only one way to access resources and the functionality of the software is independent of its location.
  • Changes in certain system resources have little or no impact on application software.
  • The system's resource redundancy (hardware redundancy and software redundancy) makes operation more reliable and availability better. Transparency makes it easy to replace various redundant resources with each other when implementing this redundancy.
  • In terms of resource operations, there is no impact when moving an operation from one place to several other places.

Factors affecting transparency

  • The heterogeneity of the system has a certain impact on transparency.
    · Loosely integrated through web services.
    ·Share programs in different languages.
    ·Add front-end software to multiple existing systems.
  • The impact of local autonomy on transparency
    · A distributed computing system consists of a collection of computers dispersed across locations that may wish to maintain visibility into the machines at that location. Control rights, this local autonomy limits global transparency
    ·Resource control aspect. Each machine connected by a distributed computing system is operated by different users, or controlled by different departments of an organization, hoping to have greater control over the use of resources.
    ·Naming. Even on the same model of machine, different users may form their directories in different ways
  • The impact of network interconnection on transparency
    ·Many networks connect different series of computers provided by different manufacturers
    · Today’s networks are generally Developed directly from early network structures, its most important function was communication and did not take into account distributed computing
    · Remote networks are generally expensive resources characterized by low bandwidth or high latency

1.3.4 Architecture and design of distributed computing system

Layered architecture of distributed computing system
A distributed computing system can be divided into several logical layers. The layers are called interfaces, and each layer has two interfaces. The functions provided by the layer can be further divided into several modules, and there are also interfaces between modules. The interface consists of the following three parts

  • A set of visible abstract objects (resources) and the operations and parameters required for these objects.
  • A set of rules that govern the legal order of these operations.
  • Encoding and formatting conventions required for operations and parameters.

Distributed computing system model for process and message delivery
Please add image description

The distributed computing system consists of four layers: the first layer is the hard core composed of hardware or firmware, the second layer is the kernel of the distributed operating system, the third layer is the service layer of the distributed operating system, and the fourth layer is with users. related application layer.

Distributed computing system based on middlewarePlease add image description

Design issues for distributed computing systems
Design issues common to each layer or many layers:

  1. Naming issue.
  2. Error control.
  3. Resource management issues.
  4. Synchronization issues.
  5. protection issues.
  6. test questions

1.4 Principles of quantification

1.4.1 Principle of regularity

Focus on recurring events and select the principles of optimization methods for frequently occurring situations to obtain more overall improvements.

1.4.2 Principle of locality

The distribution of memory addresses accessed during program execution is not random, but relatively clustered.

  • Pareto’s principle
    80% of the program execution time is executing 20% ​​of the code in the program.
    The 80/20 principle in life, the last 20%
  • Temporal locality of the program
    The information that the program is about to use is likely to be the information currently being used (or information that has just been used recently).
  • Spatial locality of the program
    The information that the program is about to use is likely to be spatially adjacent or close to the information currently being used.

1.4.3 Benchmark principle

Benchmark Suits • Select a set of representative test programs in all aspects to form a general test program collection. Avoid the one-sidedness of independent testing procedures and test the performance of a computer system in all aspects (compared to other typical computer systems) as comprehensively as possible

1.4.4 Moore’s Law

The number of transistors per unit area doubles every 18 months, and the opposite is true for prices

1.4.5 Dennard’s law

In each technology generation, transistor density doubles while power consumption remains the same

1.4.6 Amdahl’s law

Describes the system performance acceleration ratio that can be obtained by speeding up the execution of a certain component, limited by the percentage of the execution time of the component to the total execution time in the system.

  • Improveable ratio: In the system before improvement, the proportion of the execution time of the improveable part to the total execution time. (It is always less than 1) percentage.
  • Component speedup ratio: The multiple by which performance can be improved after a partial improvement. It is the ratio of the required execution time before improvement to the execution time after improvement.

1.4.7 Computational cost evaluation model

Parallel computing overhead = computing overhead + data overhead + communication overhead + scheduling overhead

• Communication time is represented by Tn; memory access time is represented by Td; calculation time is represented by Tc;
• If asynchronous operation mechanism is used,
• When Td<Tc, Td will be partially hidden by Tc, and the total cost is T=Tc;
• When Td>Tc, Tc will be partially hidden by Td, and the total cost is T =Td;
• When Td=Tc, Tc and Td hide each other, which has the best time overlap performance.
• Therefore, the ratio of Tc and Td hiding each other is the key point of optimization
• Communication overhead p-p and data overhead p-m are point-to-point information transmission. ,
• The computing overhead acts on the computing components within p, and the scheduling overhead acts on the scheduling components within p, both within points
• Parallel computing overhead = computing overhead + Transmission overhead

1.4.8 CPU performance formula

CPU time = IC ×CPI ×Clock cycle time
• IC (Instruction Count): The number of instructions in the program, depends on the instruction set structure and compilation technology
• CPI (Cycles Per Instruction): The average number of clock cycles executed by each instruction, depending on the computer composition and instruction set structure

• Clock cycle time: is the reciprocal of the system clock frequency and depends on hardware implementation technology and computer composition

The ideal CPI is the CPI when the cache hit rate is 100% for instruction and data access. CPI = Ideal CPI + CPI Pause

Guess you like

Origin blog.csdn.net/weixin_61197809/article/details/134484490