Parallel and Distributed Chapter 7 Architecture Part 2

Parallel and Distributed Chapter 7 Architecture Part 2

7.3 Interconnection structure

7.3.1 Basic concepts of network topology

The interconnection structure is a structure formed by switching elements connected according to a certain topology and working according to a certain control method. It is used to realize the interconnection between multiple functional components in a computer system.

Three elements

• Interconnection topology: describes the topology of the connection path
• Switching element: describes the switching state of the connection path
• Control method: describes the connection Channel operating rules

Classification

• Static interconnections (one-dimensional, ring mesh, stringed rings, tree, grid, hypercube)
• Dynamic interconnections (buses, crossbars, multiple level interconnection)

Protocol
• PCIE, InfiniBand, Ethernet protocol, FDDI optical network

• Network scale (Size) The number of nodes in the network
• Number of links (Link) The number of links between all nodes in the network;
• Link Redundancy (Link Redundancy) The number of all links connecting two specific nodes in the network;
• Node Degree (Node Degree) is injected into or out of each node The number of edges is called the degree of the node, degree = number of incoming edges + number of outgoing edges;
• Diameter (Diameter) The longest distance between any two nodes in the network, and the maximum Path length;
• Bisection Width: The minimum number of edges that must be removed to bisect the network into halves;
• Symmetric (Symmetric) from From the perspective of any node, see whether the entire network is consistent;

7.3.2 Classification of interconnection networks

• According to whether the connection path in the interconnection network can be shared, it can be divided into static interconnection network and dynamic interconnection network.
• Static interconnection network, network nodes are fixedly connected to switch units to establish passive connection paths between nodes.
• Dynamic interconnection network, network nodes are only connected to switch units located at the boundary of the interconnection network. The connection configurations between switch units can be dynamically changed according to application requirements to establish node-node connections. Actively controllable connection pathways

Design Requirements for Static Interconnect Networks

• The degree of each node in the network should be small, preferably the degree of each node is equal
and has nothing to do with the size of the network;
• Network The diameter should be small and increase slowly as the number of nodes increases;
• Try to be symmetrical as much as possible, so that the information flow is distributed evenly;
• Reasonable addressing of nodes can achieve high efficiency Path algorithm;
• Has high path redundancy and meets robustness requirements;
• The expansion cost is low, and the original interaction can still be maintained after expansion. Connection topology properties

7.3.3 Typical static network

One-dimensional linear array
• N nodes are arranged into a 1xN one-dimensional array. Each node is only connected to its left and right neighbors
nodes, so It is called two-nearest neighbor connection
• The node degree is 2, the network diameter is N-1, and the profile degree is 1

One-dimensional circular array
• Constructed by connecting the two endpoints of a linear array with an additional link
• Constructed by connecting end-to-end Ring, the ring can be unidirectional or bidirectional, its node degree is 2, and the cross-section width is 2,
• The diameter of the unidirectional ring is N, and the diameter of the bidirectional ring is N/2;

With chord loop
The more links are added, the higher the node degree, and the smaller the network diameter.
Please add image description

Fully connected string loop
• Node degree: N-1
• The shortest diameter is 1

Cyclic shift network
• It is formed by adding an additional link from each node on the ring to all nodes whose distance is an integer power of 2.
• Network size N=2^n
• If |j-i|=2^r, r=0,1,2,…,n-1, Then node i is connected to node j.
• The node degree is d=2n-1, and the diameter D=n/2.

Tree connection
• Star connection, a network with N nodes. If the degree of a node is increased as much as possible so that it becomes N-1, a star connection is formed. This When the network diameter is 2;
• Binary tree connection, a network of N nodes, in addition to the root node and leaf nodes, each internal node is only connected to its parent node and two child nodes, so it is also called It is a three-nearest neighbor connection; at this time, the degree of the node is 3, the bisecting width is 1, and the diameter is 2logN;
• Binary fat tree (Fat Tree) is used to solve the problem of the root node in the binary tree connection. The bottleneck effect increases the link redundancy near the root node. The closer to the root node, the greater the link redundancy.

Please add image description

multidimensional grid

• k-dimensional grid with N=n^k nodes
• Network diameter is k(n-1)
• Internal nodes The vertex degree is 2k, • The node degrees of edge nodes and corner nodes are 3 or 2 respectively.

multidimensional ring grid

• k-dimensional ring grid with N=n^k nodes
• Internal node degree is 2k,
• Edge nodes The node degrees of and corner nodes are 3 or 2 respectively.

Illiac mesh
• A kxk two-dimensional mesh, if it uses a strip-like ring connection in the vertical direction and a cross-row serpentine connection in the horizontal direction, it is called an Illiac mesh
• The node degree is 4
• The network diameter is (k-1),
• The cross-section width is 2k; < /span>

Torus mesh
• The two-dimensional mesh of kxk, if it is connected in a strip-like ring in the vertical direction and also in a strip-like ring connection in the horizontal direction, it is called Torus mesh Grid
• Node degree is 4,
• Network diameter is k,
• Section width is 2k;

Hypercube (CUBE)
• A 2-element n-cube structure, consisting of N=2^n nodes, which are distributed in n dimensions, with 2 in each dimension Nodes, each node is a vertex of a cube;
• The node degree of a 2-element n cube is n, the network diameter is n, and the cross-section width is 2^(n-1);< /span>

Hypercube Ring
• If each vertex of the 3-cube is replaced by a ring, a 3-cube ring is formed;
• Hypercubes and hypercube rings lack scalability and application value, but have academic value due to their topological mathematical properties.

7.3.4 Typical dynamic interconnection network

Bus is a set of wires and sockets that connect various components of a computer system. It is used for data transmission between master devices and slave devices. The public bus is based on time-sharing work, so each bus must be equipped with a bus controller. use.

Issues to consider in bus design include:
Bus arbitration, interrupt processing, protocol conversion, roadblock synchronization, cache consistency, bus bridging, layer expansion, etc.

Bus level
• Local bus, bus implemented on printed circuit board
• Processor bus, bus on CPU board level • System bus, the path provided on the backplane (motherboard) for communication between all plugged-in boards • Data bus, bus on the I/O board level and communication board level
• Memory bus, bus on the memory board level

Please add image description

Crossbar switch

CrossBar Switcher is an interactive network implemented through a single-stage crossbar switch array. The status of each switch can be dynamically controlled by the program, and can dynamically provide a dedicated connection path between the source and destination pairs.
• When the number of ports is N, its switching volume (complexity) is N^2
• In parallel processing, there are usually two uses of crossbar switches. Method: used to realize 1v1 communication between processors, used to realize 1vN communication between processors and multi-body cross-memory banks

Please add image description

Multistage Interconnection Network
Multistage Interconnection Network (MIN) In order to build a large switch network, a single crossbar switch cannot be increased without limit, but it can be Multi-level crossbar switches are cascaded to form a crossbar switch network to achieve dynamic switching between input and output.

Please add image description

Interconnection scale
• Bus/crossbar switch, realizing the connection between components on the printed circuit board inside the node, with the shortest distance and the highest bandwidth, such as SCSI protocol;
• SAN, System Area Network, system network, connects nodes within short distance (3-25m) to form a tightly coupled single system with higher bandwidth InfiniBand;
• LAN, Local Area Network, local area network, nodes in a building or a geographical area of ​​a unit are connected to form a loosely coupled multi-machine system, with a network distance of 25-500m, such as Gigabit Ethernet;
• MAN, Metropolitan Area Network, metropolitan area network, a computer network covering the entire city, network distance <=25km, such as FDDI optical fiber network;
• WAN, Wide Area Network, wide area network, realizing city -A national-level network interconnecting cities, usually with logical links rather than physical links, such as the China Education Network.

Interconnection protocol
PCI bus standard protocol, InfiniBand, Ethernet, FDDI fiber optic network, VPN, SDN

7.4 Performance evaluation

Please add image description

7.4.1 Workload

Workload: The average number of tasks that are using and waiting to use the CPU over a period of time. High CPU usage does not mean that the load must be large. Example:

• There is a program that needs to use the computing function of the CPU all the time. At this time, the CPU usage may reach 100%, but the CPU workload is close to "1" because the CPU is only responsible for one job. .
• What if two such programs are executed at the same time? The CPU usage is still 100%, but the workload has become 2.
• When the workload of the CPU is larger, it means that the CPU must perform frequent context switching between different tasks

7.4.2 Peak speed

• An important indicator to measure the performance of a computer system is the peak calculation or the peak floating point calculation, which refers to the maximum number of floating point calculations that the computer can complete per second. Including theoretical floating point peak value and measured floating point peak value. The theoretical floating point peak is the maximum number of floating point calculations that the computer can theoretically complete per second. It is mainly determined by the main frequency of the CPU.
• Theoretical floating-point peak value = CPU main frequency × number of floating-point operations performed by the CPU per clock cycle × number of CPUs in the system
• Each clock cycle of the CPU The number of floating-point operations performed is determined by the number of floating-point operations units in the processor and how many floating-point operations each floating-point operation unit can process per clock cycle.

7.4.3 Parallel execution time

• T n = Tcomput + Tparo+ Tcomm

  • Tcomput: calculation time,
  • Tparo parallel overhead
    • Process management time: process generation, process termination, process switching
    • Group operation time: process group generation, process group Death
    • Process query time: query process ID and level, query process group ID and size
  • Tcomm communication time,
    • Synchronization time: roadblocks and events, locks and critical sections
    • Communication time: point-to-point messaging, sharing Variable reading and writing
    • Aggregation operation time: reduction operation, prefix operation

memory performance
Please add image description

Hierarchical storage structure, each layer is represented by three parameters
• Capacity C: total amount of storage bytes, B (bytes)
• Delay D: The total time required to read a word, s (seconds)
• Bandwidth B: The speed of data transmission between different levels of storage devices, Bps (Bytes Per Second) < /span>

Communication overhead: measurement method (ping-pong scheme)
• K nodes, use M[KxK]={Mij, i=1...K, j=1.< /span>

Please add image description

Communication overhead (2) analysis method

  • The communication overhead of sending a length of m Bytes is T(m)=T0+m/ R∞
    • R∞ is the asymptotic bandwidth in MBps.
  • T0 is communication delay, also known as startup time. During T0 time, the data path is opened, and the first Byte has not arrived yet. After T0 time, one Byte arrives every cycle time.

Communication overhead (3) Measurement of overall communication overhead

Please add image description

7.4.4 Performance-price ratio

• Price = raw material cost + direct cost + gross profit + discount
• Performance/Cost Ratio of the machine: refers to the unit cost (Tong
Often expressed in millions of dollars) Achieved performance (often expressed in MIPS or MFLOPS)
• High performance price ratio means that cost is effective and cost effectiveness is available Measured by utilization rate
. Machine utilization (Utilization): the ratio of achievable speed to peak speed.
• Cost-effectiveness determines technology selection: supercomputer vs. workstation cluster

7.4.5 Multiprocessor Performance Laws

Speedup ratio of a multi-processor system: For a given application, how many times faster the execution speed of a parallel algorithm (or parallel program) is compared to the execution speed of a serial algorithm (or serial program).
Use different acceleration ratio laws under different conditions
• Amdahl’s law: applicable when the calculation load is fixed
• Gustafson’s law ( 1988): Applies when the problem is scalable
• Sun&Ni's Law: Applies when memory is limited

Parameter convention
• PThe number of processors in the parallel system;
• WProblem scale, total computing load;< a i=3> • Serializable component in Ws application; • Parallelizable component in Wp application; • f Serializable component Ratio; 1-f The ratio of parallelizable components; • Ts serial execution time; Tp parallel execution time; • S speedup ratio, E efficiency a>




Amdahl's law

• Starting point: In application types with high real-time requirements, the computing load is fixed. Distributing these computing loads on multiple processors, adding processors can speed up execution.
• Amdahl's law focuses on reducing the time for a given fixed size problem. Amdahl's law states that the sequential part of the problem (algorithm) limits the total speedup that can be achieved as system resources increase.
Please add image description

Gustafson's Law

•Starting point: In application types that require high accuracy, in order to improve accuracy, the amount of calculation must be increased. Correspondingly, the number of processors must be increased to maintain the same calculation time.
•For this type of application, it is common to first run coarse-grained processing on small-scale computing resources as a debug test, and then run fine-grained processing on large-scale computing resources as a formal run.

Insert image description here

Sun&Ni's Law
Starting point: Amdahl's law and Gustafson's law each represent two extreme situations. In order to unify the two to achieve generalization, Xian-He Sun (Sun) Xianhe, a professor in the Department of Computer Science at the Illinois Institute of Technology, and Lionel Ni (Ni Mingxuan) proposed the Sun&Ni law in 1993.
• Neither Amdahl’s law nor Gustafson’s law places any restrictions on the number of processors and storage capacity. In fact, in a parallel system composed of multiple nodes, the scale of the solvable problem is limited by the storage capacity. Capacity limited.
• The basic idea is that as long as storage space permits, the problem size should be increased as much as possible to produce better and more accurate solutions (this may cause a slight increase in execution time), and the goal is to be larger Speedup ratio, higher solution accuracy, and better resource utilization

Parameter convention
• A parallel system with P nodes, the storage capacity of each node is M, then the total storage capacity is pM .
• When the problem scale expands from one node to p nodes, the storage requirements increase by p times, and the parallel workload calculation amount increases by G§ times, then W=fW+(1-f )W, after expansion W=fW+(1-f)W

Please add image description

Guess you like

Origin blog.csdn.net/weixin_61197809/article/details/134511606