2022 National Graduate Mathematical Modeling Competition Huawei Cup Question D PISA Architecture Chip Resource Arrangement Problem Solving Process Documents and Procedures

2022 National Graduate Mathematical Modeling Competition Huawei Cup

Question D: PISA architecture chip resource arrangement problem

Original title reproduced:

  1. Background introduction
Chips are the foundation of the electronics industry. In the current increasingly complex international situation, chips have become a high-tech technology that all major countries must compete for. This topic focuses on switching chips in the field of network communications. Traditional switching chips have fixed functions. When new network protocols appear, the chips must be redesigned. It often takes several years from chip design to use. Therefore, fixed-function switching chips This greatly reduces research and development efficiency. In order to solve this problem, programmable switching chips were born. PISA (Protocol Independent Switch Architecture) is one of the current mainstream programmable switch chip architectures. It has a processing rate comparable to that of fixed-function switch chips, while also being programmable. It has broad application scenarios in future networks [1 -2].
Before further explaining the PISA architecture, we first clarify a few basic concepts:
  1. Message: Message is the data packet transmitted in network communication. In network communication, the data transmitted by the user is encapsulated into data packets one by one. transmission.
  2. Basic block: A basic block is a program fragment of the source program. Dividing the source program into basic blocks will divide the source program into basic blocks. How to divide the basic blocks is also a question worth discussing, but it is beyond the scope of this question and will not be elaborated here.
  3. Pipeline: The pipeline is composed of a series of processing units connected in series. The message passes through each processing unit in sequence in the pipeline and is finally processed. Each stage of the pipeline refers to each processing unit in the pipeline.
  The PISA architecture is shown in Figure 1, which includes three components: message parsing (parser), multi-level message processing pipeline (Pipeline Pocket Process), and message reassembly (Deparser). Message parsing is used to identify message types; multi-level message processing pipelines are used to modify message data. In actual PISA architecture chips, different chip pipeline levels may be different; message reassembly is used to re- Assemble. This topic only focuses on the multi-stage message processing pipeline part. In the PISA architecture programming model, the user uses the P4 language to describe the message processing behavior to obtain the P4 program, and then the compiler compiles the P4 program to generate machine code that can be executed on the chip. When the compiler compiles a P4 program, it will first divide the P4 program into a series of basic blocks, and then arrange each basic block into various levels of the pipeline. Since each basic block occupies a certain amount of chip resources, arranging the basic blocks to all levels of the pipeline means arranging the resources of each basic block to each level of the pipeline (that is, it is necessary to determine which part of the pipeline each basic block is arranged to. level), so we call the basic block arrangement problem the PISA architecture chip resource arrangement problem. In the actual design of PISA architecture chips, in order to reduce the complexity of wiring, there are often various constraints on the resources at each level of the pipeline, and the resources between the pipeline levels. This series of complex resource constraints This makes the resource allocation problem particularly difficult. However, all kinds of resources of the chip are limited. A higher resource utilization means that the chip can better utilize the capabilities of the chip and support more services. Therefore, resource arrangement algorithms with high resource utilization are important for compiler design. Crucial.
Insert image description here

Figure 1 PISA architecture diagram
   2. Problem Description From the basic block definition, it can be seen that the basic block is a fragment of the source program. Instructions will be executed in the basic block to complete the calculation. When the instruction is executed, the instruction source operand (that is, the variable corresponding to the read source operand) will be read for calculation. Then assign the calculation result to the destination operand (that is, write the variable corresponding to the destination operand). For the divided basic blocks, the instructions in each basic block are executed in parallel. During execution, the order is read first and then written (determined by the underlying implementation of the chip). That is, all destination operands are first read at the same time, and then all are executed in parallel. The instruction is calculated, and finally the calculation result is assigned to the destination operand. Due to conflicts when writing multiple times to the same variable in parallel, each basic block will only write the same variable once (that is, there are no multiple instructions in the basic block that assign values ​​to the same variable).   A basic block can be abstracted into a node. After abstraction, the specific instructions executed in the basic block are shielded, and only the read and write variable information is retained. When basic block A is executed and can jump to basic block B for execution, a directed edge is added between A and B, so that the P4 program can be expressed as a directed acyclic graph (the P4 program does not have a cycle), which is called Make the P4 program flow chart, as shown in the left picture of Figure 2. The PISA architecture resource arrangement is to arrange each node (i.e. each basic block) in the P4 program flow chart into all levels of the pipeline under the condition that the constraints are met. The constraints come from two aspects. On the one hand, they come from the P4 program itself. Each basic block of the P4 program will write some variables (that is, assign values ​​to variables) and read some variables. The reading and writing of variables causes data dependence between basic blocks. At the same time, after a basic block is executed, it may jump to multiple basic blocks for execution, so that there are control dependencies between basic blocks. Data dependency and control dependency constrain the size relationship of the pipeline stages arranged by basic blocks. Regarding data For detailed descriptions of dependencies and control dependencies, see Appendix A; on the other hand, constraints come from the resource constraints of the chip. The resources in the chip include four categories: TCAM, HASH, ALU, and QUALIFY (Appendix B explains the functions of the four categories of resources and For interested students to learn more, lack of understanding does not affect the answer to the question). There are strict restrictions on these four types of resources in the pipeline (specific resource restrictions are explained in the competition questions), and the resource restrictions of the chip cannot be violated when arranging resources.   In this question, the input data gives the adjacency relationship of each basic block in the P4 program flow chart, the number of four types of resources occupied by each basic block, and the variable information read and written by each basic block. The competition questions of this question include The two sub-problems require students to arrange resources while satisfying the above-mentioned data dependencies, control dependencies, and resource constraints of each specific sub-problem, and fully consider the optimization goals of each sub-problem in order to maximize chip resource utilization.   To help you further understand this issue, a simple resource arrangement example is given in Appendix C.

Insert image description here

Figure 2 Schematic diagram of PISA architecture resource arrangement

  3. Input data description
  The input data contains three attachments, which respectively provide the resource usage of each basic block, the variable information read and written by each basic block, and the adjacent basic blocks of each basic block in the flow graph. The format of each attachment is as follows: (1) attachment1.csv: The first column in the
   resource information table used by each basic block is the basic block number, and the second to fifth columns are the numbers of the four types of resources used by each basic block. The resources are divided into four types in total (
Insert image description here
  TCAM, HASH, ALU, QUALIFY). For example, as can be seen from the above table, basic block No. 0 needs to occupy 2 ALU resources, and basic block No. 4 needs to occupy 10 ALU resources and 3 Qualify resources.
  (2) attachment2.csv: The first column in the variable information table for reading and writing of each basic block
Insert image description here
  indicates the basic block number, "W" in the second column indicates writing, "R" indicates reading, and subsequent columns indicate writing or reading of this basic block. variables, when a row does not have any elements from the third column and subsequent columns, it means that the basic block with this number does not write (or read) any variables (at this time, the basic block is simply used as an intermediate basic block connecting other basic blocks, without perform any calculations). For example: Basic block No. 0 writes variables X0 and X1, but does not read any variables. Basic block No. 1 neither writes any variables nor reads any variables.
  (3) attachment3.csv: The first column of each basic block in the adjacent basic block information table in the flow chart
Insert image description here
  is the basic block number, and the subsequent columns are the basic block numbers adjacent to the current basic block in the flow chart, that is, in the flow chart (As shown in the left picture of Figure 2), there is a directed edge connection between this basic block and the basic blocks in the subsequent columns. For example, from the first row of the above table, we can see that there are directed edge connections between basic block No. 0, basic block No. 1, and basic block No. 2 (that is, after the execution of basic block No. 0, you can jump to basic block No. 1 or basic block No. 2. Basic block No. 2 is executed). From the third line, we can see that there is no edge starting from basic block No. 2 (that is, the program ends after basic block 2 is executed and will not jump to other basic blocks for execution). Through this file, you can determine the execution order of basic blocks in the source program, determine the destination basic block to jump to after each basic block is executed, and then build a flow chart of the basic blocks.

  4. Problems
  This topic needs to establish a mathematical model for resource arrangement problems, and on this basis, deal with the following two problems.
  Question 1: The given resource constraints are as follows:
  (1) The maximum number of TCAM resources at each level of the pipeline is 1;
  (2) The maximum number of HASH resources at each level of the pipeline is 2;
  (3) The maximum number of ALU resources at each level of the pipeline is 56;
  ( 4) The maximum number of QUALIFY resources for each level of the pipeline is 64;
  (5) It is agreed that levels 0 and 16, level 1 and 17,..., level 15 and level 31 of the pipeline are folded levels, and the two folded The maximum total number of level TCAM resources is 1, and the maximum total number of HASH resources is 3. Note: If the required number of pipeline stages exceeds 32, the stages starting from the 32nd will not take into account the folding resource limit; (6
  ) The number of even-numbered stages with TCAM resources does not exceed 5;
  (7) Each basic block can only be ranked Distributed to level one.
  The resources are arranged under the above resource constraints, and the optimization goal is to shorten the number of occupied pipeline stages as much as possible. Please give the resource arrangement algorithm and output the basic block arrangement result. The output result format is as follows:

Insert image description here
  Question 2: Consider the flow graph as shown in Figure 3. Basic block 2 and basic block 3 are not on the same execution flow (because after basic block 1 is executed, either basic block 2 or basic block 3 is executed, and basic block 2 and basic block 3 are executed. Block 3 cannot all be executed). To be precise, in the P4 program flow chart, if one basic block can reach another basic block, the two basic blocks are on the same execution flow, and vice versa are not on the same execution flow. For this kind of basic block that is not on the same execution process, HASH resources and ALU resources can be shared. The HASH resources and ALU resources of any one of basic blocks 2 and 3 do not exceed the resource limit of each level. Basic blocks 2 and 3 can be arranged to the same level. Based on this, the following changes are made to the constraints (2), (3), and (5) in question 1:
  (2) The sum of the HASH resources of the basic blocks on the same execution process in each stage of the pipeline is at most 2;
  ( 3) The maximum sum of ALU resources of the basic blocks on the same execution process in each level of the pipeline is 56;
  (5) For the two folded levels, the TCAM resource constraints remain unchanged, and for the HASH resources, each level calculates the same execution process separately. The HASH resources occupied by the basic blocks above are then added up to the calculation results of the two levels, and the result does not exceed 3.
  Other constraints are the same as question 1. After changing the resource constraints, reconsider question 1, give the arrangement algorithm, and output the basic block arrangement results. The output format is the same as question 1.
Insert image description here

Figure 3 Flow graph example

Overview of the overall solution process (abstract)

  As the global "core" shortage continues to ferment, chip technology as "industrial food" has become one of the industries in my country that urgently needs to break through. As a switching chip architecture with good processing speed and programmability, PISA effectively alleviates the problem of low R&D efficiency of traditional fixed-function switching chips. In order to give full play to chip capabilities, the selection of chip resource arrangement algorithms with high resource utilization is particularly important. In view of this, this paper decomposes the complex basic block resource arrangement problem into multiple sub-problems, combines various constraints, and gradually solves it by establishing a goal programming model and a basic block arrangement model, and then designs heuristic rules to evaluate the solution results. Optimize, and finally use the shortest possible pipeline stages as the solution to the problem.
  Regarding problem one, this article divides it into two sub-problems: dependency analysis and basic block resource arrangement. For sub-problem 1, the attachment attachment3.csv data is analyzed and processed to obtain the control flow graph CFG and the forward dominating tree FDT of the control flow graph. The difference between CFG and FDT is the control dependency of the basic block. Secondly, read the read-write relationship of the basic blocks in attachment2.csv, and use the breadth-first search algorithm BFS to determine the connectivity between basic blocks to determine the data dependency. The complete dependency relationship is obtained by union of the control dependency relationship and the data dependency relationship. For sub-problem 2, the completed dependency relationship is abstracted into a weighted directed acyclic graph, and the hierarchical relationship of basic blocks arranged in the pipeline is initially determined. Then, with the main goal of occupying the shortest possible number of pipeline stages, a goal planning model was established based on the basic block resource requirements in attachment1.csv and the given resource constraints. To solve the model, a basic block arrangement model is designed. Based on the principle of serial arrangement, basic blocks are selected from basic blocks without pre-order dependencies according to random rules and placed into the earliest pipeline stage that can be placed until all basic blocks are completed. arrangement. Question 1 shows that the number of occupied stages of the pipeline is 40.
  Regarding question two, since basic blocks that are not on one execution process can share HASH resources and ALU resources, we introduce the concept of basic blocks that are not on one execution process based on question one, which alleviates the problem of HASH and ALU in question one. Occupancy issues. This article decomposes problem 2 into two sub-problems: determining whether two basic blocks are in the same execution process and arranging basic block resources. For sub-problem 1, according to the program flow chart, the breadth-first search algorithm is used to determine whether two basic blocks are on the same execution process, that is, whether starting from one basic block can reach another basic block, and obtain the basic block execution process relationship symmetry matrix. For question two, the main goal is to occupy the shortest possible number of pipeline stages. Combined with the condition that basic blocks that are not on the same execution process can share HASH and ALU resources and the three changed constraints, a goal similar to question one is established. planning model. When solving, for pipelines with new basic block arrangements, the resource requirements for HASH and ALU of basic blocks in different control flows in this level of pipeline are combined. Question 2 shows that the number of stages occupied by the pipeline is 34.
  Solution optimization takes into account the instability of random rule selection and optimizes the basic block selection rules. This paper constructs 6 heuristics, including the earliest starting time EST, the most following resource block MST, the earliest starting time and the most following resource block EST_MST Formula rules, and select the basic blocks to be added to the pipeline based on heuristic rules. This paper combines the optimal results of solving 10 random rules and 6 heuristic rules as the final pipeline layout plan. After optimization, the number of occupied stages of the pipeline in question 1 is 40. Question 2 shows that the number of stages occupied by the pipeline is 34. Finally, this article found that the performance bottleneck of problem 1 lies in HASH resources, and the performance bottleneck of problem 2 lies in TCAM resources. By comprehensively considering control dependencies, data dependencies and resource constraints of all levels of pipelines, the effectiveness and universality of the proposed solution are proven. The advantages and disadvantages of the proposed algorithm and model are summarized, and future research work is prospected.

Model assumptions:

  [1] It is assumed that the experimental scenarios are all in an ideal state and are not affected by factors such as temperature and electromagnetism.
  [2] It is assumed that there are no errors caused by chip failures during the experiment.
  [3] Assume that we only focus on the multi-stage message processing pipeline part of the PISA architecture chip.
  [4] Assume that the conditions for each level of the pipeline are the same except for the resource constraints given in the question.
  [5] Assume that there is no upper limit on the number of pipeline stages that can be arranged.
  [6] It is assumed that the process of seeding random numbers in a random model is completely random.

problem analysis:

Analysis of problem one:

  Question 1 requires considering the data dependencies and control dependencies between basic blocks, and combining the given resource constraints to realize the arrangement and optimization of basic block resources to make the number of occupied pipeline stages as short as possible in order to maximize chip resources. Utilization.
The question gives attachment1.csv, attachment2.csv, and attachment3.csv. The three attachments are the resource information used by each basic block, the variable information read and written by each basic block, and the adjacent basic block information of each basic block in the flow chart.
  Through analysis of the question, we concluded that this question can be broken down into the following three sub-problems:
  Sub-problem 1: By designing algorithms and models, process and analyze the data in Appendix 2 and Appendix 3 to determine the control dependencies between the basic blocks. and data dependencies to initially determine the hierarchical relationship between basic blocks arranged in the pipeline.
  Sub-question 2: Combine the resource constraints given in the question and the resource usage data in Appendix 1 to establish a model and determine the series arrangement of basic blocks in the pipeline.
  Sub-problem three: Optimize the arrangement of basic blocks so that the number of pipeline stages occupied by basic blocks is as short as possible.

Analysis of question two:

  Question 2 introduces the concept of basic blocks that are not on an execution process based on question 1. For basic blocks that are not on an execution process, HASH resources and ALU resources can be shared. Therefore, problem two alleviates the occupation of HASH and ALU for problem one. The question also requires giving the algorithm and results of the arrangement of basic blocks in the pipeline, but the constraints are changed.
  Combined with the analysis, problem two can be broken down into the following two sub-problems:
  Sub-problem one: Based on the program flow chart, establish a model to determine whether one basic block can reach another basic block, that is, whether the two basic blocks are on the same execution flow .
  Sub-problem 2: Combining the condition that basic blocks that are not on the same execution process can share HASH and ALU resources and the changed constraints, establish a model to determine the stage arrangement of basic blocks in the pipeline, and make the occupied pipeline stages Keep the number as short as possible.

Model establishment and solution overall paper thumbnail

Insert image description here
Insert image description here
Insert image description here

For all the papers, please see "Only modeling QQ business cards" below. Just click on the QQ business card.

code:

Some Python programs are as follows:

import networkx as nx
import matplotlib.pyplot as plt
import numpy as np
import csv
import scipy as sp
# 入度
inGF = []
# 第一步:dfs
dfn = []
rak = []
fa = []
search_path = []
# 第二步:sdom
val = []
bel = []
sdom = []
# 第三步:idom
idom = []
# 比较获取 CDG
cdg = []
idomGF = nx.DiGraph()
def read_csv3(file_name):
f = open(file_name, encoding='UTF8')
csv_reader = csv.reader(f)
control_edges = []
for line in csv_reader:
adj_num = len(line)
if adj_num > 1:
for i in range(1, adj_num):
control_edges.append((int(line[0]), int(line[i])))
f.close()
# print(control_edges)
return control_edges
def subgraph(pointList, linkList):
G = nx.DiGraph()
GF = nx.DiGraph()
# 转化为图结构
for node in pointList:
G.add_node(node)
GF.add_node(node)
for link in linkList:
G.add_edge(link[0], link[1])
GF.add_edge(link[1], link[0])
return G, GF
def dfs(GF):
# GF 的 root 为人为添加的序号最大的根
root = len(GF.nodes) - 1
T = nx.dfs_tree(GF, root)
for n in GF.nodes():
fa.append(0)
dfn.append(n)
global rak
rak = list(T) # 所有节点
for i in range(0, len(fa)):
dfn[rak[i]] = i
for i in list(T.edges): # 所有边
fa[i[1]] = i[0]
# print(dfn)
# print(rak)
# print(fa)
def find(v):
# 还未被遍历
if v == bel[v]:
return v
tmp = find(bel[v])
if (dfn[sdom[val[bel[v]]]] < dfn[sdom[val[v]]]):
val[v] = val[bel[v]]
bel[v] = tmp
return bel[v]
def tarjanFDT():
# 初始化 val 和 sdom
for i in range(0, len(dfn)):
val.append(i)
sdom.append(i)
bel.append(i)
idom.append(i)
# 从大到小遍历 dfs 树
for i in range(len(dfn) - 1, 0, -1):
# dfs 序最大的点 u
u = rak[i]
# 获取 GF 原图中所有 u 的前驱
neighbors = G.neighbors(u)
for v in neighbors:
find(v)
if (dfn[sdom[val[v]]] < dfn[sdom[u]]):
sdom[u] = sdom[val[v]]
# print(sdom)
sdomGF = nx.DiGraph()
for i in range(0, len(sdom)):
sdomGF.add_node(i)
sdomGF.add_edge(sdom[i], i)
bel[u] = fa[u]
u = fa[u]
neighbors = sdomGF.neighbors(u)
for v in neighbors:
find(v)
if sdom[val[v]] == u:
idom[v] = u
else:
idom[v] = val[v]
for i in range(0, len(dfn)):
u = rak[i]
if idom[u] != sdom[u]:
idom[u] = idom[idom[u]]
global idomGF
for i in range(0, len(idom)):
idomGF.add_node(i)
idomGF.add_edge(idom[i], i)
# nx.draw_networkx(sdomGF, with_labels=True)
# plt.show()
# nx.draw_networkx(idomGF, with_labels=True)
# plt.show()
def getInGF():
# 遍历 GF 所有点
for i in range(0, len(GF.nodes)):
# 初始化:0 标识入度为 0
inGF.append(0)
for edge in GF.edges:
inGF[edge[1]] = 1
def addGFRoot():
# print(len(GF.nodes))
GF.add_node(len(GF.nodes))
for i in range(0, len(inGF)):
if inGF[i] == 0:
GF.add_edge(len(GF.nodes) - 1, i

# nx.draw_networkx(GF, with_labels=True)
# plt.show()
if __name__ == '__main__':
# 读 attachment3.cvs
linkList = read_csv3("./data/attachment3.csv")
pointList = []
for i in range(0, 607):
pointList.append(i)
# 原始有向无环图 G,反向图 GF
G, GF = subgraph(pointList, linkList)
# 获取 GF 入度为 0 的所有点
getInGF()
# 为 GF 添加根节点
addGFRoot()
# 获取 G 的前向支配树,也就是 GF 的支配树,存储在 idomGF 中(即
FDT)
dfs(GF)
tarjanFDT()
# 对比 G 原图和 FDT 图,寻找在 G 但不在 FDT 中的边,得到 CDG
for edgeG in G.edges:
edgeGf = (edgeG[1], edgeG[0])
# 标识是都存在相同,初始化为 0
flag = 0
for edgefdt in idomGF.edges:
if edgeGf == edgefdt:
flag = 1
break
else:
continue
if flag == 0:
cdg.append(edgeG)
f = open("./data/control_dependance_less_equal.txt", "w")
f.writelines(str(cdg))
f.close()
print(len(cdg))
print(cdg)
For all the papers, please see "Only modeling QQ business cards" below. Just click on the QQ business card.

Guess you like

Origin blog.csdn.net/weixin_43292788/article/details/132836811