Chip manufacturing core technology process

Author: Zen and the Art of Computer Programming

1 Introduction

Architecture innovation engineers are mainly responsible for the design, R&D, testing and deployment of Huawei's independently developed system-level chips, AI processors, edge computing platforms and other fields. They are often senior leaders of Huawei's technology department and are crucial to the success of Huawei's products and services. It's important. Ren Zhengfei believes that architecture innovation engineers must have a strong learning ability and be good at discovering and solving new problems. He also emphasizes that R&D efficiency, development quality, stability and security are the four major benchmarks that determine the development of a field. Therefore, , it is necessary to sort out the work of architecture innovation engineers through a blog article to lay a solid foundation for subsequent training of this role.

2. Explanation of basic concepts and terms

2.1 System-on-chip

Architecture innovation engineers are usually responsible for developing various types of system-level chips, such as processors, network chips, memory chips, image processing chips, accelerator cards, integrated circuits, RFID reading and writing chips, etc. These chips are devices used to solve specific functions. Their power consumption and performance requirements are very high, so they generally need a high degree of reliability and low power consumption, and they must also be able to meet various special needs of the application. System-level core pixels are integrated circuit boards used in all Huawei mobile phones, laptops, servers, and various smart devices.

2.2 AI processor

The AI ​​processor is mainly composed of Artificial Neural Network (ANN), which can realize functions such as automatic learning, intelligent decision-making, and self-improvement. These processors are capable of completing complex image recognition tasks, machine translation, speech synthesis, text understanding and other advanced functions. Currently, Huawei's AI processors have exceeded the 500-megapixel camera range. The performance of these chips has been verified, but there are still many bottlenecks to be solved.

2.3 Edge computing platform

The edge computing platform is a distributed, programmable computing platform for massive data analysis. It can quickly analyze and process data stored on massive data devices and return results within seconds. It can help enterprises manage and operate massive data and improve productivity and efficiency. Huawei's edge computing platform initially aims to run sensors and machine learning models on various edge devices, such as environmental data from cars, screen images from smartphones, lidar data from robots, etc. Due to the performance limitations of AI processors, there are currently significant obstacles to the development of such platforms.

3. Explanation of core algorithm principles, specific operating steps and mathematical formulas

3.1 System-on-chip

3.1.1 VLIW processor architecture

VLIW processor (Very Long Instruction Word) is a processor architecture used to implement multi-threading or complex functions. It is characterized by long instruction length, and each instruction contains multiple microinstructions. Different microinstructions Can be executed in the same clock cycle. It usually uses mechanisms such as dynamic branch prediction and instruction scheduling, which can effectively reduce resource overhead and increase processing speed. The research and development of system-level chips is mainly based on the VLIW processor architecture.

A typical system-on-chip consists of the following parts:

  • Controller: Responsible for scheduling the work of various components, including instruction generation, scheduling, instruction fetching, decoding, execution, etc.
  • Accelerator: Responsible for executing operation instructions, such as multiplication, division, addition, shift, comparison, etc.
  • Cache: stores instructions and data to speed up access.
  • I/O unit: used to handle input and output of peripherals.
  • On-chip storage: stores static information, such as instructions, configuration parameters, and initialization status.

Different types of SoCs can be distinguished by different instruction sets, hardware configurations, data pipeline layouts, and electrical characteristics. Some system-level chips such as image processing chips may require better performance, but they are also limited in memory and cannot support online training. In addition, there are differences between different system-level chips. Some system-level chips have single-core performance, and some system-level chips have multi-core parallel performance.

3.1.2 Chip resource allocation method

Currently, chip resource allocation methods generally use one of the following three methods:

  • Deterministic allocation: Design each component from scratch, and then set the allocation ratio based on resource usage and performance requirements. For example, the designers of the AMD Opteron series of processors use this method. They design the processor from scratch and determine the resource allocation ratio of each part to ensure a balance between resource utilization and performance.
  • Automated allocation: SoC developers can use artificial intelligence algorithms to automatically search for the best resource allocation. For example, the developers of the NVIDIA Tesla P100 GPU have adopted an automated allocation method. System-level chip developers can provide some configuration parameters to allow GPU developers to find the most appropriate resource allocation plan.
  • Evolutionary allocation: that is, based on the resource allocation method of historical equipment, certain mechanisms are used to adjust the resource allocation strategy of the current equipment. For example, the designers of Intel's Competition and progress.

3.1.3 Reliability and security

Reliability and security are two key factors for SoCs, and if the SoC has errors, its correctness may be affected. Generally speaking, the reliability and security of system-level chips can be considered from the following aspects:

  • Fault tolerance: A system-on-a-chip should have good fault tolerance, i.e. it should have the ability to recover from a fault. For example, Intel's Skylake processor has better fault tolerance, and its interface specifications, protocols, drivers, BIOS, etc. all have fault tolerance measures for potential failures.
  • Trustworthiness: The system-level core pixel should be trustworthy and it should be able to prevent malicious attacks and data tampering. For example, Intel's Arria 10 GX FPGA development team proposed a trusted boot (Trusted Boot) mechanism, which allows the FPGA to work only after legitimate firmware is loaded, thus preventing malicious attacks.
  • Security: System-level chip security can also be implemented through measures. For example, NVIDIA's Jetson TX2 processor development team proposed a Secure Boot mechanism, which allows the processor to run only after firmware signature authentication, preventing malicious tampering.
  • Performance: The performance of the system-on-chip should be good enough to ensure that it can support the various needs of the application. For example, Intel's Ice Lake architecture processor has a powerful AVX512 instruction set that can support high-performance image processing tasks.

3.2 AI processor

3.2.1 Deep learning and neural network

Deep Learning is a branch of machine learning that uses multi-layer neural networks for learning. Artificial Neural Network (ANN) is a neural network structure that simulates humans. It is a multi-layer neuron network, in which each layer is composed of multiple neuron nodes. A typical ANN consists of input layer, hidden layer and output layer. The input layer receives external signals, and the hidden layer is connected to each neuron and passed to the next layer. The output layer produces the final output signal.

Through this connection, ANN can accept input, process information, and produce output. Different from traditional classifiers, ANN can simulate human physiological structure and can perform complex image recognition and natural language processing.

3.2.2 Acceleration method

Currently, Huawei's AI processors all use a VLIW processor architecture similar to modern CPUs. The difference is that some of their processing units use specialized AI accelerators. Therefore, in order to fully utilize the performance of an AI processor, the processor must first be optimized.

Currently, in Huawei's AI processors, both GPU and TPU adopt a multi-core processor architecture similar to the ARM big.LITTLE architecture. GPU usually contains two cores, which run floating point operations, integer operations and graphics processing tasks respectively. There is also a memory called "HBM", which can be used as the GPU's local high-bandwidth memory. At the same time, the GPU supports multi-threading, can perform many different tasks in parallel, and can use Tensor Core calculations to accelerate matrix multiplication. The TPU contains a core that can run integer arithmetic and graphics processing tasks, but it does not have a floating point unit. TPU focuses on reasoning, so the operation speed of TPU is slower than that of GPU.

In addition to architectural differences, there are also many differences in AI processors. For example, the floating-point computing performance of GPU and TPU is different, and the floating-point computing performance of GPU is higher than that of TPU. Therefore, Huawei's AI processor developers must select the appropriate processor type based on actual needs to achieve optimal performance.

3.2.3 Development and application of machine learning

With the development of artificial intelligence technology, more and more researchers are paying attention to machine learning. Machine learning is a data-driven, algorithm-guided technology for training and prediction. It can be applied to various fields, such as image recognition, natural language processing, predictive analysis, recommendation systems, advertising ranking, etc. At present, machine learning has become a hot topic in academia and industry, and many domestic and foreign companies have invested heavily in related research.

Researchers in the field of machine learning are constantly exploring new algorithms, new models, and trying to find a suitable machine learning framework. In this process, they encountered many problems, such as over-fitting, under-fitting, label noise, etc. While solving these problems, they are also facing new challenges, such as model compression, improvement of model inference efficiency, and model training on heterogeneous computing clusters.

In recent years, people are increasingly inclined to use statistical learning methods to solve machine learning problems. Statistical learning is a sub-branch of machine learning that uses statistical methods to estimate model parameters and optimization methods to minimize errors. It can train and predict on large amounts of data and achieve good performance. At present, statistical learning has become the consensus of many academic and industrial circles, and major universities and companies are committed to promoting its application.

3.3 Edge computing platform

3.3.1 Challenges of big data processing

Currently, big data processing faces multiple challenges, such as the collection, storage, analysis, and processing of massive data. In the face of these challenges, edge computing platforms must be able to cope with emergencies and maintain availability. Currently, edge computing platforms are mainly used to handle application scenarios that can occur thousands of meters away, such as monitoring, security, drive testing, video analysis, etc.

The goal of the edge computing platform is to efficiently analyze and process massive data under wireless access, low bandwidth, and weak coverage conditions. Therefore, edge computing platforms must be designed to withstand large delays, poor connection quality, and high pressure during peak hours. Under such circumstances, edge computing platforms also need to deploy as few nodes as possible in order to maintain high availability while reducing resource consumption and costs.

On the other hand, to prevent data leakage, edge computing platforms must encrypt data and restrict access to only authorized applications. In addition, edge computing platforms must have high processing efficiency, respond to requests in real time, and be able to handle massive amounts of data.

3.3.2 Impact of edge networks

The edge computing platform is mainly used to handle application scenarios that can occur thousands of meters away, including monitoring, security, drive testing, video analysis, etc. Because they are far away from the earth, the number and density of mobile terminals are small, so mobile terminals are usually in a weak coverage state. In addition, due to the long distance from the earth and the long delay of wireless access, the transmission bandwidth and processing performance of mobile terminals may be limited during weak coverage and peak hours. In order to achieve seamless connection of applications, the edge computing platform needs to adapt and optimize the transmission protocol, routing mechanism, packet loss retransmission mechanism, etc. of the edge network.

Although edge computing platforms face many challenges, their irreplaceability and scale effects are making it an independent research field.

4. System-level chip development process

The research and development process of system-on-chip is a complex and detailed process, which usually includes multiple stages and links. The general process of system-level chip development will be introduced in detail below to help you understand this process.

  1. Requirements analysis stage:
    The first step in system-level chip development is to clarify the requirements. The goal of this phase is to communicate with customers and relevant stakeholders to understand their needs and expectations, as well as the functionality and performance that the chip should have. This stage usually requires market research, competitive analysis and technical feasibility assessment.

  2. Architecture design stage:
    Based on the requirements analysis, the system-level chip architecture design begins. The goal of this stage is to determine the overall structure, core components and functional modules of the chip, and establish the relationship and interaction between them. Design teams typically use a variety of design tools and methods, such as system modeling, simulation, and optimization, to ensure that the chip design meets requirements and is scalable and reliable.

  3. Functional design stage:
    Once the chip architecture is determined, it enters the functional design stage. At this stage, the design team will design the circuit and logic of each functional module in detail. They will write code using Hardware Description Language (HDL) to describe the behavior and functionality of the chip. These codes are then simulated, verified, and synthesized using electronic design automation (EDA) tools to generate circuit netlists and logic gate-level designs.

  4. Physical design phase:
    After the functional design is completed, the physical design phase is entered. The goal of this phase is to translate the logical design into actual physical structures. The design team will perform layout design and wiring design to determine the location and connection methods of the internal circuits of the chip. They also consider factors such as power consumption, timing and signal integrity to ensure the chip is functioning properly on a physical level. After the physical design is completed, layout files will be generated for chip manufacturing and production.

  5. Verification and Validation Phase:
    After the physical design of the chip is completed, it needs to be verified and verified. The goal of this phase is to ensure that the chip functions and performs as expected and is free of errors or defects. The design team performs various verification methods such as functional verification, timing verification, and electrical verification. They also use simulation tools and hardware verification platforms to test and debug the chip to verify its reliability and stability under different operating conditions.

  6. Manufacturing and production phase:
    Once the design and verification of the chip are completed, the manufacturing and production phase is entered. This stage involves cooperation with chip manufacturers, sending the chip layout files to the manufacturers, and performing chip production and packaging. Factors such as process, materials and cost need to be considered during the manufacturing process to ensure the quality and reliability of the chip.

  7. Testing and debugging phase:
    After the chip production is completed, the chip needs to be tested and debugged. The goal of this phase is to verify the performance and reliability of the chip in a real operating environment. The testing team will use various testing equipment and methods to conduct functional testing, performance testing and reliability testing on the chip. They also perform fault analysis and repair on chips to ensure the quality and reliability of the chips.

  8. Integration and system commissioning phase

5. Chip manufacturing core technology

Chip manufacturing is a process involving precision craftsmanship and complex technology, and it includes multiple core technical links. The following will introduce the core technologies of chip manufacturing in detail to help you understand the importance and applications of these key technologies.

  1. Wafer Preparation:
    Wafer preparation is the first step in chip manufacturing. It involves converting silicon wafers into wafers with specific characteristics and structures. Key technologies include:
  • Single crystal growth: A single crystal silicon layer is grown on a wafer through chemical vapor deposition or other methods to achieve high purity and lattice integrity.
  • Cutting and polishing: Cut large pieces of single crystal silicon into thin slices, and then use processes such as polishing and chemical mechanical polishing (CMP) to make the surface smooth.
  1. Photolithography:
    Photolithography is a crucial step in chip manufacturing. It is used to transfer the chip pattern to the wafer. Key technologies include:
  • Mask preparation: Use computer-aided design software (CAD) to design the layout of the chip, and transfer the layout to a photolithography mask through electron beam exposure or laser lithography.
  • Exposure and development: Place the mask on the wafer, then use ultraviolet light or laser to illuminate the mask, and remove the unexposed photoresist through the development process to form the pattern of the chip.
  1. Process Technology:
    Process technology is the core link in chip manufacturing. It involves the gradual addition and formation of different material layers and device structures on the wafer. Key technologies include:
  • Oxidation and deposition: Using techniques such as chemical vapor deposition (CVD) or physical vapor deposition (PVD) to form an oxide layer or add layers of other materials on the wafer.
  • Distribution and etching: Through photolithography and etching processes, unnecessary material layers or parts are etched away to form circuits and structures.
  • Doping and diffusion: Dopants are introduced or diffused on the wafer surface through ion implantation or diffusion processes to change the conductive properties of the material.
  1. Metallization and Interconnection: Metallization and
    interconnection are key steps in connecting chip devices. It involves forming wires and connection structures on the wafer. Key technologies include:
  • Metal evaporation and electrolytic deposition: Use metal evaporation or electrolytic deposition technology to form metal wires and electrodes on the wafer surface.
  • Plating and Filling: Using plating and filling processes, the gaps and holes between metal conductors are filled to improve conductivity and connectivity.
  1. Packaging and Testing:
    Packaging and testing is the final stage of chip manufacturing. It involves packaging the chip into a usable package and testing the chip for functionality and reliability. Key technologies include:
  • Packaging technology: placing the chip on the packaging substrate and using wires or balls to connect the chip and sealing the chip manufacturing is a process involving precision craftsmanship and complex technology, and includes multiple core technology links.

Guess you like

Origin blog.csdn.net/universsky2015/article/details/133067278
Recommended