Why is the Apple M1 chip so fast?

Apple chips are rising rapidly. Whether in terms of configuration or other aspects, Apple has surpassed itself while leaving its peers far behind. This article will answer in detail: Why does Apple's M1 run so fast?

English name: Why is Apple's M1 Chip So Fast?

Original link:

https://erik-engheim.medium.com/why-is-apples-m1-chip-so-fast-3262b158cba2

Author: Erik Engheim, now lives in Norway, keen to UX, Julia programming, science and writing.

This article is a CSDN translation, please indicate the source of reprint.

Author | Erik Engheim has been authorized by the author to translate

Translator | Crescent Moon Editor | Zhang Wen

Head picture | CSDN download from Visual China

Exhibit | CSDN (ID: CSDNnews)

The following is the translation:

I saw through a video on YouTube that a Mac user bought an iMac last year. This $4,000 computer is equipped with a maximum of 40GB of memory. Later, however, he watched in disbelief that the iMac he had paid for was defeated by the new M1 Mac Mini, and this device only cost about $700.

In real-world tests time and time again, M1 Macs not only surpassed the top Intel Macs, but also completely defeated these computers. Many people found it incredible, and they began to explore what was going on.

If you have such doubts, then you have come to the right place. In this article, I will analyze Apple's M1 chip in depth. Specifically, I think many people have the following questions:

  1. From a technical point of view, why is the M1 chip so fast?

  2. Does Apple use any unusual technology?

  3. Is it easy for competitors like Intel and AMD to adopt the same technology?

Of course, you can search for answers to these questions on the Internet, but if you want to learn more about Apple’s efforts, you may soon be overwhelmed by highly specialized technical jargon. For example, M1 uses a very wide instruction decoder, a huge reorder buffer (ROB) and so on. Unless you know CPU hardware very well, most of the articles are heavenly books to you.

In order to facilitate understanding, let me briefly introduce the basic knowledge about the M1 chip.

What is a microprocessor (CPU)?

 

Usually, the Intel and AMD chips we are talking about refer to the central processing unit (CPU), or microprocessor. These chips fetch instructions from memory and execute each instruction in sequence.

Figure: A very basic RISC CPU (not M1)

The instruction moves from the memory (memory) to the instruction register (register) along the blue arrow, and then the decoder parses the instruction, and starts different parts of the CPU through the red control line. Finally, the arithmetic unit (ALU) will Add or subtract numbers.

The most basic CPU consists of a series of registers (register) and a number of arithmetic units (ALU) . The register is a named storage unit and the arithmetic unit is the calculation unit. ALU can perform operations such as addition, subtraction, and other basic mathematical operations. However, the ALU is only connected to CPU registers. If you want to perform the operation of adding two numbers, you must get these two numbers from memory and put them into two registers of the CPU.

The following are examples of some common instructions executed by the RISC CPU on M1:

load r1, 15load r2, 200add  r1, r2store r1, 310

The above r1 and r2 are what we call registers. Modern RISC CPUs cannot perform such operations on numbers located outside the registers. For example, it cannot add numbers in two different locations in memory. Instead, it must put these two numbers in separate registers. These are the first two instructions in the above example. We extract the number from memory address 150 and put it in register r1 of the CPU. Next, we put the number in address 200 into register r2. Only in this way, two numbers can be added by the instruction add r1, r2.

Picture: An old mechanical calculator

It has two registers, accumulator and input register. Modern CPUs usually have more than a dozen registers, and they are electronic.

The concept of registers existed a long time ago. For example, in the old mechanical calculator in the picture above, the register is where the two addends are stored. The register is where the numbers are stored.

M1 is not a CPU!

 

When you understand M1, you need to pay attention to:

M1 is not a CPU , it is a whole system integrating multiple chips. The CPU is just one of the chips.

Simply put, M1 is a complete computer integrated on a chip. M1 includes the CPU, graphics processing unit (GPU), memory, input and output controllers, and many other components that make up an overall computer. This is what we call a System on a Chip (SoC).

Nowadays, when you buy an Intel or AMD chip, you actually get a chip with multiple microprocessors. In the past, many chips in computers would be scattered on the motherboard.

Figure: Example of a computer motherboard

Memory, CPU, graphics card, IO controller, network card and many other components are connected to the motherboard and can communicate with each other.

However, because we can place so many transistors on a silicon chip today, companies such as Intel and AMD have begun to integrate multiple microprocessors into one chip. We call these chips CPU cores. A core is basically a completely independent chip that can read instructions from memory and perform calculations.

Figure: A microchip with multiple CPU cores

For a long time, if you want to improve performance, you only need to add more general-purpose CPU cores. However, the situation has now changed, and a merchant in the CPU market has started to deviate from this trend.

Apple's heterogeneous computing strategy is not so mysterious

 

Apple did not choose to add general-purpose CPU cores. They adopted another strategy: adding more and more dedicated chips to complete some specialized tasks. The advantage of this is that compared with general-purpose CPU cores, dedicated chips can complete tasks faster and consume less power.

This is not a completely new approach. Over the years, Nvidia and AMD’s graphics cards have been equipped with dedicated chips such as graphics processing units (GPUs), which perform graphics-related operations much faster than general-purpose CPUs.

All Apple has done is to move boldly in this direction. M1 not only has a common core and memory, but also contains a variety of dedicated chips:

  1. Central Processing Unit (CPU): The brain of the monolithic system. Responsible for running most of the code of the operating system and applications.

  2. Graphics processing unit (GPU): handles graphics-related tasks. For example, display the user interface of the application, and 2D/3D games.

  3. Image Processing Unit (ISP): It can be used to speed up common tasks in image processing applications.

  4. Digital signal processor (DSP): It can handle tasks that require a lot of mathematical operations better than a CPU. Including decompressing music files, etc.

  5. Neural Processing Unit (NPU): Used in high-end smartphones to accelerate machine learning (AI) tasks. Including voice recognition and camera processing.

  6. Video encoder/decoder: handles the conversion of video files and formats, and consumes less energy.

  7. Security field: encryption, identity authentication and security.

  8. Unified memory: Allows the CPU, GPU and other cores to exchange information quickly.

This is part of the reason why many people can see the speed increase when using M1 Mac for image and video editing. Many of the tasks they perform can be run directly on dedicated hardware. Therefore, the inexpensive M1 Mac Mini can easily encode large video files, while the expensive iMac cannot keep up even if all fans are running at full capacity.

In the blue area, you can see that multiple CPU cores can access memory at the same time, while in the green box, a large number of GPU cores are accessing memory.

You may not understand unified memory. What is the difference between shared memory and unified memory? In the past, didn’t people disapprove of sharing video memory with main memory? Because this will cause performance degradation. Indeed, shared memory is really bad. The reason is that the CPU and GPU must take turns to access memory. Sharing means that the two are competing for the data bus. Simply put, GPU and CPU must take turns using narrow pipelines to store or retrieve data.

But the situation of unified memory is different. In unified memory, GPU cores and CPU cores can access memory at the same time. Therefore, there is no additional overhead for shared memory. In addition, the CPU and GPU can inform each other of the location of the data in the memory. Previously, the CPU had to copy data from the main memory area to the area used by the GPU. But in the unified memory, the CPU will tell the GPU: "I placed 30MB of polygon data from memory address 2430." The GPU can use this memory without copying.

This means that since various special processors on M1 can use the same memory pool and exchange information quickly, performance can be greatly improved.

Before the emergence of unified memory, the Mac used GPU. You can even choose a graphics card installed outside your computer (installed via Thunderbolt 3 cable). Some people speculate that this situation may still occur in the future.

Why don't Intel and AMD use the same strategy?

 

If Apple's approach is so smart, why don't everyone copy it? To some extent, some people are copying Apple. Some ARM chip manufacturers are investing more and more in dedicated hardware.

AMD is also trying to install more powerful GPUs on certain chips, and gradually adopt accelerated processing units (APU), moving towards a single-chip system. The CPU cores and GPU cores of these processors are basically on the same chip.

AMD Ryzen Accelerated Processing Unit (APU) integrates CPU and GPU (Radeon Vega) on a chip. But it does not include other coprocessors, IO controllers or unified memory.

However, there are some important reasons that prevent them from fully implementing Apple's approach. A monolithic system essentially builds an entire computer on a chip. Therefore, this approach is more suitable for real computer manufacturers, such as HP and Dell. I use a car to make a simple analogy: If your business model is to manufacture and sell car engines, then it will be an unusual leap for you to manufacture and sell complete vehicles.

In contrast, this is not a big problem for ARM. Computer manufacturers such as Dell or Hewlett-Packard only need to purchase ARM and other manufacturers' chip authorizations to make their own monolithic systems using various dedicated hardware. Next, they handed over the completed design to semiconductor foundries such as GlobalFoundries or TSMC, which are now producing chips for AMD and Apple.

Picture: TSMC semiconductor foundry, responsible for the production of chips for companies such as AMD, Apple, Nvidia and Qualcomm.

Under the business models of Intel and AMD, we encountered a big problem. The foundation of their business model is to sell general-purpose CPUs, which people only need to plug into a large PC motherboard. Therefore, computer manufacturers only need to buy motherboards, memory, CPUs, and graphics cards from other suppliers and integrate these chips into a solution.

However, today's development trend is rapidly moving away from this model. In the new monolithic system world, you don’t need to assemble physical components from different vendors. Instead, you need to assemble the intellectual property rights of different vendors. First, you need to buy graphics cards, CPUs, modems, IO controllers, and other product designs from various vendors and use them for internal monolithic system design. Then, through a certain foundry to produce.

So, here comes the problem: because Intel, AMD, or Nvidia will not issue intellectual property licenses to Dell or HP, and will not give them the opportunity to manufacture their own monolithic systems.

Of course, Intel and AMD may also sell complete monolithic systems. But what does it contain? Each PC manufacturer may have its own view of what the monolithic system contains. There may be conflicts between Intel, AMD, Microsoft, and PC manufacturers because these chips require software support.

For Apple, this is not difficult, because they control all links. For example, they provide a Core ML library for developers to facilitate their writing machine learning code. As for whether Core ML runs on Apple's CPU or Neural Engine, it is not an implementation detail that developers care about.

The fundamental problem of speeding up CPU operation

Therefore, heterogeneous computing is part of the reason why the M1 chip achieves high performance, but it is not the only reason . Firestorm, the general-purpose CPU core on the M1 chip, is indeed very fast. This is a major difference between Firestorm and past ARM CPUs. The past ARM CPU cores are very weak compared to AMD and Intel cores.

However, Firestorm beat most Intel cores and almost beat the fastest AMD Ryzen cores. According to traditional experience, this situation does not happen.

Before discussing the reasons why Firestorm runs so fast, let's first understand which core concepts can really speed up the CPU.

In principle, you can combine the following two strategies to speed up the CPU:

  1. Execute more instructions quickly .

  2. Execute a large number of instructions in parallel .

In the 1980s, it was easy to execute more instructions quickly. As long as the clock frequency is increased, the instruction will be completed faster. A clock cycle is the time for the computer to perform a certain operation. But one clock cycle may not be enough. Therefore, sometimes an instruction may require multiple clock cycles to complete because it consists of several smaller tasks.

However, now it is almost impossible for us to increase the clock frequency. After more than ten years of unremitting efforts, Moore's Law is now invalid.

Therefore, all we can do is execute as many instructions as possible in parallel.

Multi-core and out-of-order processors


There are two ways to execute a large number of instructions in parallel. One is to add more CPU cores. From a software developer's point of view, this is like adding threads. Each CPU core is a hardware thread. If you don't know what a thread is, you can think of it as a process that performs tasks. A CPU with two cores can perform two separate tasks at the same time, that is, two threads. The task can be understood as two separate programs stored in memory, or the same program is executed twice. Each thread needs some records, such as the current position of the thread in the program instruction sequence. Each thread can store temporary results, and they should be stored separately.

In principle, a processor can run multiple threads even if it has only one core. In this case, the processor needs to suspend a thread, save the current process, and then switch to another thread. Switch back later. This approach cannot bring much performance improvement, and can only be used when a thread needs to frequently stop and wait for user input, or the network connection is too slow. These can be called software threads. Hardware threads mean that additional physical hardware (such as additional cores) needs to be used to speed up processing.

However, the problem is that developers need to write code to take advantage of this. There are some tasks (such as server software) that can easily do this. For example, processing each user separately, these tasks are independent of each other, so having a large number of cores is an excellent choice for servers (especially cloud-based services).

Figure: Ampere Altra Max ARM CPU, with 128 cores, designed for cloud computing, multi-threaded hardware is one of its advantages.

This is why CPUs such as Altra Max produced by ARM CPU manufacturers such as Ampere have 128 cores. The chip is specifically designed for cloud computing. A single core does not need to have crazy performance, because it is the most important thing to make good use of every watt of power consumption in the cloud and handle as many concurrent users as possible.

In contrast, Apple's situation is completely different. Apple's products are single-user devices. A large number of threads is not their advantage. Their equipment can be used to play games, edit videos, develop, etc. They want beautiful images and animations on desktops that are extremely responsive.

Desktop software usually doesn't utilize many cores. For example, 8 cores are enough for computer games, 128 cores are completely wasteful. Instead, these software require a small number of more powerful cores.

What we are going to talk about next is very interesting. Out-of-order execution is a way to execute more instructions in parallel, but does not require multiple threads. Developers can enjoy the advantages of out-of-order execution without special software. From the developer's point of view, it seems that each core is running faster.

In order to understand how it works, let's first understand some knowledge about memory. Requesting data located in a specific memory location can be slow. However, there is no difference between the delay of acquiring 1 byte and the delay of acquiring 128 bytes. Data is sent through the data bus. You can think of the data bus as a channel or pipe connecting the various parts of the memory and the CPU, and data is transmitted through this pipe. In fact, data buses are copper wires that can conduct electricity. If the data bus is wide enough, multiple bytes can be acquired at the same time.

Therefore, the CPU can fetch a whole block of instructions at a time, but these instructions must be executed one by one. Modern microprocessors use out-of-order execution.

This means that these processors can quickly analyze the instruction buffer and check which instructions have interdependencies. Let's give a simple example:

01: mul r1, r2, r3    / / r1 ← r2 × r302: add r4, r1, 5     // r4 ← r1 + 503: add r6, r2, 1     // r6 ← r2 + 1

Multiplication is a relatively slow operation, assuming it takes multiple clock cycles to complete. At this time, the second instruction needs to wait, because it needs to know the result of putting in the r1 register.

However, the third instruction (line 03) does not depend on the previous calculation results. Therefore, the out-of-order processor can start computing this instruction in parallel.

But the reality is that the processor needs to process hundreds or thousands of instructions at all times, and the CPU can find all the dependencies between these instructions .

It will analyze the instructions, check the input of each instruction, and see if these inputs depend on the output of one or more other instructions. The input and output here refer to the register containing the result of the previous calculation.

For example, the input r1 of the instruction add r4, r1, 5 depends on the result of the previous instruction mul r1, r2, r3. These dependencies are linked together to form a relationship graph, and the CPU can use this graph for processing. The nodes in the graph are instructions, and the edges are the registers that connect these instructions.

The CPU can analyze such a node graph and determine which instructions it can execute in parallel, and wait for multiple related calculation results before executing which instructions.

Although many instructions can be completed ahead of time, we cannot take them as the final result. We cannot submit the execution results of these instructions because they are executed in an incorrect order. From the user's point of view, these instructions are executed in the order of issuance.

Just like a stack, the CPU will pop completed instructions from the top until it encounters an unfinished instruction.

Although the above description is not sufficient, I hope to give you a general understanding. Basically, you can choose to let the programmer implement parallelism, or let the CPU pretend that everything is executed in a single thread, but execute out of order behind the scenes.

The Firestorm core on the M1 chip has become so powerful with the excellent out-of-order execution function. In fact, it is more powerful than any product from Intel or AMD, and may even surpass any other product on the mainstream market.

Why is AMD and Intel's out-of-order execution inferior to M1?

 

In the previous explanation of out-of-order execution, I skipped some important details. I need to explain it again. Otherwise, it is difficult to understand why Apple can lead, and it is difficult for Intel and AMD to surpass.

The real name of the "stack" mentioned earlier is called "Re-Order Buffer" (ROB), and it does not include ordinary machine code instructions . The content is not the instructions that the CPU obtains and executes from the memory. The latter belongs to the CPU instruction architecture (ISA), which is what we call x86, ARM, PowerPC, etc.

But internally, the CPU executes a series of completely different instruction sets, which are invisible to the programmer. We call it microinstructions (μops for short). All the ROB contains microinstructions.

Since the CPU makes every effort to execute instructions in parallel, ROB's approach is more practical. The reason is that microinstructions are very wide (contain more bits) and may contain various meta-information. The ARM or x86 instruction set cannot add so much information because:

  1. Doing so will cause the executable file size of the program to expand;

  2. Will expose the internal working principle of the CPU, whether there are out-of-order execution units, whether there are various details such as register renaming;

  3. A lot of meta information is only meaningful in the current execution context.

You can understand this process as writing a program. You have a public API that needs to be stable and available to everyone. This is the instruction set of ARM, x86, PowerPC, MIPS, etc. And microinstructions are those private APIs used to implement public APIs.

Moreover, microinstructions are usually easier to process by the CPU. why? Because each instruction only does one very easy task. Normal ISA instructions can be very complex and may trigger a series of operations, so they need to be translated into multiple micro instructions.

CISCCPU usually has no choice but to use microinstructions, otherwise complex CISC instructions will make pipeline and out-of-order execution almost impossible.

The RISC CPU has other options. For example, the small ARM CPU does not use microinstructions at all. But this also means that they cannot implement operations such as out-of-order execution.

But you may ask, does it matter what you say? Why do you need to know these details to understand why Apple surpasses AMD and Intel?

This is because the operating speed of the chip depends on the speed of filling the ROB and the number of microinstructions used . The faster and more filling, the greater the possibility of fetching instructions in parallel, which can improve performance.

The machine instructions are split into micro instructions by the instruction decoder. If there are multiple decoders, more instructions can be split in parallel to fill the ROB faster.

This is where there is a major difference between Apple and other manufacturers. The latest Intel and AMD microprocessor cores have only four decoders, which means they can decode four instructions at the same time.

But Apple has 8 decoders. Not only that, Apple's ROB is three times the size of Intel and AMD, and can hold three times as many instructions. No major chip manufacturer has so many decoders in the CPU.

Why can't Intel and AMD add more instruction decoders?

Below, let's take a look at the advantages of RISC and the outstanding performance of the ARM RISC architecture adopted by the M1 Firestorm core.

You know, in x86, the instruction length is about 1~15 bytes. On RISC, the instruction is of fixed length. What does it matter?

If the length of each instruction is the same, it is very easy to split a byte stream and send it to 8 different decoders in parallel.

But on x86 CPUs, the decoder does not know where the next instruction starts. It must analyze each instruction in order to know the specific length.

Intel and AMD took a violent approach to solve this problem, that is, trying to decode at every possible starting position. In other words, many wrong guesses can only be discarded. This causes the decoder to become very complex, so it is difficult to add more decoders. But this is not a problem for Apple, they can easily add more decoders.

In fact, adding more decoders will cause more problems, so for AMD, 4 decoders is the upper limit.

Therefore, the M1 Firestorm core can generate twice as many instructions as AMD and Intel CPUs at the same clock frequency.

Some people will say that CISC can be split into multiple microinstructions to increase the density of instructions, so that decoding one x86 instruction can achieve the effect of decoding two ARM instructions.

But this is not the case. Highly optimized x86 code rarely uses complex CISC instructions, and even looks more like RISC.

But this is of no use to Intel and AMD, because even if 15-byte instructions are very rare, the decoder must process them. This complexity is an obstacle for AMD and Intel to add more decoders.

But AMD's Zen3 core is faster, right?

 

As far as I know, from a performance point of view, the latest AMD CPU core Zen3 is slightly faster than the Firestorm core. But this is only because the Zen3 core clock is 5GHz, and the Firestorm core clock is 3.2GHz. Although the clock frequency of Zen3 exceeds 60%, the performance is only a little faster than Firestorm.

So why doesn't Apple increase the clock frequency? Because a higher clock frequency will increase chip heat. This is Apple's main selling point. Unlike Intel and AMD, their computers rarely require heat.

So in essence, it can be said that the Firestorm core actually exceeds the Zen3 core. Zen3 can still fight against it, just because it consumes more energy and generates more heat. Apple and choose this path.

If Apple needs higher performance, then they will add more cores. This can improve performance while maintaining low power.

future development

 

It seems that AMD and Intel are already in trouble:

  • Their business model is difficult to design heterogeneous computing and monolithic systems

  • Due to the burden of the old x86 CISC instruction set, it is difficult to improve out-of-order execution performance

But this does not mean the end of the road. They can still increase the clock frequency, use better heat dissipation, add more cores, and increase the CPU cache. But each has disadvantages. Intel is in the worst situation because their core count is no longer as good as Firestorm, and the GPU in their monolithic system solution is weaker.

The problem with adding more cores is that for the average desktop load, the benefits of too many cores are very low. Of course, for servers, the more cores the better.

However, companies such as Amazon and Ampere are all studying 128-core CPUs. This means that Intel and AMD are about to face a double attack.

But for AMD and Intel, fortunately, Apple does not sell chips on the market. So PC users have no choice. PC users may switch to Apple, but this is a slow process after all. Switching the platform you use every day is not an overnight thing.

But for young people who have money in their pockets and do not have much platform dependence, they will choose Apple more and more in the future, thereby increasing Apple's share in the high-end market, and ultimately increasing Apple's share in the entire PC market.

Reviews

The reason why M1 is so fast is not because of good technology, but because Apple has accumulated a lot of hardware.

M1 is very wide (8 decoders) and there are many execution units. Its reordering buffer is 630 deep, has a huge cache, and a lot of memory bandwidth. Very powerful chip, the design is also very balanced. 

In addition, this is not a new technology. Apple's A series chips are gradually improving every year. It's just that no one believes that the chips used in mobile phones can exceed the chips in notebook computers.

Call for topics: How to prove that you are a programmer in one sentence? Welcome to leave a message~

更多精彩推荐
☞腾讯云年度最强技术大会召开在即,这次只谈技术和代码
☞开通会员配送费反而更高了?美团外卖发致歉声明!
☞谷歌AI掌门人Jeff Dean获冯诺依曼奖;微软计划自研PC和服务器芯片;Ruby 3.0 RC1发布|极客头条☞短视频特效“耍花招”:线上投篮、摆摊,让画中人摇摆,浅谈腾讯微视的AR基建

☞赠书 | 读懂 x86 架构 CPU 虚拟化,看这文就够了
☞赠书 | 区块链和它的好基友:5G+区块链有哪些可能?
点分享点点赞点在看

Guess you like

Origin blog.csdn.net/csdnsevenn/article/details/111502811