[Transfer] What programmers should update about modern CPUs

Reprinted from: http://www.iteye.com/news/30978

What's new in CPUs since the 1980s? How do these changes affect programmers? This article will answer these questions for you.

Original link: [url=”http://danluu.com/new-cpu-features/ “] http://danluu.com/new-cpu-features/ [/url]
Author: Dan Luu
Someone talks on Twitter To my own understanding of the CPU:
quote

The CPU model I remember is still in the 1980s: a box that does arithmetic, logic, shifts, and bit operations, loads, and stores information in memory. I'm vaguely aware of various new developments, such as vector instructions (SIMD), and the new CPUs also have virtualization support (though not sure what that means in practice).
What cool developments have I missed? Is there anything today's CPUs can do that couldn't be done last year? What about CPUs from two, five or ten years ago? The things I'm most interested in are the features that programmers need to do to take full advantage of themselves (or have to redesign their programming environment). I think, this shouldn't include hyperthreading/SMT, but I'm not sure. I'm also interested in what the CPU can't do right now, but can do in the future.

Unless otherwise stated, the content of this article refers to the x86 and Linux environment. History repeats itself, and many new things on x86 are clichés for supercomputers, mainframes, and workstations.

Status
Miscellaneous

Modern CPUs have wider registers to address more memory. You may have used 8-bit CPUs in the 1980s, but you are definitely using 64-bit CPUs now. In addition to providing more address space, 64-bit mode (avoiding pseudorandom 80-bit precision via x867 floating point for 32-bit and 64-bit operations) provides more registers and more consistent floating-point results. Other very likely features that have been introduced to x86 since the early 1980s include: paging/virtual memory, pipelining and floating point arithmetic.

This article will avoid discussing the unusual low-level functions such as APIC/x2APIC, SMM or NX bits that are only used when writing drivers, BIOS code, and doing security reviews.

Memory / Caches

Of all the topics, the one most likely to really affect your day-to-day programming is memory access. My first computer was a 286, and on that machine a memory access could only take a few clock cycles. A few years ago I was using a Pentium 4 and memory accesses took over 400 clock cycles. Processors are evolving much faster than memory, and the solution to the slower memory problem is to increase the cache, frequently used data is accessed faster if the access pattern is predictable, and prefetching - preloading data into the cache.

A few cycles compared to 400+ sounds bad - 100x slower. But a loop that reads and operates on a block of 64-bit (8-byte) values, the CPU is smart enough to prefetch the correct data before I need it, at about 22GB/s on a 3Ghz processor, we Only 8% performance is lost instead of 100x.

The greatest advantage is achieved in modern CPU cache architectures by using predictable memory access patterns and data block operations that are smaller than the CPU cache. If you want to be as efficient as possible, this document is a good starting point. After digesting the 100-page PDF, next, you'll want to familiarize yourself with the system's microarchitecture and memory subsystem, as well as learn to use tools like likwid to analyze and test applications.

TLBs

There are also small caches in the chip to handle various transactions, unless you need to go all out for micro-optimization, you don't need to know about the decode instruction cache and other interesting little caches. The biggest exception is TLB - virtual memory lookup cache (done via a 4-level page table structure on x86). The page table is in the L1 data cache, and each lookup has 4, or 16 cycles, to perform a full virtual address lookup. This is unacceptable for all operations that need to be accessed by user-mode memory, resulting in a small and fast cache of virtual address lookups.

Because the first-level TLB cache must be fast, it is severely limited in size. If using 4K pages, determines the amount of memory that can be found without a TLB miss. x86 also supports 2MB and 1GB pages; some applications will benefit greatly from using larger pages. If you have a long-running application that uses a lot of memory, it's worth looking into the details of this technique.

Out of Order Execution / Serialization

For the last two decades, x86 chips have been able to think about the order of execution (to avoid being blocked by a stalled resource). This sometimes leads to very strange behavior. x86 is very strict on a single CPU, or externally visible state, like registers and memory, must be updated in time if everything is being executed in order.

These limitations make things appear to execute in order, and in most cases you can ignore the presence of OoO (out of order) execution unless you are trying to improve performance. The main exception is that not only do you want to make sure that things look like they're in order on the outside, but actually they're actually in order on the inside.

An example you might care about is if you try to measure the execution time of a series of instructions with rdtsc, rdtsc will read out a hidden internal counter and place the result in the externally visible registers edx and eax.

Suppose we do this:
Java code
foo
rdtsc
bar
mov %eax, [%ebx]
baz

Among them, foo, bar and baz do not touch eax, edx or [%ebx]. The mov followed by rdtsc will write the eax value to a certain location in memory, because eax is externally visible, the CPU will ensure that the mov will be executed after the rdtsc is executed, so that everything seems to happen in order.

However, because there is no apparent dependency between rdtsc, foo, or bar, rdtsc may come before foo, between foo and bar, or after bar. Even as long as baz does not affect the shift mov in any way, there may be cases where baz executes before rdtsc. This is fine in some cases, but not so good if rdtsc is used to measure the execution time of foo.

To precisely order rdtsc and other instructions, we need to serialize all executions. How to do it accurately? Please refer to [url=”http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper. pdf"] this document from Intel[/url].
[b]Memory/Concurrency[/b]

The ordering restrictions mentioned above mean that loads and stores in the same location cannot be reordered with respect to each other, in addition to that, x86 loads and stores have some other restrictions. In particular, for a single CPU, the store is not recorded along with the previous load, regardless of whether it is in the same location.

However, loads can be recorded with earlier stores. E.g:

mov 1, [%esp]  
mov [%ebx], %eax  


Executed like:

[mov [%ebx], %eax

mov 1, [%esp]code=”java”]

But not the other way around - if you write the latter, it can never be executed as you wrote earlier.

You might force the previous instance to execute as written by inserting serialization instructions. But this requires the CPU to serialize all instructions which is very slow as it forces the CPU to wait until all instructions have been serialized before doing anything. If you only care about load/store order, there is an additional mfence directive that just serializes loads and stores.

This article is not going to discuss memory fence, lfence and sfence, but you can read more about them here.

Single-core loads and stores are mostly ordered, and for multi-core, the same restrictions apply; if core0 is watching core1, you can see that all single-core rules apply to core1 loads and stores. However, if core0 and core1 interact, there is no guarantee that their interactions are also ordered.

For example, core0 and core1 start with eax and edx set to 0, core0 executes:
Java code
mov 1, [_foo]
mov [_foo], %eax
mov [_bar], %edx

And core1 executes the
Java code
mov 1, [_bar]
mov [_bar], %eax
mov [_foo], %edx

For both cores, eax must be 1 because the first and second instructions are interdependent. However, it is possible for eax to be 0 in both cores, because the third line of core0 may execute when core1 does not see anything, and vice versa.

Memory barriers serialize memory accesses within a core. Linus has this to say about using memory barriers instead of locking:
References

The real cost of not locking is ultimately inevitable. Doing things smart by using memory barriers is almost always a prelude to mistakes. With all the things that can happen on a dozen different architectures and have different memory orderings, the absence of a tiny barrier is really hard for you to sort out... In fact, anytime anyone writes a new locking mechanism, They always get it wrong.

And it turns out that on modern x86 processors, using locking to achieve concurrency is usually cheaper than using memory barriers, so let's look at locking.

If you set _foo to 0, and have two threads execute incl(_foo) 10,000 times - a single instruction incrementing the same location 20,000 times, but theoretically the result could be 2. Figuring this out is a good exercise.

We can experiment with a simple piece of code:
Java code

include

include

define NUM_ITERS 10000

define NUM_THREADS 2

int counter = 0;
int *p_counter = &counter;

void asm_inc() {
int *p_counter = &counter;
for (int i = 0; i < NUM_ITERS; ++i) {
asm(“incl (%0) \n\t” : : “r” (p_counter));
}
}

int main () {
std::thread t[NUM_THREADS];
for (int i = 0; i < NUM_THREADS; ++i) {
t[i] = std::thread(asm_inc);
}
for (int i = 0; i < NUM_THREADS; ++i) {
t[i].join();
}
printf(“Counter value: %i\n”, counter);
return 0;
}

Compiling with clang++ -std=c++11 --pthread on my two machines gives the following distribution:

Not only do the obtained results vary at runtime, but the distribution of results varies across machines. We never get to the theoretical minimum of 2, or for that matter, any result below 10000, but it is possible to get a final result between 10000 and 20000.

Although incl is a single instruction, atomicity is not guaranteed. Internally, incl is a load followed by an add followed by a store. An increase in cpu0 may sneak into execution between load and store in cpu1, and vice versa.

Intel's solution to this is that a small number of instructions can be prefixed with lock to guarantee their atomicity. If we change the incl of the above code to lock incl, the output is always 20000.

To make sequences atomic, we can use xchg or cmpxchg, which are always locked as compare-and-swap primitives. This article won't describe in detail how it works, but if you're curious you can read this article by David Dalrymple.

To make memory exchanges atomic, locks are globally ordered relative to each other, and loads and stores are not reordered for locks. For models with strict memory ordering, please refer to the x86 TSO documentation.

In C or C++:
Java code
local_cpu_lock = 1;
// .. do something important..
local_cpu_lock = 0;

The compiler doesn't know that local_cpu_lock = 0 can't be placed in the important middle part. Compiler barriers are not the same as CPU memory barriers. Since the x86 memory model is more restrictive, some compiler barriers at the hardware level are opt-outs and tell the compiler not to reorder. If the language used is at a higher level of abstraction than microcode, assembly, C or C++, the compiler will most likely not have any kind of annotations.

Memory / Porting

If you're porting code to other architectures, it's important to note that x86 has perhaps the strongest memory model of any architecture you'll encounter today. Without careful thought, it will almost certainly get an error when ported to an architecture with weaker guarantees (PPC, ARM, or Alpha).

Consider Linus' comment on this example:
quote

CPU1 CPU2
—- —-
if (x == 1) z = y;
y = 5; mb();
x = 1;

...if I read that the Alpha architecture memory ordering is guaranteed to be correct, then you can really get Z = 5, at least in theory

mb is a memory barrier (memory barrier). I won't go into the details of this article, but if you're wondering why anyone would create such a specification that would allow this crazy behavior to happen, consider that before rising production costs knocked out DEC, its chips were fast enough to run through emulation on the same benchmarks But faster than x86. See the paper on the motivation behind the Alpha architecture for why most RISC-Y architectures made the decisions at the time.

By the way, this is the main reason why I am so skeptical of the Mill architecture. Regardless of whether they can achieve the performance they claim, just being technically good is not a sound business model.

Memory / Non-Temporal Stores / Write-Combine Memory

The limitations described in the previous section apply to cacheable (ie "write-back" or WB) memory. Before that, there was only uncacheable (UC) memory.

One interesting thing about UC memory is that all loads and stores are designed to expect loads or stores on the bus. This makes perfect sense for processors with no or little onboard cache.
memory/NUMA

Non-uniform memory access (NUMA), which means that memory access latency and bandwidth vary from processor to processor. Because NUMA or ccNUMA is so common that it is adopted by default.

What is required here is that threads sharing memory should be on the same socket, and a memory-mapped I/O heavy thread should ensure that it talks to the socket of the closest I/O device.

Once upon a time, there was only memory. Then the CPU is so fast relative to the memory that people want to add a cache. It's bad news that the cache is inconsistent with the backing store (memory), so the cache has to keep information about what it's holding on to, so it knows if and when it needs to write to the backing store.

That's not too bad, and once you get two cores with their own cache, things get complicated. In order to maintain the same programming model as without the cache, the caches must be consistent with each other and with the backing store. Since the existing load/store instructions have nothing in their API to allow them to say "Sorry! This load failed because another CPU is using the address you wanted", the easiest way is to have each CPU Send a message to the bus when loading or storing something. We already have this memory bus that both CPUs can connect to, so just ask the other CPU to reply when its data cache is modified (and lose the corresponding cache line).

In most cases, each CPU only involves data that the other CPUs don't care about, so there is some wasted bus traffic. Not bad though, because once the CPU comes up with a message saying "Hello! I'm going to own this address and modify the data", it can be assumed that the address is fully owned before other CPUs ask for it, although that doesn't always happen.

For a 4-core CPU, it still works, albeit a bit more bytes wasted. But each of those CPUs will fail at a rate far greater than the 4 CPUs combined, both because the bus is saturated, and because the cache will be saturated (the physical size/cost of the cache is O in the number of simultaneous reads and writes (n^2) , and speed is negatively related to size).

The "simple" solution to this problem is to have a single centralized directory that records all the information, rather than doing N-way peer-to-peer broadcasts. Since we're packing 2-16 cores on a chip these days anyway, it's natural for each socket to have a single directory track of each core's cache state.

Not only does each chip solve the problem, but there needs to be some way for the chips to talk to each other. Unfortunately, as we scale these systems even for small systems the bus speeds are so fast that it's really hard to drive a signal as far as connecting a bunch of chips and memory all on one bus. The easiest solution is to have each socket have a memory area, so each socket doesn't need to be connected to every part of the memory. This also avoids the complexity of requiring a higher-level directory for a directory, since it is clear which directory owns a particular piece of memory.

The downside of this is that if you occupy a socket and want some memory owned by another socket, there is a significant performance penalty. For simplicity, most "small" (< 128 cores) systems use a ring bus, so the performance penalty is not just the direct latency/bandwidth penalty paid by going through a series of jumps to reach memory, it also uses up limited resources ( Ring bus) and slows down access to other socekts.

In theory, the OS will handle it transparently, but it is often inefficient.

Context Switches/System Calls (Syscalls)

Here, syscall refers to the Linux system call, not the x86 SYSCALL or SYSENTER instructions.

A side effect of all modern processors is that Context Switches are expensive, which makes system calls expensive. The paper by Livio Soares and Michael Stumm discusses this in detail. I will use some of their data below. The following figure shows how many instructions per clock (IPC) the Core i7 on Xalan can handle:

After 14000 cycles of system calls, the code is still not running at full speed.

Below is a table of footprints for several different system calls, both direct cost (instructions and cycles), and indirect cost (number of caches and TLB evictions).

Some syscalls caused over 40 TLB collections! For a chip with 64 D-TLBs, the TLB is almost swept away. Cache evictions are not free.

The high cost of system calls is the reason why people turn to scripted system calls (such as epoll, or recvmmsg) for high-performance code. People who need high-performance I/O often use the user-space I/O stack. The cost of Context Switches is why high-performance code tends to be one thread per core (or even a single thread on a fixed thread) rather than one thread per logical task.

This high cost is also driven by VDSO behind, putting some simple system calls that do not require any elevated privileges into simple user space library calls.

SIMD

Basically all modern x86 CPUs support SSE, 128-bit wide vector registers and instructions. Because it's common to do the same operation multiple times, Intel has added instructions that allow you to operate on 128-bit blocks as if it were 2 64-bit blocks, or 4 32-bit blocks, 8 16-bit blocks, etc. ARM supports the same thing under a different name (NEON) and the supported instructions are similar.

Getting a 2x, 4x speedup by using SIMD instructions is quite common and definitely worth looking forward to if you already have a computationally heavy job.
The compiler is good enough to distinguish common simple code that can implement vectorization modes, like the following code, which automatically uses the vector instructions of modern compilers:
Java code
for (int i = 0; i < n; ++i) {
sum += a[i];
}

However, if you don't write assembly language, compilers often produce non-optimized code, especially for SIMD code, so if you are concerned about getting the best possible performance, you should look at the disassembly and check your compiler optimization error.

power management

Modern CPUs have a lot of fancy power management features to optimize power usage in different scenarios. The result of these is "running to idle" because getting the work done as fast as possible and then putting the CPU back to sleep is the most energy efficient way to go.

While there are a number of practices that have been shown to benefit power consumption by making specific micro-optimizations, applying these micro-optimizations to real workloads often yields less than expected benefits.

GPU/GPGPU

I'm not very qualified to talk about these compared to other parts. Fortunately, Cliff Burdick volunteered to write the following section:
Quote

Before 2005, Graphics Processing Units (GPUs) were limited to an API that allowed only a very limited amount of hardware control. As libraries became more flexible, programmers started using processors for more common tasks, such as linear algebra routines. The parallel architecture of the GPU can work in a large number of matrix blocks by launching hundreds of concurrent threads. However, the code had to use the traditional graphics API and was still limited by how much hardware it could control. Nvidia and ATI took notice and released frameworks that could make APIs more familiar to people outside the graphics world to gain more access to the hardware. The library gained popularity, and today's GPUs are widely used in high-performance computing (HPC) alongside CPUs.

Compared to processors, GPU hardware has several main differences, which are outlined below:

processor

At the top level, a GPU processor contains one or more Streaming Multiprocessors (SMs). The multiprocessors of modern GPUs typically contain more than 100 floating point units per stream, or are commonly referred to as cores in the GPU world. Each core is usually clocked at around 800MHz, although like CPUs, processors with higher clock frequencies but fewer cores also exist. GPU processors lack many of the features of their CPU counterparts, including larger caches and branch prediction. Communication between the different layers of the core, SMs, and the overall processor becomes slower and slower. For this reason, problems that perform well on GPUs are usually highly parallel, but some data can be shared among a small number of threads. We'll explain why in the memory section below.

Memory

Modern GPU memory is divided into 3 categories: global memory, shared memory and registers. Global memory is GDDR typically advertised on GPU boxes to be around 2-12GB in size, and has speeds through 300-400GB/sec. Global memory is accessible by all SMS threads on the processor and is also the slowest type on memory cards. Shared memory, as the name suggests, is shared memory among all threads in the same SM. It is usually at least twice as fast as the global store, but no access is allowed between threads of different SMs. Registers are much like registers on a CPU, they are the fastest way to access data on a GPU, but they are only local to each thread and the data is not visible to other different threads running. Both shared memory and global memory have strict rules about how they can be accessed, and there are serious performance penalties for not following these rules. To achieve the above throughput, memory accesses must be fully coalesced between threads in the same thread group. Similar to the CPU reading a single cache line, if aligned properly, the GPU can have cache lines that serve all threads in a group for a single access. However, the worst case is when all threads in a group access different cache lines, each requiring an independent memory read. This usually means that the data in the cache line is not used by the thread, and the available throughput of the memory drops. Similar rules apply to shared memory, with some exceptions that we won't cover here.

Threading Model

GPU threads run in a single-instruction multi-threading (SIMT) fashion, and each thread runs in groups of a predefined size (usually 32) in hardware. This last part has a lot of implications; every thread in the group must be working on the same instruction at the same time. If any thread in a group needs to get code from others for a divergent path of code (such as an if statement), all threads not participating in that branch will not start until the end of that branch. As a simple example:
Java code
if (threadId < 5) {
// Do something
}
// Do More

In the code above, this branch causes 27 of our 32 threads to suspend execution until the branch ends. As you can imagine, if multiple groups of threads were running this code, overall performance would take a big hit as most of the cores were idle. Only when the entire group of threads is locked does the hardware allow the cores of another group to be swapped to run.

Interfaces

Modern GPUs must have a CPU to send and receive data copies between the CPU and GPU memory, and start the GPU and encode. At peak throughput, a PCIe 3.0 bus with 16 lanes can reach speeds of around 13-14GB/s. This may sound high, but they are an order of magnitude slower relative to the memory speeds present in the GPU itself. In fact, graphics processors have become so powerful that the PCIe bus is increasingly becoming a bottleneck. In order to see any GPU performance advantage over the CPU, the GPU must be loaded with so much work that the GPU needs to run the work for far longer than the data is sent and received.

The newer GPUs have some features that can dynamically allocate work in the GPU code without returning to the GPU code released by the CPU to dynamically work without returning to the CPU, but his application is currently quite limited.

GPU Conclusion

Due to the major architectural differences between CPUs and GPUs, it's hard to imagine either completely replacing the other. In fact, the GPU complements the parallel work of the CPU very well, allowing the CPU to independently complete other tasks while the GPU is running. AMD is trying to merge the two technologies with their "Heterogeneous Architecture" (HSA), but taking existing CPU code and deciding how to split the CPU and GPU parts of the processor will be a big challenge challenges, not only for processors, but also for compilers.

Virtualization

Unless you're writing very low-level code to deal directly with virtualization, Intel's embedded virtualization instructions are usually not something you need to think about.

Dealing with that stuff is pretty confusing, as you can see from the code here. Even for the very simple example shown there, setting up Intel's VT instructions to start a virtual client requires about 1000 lines of low-level code.

Virtual Memory

If you look at Vish's VT code, you'll see that there's a nice block of code dedicated to page tables/virtual memory. This is another "new" feature you don't have to worry about unless you're writing OS or other low-level system code. Using virtual memory is simpler than using segmented memory, but that's for now.

SMT/Hyper-threading

Hyperthreading is mostly transparent to the programmer. A typical speedup for enabling SMT on a single core is around 25%. Good for overall throughput, but it means that each thread may only get 60% of its original performance. For applications where you are very concerned about single-threaded performance, you may be better off disabling SMT. Although this depends a lot on the workload, and for any other variation, you should run some benchmarks on your specific workload to see what works best.

A side effect of all this complexity added to the chip (and software) is that performance is much less than once expected; there has been a corresponding resurgence in the importance of benchmarking specific hardware.

People often use "computer language benchmark games" as evidence to say that one language is faster than another. The results I've tried to reproduce myself, with my mobile Haswell (vs. the server Kentsfield used in the results), I can get up to 2x different (relative speed) results. Nanthan Kurz recently pointed me to an example where gcc -O3 is 25% slower than gcc -O2, even when running the same benchmark on the same machine. Changing the link order of a C++ program can result in a 15% performance change. The selection of benchmarks is a difficult problem.

Branches

Conventional wisdom holds that using branches is expensive and should be avoided in every (most) way possible. On Haswell, the misprediction penalty for a branch is 14 clock cycles. The branch misprediction rate depends on the workload. Using perf stat on a few different things (bzip2, top, mysqld, regenerating my blog), I got branch mispredictions between 0.5% and 4%. If we assume a correctly predicted branch cost is 1 cycle, this average cost is between .995 * 1 + .005 * 14 = 1.065 cycles to .96 * 1 + .04 * 14 = 1.52 cycles. It's not bad.

This actually overstates the cost since about 1995, since Intel added conditional move instructions that allow you to conditionally move data without a branch. This instruction was memorably criticized by Linus, which gave it a bad reputation, but using cmos has a significant speedup compared to branches. A real-world example of the extra branch cost that is fairly common is to use Integer overflow checking. When using bzip2 to compress a particular file, that increases the number of instructions by about 30% (all increments come from extra branch instructions), which results in a 1% performance penalty.

Unpredictable branches are bad, but most branches are predictable. Ignoring the cost of forking until your profiler tells you there is a hot spot is pretty reasonable these days. CPUs have gotten a lot better over the past decade at performing poorly optimized code, and compilers have gotten better at optimizing code, which makes optimizing branches a bad time to use unless you're trying to squeeze some code in The absolute best performance.

If it turns out that's what you need to do, you're better off using file-guided optimization rather than trying to do it manually.

If you really have to do this manually, there are some compiler directives you can use to indicate whether a particular branch is likely to be taken or not. Modern CPUs ignore branch hint instructions, but they help the compiler lay out the code better.

Alignment

Experience tells us that we should lengthen the struct and make sure the data is aligned. But on Haswell's chips, there's zero misregistration for just about anything you can think of that doesn't span pages single-threaded. There are cases where it's useful, but in the general case it's another optimization that doesn't matter because the CPU has become a lot better at executing bad code. It increases the memory footprint for no benefit and is also a bit detrimental.

Also, don't page-align or otherwise arrange things to large bounds, or you'll kill cache performance.

Self-modifying code

This is another optimization that doesn't make much sense at the moment. Using self-modifying code to reduce code size or increase performance used to make sense, but since modern caches tend to split their L1 instruction and data caches, modifying running code between a chip's L1 cache requires expensive communication.

The Future
Below are some possible changes, ranging from the most conservative to the boldest.

Transactional Memory and Hardware Lock Elision

IBM already has these features in their own POWER chips. Intel tried adding these things to Haswell, but was disabled due to an error.

Transactional memory support is exactly what it sounds like: hardware support for transactions. Via three new directives xbegin, xend and xabort.

xbegin starts a new transaction. A conflict (or xabort) rolls back the architectural state of the processor (including memory) to the state before xbegin. If you're using transactional memory supported via a library or language, this should be transparent to you. If you're implanting library support, you have to figure out how to translate hardware support with limited hardware buffer size constraints into abstract transactions.

This article intends to discuss Elision hardware locks, which are embedded in mechanisms very similar to those used to implement transactional memory, and are designed to speed up lock-based code. If you want to take advantage of HLE, take a look at this documentation.

Rapid I / O (Fast I / O)

For storage and networking, I/O bandwidth is rising and I/O latency is falling. The problem is, I/O is usually done through system calls. As we can see, the relative extra cost of system calls has been going up. For storage and networking, the answer is to move to the user-mode I/O stack.

Dark Silicon/System-on-Chip

An interesting side effect of transistor scaling is that we can pack a lot of transistors onto a chip, but they generate so much heat that if you don't want the chip to melt, ordinary transistors won't switch on and off most of the time.

The result of this is that it makes more sense to include dedicated hardware that is not used for a significant amount of time. On the one hand, this means that we get various specialized instructions like PCMP and ADX. But it also means that we are integrating entire devices with chips that were once not on chips. Including things like GPUs and (for mobile devices) radios.

Combined with the trend towards hardware acceleration, it also means that it makes more sense for companies to design their own chips, or at least parts of their own chips. With the acquisition of PA Semi, Apple has come a long way. First, add a handful of custom accelerators to the stagnant standard ARM architecture, then add custom accelerators to their own custom architecture. Thanks to the right custom hardware and a thoughtful combination of benchmarks and system design, the iPhone 4 is slightly more responsive than my flagship Android phone, which is many years newer than the iPhone 4 and has a faster processor and larger RAM.

Amazon took part of the original Calxeda team and hired a sufficiently sized hardware design team. Facebook has also picked experts on ARM SoCs and is working with Qualcomm on certain things. Linus is also on record saying, "We're going to see more dedicated hardware in every way" and so on.

Conclusion
x86 chips already have a lot of new features and very useful little features. In most cases, you don't need to know what they are in order to take advantage of them. The real low-level is usually hidden by a library or driver, and the compiler will try to take care of the rest. The exception is if you're really going to write low-level code, where the world has gotten a lot more messy, or if you're trying to get the absolute best in your code, it's even weirder.

Some things seem bound to happen in the future. But past experience tells us that most predictions are wrong, so who knows?

Translator: Ted, a software engineer at Realtek in Singapore, engaged in the research and development of chips such as WiFi, embedded system software design and development, and the Internet of Things.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326087768&siteId=291194637