C is no longer a programming language

Recently, an article by Rust and Swift veteran Aria Beingessner, "C is no longer a programming language," caused a heated discussion on Hacker News.

Original link: https://gankra.github.io/blah/c-isnt-a-language/

Hacker News comment section: https://news.ycombinator.com/item?id=30704642

Aria and friend Phantomderp agreed on "very disappointed with the C ABI interface and trying to fix it". But Aria and her friends have different opinions on the reasons for her disappointment. What specific differences did arise? Why make the point that C is no longer a programming language? The author compiled the original text:

Organize | Yu Xuan       

Produced | Program Life (ID: coder _life)

Phantomderp tries to natively improve the use of C itself as a programming language, while Aria wants to improve the use of any language other than C.

At this time, everyone will have questions, what does this problem have to do with C?

Aria said: If C is really a programming language, it has nothing to do with it. Unfortunately, it's not. It's not the fact that there are billions of implementations and a hierarchy of failures that lead to the way it's defined so badly, it's the fact that C is elevated to a role of prestige and power whose reign is absolute and eternal. C is the general language of programming, we all have to learn C, so C is no longer just a programming language, it has become a protocol that every general programming language needs to abide by.

It's actually kind of like about the whole "C is an elusive implementation-defined mess". But just because it forced us to use this protocol, it became an even bigger nightmare.

External function interface

Let's talk about technical issues together. Provided you have finished designing your new language Bappyscript, there is first-class support for Bappy Paws/Hooves/Fins. This is an amazing language that will revolutionize the way cats, sheep, and sharks are programmed.

But now it needs to actually do something useful. Like accepting user input, or output, or literally any observable or something. If you want programs written in that language to be compatible with mainstream operating systems, you need to interact with the operating system's interface. Heard that everything on Linux is "just a file", so let's open a file on Linux together!

OPEN(2)


NAME
       open, openat, creat - open and possibly create a file


SYNOPSIS


       #include <fcntl.h>


       int open(const char *pathname, int flags);
       int open(const char *pathname, int flags, mode_t mode);


       int creat(const char *pathname, mode_t mode);


       int openat(int dirfd, const char *pathname, int flags);
       int openat(int dirfd, const char *pathname, int flags, mode_t mode);
       /* Documented separately, in openat2(2): */
       int openat2(int dirfd, const char *pathname,
                   const struct open_how *how, size_t size);


   Feature Test Macro Requirements for glibc (see
   feature_test_macros(7)):


       openat():
           Since glibc 2.10:
               _POSIX_C_SOURCE >= 200809L
           Before glibc 2.10:
               _ATFILE_SOURCE

This is Bappyscript, not C, so where is the Bappyscript interface for Linux?

What do you mean by saying there is no Bappyscript interface in Linux? Well, of course because it's a brand new language, but you'll add one, right? Then you will find that you seem to have to use what they give.

You will need some kind of interface to allow the language to call external functions, like the foreign function interface FFI. Then you find that Rust has C FFIs, Swift has them, and even Python has them.

04c6c2685e57128193bcaf79b094a53f.png

You'll find that everyone has to learn C to talk to the major operating systems, and then suddenly everyone uses C when they need to talk to each other. So...why not just use C to talk to each other?

Now C has become a programming language, not only a programming language, it is also a protocol.

What does a conversation with C include?

It's clear that basically every language has to learn to talk to C, and this language is definitely very explicit.

What does "dialogue" C mean? It means getting a description of the interface types and functions in the form of a C header file, and in a way:

  • match these types of layouts

  • Do something with the linker that resolves the function's symbol to a pointer

  • call these functions with the appropriate ABI (like putting args in the correct registers)

So, here are a few questions:

  • You can't actually write a C parser

  • C doesn't really have an ABI, not even a defined type layout

Can't actually parse a C header file

Aria has asserted that parsing C is basically impossible, but some say there are actually many tools that can read C headers, such as rust-bindgen. Is this really the case? actually not.

bindgen uses libclang to parse C and C++ header files. To modify the way bindgen searches for libclang, see the clang-sys documentation. For more details on how bindgen uses libclang, see the bindgen user guide.

Anyone who spends a lot of time trying to parse a C(++) header file quickly will give up quickly and let a C(++) compiler do the job. Remember, parsing C headers meaningfully isn't just parsing: you also need to resolve #includes, typedefs, and macros! So now not only need to implement all related functions, but also implement the header file parsing logic for all platforms, and also need to find ways to find DEFINED!

Take Swift, for example, it has absolute advantages in terms of C interop and resources, it is a programming language developed by Apple, which effectively replaced Objective-C as the main language for defining and using system APIs on its platform. In doing so, it takes ABI stability and design concepts one step further than anyone else.

It's also one of the most FFI-enabled languages ​​Aria has ever seen. It can natively import (Objective-)C(++) headers and produce a nice native Swift interface whose types are automatically "bridged" to their Swift equivalents at boundaries (since types have the same ABI, usually is transparent).

Swift is also developed by many of the people at Apple who build and maintain Clang and LLVM. These are the world's top experts on C and its derivatives. One of them is Doug Gregor, who once expressed his views on C FFI:

eea4271236fe7e9fa93db5a98192964d.png

All of these are the reasons why Swift uses Clang internally to handle the C(++) ABI. That way, we don't have to chase every new property that Clang adds that affects the ABI.

As you can see, even Swift doesn't want to spend time parsing C(++) headers. So, if you absolutely don't want the C compiler to parse and resolve header files at compile time, how do you do it?

You need manual translation! int64_t? Or write i64. long...? What is a long?

C doesn't actually have an ABI

Well, there's no surprise here: integer types in C, which are designed to have a wobble size for "portability", are in fact unstable in size. We can think CHAR_BIT is weird, but that doesn't help us understand longthe size and alignment either.

Some say that each platform has standardized calling conventions and ABIs, and they do, and they generally define the layout of key primitives in C (and some don't just define calling conventions with C types, here's a side-eye to AMD64 SysV).

There's also a thorny problem: the architecture doesn't define the ABI, nor does the operating system. We have to go all-in on a specific target triple, like "x86_64-pc-windows-gnu" (not to be confused with "x86_64-pc-windows-msvc"). After testing, there are a total of 176 triples.

> rustc --print target-list


aarch64-apple-darwin
aarch64-apple-ios
aarch64-apple-ios-macabi
aarch64-apple-ios-sim
aarch64-apple-tvos


...
armv7-unknown-linux-musleabi
armv7-unknown-linux-musleabihf
armv7-unknown-linux-uclibceabihf
...
x86_64-uwp-windows-gnu
x86_64-uwp-windows-msvc
x86_64-wrs-vxworks
>_

This is really too many ABIs because all the different calling conventions like stdcall vs fastcall or aapcs vs aapcs-vfp are not even used in the tests.

But at least all of these ABIs and calling conventions and stuff are available in a machine-readable format that's convenient to use. At least the mainstream C compilers agree on an ABI for a specific target triple! Sure there are some weird jank C compilers, but Clang and GCC are not:

> abi-checker --tests ui128 --pairs clang_calls_gcc gcc_calls_clang


...


Test ui128::c::clang_calls_gcc::i128_val_in_0_perturbed_small        passed
Test ui128::c::clang_calls_gcc::i128_val_in_1_perturbed_small        passed
Test ui128::c::clang_calls_gcc::i128_val_in_2_perturbed_small        passed
Test ui128::c::clang_calls_gcc::i128_val_in_3_perturbed_small        passed
Test ui128::c::clang_calls_gcc::i128_val_in_0_perturbed_big          failed!
test 57 arg3 field 0 mismatch
caller: [30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 3A, 3B, 3C, 3D, 3E, 3F]
callee: [38, 39, 3A, 3B, 3C, 3D, 3E, 3F, 40, 41, 42, 43, 44, 45, 46, 47]
Test ui128::c::clang_calls_gcc::i128_val_in_1_perturbed_big          failed!
test 58 arg3 field 0 mismatch
caller: [30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 3A, 3B, 3C, 3D, 3E, 3F]
callee: [38, 39, 3A, 3B, 3C, 3D, 3E, 3F, 40, 41, 42, 43, 44, 45, 46, 47]


...


392 passed, 60 failed, 0 completely failed, 8 skipped

Above is Aria's FFI abi-checker running on Ubuntu 20.04 x64, where she tested some pretty boring situations on this rather important, well-behaved platform. It turns out that some integer arguments failed to pass by value between two static libraries compiled by Clang and GCC!

_int128Aria found that Clang and GCC can't even agree on an ABI on Linux x64 .

Aria was originally designed to check for errors in rustc, but I didn't expect to find inconsistencies between the two major C compilers on an important and commonly used ABI.

trying to tame C

Aria believes that the scary thing is that semantic parsing of C header files can only be done by the platform's C compiler. Even though the C compiler tells you the types and how to understand annotations, you still really don't know the size/alignment/conventions of everything. So how do you interoperate with all this mess? Aria offers two options.

The first option is to surrender completely and soul bind your language to C, which can be any of the following:

  • Write your compiler/runtime in C(++) so it works in C

  • Make your "codegen" emit C(++) directly, so the user needs a C compiler anyway

  • Build your compiler on top of a full-fledged major C compiler (Clang or GCC)

But all of the above will only get you so far, because unless your language is really exposed unsigned long long, you're inheriting C's huge portability mess.

Which brings us to the second option: lying, cheating, and stealing.

If all this is an unavoidable disaster anyway, you might as well start translating type and interface definitions into your language by hand, basically what we do in Rust every day. For example, people use rust-bindgen and friends to automate things, but many times the definitions are checked or manually adjusted. Because people don't want to waste time trying to work portably with Phantomderp's custom C build system.

In Rust, what's on Linux x64 intmax_t?

pub type intmax_t = i64;

In Nim, what on Linux x64 long long?

clonglong {.importc: "long long", nodecl.} = int64

A lot of code has completely abandoned keeping C in loops and started hardcoding definitions of core types. After all, they are clearly only part of the platform ABI! Are they going to change intmax_tthe size? This is clearly an ABI breaking change!

So what is phantomderp working on?

We discussed why intmax_tit can't be changed, because if we change from long long(64-bit int) to _int128_t(128-bit int), the binary somewhere gets out of hand using the wrong calling/returning convention. But is there a way, if the code picks it up or something, we can upgrade the function calls for the newer app and leave the old app untouched? Let's write some code to test the idea that transparent aliases can help ABI.

Aria asked her question: How do programming languages ​​handle this change? How do I specify which version to intmax_tinterop with? If you have some C header file mentioning intmax_t, which definition does it use?

The primary mechanism for discussing platforms with different ABIs here is the target triplet. Do you know what a target triple is? Do you know what's x86_64-unknown-linux-gnuincluded ? Now, while it's ostensibly possible to compile against this target and get a binary that "just works" on all of these platforms, Aria doesn't believe some programs will be compiled to be intmax_tlarger than int64_tthat.

Will any platform trying to make this change become a new x86_64-unknown-linux-gnu2 target triple? Wouldn't it be enough if anything against x86_64-unknown-linux-gnucompilation was allowed to run on it?

Change the signature without breaking the ABI

"So what, C will never improve again?" No! But also! Because they provide bad design.

Honestly, doing ABI compliant modifications is an art form. Part of this art is preparation. Specifically, if you're ready, it's much easier to make changes that don't break the ABI.

As phantomderp's article points out, something like glibc ( g yes ) x86_64-unknown-linux-gnu has gnu long understood this and uses mechanisms like symbol versioning to update signatures and APIs, while keeping the old version for anyone compiling against it.

So if you have int32_t my_rad_symbol(int32_t) , you tell the compiler to export it as my_rad_symbol_v1 , then anyone who compiles against this header will have it written in their code my_rad_symbol , but for my_rad_symbol_v1 linking.

Then when you decide it should actually be used int64_t, you can put int64_t my_rad_symbol(int64_t) as my_rad_symbol_v2but keep the old definition as   my_rad_symbol_v1. Anyone compiling against a newer version of a header will happily use v2 symbols, while anyone compiling against an older version continues to use v1!

But you still have a compatibility problem: anyone compiling with the new headers can't link with the old version of the library, the V1 version of the library has no V2 symbols at all! So if you want hot new features, you have to accept incompatibility with older systems.

It's not a big deal though, it just makes the platform vendors sad because no one has immediate access to what they've spent so much time doing. You have to roll out a shiny new feature and then make everyone wait for it to become common enough and mature enough. But in order for people to be willing to rely on it and break support for older platforms (or be willing to implement dynamic checks and fallbacks for it), you have to wait a few years.

If you really want people to upgrade right away, talk about forward compatibility. This makes old versions of things work somehow with new features they had no concept of.

Change the type without breaking the ABI

So in addition to changing the signature of a function, can you also change the type layout? Aria says it depends on how you expose the type.

One of the really wonderful features of C is that it lets you distinguish a type with a known layout from a type with an unknown layout. If you only forward declare a type in a C header file, then any user code that interacts with it is not "allowed" to know the layout of that type, and must always handle it opaquely behind a pointer.

So you can do an API like MyRadType* make_val()and use_val(MyRadType*), and then use the same symbol versioning trick to expose the make_val_v1AND use_val_v1symbol, and anytime you want to change this layout, you just increment the version on everything that interacts with that type. Similarly, you keep some in MyRadTypeV1, and some type definitions to make sure people use the "correct" type. MyRadTypeV2This makes it possible to change the layout of types between versions.

Bad things can happen if multiple things build on top of your library and then start talking to each other with opaque types:

  • lib1: make an API, accept MyRadType*and call use_val

  • lib2: call make_valand pass the result to lib1

If lib1 and lib2 were compiled against different versions of the library, that make_val_v1would be entered use_val_v2in! You have two options to deal with this:

1. Say it's forbidden, blame those who do it anyway, and be sad

2. Design in a forward compatible way MyRadTypeso that mixes are ok

Common forward compatibility tricks include:

  • Reserve unused fields for future versions

  • All versions of MyRadType have a common prefix that allows you to "check" which version you are using

  • Have self-sized fields so older versions can "skip" new sections

CASE STUDY: MINIDUMP_HANDLE_DATA

Microsoft is a master at this kind of forward compatibility, and can even implement layout compatibility between architectures. An example that Aria is working on recently is Minidumpapiset.hMINIDUMP_HANDLE_DATA_STREAM in .

This API describes a versioned list of values. The list starts with this type:

typedef struct _MINIDUMP_HANDLE_DATA_STREAM {
    ULONG32 SizeOfHeader;
    ULONG32 SizeOfDescriptor;
    ULONG32 NumberOfDescriptors;
    ULONG32 Reserved;
} MINIDUMP_HANDLE_DATA_STREAM, *PMINIDUMP_HANDLE_DATA_STREAM;

in:

  • SizeOfHeaderis the size of MINIDUMP_HANDLE_DATA_STREAM itself. If they need to add more fields at the end, that's okay, because older versions can use this value to detect the "version" of the header, and also skip any fields they don't know about.

  • SizeOfDescriptoris the size of each element in the array. This lets you know what "version" of the element you have, and skips any fields you don't know about.

  • NumberOfDescriptors is the length of the array

  • Reservedis some extra memory that they decided to keep in the headers anyway (Minidumpapiset.h is very careful to never do padding anywhere, because the padding bytes have unspecified values, and it's a serialized binary file format .. I hope they added this field so that the size of the struct is a multiple of 8 so there won't be any question about whether the array elements need padding after the header. This is taking compatibility seriously!)

In fact, Microsoft actually has a reason to use this versioning scheme, and defines two versions of array elements:

typedef struct _MINIDUMP_HANDLE_DESCRIPTOR {
    ULONG64 Handle;
    RVA TypeNameRva;
    RVA ObjectNameRva;
    ULONG32 Attributes;
    ULONG32 GrantedAccess;
    ULONG32 HandleCount;
    ULONG32 PointerCount;
} MINIDUMP_HANDLE_DESCRIPTOR, *PMINIDUMP_HANDLE_DESCRIPTOR;




typedef struct _MINIDUMP_HANDLE_DESCRIPTOR_2 {
    ULONG64 Handle;
    RVA TypeNameRva;
    RVA ObjectNameRva;
    ULONG32 Attributes;
    ULONG32 GrantedAccess;
    ULONG32 HandleCount;
    ULONG32 PointerCount;
    RVA ObjectInfoRva;
    ULONG32 Reserved0;
} MINIDUMP_HANDLE_DESCRIPTOR_2, *PMINIDUMP_HANDLE_DESCRIPTOR_2;




// The latest MINIDUMP_HANDLE_DESCRIPTOR definition.
typedef MINIDUMP_HANDLE_DESCRIPTOR_2 MINIDUMP_HANDLE_DESCRIPTOR_N;
typedef MINIDUMP_HANDLE_DESCRIPTOR_N *PMINIDUMP_HANDLE_DESCRIPTOR_N;

The actual details of these structures are not very interesting, except:

  • They just changed it by adding fields at the end

  • There is a "latest version" type definition

  • Retain some maybe Padding again (RVA is a ULONG32)

It's an indestructible forward-compatible behemoth. They are very careful with padding, it even has the same layout between 32bit and 64bit (which is actually very important since you want a minidump processor on one architecture to be able to handle minidumps from each).

Case Study: jmp_buf

Aria isn't very familiar with this situation, but while researching historical glibc outages, she came across a great article on LWN: "glibc s390 ABI outages", which she assumed was accurate.

It turns out that glibc used to crack the type ABI, at least on s390. According to the description in this post, it's messy.

In particular they changed the layout of the saved state type used by setjmp/longjmp, ie jmp_buf. Now, they know this is an ABI breaking change, so they do the responsible symbol versioning thing.

But jmp_bufnot an opaque type, other things store instances of this type inline, like Perl's runtime. Needless to say, this relatively obscure type has permeated many binaries, and the end result is that everything in Debian needs to be recompiled!

This post even discusses the possibility of upgrading the libc version to deal with this situation:

In a mixed ABI environment like debian, SO name collisions cause two libcs ​​to be loaded and contend for the same symbol namespace, while resolution (and therefore ABI choice) is dictated by ELF interpolation and scoping rules. It was a real nightmare. This could be a worse solution than telling everyone to rebuild and move on.

Is it really possible to change intmax_t?

In Aria's view, not quite. Just jmp_buflike, it's not an opaque type, which means it's inlined into a large number of random structures, is considered to have a large number of other language- and compiler-specific representations, and may be part of a large number of public interfaces. And these interfaces are not under the control of libc, Linux, or even the distro maintainers.

Of course libc could use the symbol version trick appropriately to make its API compatible with the new definition, but changing the size of a basic data type like intmax_tthis is a quest for confusion in a platform's large ecosystem.

Aria wants to be proven wrong, but as far as she can tell, making such a change requires a new target triple and doesn't allow any binaries/libraries built for the old ABI to run on this new triple. Of course someone could do the work, but Aria doesn't envy any distro that does it.

Even so, there's the problem of x64's int: it's a very basic type, and has been of this size for so long, that countless applications may have strange, imperceptible assumptions about it. That's why int is 32-bit on x64, even though it should be 64-bit: int was 32-bit for so long that it was completely hopeless to update software to the new size, despite it being a whole new architecture and Target triples.

Aria once again wished she was wrong, but people sometimes make mistakes so bad that they are simply irreversible. What if C was a standalone programming language? Of course it can be done. But it's not, it's a protocol, or a bad protocol that we have to use.

Even if C conquers the world, maybe it will never get good things again.

Guess you like

Origin blog.csdn.net/weixin_41055260/article/details/123700694