Socket-depth analysis and system calls

Socket-depth analysis and system calls

It is conceivable that, when the application calls the socket () interface to request an operating system service, the system inevitably call, the kernel call number transmission system according to the system call is initiated, determining program to be executed, if the number corresponding to socket it is performed socket corresponding interrupt service routine. Internal Services program, and depending on the service you request to execute different processing program corresponding to the service. When the end of the process, execution returns from the interrupt service routine to initiate interrupt int 0x80, then the user mode user program we executed, layers of return, socket () also finished up.
This time, we are concerned about three issues:
1. how the application requests a system call, or how to enter the kernel mode.
2. The relationship between the interrupt call service program, how he jumps to the service program we needed.
3.socket To complete our call, what had been done at initialization thing.

The application calls socket

We still use written before hello / hi chat program, with the client to debug and see how socket is performed.

ready:

In order to be able to debug the contents of the library libc libc libraries need to be downloaded,
1. First install glibc symbol table, the installation method:
sudo apt-get install libc6-dbg
2. Debugging libc need to go to the corresponding source file, with the open source libc, and libc we can download the source code, in you can see where it occurs while debugging:
sudo apt-get source libc6-dev
Note that you download the source file path, behind the debugging process will be used.
3. Source File: clinet.c


#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <string.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#define MAX_len 1024
int sock_fd;
struct sockaddr_in add;
int main()
{
        int ret;
        char buf[MAX_len]={0};
        char buf_rec[MAX_len]={0};
        char buf_p[5]={"0"};
        memset(&add,0,sizeof(add));
        add.sin_family=AF_INET;
        add.sin_port=htons(8000);
        add.sin_addr.s_addr=inet_addr("127.0.0.1");

        if((sock_fd=socket(PF_INET,SOCK_STREAM,0))<=0)
        {

                perror("socket");
                return 1;
        }
        if((ret=connect(sock_fd,(struct sockaddr*)& add,sizeof(struct sockaddr)))<0)
        {
                perror("connet");
                return 1;
        }
        if((ret=send(sock_fd,(void*)buf_p,strlen(buf),0))<0)
        {
                perror("recvfrom");
                return 1;
        }
        while (1)
        {
                scanf("%s",buf);
                if((ret=send(sock_fd,(void*)buf,sizeof(buf),0))<0)
                {
                        perror("sendfrom1");
                        return 1;
                }
                if((ret=recv(sock_fd,(void*)buf_rec,sizeof(buf_rec),0))<0)
                {
                        perror("recvfrom1");
                        return 1;

                }
                printf("%s\n",buf_rec);
        }
        return 0;
}

Start debugging:

1. Compile the file and generate debug information:
gcc -o - g client client.c
2. Execute the command and gdb debugger:gdb client

(gdb) file client
Load new symbol table from "client"? (y or n) y
Reading symbols from client...done.
(gdb) b 23
Breakpoint 1 at 0x40091d: file client.c, line 23.
(gdb) c
The program is not being run.
(gdb) run 
Starting program: /home/netlab/netlab/systemcall/client 

Breakpoint 1, main () at client.c:23
23          if((sock_fd=socket(PF_INET,SOCK_STREAM,0))<=0)

The break in the line 23, which is the first execution socket () of the line, and then run the program, the program stops at line 23, then use the stepcommand into the socket () internal analysis of how to achieve internal socket .

(gdb) s
socket () at ../sysdeps/unix/syscall-template.S:84
84  ../sysdeps/unix/syscall-template.S: No such file or directory.

But this suggests that we are going to jump does not exist, which is due and no source code on our libc, which is why we are ready to download the source reason codes when, according to his prompt directory, we use the directory glibc-2.23/sysdeps/unix/command to download the libc the source code is loaded into gdb, and then re-commissioning:

(gdb) directory glibc-2.23/sysdeps/unix/
Source directories searched: /home/netlab/netlab/systemcall/glibc-2.23/sysdeps/unix:$cdir:$cwd
(gdb) s
socket () at ../sysdeps/unix/syscall-template.S:85
85      ret
(gdb) l
80  
81  /* This is a "normal" system call stub: if there is an error,
82     it returns -1 and sets errno.  */
83  
84  T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
85      ret
86  T_PSEUDO_END (SYSCALL_SYMBOL)
87  
88  #endif
89  
(gdb) 

Program jumped into the systemcall-template.s return, and two are from the macro definition, and can not see what, in fact, this is the system call generated template, also was able to guess from the name, this provides conventional format system calls.
So now it seems gdb debugging to system call can not be achieved, and, 32 are consistent with the case where the 64 encounters, so we skipped debug, analyze libc source code.

socket glibc library implements:

First, by a relocation of the socket relocation __socket

#define __socket socket
#define __recvmsg recvmsg
#define __bind bind
#define __sendto sendto

Then in the library implements __socket ():

int __socket (int fd, int type, int domain)
{
    #ifdef __ASSUME_SOCKET_SYSCALL
      return INLINE_SYSCALL (socket, 3, fd, type, domain);
    #else
      return SOCKETCALL (socket, fd, type, domain);
    #endif
}
libc_hidden_def (__socket)
weak_alias (__socket, socket)

Inside __socket () is called SOCKETCALL or INLINE_SYSCALL, eventually they will be converted to INLINE_SYSCALL, closely related to INLINE_SYSCALL with architecture, corresponding to x86_ architecture to achieve the following:

# define INLINE_SYSCALL(name, nr, args...) \
  ({                                          \
    unsigned long int resultvar = INTERNAL_SYSCALL (name, , nr, args);        \
    if (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (resultvar, )))        \
      {                                       \
    __set_errno (INTERNAL_SYSCALL_ERRNO (resultvar, ));           \
    resultvar = (unsigned long int) -1;                   \
      }                                       \
    (long int) resultvar; })
#undef INTERNAL_SYSCALL
#define INTERNAL_SYSCALL(name, err, nr, args...)            \
    internal_syscall##nr (SYS_ify (name), err, args)

Here the number of parameters, will be converted to:

#define internal_syscall3(number, err, arg1, arg2, arg3)        \
({                                  \
    unsigned long int resultvar;                    \
    TYPEFY (arg3, __arg3) = ARGIFY (arg3);              \
    TYPEFY (arg2, __arg2) = ARGIFY (arg2);              \
    TYPEFY (arg1, __arg1) = ARGIFY (arg1);              \
    register TYPEFY (arg3, _a3) asm ("rdx") = __arg3;           \
    register TYPEFY (arg2, _a2) asm ("rsi") = __arg2;           \
    register TYPEFY (arg1, _a1) asm ("rdi") = __arg1;           \
    asm volatile (                          \
    "syscall\n\t"                           \
    : "=a" (resultvar)                          \
    : "0" (number), "r" (_a1), "r" (_a2), "r" (_a3)         \
    : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);            \
    (long int) resultvar;                       \
})

Inline assembly is used here in the form of the parameters rdx, rsi, rdi to store, use the interrupt number eax memory, the kernel also will initiate soft interrupt the corresponding interrupt into the interrupt handler, to this end part of the application.

Kernel interrupt response:

In order to see how the kernel socket in response to the request of the application, we use qemu + gdb debugging kernel, the kernel observation process when the socket request response.
1. Debug state run menuos, pay attention to add the client program to facilitate debugging
[picture 1]
2. analyze the breakpoint, in order to observe the response of the kernel socket, obviously should be marked on the call path of the response function breakpoint to facilitate debugging, but can not set a breakpoint in all interrupts entrance, so it is difficult to get what we want interrupt response, the best position is the entrance of the socket system call handler, in this position, only socket request It can trigger, to ensure that we can directly analyze, how to find the entrance of the system call it?
Kernel arch / x86 all interrupts in the described inlet / entry / syscalls have x86 architecture, for backward compatibility, and is divided into 32 64-bit interrupt entry:
32:

99  i386    statfs          sys_statfs          __ia32_compat_sys_statfs
100 i386    fstatfs         sys_fstatfs         __ia32_compat_sys_fstatfs
101 i386    ioperm          sys_ioperm          __ia32_sys_ioperm
102 i386    socketcall      sys_socketcall          __ia32_compat_sys_socketcall
103 i386    syslog          sys_syslog          __ia32_sys_syslog
104 i386    setitimer       sys_setitimer           __ia32_compat_sys_setitimer
105 i386    getitimer       sys_getitimer           __ia32_compat_sys_getitimer

# 64:

......
40  common  sendfile        __x64_sys_sendfile64
41  common  socket          __x64_sys_socket
42  common  connect         __x64_sys_connect
43  common  accept          __x64_sys_accept
44  common  sendto          __x64_sys_sendto

......

For program 32, 32 should obviously be positioned in the inlet of the system calls, the program 64, 64 should be located at the entrance. We were watching the two procedures corresponding interrupt entry:
The first is 32, set breakpoints __ia32_compat_sys_socketcall, run menuos, and run the client program, gdb will stop when entering __ia32_compat_sys_socketcall:

(gdb) b __ia32_compat_sys_socketcall
Breakpoint 3 at 0xffffffff818474b0: file net/compat.c, line 718.
(gdb) c
Continuing.

Breakpoint 3, __ia32_compat_sys_socketcall (regs=0xffffc900001eff58)
    at net/compat.c:718
718 COMPAT_SYSCALL_DEFINE2(socketcall, int, call, u32 __user *, args)

[Photo 2]
look at this function:

718 COMPAT_SYSCALL_DEFINE2(socketcall, int, call, u32 __user *, args)
719 {
720     u32 a[AUDITSC_ARGS];
721     unsigned int len;
722     u32 a0, a1;
(gdb) 
723     int ret;
724 
725     if (call < SYS_SOCKET || call > SYS_SENDMMSG)
726         return -EINVAL;
727     len = nas[call];
728     if (len > sizeof(a))
729         return -EINVAL;
730 
731     if (copy_from_user(a, args, len))
733 
734     ret = audit_socketcall_compat(len / sizeof(a[0]), a);
735     if (ret)
736         return ret;
737 
738     a0 = a[0];
739     a1 = a[1];
740 
741     switch (call) {
742     case SYS_SOCKET:
(gdb) 
743         ret = __sys_socket(a0, a1, a[2]);
744         break;
745     case SYS_BIND:
746         ret = __sys_bind(a0, compat_ptr(a1), a[2]);
747         break;
748     case SYS_CONNECT:
749         ret = __sys_connect(a0, compat_ptr(a1), a[2]);
750         break;
751     case SYS_LISTEN:
752         ret = __sys_listen(a0, a1);

Obviously, this handler is a class of total inlet socket operations, it first acquires the parameters of the system call, and then depending on the type of service requested, a jump to a different processing procedures, the distribution will continue to monitor the function call:

(gdb) b __sys_socket
Breakpoint 4 at 0xffffffff817eea40: file net/socket.c, line 1498.
(gdb) c
Continuing.

Breakpoint 4, __sys_socket (family=2, type=1, protocol=0) at net/socket.c:1498
1498    {
(gdb) l
1493        return __sock_create(net, family, type, protocol, res, 1);
1494    }
1495    EXPORT_SYMBOL(sock_create_kern);
1496    
1497    int __sys_socket(int family, int type, int protocol)
1498    {
1499        int retval;
1500        struct socket *sock;
1501        int flags;
1502    
(gdb) 
1503        /* Check the SOCK_* constants for consistency.  */
1504        BUILD_BUG_ON(SOCK_CLOEXEC != O_CLOEXEC);
1505        BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
1506        BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
1507        BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);
1508    
1509        flags = type & ~SOCK_TYPE_MASK;
1510        if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
1511            return -EINVAL;
1512        type &= SOCK_TYPE_MASK;
(gdb) 
1513    
1514        if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
1515            flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
1516    
1517        retval = sock_create(family, type, protocol, &sock);
1518        if (retval < 0)
1519            return retval;
1520    
1521        return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
1522    }

In __sys_socket () function only checked the internal parameters, jump to sock_creat () is executed

(gdb) b __sock_create
Breakpoint 7 at 0xffffffff817ec9a0: file net/socket.c, line 1363.
(gdb) c
Continuing.

Breakpoint 5, __sys_socket (family=2, type=1, protocol=0) at net/socket.c:1517
1517        retval = sock_create(family, type, protocol, &sock);
(gdb) c
Continuing.

Breakpoint 7, __sock_create (net=0xffffffff824e94c0 <init_net>, family=2, type=1, 
    protocol=0, res=0xffffc90000047e98, kern=0) at net/socket.c:1363
1363        if (family < 0 || family >= NPROTO)
(gdb) l
1358        const struct net_proto_family *pf;
1359    
1360        /*
1361         *      Check protocol is in range
1362         */
1363        if (family < 0 || family >= NPROTO)
1364            return -EAFNOSUPPORT;
1365        if (type < 0 || type >= SOCK_MAX)
1366            return -EINVAL;
1367    
(gdb) l
1368        /* Compatibility.
1369    
1370           This uglymoron is moved from INET layer to here to avoid
1371           deadlock in module load.
1372         */
1373        if (family == PF_INET && type == SOCK_PACKET) {
1374            pr_info_once("%s uses obsolete (PF_INET,SOCK_PACKET)\n",
1375                     current->comm);
1376            family = PF_PACKET;
1377        }
(gdb) l
1378    
1379        err = security_socket_create(family, type, protocol, kern);
1380        if (err)
1381            return err;
1382    
1383        /*
1384         *  Allocate the socket and allow the family to set things up. if
1385         *  the protocol is 0, the family is instructed to select an appropriate
1386         *  default.
1387         */
(gdb) l
1388        sock = sock_alloc();
1389        if (!sock) {
1390            net_warn_ratelimited("socket: no more sockets\n");
1391            return -ENFILE; /* Not exactly a match, but its the
1392                       closest posix thing */
1393        }
1394    
1395        sock->type = type;
1396    
1397    #ifdef CONFIG_MODULES
(gdb) l
1398        /* Attempt to load a protocol module if the find failed.
1399         *
1400         * 12/09/1996 Marcin: But! this makes REALLY only sense, if the user
1401         * requested real, full-featured networking support upon configuration.
1402         * Otherwise module support will break!
1403         */
1404        if (rcu_access_pointer(net_families[family]) == NULL)
1405            request_module("net-pf-%d", family);
1406    #endif
1407    
(gdb) l
1408        rcu_read_lock();
1409        pf = rcu_dereference(net_families[family]);
1410        err = -EAFNOSUPPORT;
1411        if (!pf)
1412            goto out_release;
1413    
1414        /*
1415         * We will call the ->create function, that possibly is in a loadable
1416         * module, so we have to bump that loadable module refcnt first.
1417         */
(gdb) l
1418        if (!try_module_get(pf->owner))
1419            goto out_release;
1420    
1421        /* Now protected by module ref count */
1422        rcu_read_unlock();
1423    
1424        err = pf->create(net, sock, protocol, kern);
1425        if (err < 0)
1426            goto out_module_put;
1427    
(gdb) l
1428        /*
1429         * Now to bump the refcnt of the [loadable] module that owns this
1430         * socket at sock_release time we decrement its refcnt.
1431         */
1432        if (!try_module_get(sock->ops->owner))
1433            goto out_module_busy;
1434    
1435        /*
1436         * Now that we're done with the ->create function, the [loadable]
1437         * module can have its refcnt decremented
(gdb) l
1438         */
1439        module_put(pf->owner);
1440        err = security_socket_post_create(sock, family, type, protocol, kern);
1441        if (err)
1442            goto out_sock_release;
1443        *res = sock;
1444    
1445        return 0;
1446    
1447    out_module_busy:
(gdb) l
1448        err = -EAFNOSUPPORT;
1449    out_module_put:
1450        sock->ops = NULL;
1451        module_put(pf->owner);
1452    out_sock_release:
1453        sock_release(sock);
1454        return err;
1455    
1456    out_release:
1457        rcu_read_unlock();
(gdb) l
1458        goto out_sock_release;
1459    }
1460    EXPORT_SYMBOL(__sock_create);
1461    
1462    /**
1463     *  sock_create - creates a socket
1464     *  @family: protocol family (AF_INET, ...)
1465     *  @type: communication type (SOCK_STREAM, ...)
1466     *  @protocol: protocol (0, ...)
1467     *  @res: new socket
(gdb) l
1468     *
1469     *  A wrapper around __sock_create().
1470     *  Returns 0 or an error. This function internally uses GFP_KERNEL.
1471     */
1472    
1473    int sock_create(int family, int type, int protocol, struct socket **res)
1474    {
1475        return __sock_create(current->nsproxy->net_ns, family, type, protocol, res, 0);
1476    }

Initially set breakpoints in the sock_reatediscovery can not get to see the kernel code is found to be set a breakpoint in __sock_create, may be sock_creat be redirected. Continue to look __sock_create:
err = security_socket_create(family, type, protocol, kern);first checked to see if legitimate, then perform the most critical function
sock = sock_alloc();
in order to understand this line of code, we need to know, sock is struct socketof a variable, this interface is the core of the socket body, he reads as follows:

struct socket {
    socket_state        state;

    short           type;

    unsigned long       flags;

    struct socket_wq    *wq;

    struct file     *file;
    struct sock     *sk;
    const struct proto_ops  *ops;
};

State is the current state of the socket, and is not used to indicate a connection, the service type indicates the type of socket, such as TCP SOCK_STREAM type services, flags a flag, such as SOCK_ASYNC_NOSPACE, wq is waiting queue, as multiple requests may have a socket , file refers to the file, because the socket can also be viewed as a file, all have this pointer, operate compatible files. Sk is very important and very big, responsible for recording the protocol-related content. Such an arrangement allows the socket having a good deal of independence, can be generic, OPS is a pointer associated with the socket basic operation of the service, which is linux common usage, the operation of an object by a set of sub-ah structure function pointer in.

struct proto_ops {
    int     family;
    struct module   *owner;
    int     (*release)   (struct socket *sock);
    int     (*bind)      (struct socket *sock,
                      struct sockaddr *myaddr,
                      int sockaddr_len);
    int     (*connect)   (struct socket *sock,
                      struct sockaddr *vaddr,
                      int sockaddr_len, int flags);
    int     (*socketpair)(struct socket *sock1,
                      struct socket *sock2);
    int     (*accept)    (struct socket *sock,
                      struct socket *newsock, int flags, bool kern);
    int     (*getname)   (struct socket *sock,
                      struct sockaddr *addr,
                      int peer);
    __poll_t    (*poll)      (struct file *file, struct socket *sock,
                      struct poll_table_struct *wait);
    int     (*ioctl)     (struct socket *sock, unsigned int cmd,
                      unsigned long arg);
#ifdef CONFIG_COMPAT
    int     (*compat_ioctl) (struct socket *sock, unsigned int cmd,
                      unsigned long arg);
#endif
    int     (*listen)    (struct socket *sock, int len);
    int     (*shutdown)  (struct socket *sock, int flags);
    int     (*setsockopt)(struct socket *sock, int level,
                      int optname, char __user *optval, unsigned int optlen);
    int     (*getsockopt)(struct socket *sock, int level,
                      int optname, char __user *optval, int __user *optlen);
#ifdef CONFIG_COMPAT
    int     (*compat_setsockopt)(struct socket *sock, int level,
                      int optname, char __user *optval, unsigned int optlen);
    int     (*compat_getsockopt)(struct socket *sock, int level,
                      int optname, char __user *optval, int __user *optlen);
#endif
    int     (*sendmsg)   (struct socket *sock, struct msghdr *m,
                      size_t total_len);
    /* Notes for implementing recvmsg:
     * ===============================
     * msg->msg_namelen should get updated by the recvmsg handlers
     * iff msg_name != NULL. It is by default 0 to prevent
     * returning uninitialized memory to user space.  The recvfrom
     * handlers can assume that msg.msg_name is either NULL or has
     * a minimum size of sizeof(struct sockaddr_storage).
     */
    int     (*recvmsg)   (struct socket *sock, struct msghdr *m,
                      size_t total_len, int flags);
    int     (*mmap)      (struct file *file, struct socket *sock,
                      struct vm_area_struct * vma);
    ssize_t     (*sendpage)  (struct socket *sock, struct page *page,
                      int offset, size_t size, int flags);
    ssize_t     (*splice_read)(struct socket *sock,  loff_t *ppos,
                       struct pipe_inode_info *pipe, size_t len, unsigned int flags);
    int     (*set_peek_off)(struct sock *sk, int val);
    int     (*peek_len)(struct socket *sock);

    /* The following functions are called internally by kernel with
     * sock lock already held.
     */
    int     (*read_sock)(struct sock *sk, read_descriptor_t *desc,
                     sk_read_actor_t recv_actor);
    int     (*sendpage_locked)(struct sock *sk, struct page *page,
                       int offset, size_t size, int flags);
    int     (*sendmsg_locked)(struct sock *sk, struct msghdr *msg,
                      size_t size);
    int     (*set_rcvlowat)(struct sock *sk, int val);
};

Then back to the program debugging, sock_alloc()allocate a socket structure, interior and how to achieve it? Continue to set breakpoints observed:

(gdb) b sock_alloc
Breakpoint 8 at 0xffffffff817ec230: file net/socket.c, line 569.
(gdb) c
Continuing.

Breakpoint 8, sock_alloc () at net/socket.c:569
569     inode = new_inode_pseudo(sock_mnt->mnt_sb);
(gdb) l
564 struct socket *sock_alloc(void)
565 {
566     struct inode *inode;
567     struct socket *sock;
568 
569     inode = new_inode_pseudo(sock_mnt->mnt_sb);
570     if (!inode)
571         return NULL;
572 
573     sock = SOCKET_I(inode);
(gdb) l
574 
575     inode->i_ino = get_next_ino();
576     inode->i_mode = S_IFSOCK | S_IRWXUGO;
577     inode->i_uid = current_fsuid();
578     inode->i_gid = current_fsgid();
579     inode->i_op = &sockfs_inode_ops;
580 
581     return sock;
582 }

In the sock_allocinterior enables the creation of two structures, disk file inode, as well as to return socket structure, and also behind the inode assignment problem has gathered in the SOCKET_I (), according to this view, SOCKET_I should be to create the position of the socket, but passing inode is a bit difficult to understand, want to continue in-depth look, but unfortunately, SOCKET_I function inline, so do not jump to the internal function can only be analyzed by source.

static inline struct socket *SOCKET_I(struct inode *inode)
{
    return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
}

container_of (), is a very classic macro, where, for container_of (A, B, C); to give a first type field is located in the structure A with B in the row. I.e. inode-> socket_alloc-> socket, that is the first field in the inode for the socket_alloc, and the socket has socket_alloc domain, socket_alloc fields as follows:

struct socket_alloc {
    struct socket socket;
    struct inode vfs_inode;
};

How the inode node that created it? In the sock_allocfunction, call the new_inode_pseudofunction to achieve, he achieved as follows:

struct inode *new_inode_pseudo(struct super_block *sb)
{
    struct inode *inode = alloc_inode(sb);

    if (inode) {
        spin_lock(&inode->i_lock);
        inode->i_state = 0;
        spin_unlock(&inode->i_lock);
        INIT_LIST_HEAD(&inode->i_sb_list);
    }
    return inode;
}

Here again, we call the alloc_inodefunction:

static struct inode *alloc_inode(struct super_block *sb)
{
    struct inode *inode;

    if (sb->s_op->alloc_inode)
        inode = sb->s_op->alloc_inode(sb);
    else
        inode = kmem_cache_alloc(inode_cachep, GFP_KERNEL);

    if (!inode)
        return NULL;

    if (unlikely(inode_init_always(sb, inode))) {
        if (inode->i_sb->s_op->destroy_inode)
            inode->i_sb->s_op->destroy_inode(inode);
        else
            kmem_cache_free(inode_cachep, inode);
        return NULL;
    }

    return inode;
}

This is what has become know better, inode is calling superblock s_op-> alloc_inode to achieve, but also related to the contents of the file system, linux in a super block is represented by a file system, each file system creation disk file, delete disk files and other methods, obviously, socket is also used as a file system, the node node function is soket function to create the file system calls here as early as the file system, not directly create a inodenode, and is to create a sock_allocstructure that there are both struct inodethere struct socket, finally, to initialize the socket and returns, but here there is one detail, socket system call return value of a socket descriptor (file descriptor), but here the file descriptor and does not appear, because here __sys_socketthe function sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));mentioned before, socket being viewed as a file system, including struct socketthe structure of the body as well struct file, and this function is the file descriptor struct socketand struct filefunction bound to achieve as follows:

static int sock_map_fd(struct socket *sock, int flags)
{
    struct file *newfile;
    int fd = get_unused_fd_flags(flags);
    if (unlikely(fd < 0)) {
        sock_release(sock);
        return fd;
    }

    newfile = sock_alloc_file(sock, flags, NULL);
    if (likely(!IS_ERR(newfile))) {
        fd_install(fd, newfile);
        return fd;
    }

    put_unused_fd(fd);
    return PTR_ERR(newfile);
}

In it, by sock_alloc_file(sock, flags, NULL);get the file descriptor to be returned fd, and created a struct fileobject creation process is as follows:

struct file *sock_alloc_file(struct socket *sock, int flags, const char *dname)
{
    struct file *file;

    if (!dname)
        dname = sock->sk ? sock->sk->sk_prot_creator->name : "";

    file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,
                O_RDWR | (flags & O_NONBLOCK),
                &socket_file_ops);
    if (IS_ERR(file)) {
        sock_release(sock);
        return file;
    }

    sock->file = file;
    file->private_data = sock;
    return file;
}

Here again called alloc_file_pseudo, note that there is a critical structure is socket_file_ops, he defines some basic socket file operations, so this step file operations and file these in turn bind together. It is defined as follows:

static const struct file_operations socket_file_ops = {
    .owner =    THIS_MODULE,
    .llseek =   no_llseek,
    .read_iter =    sock_read_iter,
    .write_iter =   sock_write_iter,
    .poll =     sock_poll,
    .unlocked_ioctl = sock_ioctl,
#ifdef CONFIG_COMPAT
    .compat_ioctl = compat_sock_ioctl,
#endif
    .mmap =     sock_mmap,
    .release =  sock_close,
    .fasync =   sock_fasync,
    .sendpage = sock_sendpage,
    .splice_write = generic_splice_sendpage,
    .splice_read =  sock_splice_read,
};

This, we completed the first two socket, because relations are complicated, what we call the relationship between stroke:
__ia32_compat_sys_socketcall -> __ sys_socket-> sock_create-> sock_alloc-> alloc_file_pseudo-> || sb-> s_op-> alloc_inode;
through such a process, the core is to create a structure socket_alloc, because the structure which have both socket inode.
There is one final question: socket initialization, in front socket structof the introduction, proto_ops domains have different service function pointer, but the pointer assignment at what time, how assignment we have not analyzed this step, we mainly analyze the problem.

socket initialization

Similarly, to observe the linux kernel boot process by gdb, observing socket in what order, how is initialized:
Reopen qemu, load menuos, with the start gdb debugging kernel:
First, a breakpoint hit start_kernel, and observe whether network initialization code:

(gdb) target remote: 1234
Remote debugging using : 1234
0x0000000000000000 in fixed_percpu_data ()
(gdb) b start_kernel 
Breakpoint 1 at 0xffffffff82997b05: file init/main.c, line 552.
(gdb) c
Continuing.

Breakpoint 1, start_kernel () at init/main.c:552
warning: Source file is more recent than executable.
552 asmlinkage __visible void __init start_kernel(void)
553 {
554     char *command_line;
555     char *after_dashes;
556 
(gdb) l
557     set_task_stack_end_magic(&init_task);
558     smp_setup_processor_id();
559     debug_objects_early_init();
560 
561     cgroup_init_early();
562 
563     local_irq_disable();
564     early_boot_irqs_disabled = true;
565 
566     /*
(gdb) l
567      * Interrupts are still disabled. Do necessary setups, then
568      * enable them.
569      */
570     boot_cpu_init();
571     page_address_init();
572     pr_notice("%s", linux_banner);
573     setup_arch(&command_line);
574     mm_init_cpumask(&init_mm);
575     setup_command_line(command_line);
576     setup_nr_cpu_ids();
 

We did not see the network initialization relevant code, arch_call_rest_init();note that this function should be performed in addition to the initialization list here the other parts will break inarch_call_rest_init();

Breakpoint 2, arch_call_rest_init () at init/main.c:548
546 
547 void __init __weak arch_call_rest_init(void)
548 {
549     rest_init();
550 }
551 
552 asmlinkage __visible void __init start_kernel(void)
(gdb) b rest_init
Breakpoint 4, rest_init () at init/main.c:411
411 
(gdb) l
406 
407 noinline void __ref rest_init(void)
408 {
409     struct task_struct *tsk;
410     int pid;
411 
412     rcu_scheduler_starting();
413     /*
414      * We need to spawn init first so that it obtains pid 1, however
415      * the init task will end up wanting to create kthreads, which, if
(gdb) 

arch_rest_init () only one line, and that is to call rest_init (), continue to track, to get the complete code rest_init:

    noinline void __ref rest_init(void)
408 {
409     struct task_struct *tsk;
410     int pid;
411 
412     rcu_scheduler_starting();
413     /*
414      * We need to spawn init first so that it obtains pid 1, however
415      * the init task will end up wanting to create kthreads, which, if
(gdb) 
416      * we schedule it before we create kthreadd, will OOPS.
417      */
418     pid = kernel_thread(kernel_init, NULL, CLONE_FS);
419     /*
420      * Pin init on the boot CPU. Task migration is not properly working
421      * until sched_init_smp() has been run. It will set the allowed
422      * CPUs for init to the non isolated CPUs.
423      */
424     rcu_read_lock();
425     tsk = find_task_by_pid_ns(pid, &init_pid_ns);
(gdb) 
426     set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
427     rcu_read_unlock();
428 
429     numa_default_policy();
430     pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
431     rcu_read_lock();
432     kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
433     rcu_read_unlock();
434 
435     /*
(gdb) 
436      * Enable might_sleep() and smp_processor_id() checks.
437      * They cannot be enabled earlier because with CONFIG_PREEMPT=y
438      * kernel_thread() would trigger might_sleep() splats. With
439      * CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled
440      * already, but it's stuck on the kthreadd_done completion.
441      */
442     system_state = SYSTEM_SCHEDULING;
443 
444     complete(&kthreadd_done);
445 
(gdb) 
446     /*
447      * The boot idle thread must execute schedule()
448      * at least once to get things moving:
449      */
450     schedule_preempt_disabled();
451     /* Call into cpu_idle with preempt disabled */
452     cpu_startup_entry(CPUHP_ONLINE);
453 }

Here creates two threads kernel_init and kthread, the actual initialization is done by them, then we will break these two functions are located at:

1086    static int __ref kernel_init(void *unused)
1087    {
1088        int ret;
1089    
1090        kernel_init_freeable();
1091        /* need to finish all async __init code before freeing the memory */
(gdb) 
1092        async_synchronize_full();
1093        ftrace_free_init_mem();
1094        free_initmem();
1095        mark_readonly();
1096    
1097        /*
1098         * Kernel mappings are now finalized - update the userspace page-table
1099         * to finalize PTI.
1100         */
1101        pti_finalize();
(gdb) 
1102    
1103        system_state = SYSTEM_RUNNING;
1104        numa_default_policy();
1105    
1106        rcu_end_inkernel_boot();
1107    
1108        if (ramdisk_execute_command) {
1109            ret = run_init_process(ramdisk_execute_command);
1110            if (!ret)
1111                return 0;
(gdb) 
1112            pr_err("Failed to execute %s (error %d)\n",
1113                   ramdisk_execute_command, ret);
1114        }
1115    
1116        /*
1117         * We try each of these until one succeeds.
1118         *
1119         * The Bourne shell can be used instead of init if we are
1120         * trying to recover a really broken machine.
1121         */
(gdb) 
1122        if (execute_command) {
1123            ret = run_init_process(execute_command);
1124            if (!ret)
1125                return 0;
1126            panic("Requested init %s failed (error %d).",
1127                  execute_command, ret);
1128        }
1129        if (!try_to_run_init_process("/sbin/init") ||
1130            !try_to_run_init_process("/etc/init") ||
1131            !try_to_run_init_process("/bin/init") ||
(gdb) 
1132            !try_to_run_init_process("/bin/sh"))
1133            return 0;
1134    
1135        panic("No working init found.  Try passing init= option to kernel. "
1136              "See Linux Documentation/admin-guide/init.rst for guidance.");
1137    }

First performed is kernel_init, responsible for internal function init file to determine which position should be executed, and finally execute the jump, but before loading init program user to do some further work by kernel_init_freeable initialization function, so the jump to kernel_init_freeable ().

In addition to internal functions do_basic_setup, but does not perform the initialization. In do_basic_setup打上断点,但是,程序首先来到了kthreadd ``` 568 int kthreadd(void *unused) 569 { 570 struct task_struct *tsk = current; 571 572 /* Setup a clean context for our children to inherit. */ 573 set_task_comm(tsk, "kthreadd"); (gdb) l 574 ignore_signals(tsk); 575 set_cpus_allowed_ptr(tsk, cpu_all_mask); 576 set_mems_allowed(node_states[N_MEMORY]); 577 578 current->flags |= PF_NOFREEZE; 579 cgroup_init_kthreadd(); 580 581 for (;;) { 582 set_current_state(TASK_INTERRUPTIBLE); 583 if (list_empty(&kthread_create_list)) (gdb) 584 schedule(); 585 __set_current_state(TASK_RUNNING); 586 587 spin_lock(&kthread_create_lock); 588 while (!list_empty(&kthread_create_list)) { 589 struct kthread_create_info *create; 590 591 create = list_entry(kthread_create_list.next, 592 struct kthread_create_info, list); 593 list_del_init(&create->list); (gdb) 594 spin_unlock(&kthread_create_lock); 595 596 create_kthread(create); 597 598 spin_lock(&kthread_create_lock); 599 } 600 spin_unlock(&kthread_create_lock); 601 } 602 603 return 0; (gdb) 604 } ``` kthreadd内部负责根据creating a series of threads kthread_create_list`, which is obviously nothing to do with network initialization we have to continue to observe do_basic_setup;

static void __init do_basic_setup(void)
{
    cpuset_init_smp();
    shmem_init();
    driver_init();
    init_irq_proc();
    do_ctors();
    usermodehelper_enable();
    do_initcalls();
}
859static void __init do_initcalls(void)
860{
861 int level;
862
863 for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++)
864     do_initcall_level(level);
865}

init_levels do_initcalls will continue to perform do_initcall_level (level) based on that first we need to look at what is do_initcall_level

static void __init do_initcall_level(int level)
{
    initcall_entry_t *fn;

    strcpy(initcall_command_line, saved_command_line);
    parse_args(initcall_level_names[level],
           initcall_command_line, __start___param,
           __stop___param - __start___param,
           level, level,
           NULL, &repair_env_string);

    trace_initcall_level(initcall_level_names[level]);
    for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++)
        do_one_initcall(initcall_from_entry(fn));
}

initcall_levels for a table, which can be initialized for each registered incoming initialize the project, initcall_from_entry address is fn returned, then this address, perform do_one_initcall, as for the table is how can be obtained from the network initialization procedure inet_init answer

略去了很多无关代码

static int __init inet_init(void)
{
    struct inet_protosw *q;
    struct list_head *r;
    int rc = -EINVAL;

    sock_skb_cb_check_size(sizeof(struct inet_skb_parm));

    rc = proto_register(&tcp_prot, 1);
    if (rc)
        goto out;

    rc = proto_register(&udp_prot, 1);
    if (rc)
        goto out_unregister_tcp_proto;

    rc = proto_register(&raw_prot, 1);
    if (rc)
        goto out_unregister_udp_proto;

    rc = proto_register(&ping_prot, 1);
    if (rc)
        goto out_unregister_raw_proto;

    (void)sock_register(&inet_family_ops);
    ip_static_sysctl_init();


    if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
        pr_crit("%s: Cannot add ICMP protocol\n", __func__);
    if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
        pr_crit("%s: Cannot add UDP protocol\n", __func__);
    if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
        pr_crit("%s: Cannot add TCP protocol\n", __func__);

    /* Register the socket-side information for inet_create. */
    for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r)
        INIT_LIST_HEAD(r);

    for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q)
        inet_register_protosw(q);

    arp_init();

    ip_init();
    tcp_init();
    udp_init();
    udplite4_register();
    raw_init();
    ping_init();


    ipv4_proc_init();

    ipfrag_init();

    dev_add_pack(&ip_packet_type);

    ip_tunnel_core_init();

    rc = 0;
out:
    return rc;
out_unregister_raw_proto:
    proto_unregister(&raw_prot);
out_unregister_udp_proto:
    proto_unregister(&udp_prot);
out_unregister_tcp_proto:
    proto_unregister(&tcp_prot);
    goto out;
}

fs_initcall(inet_init);

So by fs_initcall (inet_init) will be registered into initcalls inet_init function of initcall_levels, finally initialized, in order to verify, the best way is to reboot, hit a breakpoint in inet_init, see if you can call this function.

(gdb) b inet_init
Breakpoint 1 at 0xffffffff829f49fe: file net/ipv4/af_inet.c, line 1906.
(gdb) c
Continuing.

Breakpoint 1, inet_init () at net/ipv4/af_inet.c:1906
1906    {

Then look at the inet_init code: This includes almost all network protocols --TCP, UDP, ICMP, etc., the process is first registered port number, and then add the corresponding protocol, and finally initialized track to report it here a paragraph, but we do not see the underlying operating system, such as alloc_inode socket is how to initialize, which is defined in the superblock socket when directly defined, not by the initialization process.

static const struct super_operations sockfs_ops = {
    .alloc_inode    = sock_alloc_inode,
    .destroy_inode  = sock_destroy_inode,
    .statfs     = simple_statfs,
};

Although the end, and by then the third question has not been resolved, and that is located in struct socketthe structure of proto_ops domain, that is, the function pointer handle a specific protocol in which initialization did not find, can look forward to later study found.

Guess you like

Origin www.cnblogs.com/myguaiguai/p/12041354.html