Socket-depth analysis and system calls
It is conceivable that, when the application calls the socket () interface to request an operating system service, the system inevitably call, the kernel call number transmission system according to the system call is initiated, determining program to be executed, if the number corresponding to socket it is performed socket corresponding interrupt service routine. Internal Services program, and depending on the service you request to execute different processing program corresponding to the service. When the end of the process, execution returns from the interrupt service routine to initiate interrupt int 0x80, then the user mode user program we executed, layers of return, socket () also finished up.
This time, we are concerned about three issues:
1. how the application requests a system call, or how to enter the kernel mode.
2. The relationship between the interrupt call service program, how he jumps to the service program we needed.
3.socket To complete our call, what had been done at initialization thing.
The application calls socket
We still use written before hello / hi chat program, with the client to debug and see how socket is performed.
ready:
In order to be able to debug the contents of the library libc libc libraries need to be downloaded,
1. First install glibc symbol table, the installation method:
sudo apt-get install libc6-dbg
2. Debugging libc need to go to the corresponding source file, with the open source libc, and libc we can download the source code, in you can see where it occurs while debugging:
sudo apt-get source libc6-dev
Note that you download the source file path, behind the debugging process will be used.
3. Source File: clinet.c
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <string.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#define MAX_len 1024
int sock_fd;
struct sockaddr_in add;
int main()
{
int ret;
char buf[MAX_len]={0};
char buf_rec[MAX_len]={0};
char buf_p[5]={"0"};
memset(&add,0,sizeof(add));
add.sin_family=AF_INET;
add.sin_port=htons(8000);
add.sin_addr.s_addr=inet_addr("127.0.0.1");
if((sock_fd=socket(PF_INET,SOCK_STREAM,0))<=0)
{
perror("socket");
return 1;
}
if((ret=connect(sock_fd,(struct sockaddr*)& add,sizeof(struct sockaddr)))<0)
{
perror("connet");
return 1;
}
if((ret=send(sock_fd,(void*)buf_p,strlen(buf),0))<0)
{
perror("recvfrom");
return 1;
}
while (1)
{
scanf("%s",buf);
if((ret=send(sock_fd,(void*)buf,sizeof(buf),0))<0)
{
perror("sendfrom1");
return 1;
}
if((ret=recv(sock_fd,(void*)buf_rec,sizeof(buf_rec),0))<0)
{
perror("recvfrom1");
return 1;
}
printf("%s\n",buf_rec);
}
return 0;
}
Start debugging:
1. Compile the file and generate debug information:
gcc -o - g client client.c
2. Execute the command and gdb debugger:gdb client
(gdb) file client
Load new symbol table from "client"? (y or n) y
Reading symbols from client...done.
(gdb) b 23
Breakpoint 1 at 0x40091d: file client.c, line 23.
(gdb) c
The program is not being run.
(gdb) run
Starting program: /home/netlab/netlab/systemcall/client
Breakpoint 1, main () at client.c:23
23 if((sock_fd=socket(PF_INET,SOCK_STREAM,0))<=0)
The break in the line 23, which is the first execution socket () of the line, and then run the program, the program stops at line 23, then use the step
command into the socket () internal analysis of how to achieve internal socket .
(gdb) s
socket () at ../sysdeps/unix/syscall-template.S:84
84 ../sysdeps/unix/syscall-template.S: No such file or directory.
But this suggests that we are going to jump does not exist, which is due and no source code on our libc, which is why we are ready to download the source reason codes when, according to his prompt directory, we use the directory glibc-2.23/sysdeps/unix/
command to download the libc the source code is loaded into gdb, and then re-commissioning:
(gdb) directory glibc-2.23/sysdeps/unix/
Source directories searched: /home/netlab/netlab/systemcall/glibc-2.23/sysdeps/unix:$cdir:$cwd
(gdb) s
socket () at ../sysdeps/unix/syscall-template.S:85
85 ret
(gdb) l
80
81 /* This is a "normal" system call stub: if there is an error,
82 it returns -1 and sets errno. */
83
84 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
85 ret
86 T_PSEUDO_END (SYSCALL_SYMBOL)
87
88 #endif
89
(gdb)
Program jumped into the systemcall-template.s return, and two are from the macro definition, and can not see what, in fact, this is the system call generated template, also was able to guess from the name, this provides conventional format system calls.
So now it seems gdb debugging to system call can not be achieved, and, 32 are consistent with the case where the 64 encounters, so we skipped debug, analyze libc source code.
socket glibc library implements:
First, by a relocation of the socket relocation __socket
#define __socket socket
#define __recvmsg recvmsg
#define __bind bind
#define __sendto sendto
Then in the library implements __socket ():
int __socket (int fd, int type, int domain)
{
#ifdef __ASSUME_SOCKET_SYSCALL
return INLINE_SYSCALL (socket, 3, fd, type, domain);
#else
return SOCKETCALL (socket, fd, type, domain);
#endif
}
libc_hidden_def (__socket)
weak_alias (__socket, socket)
Inside __socket () is called SOCKETCALL or INLINE_SYSCALL, eventually they will be converted to INLINE_SYSCALL, closely related to INLINE_SYSCALL with architecture, corresponding to x86_ architecture to achieve the following:
# define INLINE_SYSCALL(name, nr, args...) \
({ \
unsigned long int resultvar = INTERNAL_SYSCALL (name, , nr, args); \
if (__glibc_unlikely (INTERNAL_SYSCALL_ERROR_P (resultvar, ))) \
{ \
__set_errno (INTERNAL_SYSCALL_ERRNO (resultvar, )); \
resultvar = (unsigned long int) -1; \
} \
(long int) resultvar; })
#undef INTERNAL_SYSCALL
#define INTERNAL_SYSCALL(name, err, nr, args...) \
internal_syscall##nr (SYS_ify (name), err, args)
Here the number of parameters, will be converted to:
#define internal_syscall3(number, err, arg1, arg2, arg3) \
({ \
unsigned long int resultvar; \
TYPEFY (arg3, __arg3) = ARGIFY (arg3); \
TYPEFY (arg2, __arg2) = ARGIFY (arg2); \
TYPEFY (arg1, __arg1) = ARGIFY (arg1); \
register TYPEFY (arg3, _a3) asm ("rdx") = __arg3; \
register TYPEFY (arg2, _a2) asm ("rsi") = __arg2; \
register TYPEFY (arg1, _a1) asm ("rdi") = __arg1; \
asm volatile ( \
"syscall\n\t" \
: "=a" (resultvar) \
: "0" (number), "r" (_a1), "r" (_a2), "r" (_a3) \
: "memory", REGISTERS_CLOBBERED_BY_SYSCALL); \
(long int) resultvar; \
})
Inline assembly is used here in the form of the parameters rdx, rsi, rdi to store, use the interrupt number eax memory, the kernel also will initiate soft interrupt the corresponding interrupt into the interrupt handler, to this end part of the application.
Kernel interrupt response:
In order to see how the kernel socket in response to the request of the application, we use qemu + gdb debugging kernel, the kernel observation process when the socket request response.
1. Debug state run menuos, pay attention to add the client program to facilitate debugging
[picture 1]
2. analyze the breakpoint, in order to observe the response of the kernel socket, obviously should be marked on the call path of the response function breakpoint to facilitate debugging, but can not set a breakpoint in all interrupts entrance, so it is difficult to get what we want interrupt response, the best position is the entrance of the socket system call handler, in this position, only socket request It can trigger, to ensure that we can directly analyze, how to find the entrance of the system call it?
Kernel arch / x86 all interrupts in the described inlet / entry / syscalls have x86 architecture, for backward compatibility, and is divided into 32 64-bit interrupt entry:
32:
99 i386 statfs sys_statfs __ia32_compat_sys_statfs
100 i386 fstatfs sys_fstatfs __ia32_compat_sys_fstatfs
101 i386 ioperm sys_ioperm __ia32_sys_ioperm
102 i386 socketcall sys_socketcall __ia32_compat_sys_socketcall
103 i386 syslog sys_syslog __ia32_sys_syslog
104 i386 setitimer sys_setitimer __ia32_compat_sys_setitimer
105 i386 getitimer sys_getitimer __ia32_compat_sys_getitimer
# 64:
......
40 common sendfile __x64_sys_sendfile64
41 common socket __x64_sys_socket
42 common connect __x64_sys_connect
43 common accept __x64_sys_accept
44 common sendto __x64_sys_sendto
......
For program 32, 32 should obviously be positioned in the inlet of the system calls, the program 64, 64 should be located at the entrance. We were watching the two procedures corresponding interrupt entry:
The first is 32, set breakpoints __ia32_compat_sys_socketcall
, run menuos, and run the client program, gdb will stop when entering __ia32_compat_sys_socketcall:
(gdb) b __ia32_compat_sys_socketcall
Breakpoint 3 at 0xffffffff818474b0: file net/compat.c, line 718.
(gdb) c
Continuing.
Breakpoint 3, __ia32_compat_sys_socketcall (regs=0xffffc900001eff58)
at net/compat.c:718
718 COMPAT_SYSCALL_DEFINE2(socketcall, int, call, u32 __user *, args)
[Photo 2]
look at this function:
718 COMPAT_SYSCALL_DEFINE2(socketcall, int, call, u32 __user *, args)
719 {
720 u32 a[AUDITSC_ARGS];
721 unsigned int len;
722 u32 a0, a1;
(gdb)
723 int ret;
724
725 if (call < SYS_SOCKET || call > SYS_SENDMMSG)
726 return -EINVAL;
727 len = nas[call];
728 if (len > sizeof(a))
729 return -EINVAL;
730
731 if (copy_from_user(a, args, len))
733
734 ret = audit_socketcall_compat(len / sizeof(a[0]), a);
735 if (ret)
736 return ret;
737
738 a0 = a[0];
739 a1 = a[1];
740
741 switch (call) {
742 case SYS_SOCKET:
(gdb)
743 ret = __sys_socket(a0, a1, a[2]);
744 break;
745 case SYS_BIND:
746 ret = __sys_bind(a0, compat_ptr(a1), a[2]);
747 break;
748 case SYS_CONNECT:
749 ret = __sys_connect(a0, compat_ptr(a1), a[2]);
750 break;
751 case SYS_LISTEN:
752 ret = __sys_listen(a0, a1);
Obviously, this handler is a class of total inlet socket operations, it first acquires the parameters of the system call, and then depending on the type of service requested, a jump to a different processing procedures, the distribution will continue to monitor the function call:
(gdb) b __sys_socket
Breakpoint 4 at 0xffffffff817eea40: file net/socket.c, line 1498.
(gdb) c
Continuing.
Breakpoint 4, __sys_socket (family=2, type=1, protocol=0) at net/socket.c:1498
1498 {
(gdb) l
1493 return __sock_create(net, family, type, protocol, res, 1);
1494 }
1495 EXPORT_SYMBOL(sock_create_kern);
1496
1497 int __sys_socket(int family, int type, int protocol)
1498 {
1499 int retval;
1500 struct socket *sock;
1501 int flags;
1502
(gdb)
1503 /* Check the SOCK_* constants for consistency. */
1504 BUILD_BUG_ON(SOCK_CLOEXEC != O_CLOEXEC);
1505 BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
1506 BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
1507 BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);
1508
1509 flags = type & ~SOCK_TYPE_MASK;
1510 if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
1511 return -EINVAL;
1512 type &= SOCK_TYPE_MASK;
(gdb)
1513
1514 if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
1515 flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
1516
1517 retval = sock_create(family, type, protocol, &sock);
1518 if (retval < 0)
1519 return retval;
1520
1521 return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
1522 }
In __sys_socket () function only checked the internal parameters, jump to sock_creat () is executed
(gdb) b __sock_create
Breakpoint 7 at 0xffffffff817ec9a0: file net/socket.c, line 1363.
(gdb) c
Continuing.
Breakpoint 5, __sys_socket (family=2, type=1, protocol=0) at net/socket.c:1517
1517 retval = sock_create(family, type, protocol, &sock);
(gdb) c
Continuing.
Breakpoint 7, __sock_create (net=0xffffffff824e94c0 <init_net>, family=2, type=1,
protocol=0, res=0xffffc90000047e98, kern=0) at net/socket.c:1363
1363 if (family < 0 || family >= NPROTO)
(gdb) l
1358 const struct net_proto_family *pf;
1359
1360 /*
1361 * Check protocol is in range
1362 */
1363 if (family < 0 || family >= NPROTO)
1364 return -EAFNOSUPPORT;
1365 if (type < 0 || type >= SOCK_MAX)
1366 return -EINVAL;
1367
(gdb) l
1368 /* Compatibility.
1369
1370 This uglymoron is moved from INET layer to here to avoid
1371 deadlock in module load.
1372 */
1373 if (family == PF_INET && type == SOCK_PACKET) {
1374 pr_info_once("%s uses obsolete (PF_INET,SOCK_PACKET)\n",
1375 current->comm);
1376 family = PF_PACKET;
1377 }
(gdb) l
1378
1379 err = security_socket_create(family, type, protocol, kern);
1380 if (err)
1381 return err;
1382
1383 /*
1384 * Allocate the socket and allow the family to set things up. if
1385 * the protocol is 0, the family is instructed to select an appropriate
1386 * default.
1387 */
(gdb) l
1388 sock = sock_alloc();
1389 if (!sock) {
1390 net_warn_ratelimited("socket: no more sockets\n");
1391 return -ENFILE; /* Not exactly a match, but its the
1392 closest posix thing */
1393 }
1394
1395 sock->type = type;
1396
1397 #ifdef CONFIG_MODULES
(gdb) l
1398 /* Attempt to load a protocol module if the find failed.
1399 *
1400 * 12/09/1996 Marcin: But! this makes REALLY only sense, if the user
1401 * requested real, full-featured networking support upon configuration.
1402 * Otherwise module support will break!
1403 */
1404 if (rcu_access_pointer(net_families[family]) == NULL)
1405 request_module("net-pf-%d", family);
1406 #endif
1407
(gdb) l
1408 rcu_read_lock();
1409 pf = rcu_dereference(net_families[family]);
1410 err = -EAFNOSUPPORT;
1411 if (!pf)
1412 goto out_release;
1413
1414 /*
1415 * We will call the ->create function, that possibly is in a loadable
1416 * module, so we have to bump that loadable module refcnt first.
1417 */
(gdb) l
1418 if (!try_module_get(pf->owner))
1419 goto out_release;
1420
1421 /* Now protected by module ref count */
1422 rcu_read_unlock();
1423
1424 err = pf->create(net, sock, protocol, kern);
1425 if (err < 0)
1426 goto out_module_put;
1427
(gdb) l
1428 /*
1429 * Now to bump the refcnt of the [loadable] module that owns this
1430 * socket at sock_release time we decrement its refcnt.
1431 */
1432 if (!try_module_get(sock->ops->owner))
1433 goto out_module_busy;
1434
1435 /*
1436 * Now that we're done with the ->create function, the [loadable]
1437 * module can have its refcnt decremented
(gdb) l
1438 */
1439 module_put(pf->owner);
1440 err = security_socket_post_create(sock, family, type, protocol, kern);
1441 if (err)
1442 goto out_sock_release;
1443 *res = sock;
1444
1445 return 0;
1446
1447 out_module_busy:
(gdb) l
1448 err = -EAFNOSUPPORT;
1449 out_module_put:
1450 sock->ops = NULL;
1451 module_put(pf->owner);
1452 out_sock_release:
1453 sock_release(sock);
1454 return err;
1455
1456 out_release:
1457 rcu_read_unlock();
(gdb) l
1458 goto out_sock_release;
1459 }
1460 EXPORT_SYMBOL(__sock_create);
1461
1462 /**
1463 * sock_create - creates a socket
1464 * @family: protocol family (AF_INET, ...)
1465 * @type: communication type (SOCK_STREAM, ...)
1466 * @protocol: protocol (0, ...)
1467 * @res: new socket
(gdb) l
1468 *
1469 * A wrapper around __sock_create().
1470 * Returns 0 or an error. This function internally uses GFP_KERNEL.
1471 */
1472
1473 int sock_create(int family, int type, int protocol, struct socket **res)
1474 {
1475 return __sock_create(current->nsproxy->net_ns, family, type, protocol, res, 0);
1476 }
Initially set breakpoints in the sock_reate
discovery can not get to see the kernel code is found to be set a breakpoint in __sock_create
, may be sock_creat be redirected. Continue to look __sock_create:
err = security_socket_create(family, type, protocol, kern);
first checked to see if legitimate, then perform the most critical function
sock = sock_alloc();
in order to understand this line of code, we need to know, sock is struct socket
of a variable, this interface is the core of the socket body, he reads as follows:
struct socket {
socket_state state;
short type;
unsigned long flags;
struct socket_wq *wq;
struct file *file;
struct sock *sk;
const struct proto_ops *ops;
};
State is the current state of the socket, and is not used to indicate a connection, the service type indicates the type of socket, such as TCP SOCK_STREAM type services, flags a flag, such as SOCK_ASYNC_NOSPACE, wq is waiting queue, as multiple requests may have a socket , file refers to the file, because the socket can also be viewed as a file, all have this pointer, operate compatible files. Sk is very important and very big, responsible for recording the protocol-related content. Such an arrangement allows the socket having a good deal of independence, can be generic, OPS is a pointer associated with the socket basic operation of the service, which is linux common usage, the operation of an object by a set of sub-ah structure function pointer in.
struct proto_ops {
int family;
struct module *owner;
int (*release) (struct socket *sock);
int (*bind) (struct socket *sock,
struct sockaddr *myaddr,
int sockaddr_len);
int (*connect) (struct socket *sock,
struct sockaddr *vaddr,
int sockaddr_len, int flags);
int (*socketpair)(struct socket *sock1,
struct socket *sock2);
int (*accept) (struct socket *sock,
struct socket *newsock, int flags, bool kern);
int (*getname) (struct socket *sock,
struct sockaddr *addr,
int peer);
__poll_t (*poll) (struct file *file, struct socket *sock,
struct poll_table_struct *wait);
int (*ioctl) (struct socket *sock, unsigned int cmd,
unsigned long arg);
#ifdef CONFIG_COMPAT
int (*compat_ioctl) (struct socket *sock, unsigned int cmd,
unsigned long arg);
#endif
int (*listen) (struct socket *sock, int len);
int (*shutdown) (struct socket *sock, int flags);
int (*setsockopt)(struct socket *sock, int level,
int optname, char __user *optval, unsigned int optlen);
int (*getsockopt)(struct socket *sock, int level,
int optname, char __user *optval, int __user *optlen);
#ifdef CONFIG_COMPAT
int (*compat_setsockopt)(struct socket *sock, int level,
int optname, char __user *optval, unsigned int optlen);
int (*compat_getsockopt)(struct socket *sock, int level,
int optname, char __user *optval, int __user *optlen);
#endif
int (*sendmsg) (struct socket *sock, struct msghdr *m,
size_t total_len);
/* Notes for implementing recvmsg:
* ===============================
* msg->msg_namelen should get updated by the recvmsg handlers
* iff msg_name != NULL. It is by default 0 to prevent
* returning uninitialized memory to user space. The recvfrom
* handlers can assume that msg.msg_name is either NULL or has
* a minimum size of sizeof(struct sockaddr_storage).
*/
int (*recvmsg) (struct socket *sock, struct msghdr *m,
size_t total_len, int flags);
int (*mmap) (struct file *file, struct socket *sock,
struct vm_area_struct * vma);
ssize_t (*sendpage) (struct socket *sock, struct page *page,
int offset, size_t size, int flags);
ssize_t (*splice_read)(struct socket *sock, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len, unsigned int flags);
int (*set_peek_off)(struct sock *sk, int val);
int (*peek_len)(struct socket *sock);
/* The following functions are called internally by kernel with
* sock lock already held.
*/
int (*read_sock)(struct sock *sk, read_descriptor_t *desc,
sk_read_actor_t recv_actor);
int (*sendpage_locked)(struct sock *sk, struct page *page,
int offset, size_t size, int flags);
int (*sendmsg_locked)(struct sock *sk, struct msghdr *msg,
size_t size);
int (*set_rcvlowat)(struct sock *sk, int val);
};
Then back to the program debugging, sock_alloc()
allocate a socket structure, interior and how to achieve it? Continue to set breakpoints observed:
(gdb) b sock_alloc
Breakpoint 8 at 0xffffffff817ec230: file net/socket.c, line 569.
(gdb) c
Continuing.
Breakpoint 8, sock_alloc () at net/socket.c:569
569 inode = new_inode_pseudo(sock_mnt->mnt_sb);
(gdb) l
564 struct socket *sock_alloc(void)
565 {
566 struct inode *inode;
567 struct socket *sock;
568
569 inode = new_inode_pseudo(sock_mnt->mnt_sb);
570 if (!inode)
571 return NULL;
572
573 sock = SOCKET_I(inode);
(gdb) l
574
575 inode->i_ino = get_next_ino();
576 inode->i_mode = S_IFSOCK | S_IRWXUGO;
577 inode->i_uid = current_fsuid();
578 inode->i_gid = current_fsgid();
579 inode->i_op = &sockfs_inode_ops;
580
581 return sock;
582 }
In the sock_alloc
interior enables the creation of two structures, disk file inode, as well as to return socket structure, and also behind the inode assignment problem has gathered in the SOCKET_I (), according to this view, SOCKET_I should be to create the position of the socket, but passing inode is a bit difficult to understand, want to continue in-depth look, but unfortunately, SOCKET_I function inline, so do not jump to the internal function can only be analyzed by source.
static inline struct socket *SOCKET_I(struct inode *inode)
{
return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
}
container_of (), is a very classic macro, where, for container_of (A, B, C); to give a first type field is located in the structure A with B in the row. I.e. inode-> socket_alloc-> socket, that is the first field in the inode for the socket_alloc, and the socket has socket_alloc domain, socket_alloc fields as follows:
struct socket_alloc {
struct socket socket;
struct inode vfs_inode;
};
How the inode node that created it? In the sock_alloc
function, call the new_inode_pseudo
function to achieve, he achieved as follows:
struct inode *new_inode_pseudo(struct super_block *sb)
{
struct inode *inode = alloc_inode(sb);
if (inode) {
spin_lock(&inode->i_lock);
inode->i_state = 0;
spin_unlock(&inode->i_lock);
INIT_LIST_HEAD(&inode->i_sb_list);
}
return inode;
}
Here again, we call the alloc_inode
function:
static struct inode *alloc_inode(struct super_block *sb)
{
struct inode *inode;
if (sb->s_op->alloc_inode)
inode = sb->s_op->alloc_inode(sb);
else
inode = kmem_cache_alloc(inode_cachep, GFP_KERNEL);
if (!inode)
return NULL;
if (unlikely(inode_init_always(sb, inode))) {
if (inode->i_sb->s_op->destroy_inode)
inode->i_sb->s_op->destroy_inode(inode);
else
kmem_cache_free(inode_cachep, inode);
return NULL;
}
return inode;
}
This is what has become know better, inode is calling superblock s_op-> alloc_inode to achieve, but also related to the contents of the file system, linux in a super block is represented by a file system, each file system creation disk file, delete disk files and other methods, obviously, socket is also used as a file system, the node node function is soket function to create the file system calls here as early as the file system, not directly create a inode
node, and is to create a sock_alloc
structure that there are both struct inode
there struct socket
, finally, to initialize the socket and returns, but here there is one detail, socket system call return value of a socket descriptor (file descriptor), but here the file descriptor and does not appear, because here __sys_socket
the function sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
mentioned before, socket being viewed as a file system, including struct socket
the structure of the body as well struct file
, and this function is the file descriptor struct socket
and struct file
function bound to achieve as follows:
static int sock_map_fd(struct socket *sock, int flags)
{
struct file *newfile;
int fd = get_unused_fd_flags(flags);
if (unlikely(fd < 0)) {
sock_release(sock);
return fd;
}
newfile = sock_alloc_file(sock, flags, NULL);
if (likely(!IS_ERR(newfile))) {
fd_install(fd, newfile);
return fd;
}
put_unused_fd(fd);
return PTR_ERR(newfile);
}
In it, by sock_alloc_file(sock, flags, NULL);
get the file descriptor to be returned fd
, and created a struct file
object creation process is as follows:
struct file *sock_alloc_file(struct socket *sock, int flags, const char *dname)
{
struct file *file;
if (!dname)
dname = sock->sk ? sock->sk->sk_prot_creator->name : "";
file = alloc_file_pseudo(SOCK_INODE(sock), sock_mnt, dname,
O_RDWR | (flags & O_NONBLOCK),
&socket_file_ops);
if (IS_ERR(file)) {
sock_release(sock);
return file;
}
sock->file = file;
file->private_data = sock;
return file;
}
Here again called alloc_file_pseudo, note that there is a critical structure is socket_file_ops, he defines some basic socket file operations, so this step file operations and file these in turn bind together. It is defined as follows:
static const struct file_operations socket_file_ops = {
.owner = THIS_MODULE,
.llseek = no_llseek,
.read_iter = sock_read_iter,
.write_iter = sock_write_iter,
.poll = sock_poll,
.unlocked_ioctl = sock_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = compat_sock_ioctl,
#endif
.mmap = sock_mmap,
.release = sock_close,
.fasync = sock_fasync,
.sendpage = sock_sendpage,
.splice_write = generic_splice_sendpage,
.splice_read = sock_splice_read,
};
This, we completed the first two socket, because relations are complicated, what we call the relationship between stroke:
__ia32_compat_sys_socketcall -> __ sys_socket-> sock_create-> sock_alloc-> alloc_file_pseudo-> || sb-> s_op-> alloc_inode;
through such a process, the core is to create a structure socket_alloc
, because the structure which have both socket inode.
There is one final question: socket initialization, in front socket struct
of the introduction, proto_ops domains have different service function pointer, but the pointer assignment at what time, how assignment we have not analyzed this step, we mainly analyze the problem.
socket initialization
Similarly, to observe the linux kernel boot process by gdb, observing socket in what order, how is initialized:
Reopen qemu, load menuos, with the start gdb debugging kernel:
First, a breakpoint hit start_kernel
, and observe whether network initialization code:
(gdb) target remote: 1234
Remote debugging using : 1234
0x0000000000000000 in fixed_percpu_data ()
(gdb) b start_kernel
Breakpoint 1 at 0xffffffff82997b05: file init/main.c, line 552.
(gdb) c
Continuing.
Breakpoint 1, start_kernel () at init/main.c:552
warning: Source file is more recent than executable.
552 asmlinkage __visible void __init start_kernel(void)
553 {
554 char *command_line;
555 char *after_dashes;
556
(gdb) l
557 set_task_stack_end_magic(&init_task);
558 smp_setup_processor_id();
559 debug_objects_early_init();
560
561 cgroup_init_early();
562
563 local_irq_disable();
564 early_boot_irqs_disabled = true;
565
566 /*
(gdb) l
567 * Interrupts are still disabled. Do necessary setups, then
568 * enable them.
569 */
570 boot_cpu_init();
571 page_address_init();
572 pr_notice("%s", linux_banner);
573 setup_arch(&command_line);
574 mm_init_cpumask(&init_mm);
575 setup_command_line(command_line);
576 setup_nr_cpu_ids();
We did not see the network initialization relevant code, arch_call_rest_init();
note that this function should be performed in addition to the initialization list here the other parts will break inarch_call_rest_init();
Breakpoint 2, arch_call_rest_init () at init/main.c:548
546
547 void __init __weak arch_call_rest_init(void)
548 {
549 rest_init();
550 }
551
552 asmlinkage __visible void __init start_kernel(void)
(gdb) b rest_init
Breakpoint 4, rest_init () at init/main.c:411
411
(gdb) l
406
407 noinline void __ref rest_init(void)
408 {
409 struct task_struct *tsk;
410 int pid;
411
412 rcu_scheduler_starting();
413 /*
414 * We need to spawn init first so that it obtains pid 1, however
415 * the init task will end up wanting to create kthreads, which, if
(gdb)
arch_rest_init () only one line, and that is to call rest_init (), continue to track, to get the complete code rest_init:
noinline void __ref rest_init(void)
408 {
409 struct task_struct *tsk;
410 int pid;
411
412 rcu_scheduler_starting();
413 /*
414 * We need to spawn init first so that it obtains pid 1, however
415 * the init task will end up wanting to create kthreads, which, if
(gdb)
416 * we schedule it before we create kthreadd, will OOPS.
417 */
418 pid = kernel_thread(kernel_init, NULL, CLONE_FS);
419 /*
420 * Pin init on the boot CPU. Task migration is not properly working
421 * until sched_init_smp() has been run. It will set the allowed
422 * CPUs for init to the non isolated CPUs.
423 */
424 rcu_read_lock();
425 tsk = find_task_by_pid_ns(pid, &init_pid_ns);
(gdb)
426 set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
427 rcu_read_unlock();
428
429 numa_default_policy();
430 pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
431 rcu_read_lock();
432 kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
433 rcu_read_unlock();
434
435 /*
(gdb)
436 * Enable might_sleep() and smp_processor_id() checks.
437 * They cannot be enabled earlier because with CONFIG_PREEMPT=y
438 * kernel_thread() would trigger might_sleep() splats. With
439 * CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled
440 * already, but it's stuck on the kthreadd_done completion.
441 */
442 system_state = SYSTEM_SCHEDULING;
443
444 complete(&kthreadd_done);
445
(gdb)
446 /*
447 * The boot idle thread must execute schedule()
448 * at least once to get things moving:
449 */
450 schedule_preempt_disabled();
451 /* Call into cpu_idle with preempt disabled */
452 cpu_startup_entry(CPUHP_ONLINE);
453 }
Here creates two threads kernel_init and kthread, the actual initialization is done by them, then we will break these two functions are located at:
1086 static int __ref kernel_init(void *unused)
1087 {
1088 int ret;
1089
1090 kernel_init_freeable();
1091 /* need to finish all async __init code before freeing the memory */
(gdb)
1092 async_synchronize_full();
1093 ftrace_free_init_mem();
1094 free_initmem();
1095 mark_readonly();
1096
1097 /*
1098 * Kernel mappings are now finalized - update the userspace page-table
1099 * to finalize PTI.
1100 */
1101 pti_finalize();
(gdb)
1102
1103 system_state = SYSTEM_RUNNING;
1104 numa_default_policy();
1105
1106 rcu_end_inkernel_boot();
1107
1108 if (ramdisk_execute_command) {
1109 ret = run_init_process(ramdisk_execute_command);
1110 if (!ret)
1111 return 0;
(gdb)
1112 pr_err("Failed to execute %s (error %d)\n",
1113 ramdisk_execute_command, ret);
1114 }
1115
1116 /*
1117 * We try each of these until one succeeds.
1118 *
1119 * The Bourne shell can be used instead of init if we are
1120 * trying to recover a really broken machine.
1121 */
(gdb)
1122 if (execute_command) {
1123 ret = run_init_process(execute_command);
1124 if (!ret)
1125 return 0;
1126 panic("Requested init %s failed (error %d).",
1127 execute_command, ret);
1128 }
1129 if (!try_to_run_init_process("/sbin/init") ||
1130 !try_to_run_init_process("/etc/init") ||
1131 !try_to_run_init_process("/bin/init") ||
(gdb)
1132 !try_to_run_init_process("/bin/sh"))
1133 return 0;
1134
1135 panic("No working init found. Try passing init= option to kernel. "
1136 "See Linux Documentation/admin-guide/init.rst for guidance.");
1137 }
First performed is kernel_init, responsible for internal function init file to determine which position should be executed, and finally execute the jump, but before loading init program user to do some further work by kernel_init_freeable initialization function, so the jump to kernel_init_freeable ().
In addition to internal functions do_basic_setup
, but does not perform the initialization. In do_basic_setup打上断点,但是,程序首先来到了kthreadd ``` 568 int kthreadd(void *unused) 569 { 570 struct task_struct *tsk = current; 571 572 /* Setup a clean context for our children to inherit. */ 573 set_task_comm(tsk, "kthreadd"); (gdb) l 574 ignore_signals(tsk); 575 set_cpus_allowed_ptr(tsk, cpu_all_mask); 576 set_mems_allowed(node_states[N_MEMORY]); 577 578 current->flags |= PF_NOFREEZE; 579 cgroup_init_kthreadd(); 580 581 for (;;) { 582 set_current_state(TASK_INTERRUPTIBLE); 583 if (list_empty(&kthread_create_list)) (gdb) 584 schedule(); 585 __set_current_state(TASK_RUNNING); 586 587 spin_lock(&kthread_create_lock); 588 while (!list_empty(&kthread_create_list)) { 589 struct kthread_create_info *create; 590 591 create = list_entry(kthread_create_list.next, 592 struct kthread_create_info, list); 593 list_del_init(&create->list); (gdb) 594 spin_unlock(&kthread_create_lock); 595 596 create_kthread(create); 597 598 spin_lock(&kthread_create_lock); 599 } 600 spin_unlock(&kthread_create_lock); 601 } 602 603 return 0; (gdb) 604 } ``` kthreadd内部负责根据
creating a series of threads kthread_create_list`, which is obviously nothing to do with network initialization we have to continue to observe do_basic_setup;
static void __init do_basic_setup(void)
{
cpuset_init_smp();
shmem_init();
driver_init();
init_irq_proc();
do_ctors();
usermodehelper_enable();
do_initcalls();
}
859static void __init do_initcalls(void)
860{
861 int level;
862
863 for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++)
864 do_initcall_level(level);
865}
init_levels do_initcalls will continue to perform do_initcall_level (level) based on that first we need to look at what is do_initcall_level
static void __init do_initcall_level(int level)
{
initcall_entry_t *fn;
strcpy(initcall_command_line, saved_command_line);
parse_args(initcall_level_names[level],
initcall_command_line, __start___param,
__stop___param - __start___param,
level, level,
NULL, &repair_env_string);
trace_initcall_level(initcall_level_names[level]);
for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++)
do_one_initcall(initcall_from_entry(fn));
}
initcall_levels for a table, which can be initialized for each registered incoming initialize the project, initcall_from_entry address is fn returned, then this address, perform do_one_initcall, as for the table is how can be obtained from the network initialization procedure inet_init answer
略去了很多无关代码
static int __init inet_init(void)
{
struct inet_protosw *q;
struct list_head *r;
int rc = -EINVAL;
sock_skb_cb_check_size(sizeof(struct inet_skb_parm));
rc = proto_register(&tcp_prot, 1);
if (rc)
goto out;
rc = proto_register(&udp_prot, 1);
if (rc)
goto out_unregister_tcp_proto;
rc = proto_register(&raw_prot, 1);
if (rc)
goto out_unregister_udp_proto;
rc = proto_register(&ping_prot, 1);
if (rc)
goto out_unregister_raw_proto;
(void)sock_register(&inet_family_ops);
ip_static_sysctl_init();
if (inet_add_protocol(&icmp_protocol, IPPROTO_ICMP) < 0)
pr_crit("%s: Cannot add ICMP protocol\n", __func__);
if (inet_add_protocol(&udp_protocol, IPPROTO_UDP) < 0)
pr_crit("%s: Cannot add UDP protocol\n", __func__);
if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0)
pr_crit("%s: Cannot add TCP protocol\n", __func__);
/* Register the socket-side information for inet_create. */
for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r)
INIT_LIST_HEAD(r);
for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q)
inet_register_protosw(q);
arp_init();
ip_init();
tcp_init();
udp_init();
udplite4_register();
raw_init();
ping_init();
ipv4_proc_init();
ipfrag_init();
dev_add_pack(&ip_packet_type);
ip_tunnel_core_init();
rc = 0;
out:
return rc;
out_unregister_raw_proto:
proto_unregister(&raw_prot);
out_unregister_udp_proto:
proto_unregister(&udp_prot);
out_unregister_tcp_proto:
proto_unregister(&tcp_prot);
goto out;
}
fs_initcall(inet_init);
So by fs_initcall (inet_init) will be registered into initcalls inet_init function of initcall_levels, finally initialized, in order to verify, the best way is to reboot, hit a breakpoint in inet_init, see if you can call this function.
(gdb) b inet_init
Breakpoint 1 at 0xffffffff829f49fe: file net/ipv4/af_inet.c, line 1906.
(gdb) c
Continuing.
Breakpoint 1, inet_init () at net/ipv4/af_inet.c:1906
1906 {
Then look at the inet_init code: This includes almost all network protocols --TCP, UDP, ICMP, etc., the process is first registered port number, and then add the corresponding protocol, and finally initialized track to report it here a paragraph, but we do not see the underlying operating system, such as alloc_inode socket is how to initialize, which is defined in the superblock socket when directly defined, not by the initialization process.
static const struct super_operations sockfs_ops = {
.alloc_inode = sock_alloc_inode,
.destroy_inode = sock_destroy_inode,
.statfs = simple_statfs,
};
Although the end, and by then the third question has not been resolved, and that is located in struct socket
the structure of proto_ops domain, that is, the function pointer handle a specific protocol in which initialization did not find, can look forward to later study found.