The back-end face-to-face of Alibaba’s spring recruitment is here~

operating system

An operating system, when we measure its memory usage, what memory parts does it generally have?

Reader Answers: Heap and Stack

Replenish:

This is actually asking about your understanding of the free command.

 

The host's memory does some cleaning actions. Do you know which memory areas will be involved in the operation?

Reader Answer: I don't know

Replenish:

When the system memory is tight, it will reclaim memory. So which memory can be reclaimed?

There are two main types of memory that can be reclaimed, and they are reclaimed in different ways.

1. File -backed Page: Both the disk data (Buffer) cached by the kernel and the file data (Cache) cached by the kernel are called file pages. Most of the file pages can be directly released from the memory, and then re-read from the disk when needed in the future. And those data that have been modified by the application and have not yet been written to the disk (that is, dirty pages) must be written to the disk before the memory can be released. Therefore, the way to reclaim clean pages is to release memory directly, and the way to reclaim dirty pages is to write them back to disk first and then release memory .

2. Anonymous page (Anonymous Page): This part of memory has no actual carrier, unlike the file cache that has a carrier such as hard disk files, such as heap and stack data. This part of memory is likely to be accessed again, so the memory cannot be released directly. They are recycled through the Swap mechanism of Linux . Swap will first write the infrequently accessed memory to the disk, and then release the memory for other more Required process use. When accessing these memories again, it is enough to re-read the memory from the disk.

The recovery of file pages and anonymous pages is based on the LRU algorithm, that is, memory that is not frequently accessed is preferentially recovered. The LRU recovery algorithm actually maintains two doubly linked lists, active and inactive, where:

1. active_list  Active memory page linked list, which stores recently accessed (active) memory pages;

2. inactive_list  is an inactive memory page linked list, which stores memory pages that are rarely accessed (inactive);

The closer to the end of the linked list, the less frequently the memory page is accessed. In this way, when reclaiming memory, the system can give priority to reclaiming inactive memory according to the degree of activity.

Copy from a file to another directory, b is a file copied from a directory to a b directory, what system calls are included in the operation process? How many copy actions are executed here?

Reader Answer: I don't know

Replenish:

A simple understanding of this process is to create a file in another directory, then read the data of the original a file through read, and then write to the a file in the new directory.

Four data copies occur, two of which are by DMA and two by the CPU.

Let's talk about the process:

1. For the first copy, the data on the disk is copied to the buffer of the operating system kernel. The copying process is carried by DMA.

2. The second copy is to copy the data in the kernel buffer to the user's buffer, so our application can use this part of the data. This copying process is completed by the CPU.

3. For the third copy, the data copied to the user buffer just now is copied to the kernel buffer. This process is still carried by the CPU.

4. For the fourth copy, the data in the kernel buffer is copied to the disk, and this process is carried by DMA again.

database

The difference between NoSQL and relational database

Reader Answers:

The main difference between them is: relational databases are composed of a relatively uniform format, and there are relational mappings between tables. But NoSQL is basically in the form of key-value pairs, which can provide different services in different scenarios. For example, like MongoDB, which belongs to document storage, and some databases related to graph databases, it does not have such strong relationality as relational databases. And then the main one gap is here.

Replenish:

NoSQL and relational databases are two different types of databases with some notable differences. The following is an analysis of the differences between these two database types from various angles:

1. Data model: Relational databases use a table model to store data. Each row in the table represents a record, and each column represents an attribute or field. NoSQL databases use different data models, such as documents, key-value pairs, graphs, and columns.

2. Data structure: The data structure of a relational database is strict, requiring each table to have a predefined schema (schema), including data type, size, relationship, etc. NoSQL databases, on the other hand, do not have this limitation and allow dynamic changes in data structures.

3. Scalability: NoSQL databases usually have good scalability, and can be expanded horizontally by adding more nodes to meet the needs of large-scale data processing and high concurrent access. Relational databases can also be expanded vertically, that is, adding more processing power and storage space, but there are certain limitations in the expansion of this method.

4. ACID characteristics: Relational databases usually have ACID characteristics (atomicity, consistency, isolation, and persistence), which can ensure data integrity and consistency. NoSQL databases usually do not support full ACID characteristics, but provide higher availability and performance.

5. Query language: Relational databases usually use SQL language to query and manipulate data. This language is easy to learn and can complete complex data query and statistical analysis. NoSQL databases use different query languages, such as Mongo Query of MongoDB and CQL of Cassandra.

There are a lot of data in the MySQL table, how to speed up the data query?

Reader Answer: Indexes can be built to improve data query speed.

Replenish:

When you want to look up the content of a certain knowledge in a book, would you choose to search page by page? Or look for it in the catalog of the book?

All fools know that time is precious. Of course, they choose to find it in the catalog of the book, and then turn to the corresponding page after finding it. The table of contents in the book acts as an index , which is convenient for us to quickly find the content in the book, so the index is designed with the idea of ​​exchanging space for time.

Then switch to the database, the definition of the index is a data structure that helps the storage engine to quickly obtain data, and the image is that the index is the directory of the data .

Therefore, to speed up the query of data, it is to increase the speed of query by building an index.

How is the index implemented? Why can it speed up the query?

Reader Answer: I personally understand that an index is a kind of data organization structure, which can store data in different data structures according to different scenarios, and speed up data query. For indexes, if the memory is large enough, or the ultimate search efficiency is pursued, of course hash indexes will be very fast. If the B+ tree structure is suitable for storage on disk. Such an index structure is more suitable for storage on a disk structure like SSD, because it provides a leaf node, which is similar to the structure of a disk. A leaf node has a size of about 16 KB, which is similar to that of a disk. A page size is basically able to coincide with integer multiples. At the same time, it can also provide some range searches to speed up the reading speed of data.

Replenish:

The index of the Innodb storage engine is a B+ tree structure, and I take the B+ tree as an example.

B+Tree is a multi-fork tree, leaf nodes only store data, non-leaf nodes only store indexes, and the data in each node is stored in the order of the primary key . The index value of each parent node will appear in the index value of the lower child node, so in the leaf node, all the index value information is included, and each leaf node has two pointers, pointing to the next leaf node respectively and the previous leaf node to form a doubly linked list.

The B+Tree of the primary key index is shown in the figure (I drew a one-way linked list between the leaf nodes in the picture, but it is actually a two-way linked list. I can’t find the original picture, and I can’t modify it. I won’t redraw it if I’m lazy. Everyone can make up a doubly linked list):

Primary key index B+Tree

For example, we executed the following query statement:

select * from product where id= 5;

This statement uses the primary key index to query the product with id number 5. The query process is as follows, B+Tree will search layer by layer from top to bottom:

1. Compare 5 with the index data (1, 10, 20) of the root node, 5 is between 1 and 10, so according to the search logic of B+Tree, find the index data (1, 4, 7) of the second layer ;

2. Search in the index data (1, 4, 7) of the second layer, because 5 is between 4 and 7, so find the index data (4, 5, 6) of the third layer;

3. Search in the index data (4, 5, 6) of the leaf node, and then we find the row data with the index value 5.

It is precisely because the index of the B+ tree is stored in an orderly manner, so we can quickly find the corresponding data through an algorithm similar to binary search.

What is the specific implementation of B+ tree?

Reader's answer: B+ tree is a tree structure, in which it can support multiple nodes (multi-fork tree). That is, compared with a balanced binary tree, its layer can store more nodes. The special feature of the B+ tree is that, except for the leaf nodes, it basically stores path information and index information. Only in the last layer of page nodes will specific data information be stored.

The difference between B+ and red-black tree, B tree

Reader Answers:

1. Red-black tree, which is actually a type of binary tree, has only two types of leaf nodes: red and black. Therefore, compared with the B+ number, the depth and height of the entire tree structure will be higher. For queries, a higher height means that the query efficiency will be lower. Therefore, the B+ tree will be more chunky than the red-black tree, and it can store a little more data in one layer.

2. The B tree is also a multi-fork tree, but in the case of some data scanning, the b tree may have a backtracking process, but the B+ tree can be directly searched by traversal.

Replenish:

Red-black tree, it also achieves self-balancing through some constraints, but the constraints of red-black tree are more complicated, not the focus of this article, you can read books related to "Data Structure" to understand the constraints of red-black tree .

The following is the process of inserting a node into a red-black tree. This left-handed and right-handed operation is for self-balancing.

picture

Regardless of the balanced binary search tree or the red-black tree, the height of the tree will increase as the number of inserted elements increases, which means that the number of disk I/O operations will increase, which will affect the efficiency of the overall data query .

Although balanced binary search trees and red-black trees can keep the time complexity of query operations at O(logn), but because it is essentially a binary tree, each node can only have 2 child nodes, so when the number of nodes is more When , the height of the tree will increase correspondingly, which will increase the number of disk I/Os, thereby affecting the efficiency of data query.

In order to solve the problem of reducing the height of the tree, B-tree and B+ came out later. It no longer restricts a node to have only 2 child nodes, but allows M child nodes (M>2), thereby reducing the height of the tree. This reduces the number of disk I/Os.

Both B-tree and B+ use multi-fork trees to shorten the height of the tree, so these two data structures are very suitable for retrieving data stored on disk. However, MySQL's default storage engine, InnoDB, uses B+ as the index data structure for the following reasons:

1. The non-leaf nodes of the B+ tree do not store actual record data, but only store the index. Therefore, in the case of the same amount of data, compared with the B-tree that stores both indexes and records, the non-leaf nodes of the B+ tree can store more index, so the B+ tree can be more "chunky" than the B tree, and the number of disk I/Os to query the underlying nodes will be less.

2. The B+ tree has a large number of redundant nodes (all non-leaf nodes are redundant indexes). These redundant indexes make the insertion and deletion of the B+ tree more efficient. For example, when deleting the root node, it will not be like B Trees will undergo complex tree changes;

3. The leaf nodes of the B+ tree are connected by a linked list, which is conducive to range query, while the B tree needs to implement range query, so the range query can only be completed through tree traversal, which will involve disk I/O of multiple nodes Operation, range query efficiency is not as good as B+ tree.

When do we use transactions?

Reader's answer: There are multiple operations, and you want this operation to be atomic, or these multiple operations succeed together, or fail together. Such scenarios would require transactions.

 Replenish:

This is my wallet, with a total of 1 million yuan.

Today I am in a good mood. I decided to transfer 1 million to you. The final result must be that my balance has become 0 yuan, and your balance has increased by 1 million yuan. Are you happy just thinking about it?

The action of transferring money will involve a series of operations in the program. Assume that the process of transferring 1 million to you consists of the following steps:

It can be seen that the transfer process involves two operations of modifying the database.

Suppose that after the third step is executed, the server suddenly loses power, and a terrible thing will happen. My account has deducted 1 million, but the money has not been transferred to your account, which means that the 1 million has disappeared !

To solve this problem, it is necessary to ensure that all database operations in the transfer business are inseparable, either all of them are executed successfully, or all of them fail, and no intermediate state data is allowed.

The " transaction (*Transaction*) " in the database can achieve this effect.

We start the transaction before the transfer operation, and submit the transaction after all database operations are completed. For the transaction that has been submitted, the modification made by the transaction to the database will take effect permanently. If an interruption or error occurs midway, then Modifications made to the database during the transaction will be rolled back to the state before the transaction was executed.

How to ensure the atomicity of success together and failure together

Reader Answer: Rollback if execution fails. If the execution is successful, it will be persistent, and the undo log will be used to ensure atomicity.

Replenish:

The undo log (rollback log) guarantees the atomicity (Atomicity) in the ACID characteristics of the transaction.

The undo log is a log for undo rollback. Before the transaction is committed, MySQL will first record the data before the update to the undo log log file. When the transaction is rolled back, the undo log can be used to roll back. As shown below:

rollback transaction

Whenever the InnoDB engine operates on a record (modify, delete, add), it needs to record all the information required for rollback into the undo log, for example:

1. When inserting a record, write down the primary key value of this record, so that when you roll back later, you only need to delete the record corresponding to the primary key value ;

2. When deleting a record, write down all the contents in this record, so that when you roll back later, you can insert the records composed of these contents into the table;

3. When updating a record, write down the old values ​​of the updated columns, so that these columns can be updated to the old values ​​when rolling back later.

When a rollback occurs, read the data in the undo log, and then do the original reverse operation. For example, when a record is deleted, the contents of the record will be recorded in the undo log, and then when the rollback operation is performed, the data in the undo log will be read, and then the insert operation will be performed.

If we have inserted a piece of data now, let's modify another piece of data next. Can the entered data be queried at this time?

Reader's answer: It depends on the isolation level of the database. If its isolation level is relatively high, another transaction cannot perceive the modification of the data before the transaction is completed. But if the isolation level of his transaction is relatively low, it can be found.

What are the general isolation levels

Reader Answer: There are about 4 isolation levels: read uncommitted, read committed, repeatable read, and serialization.

Replenish:

The SQL standard proposes four isolation levels to avoid these phenomena. The higher the isolation level, the lower the performance efficiency. The four isolation levels are as follows:

1. Read uncommitted (read uncommitted) , which means that when a transaction has not been committed, the changes it makes can be seen by other transactions;

2. Read committed (read committed) , which means that after a transaction is committed, the changes it makes can be seen by other transactions;

3. Repeatable read refers to the data seen during the execution of a transaction, which is always consistent with the data seen when the transaction is started. The default isolation level of the MySQL InnoDB engine ;

4. Serializable : A read-write lock will be added to the record. When multiple transactions read and write this record, if a read-write conflict occurs, the later accessed transaction must wait for the previous transaction Execution can only continue after the execution is completed;

Sort by isolation level as follows:

picture

For different isolation levels, the phenomena that may occur during concurrent transactions will also be different.

picture

That is to say:

1. Under the "read uncommitted" isolation level, dirty reads, non-repeatable reads, and phantom reads may occur;

2. Under the "read committed" isolation level, non-repeatable reads and phantom reads may occur, but dirty reads are impossible;

3. Under the "repeatable read" isolation level, phantom reads may occur, but dirty reads and non-repeatable reads are impossible;

4. Under the "serialization" isolation level, dirty reads, non-repeatable reads, and phantom reads are impossible.

Therefore, to solve the phenomenon of dirty reads, it is necessary to upgrade to the isolation level above "read commit"; to solve the phenomenon of non-repeatable reads, it is necessary to upgrade to the isolation level of "repeatable read". The level is upgraded to "serialization".

Different database vendors support the four isolation levels specified in the SQL standard differently. Some databases only implement several of these isolation levels. Although the MySQL we discussed supports the four isolation levels, it is different from the various isolation levels specified in the SQL standard . There are some differences in what the level isolation level allows to happen .

Under the "repeatable read" isolation level, MySQL can largely avoid the occurrence of phantom reads (note that it is largely avoided, not completely avoided), so MySQL does not use the "serialized" isolation level to Avoid the occurrence of phantom reading, because using the "serialization" isolation level will affect performance.

Go

Why Go?

Reader Answer: Go supports high concurrency. In my research field, go is basically used, and it has a relatively good ecology.

Replenish:

Cloud native like k8s and docker are all implemented by go, and have better api support for go, and then go has natural advantages in concurrent occurrence, and the syntax is simple

Why Goroutine supports high concurrency

Reader's answer: Compared with threads, it occupies smaller resources. A thread can have multiple coroutines; thread processes can be seen as a synchronization mechanism, and coroutines are in non-blocking mode. The combination with the channel can make the development process not too much to consider the concurrency control of things, because it can be controlled through the channel, because it is a coroutine, they actually share some resources, so they can actually share memory. Make related interactions. Compared with the locking mechanism of threads, coroutines are more efficient.

Replenish:

1. The size of the coroutine is more ridiculous than that of the county seat. The size is only 2k, and more Ctrips can be allowed in the memory to handle concurrent tasks

2. The context switching of the coroutine is not the same as switching from the user mode to the kernel mode, and the saved register value is less, so the overhead of switching is lower and the efficiency is higher

3. The go language has a built-in coroutine scheduler in the runtime of the go language at the language level, and the overall operation of the goroutine is efficiently scheduled through the gmp model

If the overhead is high, can't it be solved with a thread pool?

Reader's answer: It can be solved, but the user needs to build a thread pool by himself, which also requires overhead, and all threads need to be managed manually.

Replenish:

Although the thread pool can control the number of threads, the context switching overhead of threads is greater than that of coroutines at that time, and switching from user mode to kernel mode is required. And you also need to implement and maintain the thread pool yourself, which is complicated to implement

What does overhead mean? What management do you think the thread pool needs to do?

Reader's answer: I don't know the details about maintaining expired threads.

Replenish:

Overhead refers to the overhead of thread switching, switching from user mode to kernel mode, and saving the execution context

1. corePoolSize (number of core threads)

Submit a task to the thread pool. At this time, if the number of threads created in the thread pool is less than corePoolSize, even if there are idle threads at this time, a new thread will be created to execute the task until the number of created threads is greater than or equal to corePoolSize (In addition to creating and starting threads by submitting new tasks (on-demand construction), you can also use the prestartCoreThread() or prestartAllCoreThreads() method to start the basic threads in the thread pool in advance.)

2. maximumPoolSize (maximum number of threads)

The maximum number of concurrently executable threads allowed by the thread pool. When the queue is full and the number of created threads is less than maximumPoolSize, the thread pool will create new threads to perform tasks. Also, for unbounded queues, this parameter can be ignored.

3. keepAliveTime (thread survival time)

When the number of threads in the thread pool is greater than the number of core threads, if the idle time of the thread exceeds the thread survival time, the thread will be destroyed until the number of threads in the thread pool is less than or equal to the number of core threads.

4. workQueue (task queue)

A blocking queue for transferring and holding tasks waiting to be executed.

5. threadFactory (thread factory)

Used to create new threads. Threads created by threadFactory also use the new Thread() method, and the thread names created by threadFactory all have a unified style: pool-m-thread-n (m is the number of the thread pool, and n is the number of threads in the thread pool).

6. handler (thread saturation strategy)

When the thread pool and queue are full, rejoining threads will execute this strategy.

Goroutine scheduling

Reader's answer: Basically, it is a GMP model. The goroutine will be cached in a queue, waiting for the p processor to perform related scheduling. The actual executor is m, which is a thread, that is, the amount of m basically depends on the CPU. It is recommended that the amount of data is almost the same as the CPU core.

interview summary

Feel

The knowledge of the operating system ends quickly, without paying attention to the underlying technical principles

hung up behind

Inadequacies

1. The underlying knowledge of the operating system is not good enough

2. Get less than the inspiration point of the interviewer

Guess you like

Origin blog.csdn.net/JACK_SUJAVA/article/details/130082729