Alibaba final interview: How to quickly insert 1 billion data into MySQL

To import 1 billion pieces of data into the database as quickly as possible, you first need to clarify with the interviewer what form and where the 1 billion pieces of data exist, how big each piece of data is, whether it is imported in an orderly manner, whether it cannot be repeated, and whether the database is MySQL?

After the assumptions are made clear to the interviewer, there are the following constraints:

  1. 1 billion pieces of data, each piece of data is 1 Kb

  2. The data content is unstructured user access logs, which need to be parsed and written to the database.

  3. Data is stored in Hdfs or  S3 distributed file storage

  4. One billion pieces of data is not one large file, but is roughly divided into 100 files, with the suffix marking the order.

  5. It is required to import it in order and try not to repeat it.

  6. The database is MySQL

First consider whether it is feasible to write 1 billion data to a single MySQL table?

Can a single database table support 1 billion?

The answer is no, the recommended value for a single meter is below 2000W. How is this value calculated?

The MySQL index data structure is a B+ tree, and all data is stored in the primary key index, which is the leaf node of the clustered index. The performance of B+ tree insertion and query is directly related to the number of B+ tree levels. Below 2000W, it is a three-level index, while above 2000w, it may be a four-level index.

Mysql b+The page size of the leaf nodes of the index is 16K. Currently, each piece of data is exactly 1K, so it can be simply understood that each leaf node stores 16 pieces of data. The size of each non-leaf node in the b+ index is also 16K, but it only needs to store the primary key and the pointer to the leaf node. We assume that the type of the primary key is BigInt, the length is 8 bytes, and the pointer size is set to 6 bytes in InnoDB. This totals 14 bytes, so a non-leaf node can be stored  16 * 1024/14=1170.

That is, each non-leaf node can be associated with 1170 leaf nodes, and each leaf node stores 16 pieces of data. From this, we can get the table of the number of B+ tree index levels and storage quantity. The number of index layers above 2KW is 4, and the performance is even worse.

Number of layers Maximum data size
2 1170 * 16 = 18720
3 1170 * 1170 * 16= 21902400 = 2000w
4 1170 * 1170 * 1170 * 16 = 25625808000 = 25.6 billion

In order to facilitate calculation, we can design a single table with a capacity of 1KW and a total of 100 tables with 1 billion pieces of data.

How to write to the database efficiently

The performance of writing a single entry to the database is relatively poor. You can consider writing to the database in batches. The batch value can be dynamically adjusted. Each item is 1K, and can be adjusted to batch writing of 100 items by default.

How to ensure that batch data is written successfully at the same time? The MySQL Innodb storage engine guarantees that batch write transactions succeed or fail at the same time.

Retry must be supported when writing to the database. If the database fails to be written, retry the write. If it still fails after retrying N times, consider writing 100 entries to the database. The failed data will be printed and recorded, and then discarded.

In addition, writing sequentially in the order of the primary key ID can achieve the fastest performance, while the insertion of non-primary key indexes is not necessarily sequential. Frequent index structure adjustments will lead to a decrease in insertion performance. It is best not to create non-primary key indexes, or to create indexes after the table is created to ensure the fastest insert performance.

Do you need to write to the same table concurrently?

cannot

  1. Concurrent writing to the same table cannot guarantee that the data is written in order.

  2. Raising the threshold for batch insertion increases insertion concurrency to a certain extent. No need to write concurrently to a single table

MySQL storage engine selection

Myisam It innodbhas better insertion performance but loses transaction support. There is no guarantee of success or failure at the same time during batch insertion. Therefore, when batch insertion times out or fails, if you try again, it will inevitably lead to the occurrence of some duplicate data. But in order to ensure faster import speed, the myisam storage engine can be listed as one of the plans.

At this stage, I will quote other people’s performance test results: Comparative analysis of MyISAM and InnoDB

picture

picture

It can be seen from the data that batch writing is significantly better than single writing. And after innodb turns off the instant refresh disk policy, innodb insertion performance is not much worse than myisam.

innodb_flush_log_at_trx_commit: Control the strategy of MySQL flushing data to disk.

  1. Default=1, that is, data will be flushed to disk every time a transaction is submitted, with the highest security and no data loss.

  2. When configured to 0 and 2, the data will be refreshed to the disk every 1 second, and mysql crash1 second of data may be lost when the system is down.

Considering that Innodb's batch performance is also good when the instant refresh disk policy is turned off, it is tentatively decided to use it first innodb(if the company's MySQL cluster does not allow this policy value to be changed, MyIsam may be used.). When testing in an online environment, you can focus on comparing the insertion performance of the two.

Do you want to branch the database?

There is a performance bottleneck for concurrent writing in a single mysql database. Generally, 5K TPS writing is very high.

The current data is stored using SSD, and the performance should be better. But if it is an HDD, although sequential reading and writing will have very high performance, HDD cannot cope with concurrent writing. For example, if there are 10 tables in each library, assuming that 10 tables are being written concurrently, although each table is written sequentially , due to the different storage locations of multiple tables, the HDD has only one magnetic head and does not support concurrent writing. It can only seek again, which will greatly increase the time consumption and lose the high performance of sequential reading and writing. Therefore, for HDD, it is not a good solution for a single database to write multiple tables concurrently. Back to the SSD scenario, different SSD manufacturers have different writing capabilities and different concurrent writing capabilities. Some support 500M/s, some support 1G/s reading and writing, some support 8 concurrent writes, and some support 4 concurrently. Before the online experiment, we did not know how the actual performance would be.

Therefore, the design must be more flexible and support the following capabilities:

  1. Number of configuration databases supported

  2. Supports configuring the number of concurrently written tables (if MySQL is an HDD disk, only one table is written sequentially, and other tasks wait)

Through the above configuration, our system can flexibly adjust the number of online databases and the concurrency of writing tables. Whether it is HDD or SSD, our system can support it. Regardless of the SSD manufacturer's model or its performance, the configuration can be adjusted to continuously obtain higher performance. This is also the idea behind the design. There is no fixed threshold number, but it must be dynamically adjustable.

Next, let’s talk about file reading. There are 1 billion pieces of data, each piece is 1K, and the total is 931G. A large file of nearly 1T is generally not generated. So our default file has been roughly divided into 100 files. The number of each file should be approximately the same. Why is it cut into 100 pieces? Wouldn't it be possible to import the database faster by dividing it into 1000 pieces and increasing the read concurrency? As mentioned just now, the read and write performance of the database is limited by the disk, but any disk has faster read operations than write operations. Especially when reading, you only need to read from the file, but when writing, MySQL has to perform complex processes such as indexing, parsing SQL, transactions, etc. Therefore, the maximum concurrency for writing is 100, and the concurrency for reading files does not need to exceed 100.

More importantly, the concurrency of reading files is equal to the number of sub-tables, which is beneficial to simplifying model design. That is, 100 reading tasks and 100 writing tasks correspond to 100 tables.

How to ensure that writes to the database are in order

Since the file is divided into 100 small files of 10G, the file suffix + the file line number can be used as the unique key of the record, while ensuring that the contents of the same file are written to the same table. For example

  1. index_90.txt is written to database database_9, table_0,

  2. index_67.txt is written to database database_6, table_7.

This way each table is ordered. The overall order is achieved through database suffix + table name suffix.

How to read files faster

Obviously, a 10G file cannot be read into the memory at one time. The scene file reading includes

  1. Files.readAllBytesOne-time loading of internal memory

  2. FileReader+ BufferedReader reads line by line

  3. File+ BufferedReader

  4. Scanner reads line by line

  5. Java NIO FileChannel buffer mode reading

On MAC, performance comparison of reading 3.4G files using these methods

Reading method
Files.readAllBytes The memory is full of OOM
FileReader+ BufferedReader Read line by line 11 seconds
File+ BufferedReader 10 seconds
Scanner 57 seconds
Java NIO FileChannelReading in buffer mode 3 seconds

For detailed evaluation content, please refer to: File Reading Performance Comparison: https://zhuanlan.zhihu.com/p/142029812

It can be seen that using this method JavaNIO FileChannnelis obviously better, but FileChannelthe method is to read the fixed size buffer first, and does not support line-by-line reading. There is also no guarantee that the buffer will contain exactly an integer row of data. If the last byte of the buffer happens to be stuck in the middle of a line of data, additional cooperation is required to read the next batch of data. How to turn the buffer into rows of data is more difficult.

File file = new File("/xxx.zip");
FileInputStream fileInputStream = null;
long now = System.currentTimeMillis();
try {
       fileInputStream = new FileInputStream(file);
       FileChannel fileChannel = fileInputStream.getChannel();

       int capacity = 1 * 1024 * 1024;//1M
       ByteBuffer byteBuffer = ByteBuffer.allocate(capacity);
       StringBuffer buffer = new StringBuffer();
       int size = 0;
       while (fileChannel.read(byteBuffer) != -1) {
          //读取后,将位置置为0,将limit置为容量, 以备下次读入到字节缓冲中,从0开始存储
          byteBuffer.clear();
          byte[] bytes = byteBuffer.array();
          size += bytes.length;
       }
       System.out.println("file size:" + size);
} catch (FileNotFoundException e) {
   e.printStackTrace();
} catch (IOException e) {
   e.printStackTrace();
} finally {
   //TODO close资源.
}
System.out.println("Time:" + (System.currentTimeMillis() - now));

JavaNIO It is buffer-based and ByteBuffercan be converted to a byte array. It needs to be converted to a string and needs to be truncated by line.

However, BufferedReader JavaIO reading can naturally support line truncation, and the performance is not bad. It only takes 30 seconds to read a 10G file. Since the overall bottleneck of the import is the writing part, even if it is read in 30 seconds, the overall performance will not be affected. . So file reading uses BufferedReader line-by-line reading. That is option 3

If you coordinate reading file tasks and writing database tasks

This section is confusing, so please read it patiently.

Is it possible to have 100 reading tasks, each task reading a batch of data and writing it to the database immediately? As mentioned earlier, due to the bottleneck of concurrent writing in the database, it is impossible for one database to concurrently write 10 tables in large batches at the same time. Therefore, 100 tasks writing to the database at the same time will inevitably lead to 10 tables in each database being written sequentially at the same time. This intensifies the concurrent write pressure on the disk. In order to increase the speed as much as possible and reduce the performance degradation caused by concurrent disk writing, some writing tasks need to be suspended. So do reading tasks need to limit concurrency? unnecessary.

Assuming that writing tasks and reading tasks are merged, the concurrency of reading tasks will be affected. The initial plan is to handle the reading tasks and writing tasks separately, so that no one will delay the other. However, this solution was found to be more difficult during actual design.

The original idea was to introduce Kafka, that is, 100 reading tasks would deliver data to Kafka, and the writing tasks would consume Kafka and write it to the DB. 100 reading tasks deliver messages to Kafka. At this time, the order is disrupted. How to ensure orderly writing to the database? I thought that I could use Kafka partition routing, that is, read the task ID and route all the messages of the same task to the same partition to ensure orderly consumption in each partition.

How many shards should be prepared? 100 is obviously too many, if the partition is less than 100, such as 10. Then there must be messages from multiple tasks mixed together. If multiple tables of the same database are in a Kafka partition, and this database only supports batch writing of a single table, it does not support concurrent writing of multiple tables. Messages from multiple tables in this library are mixed in one shard. Due to concurrency limitations, messages corresponding to tables that do not support writing can only be discarded. So this solution is complex and difficult to implement.

Therefore, the Kafka solution was finally abandoned, and the solution of separating reading and writing tasks was temporarily abandoned.

The final solution is simplified to the reading task of reading a batch of data and writing a batch. That is, the task is responsible for both reading files and inserting into the database.

How to ensure task reliability

If the reading task is halfway through, how to deal with downtime or service release? Or if there is a database failure, writing continues to fail, and the task is temporarily terminated. How to ensure that when the task is started again, processing will continue at the breakpoint without repeated writing?

Just now we mentioned that we can set a primary key Id for each record, that is, the file suffix index + the line number of the file. The idempotence of writing can be guaranteed through the primary key id.

The maximum value of the line number where the file is located is roughly 10G/1k = 10M, which is 10000000. Splice the largest suffix 99. The largest id is 990000000.

Therefore, there is no need for the database to automatically increment the primary key ID. The primary key ID can be specified during batch insertion.

What if another task also needs to import the database? How to achieve primary key ID isolation, so the primary key ID still needs to be spliced taskId. For example, {taskId}{fileIndex}{fileRowNumber} convert to Long type. If the taskId is large and the spliced ​​value is too large, an error may occur when converting to Long type.

The most important thing is that if some tasks write 1kw and some other tasks write 100W, the length of each placeholder cannot be known using the Long type, and there is a possibility of conflict. However, if strings are spliced {taskId}_{fileIndex}_{fileRowNumber} ​​and a unique index is added, the insertion performance will be worse and cannot meet the demand for the fastest data import. So we need to think of another plan.

You can consider using Redis to record the progress of the current task. For example, Redis records the progress of tasks and updates the task progress after successful batch writing to the database.

INCRBY KEY_NAME INCR_AMOUNT

Specifies that the current progress is increased by 100, for example  incrby task_offset_{taskId} 100. If batch insertion fails, retry the insertion. If multiple failures occur, a single insert and a single update to redis will be performed. To ensure that the Redis update is successful, you can also add a retry during the Redis update.

If you are not sure about the consistency of Redis progress and database updates, you can consider consuming database binlog. Every time a record is added, redis +1 will be added.

If the task is interrupted, the offset of the task is first queried. Then read the file to the specified offset and continue processing.

How to coordinate the concurrency of reading tasks

As mentioned earlier, in order to avoid excessive concurrency of a single database insert table, affecting database performance. Consider limiting concurrency. How to do it?

Since the reading tasks and writing tasks are merged together. Then you need to limit the reading tasks at the same time. That is, only a batch of read and write tasks are selected for execution each time.

Before doing this, you need to design the storage model of the task table.

picture

picture

  1. bizId is a default field in order to support other product lines in the future. The default is 1, which represents the current business line.

  2. datbaseIndex represents the assigned database suffix

  3. tableIndex represents the assigned table name suffix

  4. parentTaskId, that is, the total task id

  5. offset can be used to record the progress of the current task

  6. 1 billion pieces of data are imported into the database and divided into 100 tasks. 100 taskIds will be added to process a portion of the data respectively, which is a 10G file.

  7. The status status is used to distinguish whether the current task is being executed and the execution is completed.

How to allocate tasks to each node can consider the preemption method. Each task node needs to preempt tasks, and each node can only preempt one task at the same time. How to implement it specifically? You can consider starting a scheduled task on each node, scanning the table regularly, scanning for subtasks to be executed, and trying to execute the task.

How to control concurrency? You can use the redission semaphore. key is the database id,

RedissonClient redissonClient = Redisson.create(config);
  RSemaphore rSemaphore = redissonClient.getSemaphore("semaphore");
    // 设置1个并发度
  rSemaphore.trySetPermits(1);
  rSemaphore.tryAcquire();//申请加锁,非阻塞。

The mission is responsible for regular rotation training, and after grabbing a spot, the mission begins. Set the task status to Process, and release the semaphore after the task is completed or fails.

TaskTassk task table Redisalt competes for the semaphore successfully. The scheduled polling task starts querying the tasks to be executed. The loop competes for the semaphore. Modify the task status. Set the start time and time. Query the current progress. Read the file. Read the file from the current progress. Import the database in batches. The update progress is completed and the semaphore is released to apply for the semaphore of the next task TaskTassk task table Redis

But there is a problem with using semaphores to limit current. If the task forgets to release the semaphore, or the process crashes and cannot release the semaphore, how to deal with it? Consider adding a timeout to the semaphore. So what if the task execution takes too long, causing the semaphore to be released early, and another client competes for the semaphore, causing two clients to write a task at the same time?

What, obviously importing 1 billion data into the database, how did it turn into a similar problem of distributed lock timeout?

In fact  Redisson, there is no good way to solve the semaphore timeout problem with semaphores. Normal thinking: If the task execution is too long, causing the semaphore to be released, you only need to renew the contract to solve this problem. When the task is being executed, as long as it is found When the semaphore is about to expire, renew it for a period of time to keep the semaphore from expiring. But Redission does not provide the ability to renew the semaphore, what should I do?

Let's think about it differently. We have been trying to let multiple nodes compete for the semaphore, thereby limiting concurrency. You can try to select a master node and rotate the task list through the master node. There are three situations:

Case 1 The number of current executions is less than the concurrency.

  1. Then select the task to be executed with the smallest ID, set the status to in progress, and notify the release message.

  2. The process that consumes the message applies for a distributed lock and starts processing the task. Release the lock when processing is complete. With the help of Redission distributed lock renewal, it is guaranteed that the lock will not time out before the task is completed.

Case 2 The number currently executing is equal to the concurrency.

  1. The master node tries to get whether the ongoing task has a lock.

  2. If there is no lock, it means that a task execution failed, and the task should be reissued at this time. If there is a lock, it means that a task is being executed.

Case 3 The number currently executing is greater than the concurrency degree

  1. Report abnormal situations, call the police, and manually intervene

Using the master node rotation training task can reduce task competition, publish messages through Kafka, and the process that receives the message handles the task. In order to ensure that more nodes participate in consumption, you can consider increasing the number of Kafka shards. Although each node may handle multiple tasks at the same time, it will not affect performance because the performance bottleneck is in the database.

So how should the master node be selected? You can Zookeeper+curator select the master node. The reliability is relatively high.

There are many factors that affect the time it takes to insert 1 billion pieces of data into the database. Including database disk type and performance. If the number of database sub-databases can be divided into 1,000 databases, the performance will certainly be faster. The number of sub-databases and sub-tables must be determined based on the actual online situation, which greatly determines the writing rate. Finally, the database batch insertion threshold is not static and needs to be continuously tested and adjusted to obtain the best performance. You can continuously try the optimal threshold for batch insertion according to 100, 1000, 10000, etc.

Finally, let’s summarize some important points

Summarize

  1. The constraints must first be confirmed before the plan can be designed. Determine the main direction that the interviewer wants to ask, such as how to cut a 1T file into small files. Although it is difficult, it may not be a question that the interviewer wants to investigate.

  2. From the perspective of data scale, it is necessary to sub-database and sub-table, and roughly determine the scale of the sub-table.

  3. From the analysis of the writing bottleneck of a single database, it is determined that separate databases are needed.

  4. Considering that disks have different support for concurrent writing, the concurrency of writing to multiple tables in the same library needs to be limited. It also supports dynamic adjustment, making it easy to debug the optimal value in an online environment.

  5. MySQL innodb、myisam Storage engines have different support for write performance, which must also be compared and verified online.

  6. The optimal threshold for database batch insertion requires repeated testing.

  7. Due to concurrency limitations, it is difficult to separate read tasks and write tasks based on Kafka. So merge the read tasks and write tasks.

  8. Redis is required to record the progress of task execution. After a task fails, the progress is recorded when re-importing to avoid data duplication problems.

  9. Coordination of distributed tasks is difficult, and using Redission semaphores cannot solve the problem of timeout renewal. Tasks can be assigned by the master node + distributed locks to ensure exclusive writing of tasks. The master node uses Zookeeper+Curatorselection.

Guess you like

Origin blog.csdn.net/weixin_37576193/article/details/134240305