Big data table synchronization

       Some time ago, the project team had a need to synchronize data for a table of tens of millions. The goal is not complicated. Synchronize a table data of the user to our own database. Of course, we need to associate several table data of our own. .

       At the beginning, considering the safety and other factors, the customer only provided the csv file exported from the data table, with a size of more than 700 M and a data volume of 1200w. LZ's brain was hot, and he started to use spring+ibatis to start the code without thinking. A small application was written in about a morning. At that time, LZ was still complacent about his small achievements. But in the actual test that afternoon, LZ was really hit hard, and the system could not support it at all. When the test data volume of the local machine reached more than 100W, the system memory overflowed. LZ tried to increase the environment variables, increase the memory configuration, data When it reaches 200w, all the memory of the system has been occupied. LZ suspects that his laptop is not good enough. He deployed the code to the linux host and started to use the default maximum memory of 2G. The effect of running out is equivalent to that of a laptop. . LZ took the initiative to change the memory configuration to the maximum configuration of 16G, trying to take a chance, the system finally lost at 700W.

       LZ was not discouraged, and thought of a compromise solution. Loading the data at one time was too much to bear, so I opened the load and loaded 10W pieces of data each time. Seeing the data entered into the database one by one, LZ was very pleased. , LZ deploys the program to linux and lets it execute in the background, and plans to see the execution result the next morning. The results of the second day made LZ dumbfounded. After 13 hours, the data was still crawling like a snail. After the program ran for 16 hours, the 1200w data was finally stored.

       The function is completed, but the synchronization efficiency is extremely poor, and it certainly cannot meet the needs of customers. LZ began to doubt the performance of ibatis, saying that ibatis is encapsulated by JDBC, and the performance is bound to suffer. It took LZ an hour to transform the code, and after redeploying it, the second round of testing began. The results are still improved. Obviously, the speed of data synchronization has been accelerated, but it is still not satisfactory. LZ It is estimated that it still takes 8-9 hours to run the program.

       While the program is running, LZ has begun to ponder new ideas. Naturally, multi-threading comes to mind. Yes, the execution of a single thread is indeed too slow. LZ plans to use the JDK thread pool with 20 threads and the amount of data per thread. About 10W, the maximum memory of the program is 2G, so the real-time data of the program is 200W. Of course, due to the use of spring to load properties, all parameters have been changed to be configurable, and adjusted according to the size of the deployment environment. At this time, LZ also found a problem. Since the data source is not from a table structure, but a csv file, it is a bit troublesome to read data in multiple threads, and it is inconvenient to locate the data location if an error occurs. LZ finally sacrificed 3 minutes after the program was started, and split 1200W csv files with a total data size of 700M into 120 small files of 6.7M each, and then used multi-threading for operation and storage. After deploying the Linux environment, the third round of debugging was carried out. The program can be run in 80 minutes, and the performance has fully met the customer's requirements. After adjusting the memory to a maximum of 16G, the program can be run in 40 minutes. However, the Linux board does not hang up. , safety considerations still use up to 2G memory.

       LZ dare not claim that his ideas and practices are the best. Maybe someone can improve the efficiency to half an hour or even 10 minutes. But through the continuous improvement of this demand and the final result, LZ wants everyone to understand, Ideas are always more valuable than the technology itself, and innovation and breakthroughs can only be achieved by learning and applying them. Programs can change the world, as long as you are willing to use your brains.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327039633&siteId=291194637