Preconditions :
The database capacity is hundreds of millions, the index only has id, and there is no creation time index
Achieve goals:
- Synchronize Alibaba Cloud RDS Mysql table data to hive, partition according to the creation time and date format of MySQL table data, one partition per day is convenient for query
- Run crontab for incremental backup data every day, or based on the self-incrementing id
Problems encountered:
- It is impossible to create an index of the creation time, and it is impossible to query according to the time range, which will seriously affect the performance of the online database?
The index can only be read incrementally according to the id method, stored in the temporary table, and then dumped to the formal table, dynamically partitioned by the write time - Import hive directly using sqoop? Or import the data into hdfs and write the data to the specified temporary table in the form of a built-in table?
If you use sqoop hive import directly, it is not supported, the query statement cannot be customized to extract fields, and there is no way to lock some data according to the id range.
Using hive import can only synchronize tables in full, and id conditions can only be assigned to various places. I personally feel Query is more flexible for me, so I use
sqoop to import the built-in table of hdfs to synchronize data
- Reading and writing must be allocated at one time, and one cannot read too much at one time, which affects the performance of the online database. Is Alibaba Cloud RDS used for the online database?
1. First query mysql max(id) and hive max(id), and load them in batches after calculating the difference. Querying hive max does not directly connect hive, but uses a roundabout strategy,
using python to call the system command line to execute hive -e Query the maximum value, write it to the local file system, and then query the maximum value of the local file system
2. Here I am loading 3 million times each time, and the single difference is less than 3 million
. 3. Use sqoop to divide 4 To execute a map task, it takes about 1-2 minutes to read 3 million data to the local
. 4. It takes about 5 seconds to query min, max, id in mysql to determine the id range of this data segmentation.
5. A single map task probably needs It takes about 15 seconds to read and send data to sqoop, sqoop to hdfs built-in table will not be under pressure soon, so I won't write it here - When the hive table is first created and synchronized, it is necessary to rebuild the dynamic partition from the temporary table to the formal table. The data span is too large, and it takes too much time to rebuild every day?
Step 1: When the data is first synchronized, do not open the transfer to the formal table and dynamically partition it, first synchronize the full amount of data to the temporary table of hive, after the synchronization is completed, transfer the full amount of the temporary table to the formal table, and dynamically write to the partition Data
step 2: The data has been fully synchronized. At this time, a crontab task is created, the synchronization script is called regularly, the incremental data is inserted into the temporary table, and then the data at the specified time is transferred to the official partition. This step will update the current id. The data is synchronized, the range is from the last synchronized id to today's largest id, the date is yesterday and today, here is the official table to filter out all the data of yesterday.