Data migration experience

  A few days ago on a business trip, went to the scene to help customers migrate data, after several days of fighting, and finally migrate data automation together, day and can run batch operations, here small just share with you, which stepped pit (also possible practical experience is not rich cause).
  First, Rong Xiao Bian I complain about, not their familiar development environment is really sad, to a computer, we do not say no IDE, and even java are not installed, the tool does not connect to the database, the only good news is there xshell, but totally inconsistent with the preferences of individual shortcuts, no way, we want to develop efficient and configure it yourself. Just configure the development environment on the whole at the expense of a small series of morning time, better late development significantly accelerated. Eat a meal at noon, in the afternoon to business.
  Migration tasks to get the afternoon, found a library has many tables, where the table some small, I was silent for a few seconds, I feel this week is accountable here. Xiao Bian also here this migration process in thought for a moment that a few seconds of silence, the first data exists in a relational database, and then we have to pass sqoop uploads the data to HDFS, and then use these maps to the appearance of the hive data, and ultimately establish within the index table to store a complete data. Within the partition table are generally divided barrel, and there is an index of orc table, query speed obviously a lot faster than appearance. Then the next step which will be under the small series of analysis, a little bit.

Preparation:

   When a database to get the data, we first analyze which tables here at large, which is relatively small table, separate large tables and small tables, using different migration method, generally provide customers with a few pieces of data each table If not, can only selecct count (*) from table; the data in these tables to check out, not only easy to distinguish the size of the table, but also for the post-data check have more help.

1. From the relational database migrate data to HDFS in

  Here is a small series of data migration sqoop, although sqoop slower, but relatively low cost to learn, and easy to generate the bulk of the statement, the development required is not so high. First, to test whether a sqoop can successfully migrate data, and then write a batch script generation sqoop statements, these statements last call, data migration background in parallel. Xiao Bian here to talk about the use sqoop a few tips:

  • If the cluster resources more fully, most of the new cluster is no production task, our sqoop statement can be added:
    -m 这个参数可以设置为>1 ,表示并行多个map去抽取数据。
    --split-by 当然-m 参数设置大于后,要同时设置这个参数,表示以表中的哪一个字段去分map并行。

    Here Select --split-by table to make use of the scattered field, the secured amount data of each map are extracted substantially the same task.

  • If the data table is larger than, for example, more than one hundred million, here we need this table into multiple sqoop to extract:

    --query: conditions after specified where, extract some data
       The advantage is: if only after a sqoop task to extract 90% of the data found sqoop mission hung up, then this extraction fails, not only time consuming, but data not drawn to. Sqoop in several tasks, not only parallel, but also the amount of data for each task is not large, if there are tasks hung up, only you need to extract data abstraction that under the conditions where that is, and there is also a great help to find fault.
       Select partition field is equally important, is generally used herein as the date where the condition after, task sqoop secured amount data for each minute is almost the same.

    • Directory sqoop extract data plan
      #小表目录规划
      /tmp/库名/表名
      #大表的目录规划
      /tmp/库名/表名/分区名
  • query statement: In sqoop command, we write queries to extract data, remember not to:

    - ×
    SELECT * from Table;
    - √
    SELECT field 1, field 2 .... from table;
    otherwise might cause extraction sqoop slow, and may even lead to data not extracted.
       When we pay attention to the contents of the above, you can write sqoop statement generates a script for each table in the batch, according to the library name. Table name, to obtain metadata tables in a relational database, and finally assembled sqoop. The last task in writing the script, the timing of the implementation of these sqoop statement.
    Actual data sharing:
       here small tested, large amount of data time-consuming, single and multiple sqoop sqoop of:
    In Case Data 600G:
       - Multi sqoop parallel data extraction time: 3 to 4 hours.
       - Single sqoop data extraction time: 12 hours or more.
       - Single sqoop && (-m 1) extraction of data 50 million, about 27 minutes.

2. Establish appearance map

   It means the extracted data in the hive mapped out by the appearance of the way, in fact, nothing difficult, mainly to see where customers if required, may be the appearance of a single library, or the appearance of the name of unity are: table name _ext . But remember, do not go hand in written statements to build the table, if the table has more than one hundred, easy fried mentality, where you can use the metadata relational database, generate a hive of construction of the table statement, here we are with MySQL as an example:

SELECT CONCAT('create table ', TABLE_NAME, '(', substring(column_info, 1, length(column_info) - 1), ')', ' comment ', '"', TABLE_COMMENT, '"', ';')
FROM (SELECT TABLE_NAME, TABLE_COMMENT, group_concat(CONCAT(COLUMN_NAME, ' ', DATA_TYPE, ' comment ', '"', COLUMN_COMMENT, '"')) AS column_info
FROM (SELECT t1.TABLE_NAME, CASE WHEN t2.TABLE_COMMENT = NULL THEN t1.TABLE_NAME ELSE t2.TABLE_COMMENT END AS TABLE_COMMENT, COLUMN_NAME, CASE WHEN DATA_TYPE = 'varchar' THEN 'string' WHEN DATA_TYPE = 'int' THEN 'int' WHEN DATA_TYPE = 'tinyint' THEN 'tinyint' WHEN DATA_TYPE = 'decimal' THEN 'double' WHEN DATA_TYPE = 'datetime' THEN 'string' WHEN DATA_TYPE = 'timestamp' THEN 'string' WHEN DATA_TYPE = 'float' THEN 'double' WHEN DATA_TYPE = 'double' THEN 'double' WHEN DATA_TYPE = 'bigint' THEN 'bigint' END AS DATA_TYPE, CASE WHEN COLUMN_COMMENT = NULL THEN COLUMN_NAME ELSE COLUMN_COMMENT END AS COLUMN_COMMENT
FROM COLUMNS t1 JOIN TABLES t2 ON t1.TABLE_NAME = t2.TABLE_NAME
WHERE t1.TABLE_NAME = 't_app_equipment_status'
) t3
GROUP BY TABLE_NAME, TABLE_COMMENT
) t4;

Many examples online, Xiao Bian is not here in the introduction. After the establishment of good looks, the best check the amount of data size, data and comparative data tables in the hive under the relational database is the same, so this part of sqoop verified whether there is data loss.
Pit encountered:
  When we extract a large table in the partition, is not directly mapped to appearance, we need to establish the scope of the partition table to map the partition table of contents to each partition.

3. Establish effective internal table

   In fact, this step is the appearance of the data, insert into a field and look the same in the table optimized in this table are generally within the sub-district barrels, index, or flash-based table, anyway, is the query greatly improving the speed of a table, also called a business table.
   Xiao Bian here is based on efficient query of flash memory, internal development of a table structure. Xiao Bian presented here about how to determine the sub-barrel fields: benefits divided barrel is
   (1) a higher query processing efficiency. Barrel as a table plus an additional structure, Hive in dealing with some queries can use this structure. Specifically, two identical connector (connector column comprising a) a barrel divided column table can be used Map end connector (Map-side join) efficient implementation. For example, JOIN operations. For a JOIN operation two tables have the same column, if these two tables have carried out a bucket operation. Save the same column then the value of the bucket operation can be JOIN, a smaller amount of data can be greatly the JOIN.
   (2) the sampling (sampling) more efficient. When dealing with large data sets, and modify the query in the development stage, if we can try to run a query on a small portion of the data set, it will bring a lot of convenience.
   Well, if it is determined sub-barrel field, if there is a general primary key table to use as a primary key sub-barrel field, if there is no primary key table, find a few scattered fields of use:

 select count(distinct feild) from table;

Find the maximum data that field as a sub-barrel field. Specific sub-barrels, is a prime number proposed here, because if it is a non-prime number, it may lead to non-uniform sub-barrel, the barrel is because if the number of points, then 9, then if the field value is 18, 27 are assigned to a bucket It may result in a barrel "hot spots."

4. The outer insert into the table

   This process is time-consuming second only to sqoop process, since we established in the table partition, in this step we need to use the hive dynamic partitioning insert, insert statements are generally:

    insert into table_1 partition(par_field) select field1.field2...from table_2 ;

It should be noted that select the last field must be a partition field.
   When we insert, if all the data in the need to check all the insert is successful, this time: Table = the amount of data relational database data volume = volume of data within the exterior .

5. The data processing increment per day

   When we complete the data migration, in fact, equivalent to the appearance of a transit station, just go to the data in the table, if we make sure that the internal table has a complete data, then you can empty the appearance of the data, which is why we will look in the location set for the reason in / tmp, after emptying the appearance of data, will draw increasing data to the location address of the appearance, then the full amount will be the appearance of data insert into the table, it ensures the daily incremental data successfully imported into the hive.

Guess you like

Origin blog.51cto.com/14048416/2425833