sqoop incremental import

core parameters

  • --check-column 
    is used to specify some columns, which are used to check whether the data is imported as incremental data during incremental import, similar to auto-increment fields and timestamps in relational databases. 
    Note: these specified The type of the column cannot be any character type, such as char, varchar, etc., and --check-column can specify multiple columns
  • --incremental is 
    used to specify the incremental import mode, the two modes are Append and Lastmodified
  • --last-value 
    specifies the maximum value of the specified field in the check column in the last import

Append mode actual combat incremental import

Execute the following command to import our previous data first

sqoop import \
--connect jdbc:mysql://master:3306/test \
--username hive \
--password 123456 \
--table customer \
-m 1

Use hdfs dfs -cat to view the generated data file and find that the data has been imported. Then we insert 2 pieces of data into the customer of mysql

insert into customer values(6,'james');
insert into customer values(7,'luna');

Execute the following command to implement incremental import

sqoop import \
--connect jdbc:mysql://master:3306/test \
--username hive \ 
--password 123456 \
--table customer \
--check-column id \
--incremental append \
--last-value 5

In the table field of the database, an auto-incrementing field is often set as the primary key of the data table. Here we use the id field as the basis for judging whether the data row is incremental data. Last-value sets the maximum value of the last imported id value, so sqoop will only import the data whose id value is between 5 and 7, realizing the incremental import. 
Note: If the last-value value is not specified, all the data in the table will be imported, which happens. data redundancy

Lastmodified import actual combat

First we have to create a customer table and specify a timestamp field

create table customertest(id int,name varchar(20),last_mod timestamp DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP);

We insert the following record again:

insert into customertest(id,name) values(1,'neil');
insert into customertest(id,name) values(2,'jack');
insert into customertest(id,name) values(3,'martin');
insert into customertest(id,name) values(4,'tony');
insert into customertest(id,name) values(5,'eric');

The timestamp here is set to change when the data is generated and updated. 
At this time, we execute the sqoop command to import the data into hdfs,

sqoop import \
--connect jdbc:mysql://master:3306/test \
--username hive \
--password 123456 \
--table customertest 
-m 1

We insert a piece of data into the customertest table again

insert into customertest(id,name) values(6,'james')

We use incremental method for incremental import

sqoop import \
--connect jdbc:mysql://master:3306/test \
--username hive \
--password 123456 \
--table customertest \
--check-column last_mod \
--incremental lastmodified \
--last-value "2016-12-15 15:47:29" \
-m 1 \
--append 

The last record we inserted has been imported here, but we found that 2 pieces of data were inserted here. Why? 
This is because when the lastmodified mode is used to process the increment, the data greater than or equal to the last-value value will be inserted as an increment. 
Note: 
When using the lastmodified mode for incremental processing, it is necessary to specify whether the incremental data is in append mode (append) or merge. -key (merge) mode add 
write picture description here
we demonstrate incremental update using merge-by mode, we go to update the name field with id 1

update customertest set name = 'Neil' where id = 1;

After the update, the timestamp of this data will be updated to the system time when we update the data. We execute the following command to use the id field as the merge-key

sqoop import \
--connect jdbc:mysql://master:3306/test \
--username hive \
--password 123456 \
--table customertest \
--check-column last_mod \
--incremental lastmodified \
--last-value "2016-12-15 15:47:30" \
-m 1 \
--merge-key id 

Since the merge-key mode is a complete maoreduce operation, we can finally see the generated files such as part-r-00000 in the customertest folder, and we will find that the name with id=1 has been modified. At the same time, data with id=6 was added.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324927572&siteId=291194637