Apache Sqoop install and use (learning)

A, the Apache Sqoop
1. Introduction sqoop
Apache Sqoop is a tool for data transfer between the system and RDBMS Hadoop ecosystem. From the Apache Software Foundation.

Sqoop working mechanism is to import or export commands translated into mapreduce program. In the translated mapreduce in outputformat mainly on inputformat and customization.
Hadoop ecosystem include: HDFS, Hive, Hbase and other
RDBMS systems include: Mysql, Oracle, DB2 and other
Sqoop can be understood as: "SQL to Hadoop and Hadoop to SQL".

Apache standing position look at the data transfer problem, you can import data into Export:
Import: Data import. ----- the RDBMS> Hadoop
Export: export data. Hadoop ----> RDBMS

2. sqoop installation
premise installation sqoop that already have java and hadoop environment.
The latest stable version: 1.4.6
configuration file is modified:
cd SQOOP_HOME $ / conf

mv sqoop-env-template.sh sqoop-env.sh
vi sqoop-env.sh
# modify the configuration

export HADOOP_COMMON_HOME= /export/servers/hadoop-2.7.5 
export HADOOP_MAPRED_HOME= /export/servers/hadoop-2.7.5
export HIVE_HOME= /export/servers/hive

Add jdbc mysql driver package

cp /hive/lib/mysql-connector-java-5.1.32.jar $SQOOP_HOME/lib/

Verify start

bin/sqoop list-databases \
 --connect jdbc:mysql://localhost:3306/ \
 --username root --password hadoop

This command lists all the mysql database.
Here, the entire Sqoop installation is complete.

Two, Sqoop import
"Import Tool" import individual tables from the RDBMS to HDFS. Each row in the table is considered HDFS records. All records are stored as a text file of the text data
following syntax is used to import the data HDFS.
Import Sqoop $ (the Generic-args) (Import-args)
Sqoop test data table

is created in the mysql database userdb, and then execute the sql script references:
create three tables: emp Employees table, emp_add employees address table, emp_conn employees Contact table.

1. The total amount of data to import table mysql HDFS
command from below for introducing HDFS MySQL database server emp table.
bin / Sqoop Import
-connect JDBC: MySQL: // Node. 1-: 3306 / the userdb
-username the root
-password Hadoop
-delete the dir-target-
-target-dir / sqoopresult
-table EMP. 1 - M

wherein -target-dir can It is used to specify export data to HDFS storage directory;
MySQL jdbc url use the ip address.

In order to verify the data HDFS import, use the following command to view data imported:
HDFS DFS -cat / sqoopresult / 00000 Part-m-
can be seen that it will default separated by commas, and data fields on the emp table HDFS. Can
be specified delimiter -fields-terminated-by '\ t '.

1201,gopal,manager,50000,TP
1202,manisha,Proof reader,50000,TP
1203,khalil,php dev,30000,AC
1204,prasanth,php dev,30000,AC
1205,kranthi,admin,20000,TP

2. The total amount of data to import table mysql the HIVE
2.1. One way: first copy table structure into the hive and then introduced into the data
to copy the structure of relational data tables into the hive

bin/sqoop create-hive-table \
--connect jdbc:mysql://node-1:3306/sqoopdb \
--table emp_add \
--username root \
--password hadoop \
--hive-table test.emp_add_sp

Where:
-table emp_add to the mysql database sqoopdb tables.
-Hive-table emp_add_sp in the new hive to the table name.

Importing files from a relational database in the hive

bin/sqoop import \
--connect jdbc:mysql://node-1:3306/sqoopdb \
--username root \
--password hadoop \
--table emp_add \
--hive-table test.emp_add_sp \
--hive-import \
--m 1

2.2. Second way: direct replication table structure of data to hive

bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--table emp_conn \
--hive-import \
--m 1 \
--hive-database test;

3. Import table data subsets (WHERE filter)
-where query can be specified when import data from a relational database. It performs the appropriate SQL queries in the database server, and stores the result in the destination directory of HDFS.

bin/sqoop import \
--connect jdbc:mysql://node-1:3306/sqoopdb \
--username root \
--password hadoop \
--where "city ='sec-bad'" \
--target-dir /wherequery \
--table emp_add --m 1

4. Import table data subsets (query query)
Note:
Use query sql statement to find it can not add parameters -table;
and it must be added where conditions;
and behind where the conditions must bring a $ CONDITIONS string;
and this must sql statement use single quotes, not double quotation marks;

bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--target-dir /wherequery12 \
--query 'select id,name,deg from emp WHERE  id>1203 and $CONDITIONS' \
--split-by id \
--fields-terminated-by '\t' \
--m 2

sqoop command, -split-by id -m 10 generally fit parameters. It is used to specify which fields and divided according to the number of maptask start.

5. Delta Import

In practical work, import data, most of the time you can just import incremental data, does not require the data in the table each time all or imported into the hive hdfs were to go, it will cause data duplication. Thus use some fields are generally introduced increments, sqoop supporting incremental data import.
Incremental import only import technology rows in the table newly added.

-Check-column (col)
to specify the number of columns that is used to check whether the incremental import these data and the time stamp field from increasing import, and a relational database as similar incremental data.
Note: These types of columns can not be assigned to any type of characters, such as char, varchar, etc. types are not at the same time - check-column can instruct a plurality of columns.
-Incremental (MODE)
the append: adding, for example, after recording the value of the last-value is greater than the specified is additionally introduced. lastmodified: last modification time, after the record is added last-value specified date
-last-value (value)
specified maximum value since the last import column (value greater than the specified), a certain value may be set themselves

5.1. Append mode incremental import

 first instruction following our previous data import:
bin / Sqoop Import
-connect JDBC: MySQL: // Node. 1-: 3306 / the userdb
-username the root
-password Hadoop
-target-the dir / appendresult
-table EMP. 1 - M
 hadoop fs -cat view using the generated data file into the data has been found in hdfs.
 two incremental data is then inserted in the emp mysql:

insert into `userdb`.`emp` (`id`, `name`, `deg`, `salary`, `dept`) values ('1206', 'allen', 'admin', '30000', 'tp');
insert into `userdb`.`emp` (`id`, `name`, `deg`, `salary`, `dept`) values ('1207', 'woon', 'admin', '40000', 'tp');

Instruction execution follows to achieve incremental introduced:

bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root  --password hadoop \
--table emp --m 1 \
--target-dir /appendresult \
--incremental append \
--check-column id \
--last-value  1205

 Finally, validation import data directory can be found in a document which is incremental data

5.2. Lastmodified delta import mode
 first create a customer table, to specify a timestamp field:
Create Table customertest (int ID, name VARCHAR (20 is), last_mod default timestamp Update CURRENT_TIMESTAMP CURRENT_TIMESTAMP ON);
timestamp in the data set where the occurs when generating and updating changes.
Ø respectively inserted recorded as follows:
iNSERT INTO customertest (ID, name) values (. 1, 'Neil');
iNSERT INTO customertest (ID, name) values (2, 'Jack');
iNSERT INTO customertest (ID, name) values (. 3, 'Martin');
INSERT INTO customertest (ID, name) values (. 4, 'Tony');
INSERT INTO customertest (ID, name) values (. 5, 'Eric');
Ø Sqoop execution instruction data import all HDFS:
bin / Sqoop import
-connect JDBC: MySQL: // Node. 1-: 3306 / the userdb
-username the root
-password Hadoop
-target-the dir / lastmodifiedresult
–table customertest --m 1

See the results derived from the data at this time:

Again insert a data table into the customertest

insert into customertest(id,name) values(6,'james')

 introduced using incremental manner increment:

bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--table customertest \
--target-dir /lastmodifiedresult \
--check-column last_mod \
--incremental lastmodified \
--last-value "2019-05-28 18:42:06" \
--m 1 \
--append

Here we have a record imports last inserted, but we found here inserted two data, which is why?
This is because the process employed to increment lastmodified mode, will be equal to the data value of the last-value is inserted as the increment.

5.3.Lastmodified模式:append、merge-key

Lastmodified mode using incremental processing To specify the incremental data is append mode (additional) or merge-key (merge) mode add
the following demonstrates the use of merge-by mode incremental update, we go to update the id name field 1 .

update customertest set name = 'Neil' where id = 1;

After the update, the time stamp of this data is updated when updating the system time data
to perform the instruction, the id field as a merge-key:

bin/sqoop import \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--table customertest \
--target-dir /lastmodifiedresult \
--check-column last_mod \
--incremental lastmodified \
--last-value "2019-05-28 18:42:06" \
--m 1 \
--merge-key id

Since the merge-key mode is a complete mapreduce operation,
so the final at lastmodifiedresult we can see the folder of files generated for this part-r-00000, will find the name id = 1 has been modified, and the new id = 6 by the data.

Four, Sqoop export

Export data from Hadoop ecosystem before RDBMS database export, the target table must exist in the target database.
export has three modes:
default action is inserted into the table from the data file using the INSERT statement.
Update Mode: Sqoop generated replace the existing database record UPDATE statement.
Call mode: Sqoop each record will create a stored procedure call.

The following is the export command syntax:

$ sqoop export (generic-args) (export-args)

1. The default mode data is exported to HDFS mysql
default, sqoop export converted per line is recorded as an INSERT statement, added to the target database table. If the database tables having a constraint condition (e.g., its value must be unique primary key column) and the existing data is present, it must be taken to avoid inserting violate these constraints recorded. If an INSERT statement fails, the export process will fail. This mode is mainly used to export records into an empty table may receive these results. Full table usually used for data export.
Can export all records in the table or Hive HDFS data (all of the fields may be part of a field) Mysql exported to the target table.

1.1. HDFS data preparation

Under "/ emp /" directory is created in the HDFS file system in a file emp_data.txt:
1201, Gopal, Manager, 50000, TP
1202, Manisha, preader, 50000, TP
1203, Kalil, PHP dev, 30000, AC
1204, Prasanth, PHP dev, 30000, AC
1205, Kranthi, ADMIN, 20000, TP
1206, satishp, grpdes, 20000, GR
1.2. Mysql create the target table manually

mysql> USE userdb;
mysql> CREATE TABLE employee ( 
   id INT NOT NULL PRIMARY KEY, 
   name VARCHAR(20), 
   deg VARCHAR(20),
   salary INT,
   dept VARCHAR(10));

1.3. Export command execution

bin/sqoop export \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--table employee \
--export-dir /emp/emp_data

1.4. Configuration parameters
-input-fields-terminated-by '
\ t' delimiter specified file
-columns
selected column and control their sorting. When exporting data file and the destination table column order exactly the same field can not write. Otherwise comma selection and arrangement intervals of the respective columns. Not contain either a column or field includes a default value, or to allow insertion of null value after the -columns. Otherwise, the database will refuse to accept data sqoop exported, resulting in Sqoop job fails
-export-dir export directory, in the implementation of export, you must specify this parameter, while the need to have one or both -call -table parameters -table means among the corresponding export database table,
-call means of a stored procedure.

--input-null-string --input-null-non-string

If you do not specify the first argument for the string type columns for, "NULL" string on the back is translated into a null value, if no second argument, whether it is "NULL" said empty string or character string or, for non-string type of field, these two types of empty string will be translated into a null value. For example:
-input-null-String "\ N" --input-non-null-String "\ N"

2. Updating exported (updateonly mode)
2.1. Parameters
- update-key, updating identity, i.e. updated based on a field, for example, id, field updates can specify a plurality of identification, a plurality of fields separated by commas.
- updatemod, designated updateonly (the default mode), only the updated data record that already exists, does not insert a new record.
2.2. HDFS data ready to
create a file updateonly_1.txt in HDFS "/ updateonly_1 /" directory:
1201, Gopal, Manager, 50000
1202, Manisha, preader, 50000
1203, Kalil, PHP dev, 30000
2.3. Mysql create the target table manually

mysql> USE userdb;
mysql> CREATE TABLE updateonly ( 
   id INT NOT NULL PRIMARY KEY, 
   name VARCHAR(20), 
   deg VARCHAR(20),
   salary INT);

2.4. To perform all export operations

bin/sqoop export \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--table updateonly \
--export-dir /updateonly_1/

2.5. Check that the data in mysql
can be found in the full amount exported, all of the data

2.6. Add a file
updateonly_2.txt. Modify the first three and add a data record. Upload to / updateonly_2 / directory:
1201, Gopal, Manager, 1212
1202, Manisha, preader, 1313
1203, Kalil, PHP dev, 1414
1204, allen, the Java, 1515

2.7. Export perform updates

bin/sqoop export \
--connect jdbc:mysql://node-1:3306/userdb \
--username root --password hadoop \
--table updateonly \
--export-dir /updateonly_2/ \
--update-key id \
--update-mode updateonly

2.8. Check final results

Although export when exporting logs show four records:

But in the end only the update operation

3. Updating exported (allowinsert mode)
3.1. Parameters
- update-key, updating identity, i.e. updated based on a field, for example, id, field updates can specify a plurality of identification, a plurality of fields separated by commas.
- updatemod, designated allowinsert, update existing data records, and insert a new record. Essentially the insert & update operation.
3.2. HDFS data ready to
create a file in HDFS "/ allowinsert_1 /" directory allowinsert_1.txt:
1201, Gopal, Manager, 50000
1202, Manisha, preader, 50000
1203, Kalil, PHP dev, 30000
3.3. Mysql create the target table manually

mysql> USE userdb;
mysql> CREATE TABLE allowinsert ( 
   id INT NOT NULL PRIMARY KEY, 
   name VARCHAR(20), 
   deg VARCHAR(20),
   salary INT);

3.4. To perform all export operations

bin/sqoop export \
--connect jdbc:mysql://node-1:3306/userdb \
--username root \
--password hadoop \
--table allowinsert \
--export-dir /allowinsert_1/

3.5. Mysql view the data in this case

It can be found in the full amount exported, all of the data

3.6. Add a file
allowinsert_2.txt. Modify the first three and add a data record. Upload to / allowinsert_2 / directory:
1201, Gopal, Manager, 1212
1202, Manisha, preader, 1313
1203, Kalil, PHP dev, 1414
1204, allen, the Java, 1515
3.7. Export perform updates

bin/sqoop export \
--connect jdbc:mysql://node-1:3306/userdb \
--username root --password hadoop \
--table allowinsert \
--export-dir /allowinsert_2/ \
--update-key id \
--update-mode allowinsert

3.8. Check final results
exported when exporting logs show that four records:

At the same time data update operations have also been new operations

Guess you like

Origin blog.csdn.net/qq_38483094/article/details/94742961