sqoop import hive

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/mn_kw/article/details/90602320

Hive sqoop introduced in three stages:
1. The first import --target-dirspecified directory HDFS
2. Hive built in Table
3. Call Hive of LOAD DATA INPATHthe --target-dirdata is moved to the Hive

import
--hive-import
--hive-table dw_hd.ods_store
--connect jdbc:oracle:thin:@<HOST>:1521:app
--username user
--password 123456
--query
select * from HD.STORE where $CONDITIONS and \
RCVTIME < TO_TIMESTAMP('2017-05-30 00:00:00','yyyy-mm-dd hh24:mi:ss.ff')
--split-by FLOWNO
--direct
--target-dir /user/root/store
--null-string '\\N'
--null-non-string '\\N'
--m 2
 

--hive-import: import is specified Hive
--hive-Table: Hive introduced in the database and table names
--null-string and --null-non-string: String representing the type of the null value and a non sqoop processing null string type value. If not specified, the default import Hive string type of null values are 'null', a non-null value is a string type 'NULL', here with these two cases became unified 'NULL', sqoop with '\ N ', if you want lowercase' null ', then use' \ N '.
 

Question 1: Number of data stripe after the import from the Hive found in more than the actual number of pieces found from a relational database?

Solution: The reason is the use --hive-importwill use the default Hive delimiter value separator ^Aand line separator \n.

Thus the question becomes, if the data lead-in '\ n', hive over a line that will be followed by the next row of data is divided into. In this case, the number of rows hive after importing the data than the original database and more, but also data inconsistencies arise. 
Sqoop also specifies parameters --fields-terminated-by and --lines-terminated-by line separators from the defined columns and separators. 
But when you really do when pit father ah! 
INFO hive.HiveImport: FAILED: SemanticException 1:. 381 LINES TERMINATED BY NEWLINE only the Supports '\ the n-' Although you right now that is specified by --lines-terminated-by other characters as a line separator, but the hive only supports \ n as the row delimiter.
 

ORACLE query field contains a carriage return, or not ||, is the string concatenation operator oracle. % Is a wildcard represents any string. 
  See contains carriage return, namely: \ R & lt \ n- 
  SELECT * WHERE from system.test_tab1 name like '%' || CHR (13 is) || CHR (10) || '%' 
  alone to see if contains CRLF character, i.e.: \ R & lt 
  SELECT * WHERE from system.test_tab1 name like '%' || CHR (13 is) || '%' 
  alone to see if the line breaks, namely: \ n- 
  SELECT * WHERE from system.test_tab1 name like ' % '|| chr (10) || '% '
 

Solution: The simple solution is to add parameters --hive-drop-import-delimsto the hive import data contained in the default separator removed. The easiest, if it is determined the data should not contain these characters, or determine not to remove the influence, you can use this. In addition, you can not use this --directoption a.

 

sqoop incremental import hive

Question 1: After importing all the data for each row in the first field? 
Causes and solutions: as directly into the HDFS HIve in the file folder, then, Sqoop default values given separator is a comma ,, and Hive Default delimiter is \ 001, namely: ^ A, so Hive is not recognized, it is necessary to change the value separator ^ a, i.e. coupled configuration below:

--fields-terminated-by \001

Sqoop job defined increment introduced

With sqoop job to do incremental updates, it will manage its metastore in --last-value, very convenient.

step 1 to create the Job Sqoop
A. Configuration sqoop metastore service
modify sqoop / conf / sqoop-site.xml file

Related attributes:

sqoop.metastore.server.location
sqoop.metastore.server.port
sqoop.metastore.client.autoconnect.url

<property>
 <name>sqoop.metastore.server.location</name>
 <value>/tmp/sqoop-metastore/shared.db</value>
</property>
<property>
 <name>sqoop.metastore.server.port</name>
 <value>16000</value>
</property>
<property>
  <name>sqoop.metastore.client.autoconnect.url</name>
 <value>jdbc:hsqldb:hsql://118.228.197.115:16000/sqoop</value>
</property>
<property>
  <name>sqoop.metastore.client.record.password</name>
  <value>true</value>
</property>
<!--注释掉这个属性
<property>
  <name>sqoop.metastore.client.enable.autoconnect</name>
  <value>false</value>
</property>
-->

b. Start metasotre, Console sqoop metastore command (if the first three properties is not configured, skip this step)
c. sqoop job creation

(In order to facilitate the implementation of the following script is written to the file is saved, then u After using chmod + x FILENAME modify permissions by ./FILENAME execute files, create job)

sqoop job --meta-connect jdbc:hsqldb:hsql://hostIP:16000/sqoop --create JOBNAME -- import --hive-import --incremental append --connect jdbc:oracle:thin:@DatabaseIP:1521/INSTANCENAME --username USERNAME --password PASSWD --verbose -m 1 --bindir /opt/sqoop/lib --table TABLENAME --check-column COLUMNNAME --last-value VALUE

note:

1) If the previous shared Metastore not configured (i.e. "sqoop.metastore.server.location", "sqoop.metastore.server.port", "sqoop.metastore.client.autoconnect.url" three attributes in the configuration file annotated a), it would need to be above the script "--meta-connect jdbc: hsqldb: hsql: // hostIP: 16000 / sqoop" removed.

2) "--create JOBNAME - import" in "-" followed by a space write import command, otherwise an error
3) --check-column column can not be char varchar the like, may be date, int,

step 2 to see if you can execute sqoop job smoothly

<-! View job list to see if successfully created ->

sqoop job --list

<! - perform job, test whether the normal execution here if the amount of data import, will be very time consuming ->

sqoop job --exec JOBNAME

After step 3 sqoop job can be performed to determine the normal, regular implementation scripting

The following script is written to a text file, such as execJob, then execute add command chmod u + x execJob executable permissions

source /etc/profile

rm TABLENAME.

java -f sqoop

job -exec JOBNAME

 

Be introduced through the where clause

sqoop import \ 
--connect jdbc:mysql://109.123.121.104:3306/testdb \ 
--username root \ 
--password 123456 \ 
--table user \ 
--where 'id > 5 and account like "f%"' \ 
--target-dir /sqoop/import/user_where \ 
--delete-target-dir \ 
--fields-terminated-by '\t' \ 
-m 1 \ 
--direct

 

table_name='WORK_DEVSTATETIME'
work_date='2018-05-14'
hive_database='test'

sqoop import --connect jdbc:oracle:thin:@10.60.127.64:1521:ORCL \
--username hhggk \
--password oracle \
--table ${table_name} \
--where "workdate= '${work_date}'" \
--fields-terminated-by "\t" \
--lines-terminated-by "\n" \
--delete-target-dir \
--hive-import \
--hive-database ${hive_database} \
--hive-overwrite \
--null-string '\\N' \
--null-non-string '\\N' \
-m 1 \
--direct \
--hive-drop-import-delims

Here --direct and --hive-drop-import-delims parameters can not appear before a parameter can increase the rate of import, after an argument can solve our table

There is a space problem, leading to an increase in the number of rows to import hive

 

 

 

 

 

 

 

 


 

Guess you like

Origin blog.csdn.net/mn_kw/article/details/90602320