Detailed usage and principle of sqoop

Transfer: https://blog.csdn.net/zhusiqing6/article/details/95680185

 

1, sqoop Introduction
Sqoop is a tool used in the data hadoop hdfs and mutually relational database migration, data can be a relational database (MySQL, Oracle, etc.)
into the hdfs hadoop may also hdfs data into a relational database.

2, sqoop features:
the underlying implementation is Sqoop mapreduce, so Sqoop dependent on hadoop, parallel data imported.

3, sqoop installation and configuration
1) Installation:
decompress sqoop-1.4.3.bin__hadoop-1.0.0.tar.gz, modify / etc / profile will sqoop_home added
as a link to the database, the database by the driving jar copy the packet into the lib sqoop clip file
2) configuration:
rename profile
Music Videos  sqoop-env-template.sh  sqoop-env.sh
content file (may not be modified):
#SET WHERE path to bin / iS Hadoop Available
export HADOOP_COMMON_HOME = / usr / local / hadoop /

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/usr/local/hadoop

#set the path to where bin/hbase is available
export HBASE_HOME=/usr/local/hbase

#Set the path to where bin/hive is available
export HIVE_HOME=/usr/local/hive

#Set the path for where zookeper config dir is
export ZOOCFGDIR=/usr/local/zk

4.sqoop used:
the first category: the data into the database and the HDFS
Sqoop Import --connect JDBC: MySQL: //192.168.1.10: 3306/123 itcast --username the root --password
-table trade_detail --columns 'id, account, income, expenses '
specified output path designation data delimiter
Sqoop Import --connect JDBC: MySQL: //192.168.1.10: 3306/123 itcast --username the root --password
-table trade_detail --target- dir '/ sqoop / td' --fields -terminated-by '\ t'
specified number -m the Map
Sqoop Import --connect JDBC: MySQL: //192.168.1.10: 3306/123 itcast --username the root --password
- table trade_detail --target-dir '/ sqoop / td1' --fields-terminated-by '\ t' -m 2

Conditions where increased Note: conditions must be quoted
Sqoop Import --connect JDBC: MySQL: //192.168.1.10: 3306/123 itcast --username the root --password
-table trade_detail --where 'ID>. 3' - -target-dir '/ sqoop / td2 '

Increasing the query statement (using \ statement to wrap)
Sqoop Import --connect JDBC: MySQL: //192.168.1.10: 3306/123 itcast --username the root --password
-query 'the SELECT * WHERE ID trade_detail the FROM> $ 2 the AND CONDITIONS '--split-by trade_detail.id --target-dir ' / sqoop / td3 '
Note: If this command -query time should be noted that where the latter parameter, the AND  C O N D the I T the I O N S which a parameter number must be applied on the and exist in single primer number and double primer number of regions respectively , as if - - QU E R & lt Y rear surface so that used to be double- quoted number , then it needs to be in CONDITIONS This parameter must be added to the single quotes and double quotes difference exists, if --query later use double-quotation marks, the need toCONDthe ITthe IONSwhichaparameternumbermustbeappliedontheandexistinsingleprimernumberanddoubleprimernumberofregionsrespectively,asif- -Q U E R & lt Y rear surface so that used to be double- quoted number , then it needs to be before adding CONDITIONS \ $ CONDITIONS i.e.
, if set to 1 when the number of map i.e. -m 1, without adding -split-by $ { tablename.column}, or the need to add
the second category: export data to the database on the HDFS
Sqoop export --connect JDBC: MySQL: //192.168.8.120: 3306/123 itcast --username the root --password
-export -dir '/ td3' --table td_bak -m 1 --fields-termianted-by '\ t'

   第三类:使用sqoop导入数据到hive常用语句
         sqoop import --connect jdbc:postgresql://ip/db_name--username user_name  --table table_name  --hive-import -m 5 
            内部执行实际分三部,1.将数据导入hdfs(可在hdfs上找到相应目录),2.创建hive表名相同的表,3,将hdfs上数据传入hive表中 

            sqoop根据postgresql表创建hive表 
            sqoop create-hive-table --connect jdbc:postgresql://ip/db_name --username user_name  --table table_name  --hive-table              
            hive_table_name( --hive-partition-key partition_name若需要分区则加入分区名称) 

            导入hive已经创建好的表中 
            sqoop import --connect jdbc:postgresql://ip/db_name --username user_name  --table table_name  --hive-import -m 5 --hive-
            table hive_table_name  (--hive-partition-key partition_name --hive-partition-value partititon_value); 

            使用query导入hive表 
            sqoop import --connect jdbc:postgresql://ip/db_name --username user_name  --query "select ,* from retail_tb_order where             
            \$CONDITIONS"  --hive-import -m 5 --hive-table hive_table_name  (--hive-partition-key partition_name --hive-partition-value        
            partititon_value); 
            注意:$CONDITIONS条件必须有,query子句若用双引号,则$CONDITIONS需要使用\转义,若使用单引号,则不需要转义。 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

5.配置mysql远程连接
GRANT ALL PRIVILEGES ON itcast.* TO ‘root’@‘192.168.1.201’ IDENTIFIED BY ‘123’ WITH GRANT OPTION;
FLUSH PRIVILEGES;
GRANT ALL PRIVILEGES ON . TO ‘root’@’%’ IDENTIFIED BY ‘123’ WITH GRANT OPTION;
FLUSH PRIVILEGES

6.Sqoop principle (to import, for example)

Sqoop when import, the need for split-by parameter. Sqoop to be segmented according to different split-by value, and then the cut out partial areas allocated to different map.
Each map database reprocessed values acquired line by line, is written to the HDFS. While split-by different segmentation method different parameter types, such as a relatively simple int type,
Sqoop will take the maximum and minimum split-by field values are then determined according to the num-mappers passed several divided areas. For example select max (split_by), min ( split-by) from
the obtained max (split-by) and min (split-by) and 1 1000, respectively, if the num-mappers is 2, it is divided into two regions ( 1,500) and (501-100),
but also divided into two sql to 2 to map an import operation, respectively select XXX from table where split-by > = 1 and split-by <500 and
select XXX from table where split-by> = 501 and split -by <= 1000. Finally, each map each get their own SQL data in the import work.

7.mapreduce job various parameters required in the implementation of Sqoop

  1. InputFormatClass
    com.cloudera.sqoop.mapreduce.db.DataDrivenDBInputFormat
  2. OutputFormatClass
    1)TextFile
    com.cloudera.sqoop.mapreduce.RawKeyTextOutputFormat
    2)SequenceFile
    org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
    3)AvroDataFile
    com.cloudera.sqoop.mapreduce.AvroOutputFormat
    3)Mapper
    1)TextFile
    com.cloudera.sqoop.mapreduce.TextImportMapper
    2)SequenceFile
    com.cloudera.sqoop.mapreduce.SequenceFileImportMapper

3)AvroDataFile
com.cloudera.sqoop.mapreduce.AvroImportMapper
4)taskNumbers
1)mapred.map.tasks(对应num-mappers参数) 2)job.setNumReduceTasks(0);

Example 8. explain:
Here the command line: import -connect jdbc: mysql: // localhost / test -username root -password 123456 -query "select sqoop_1.id as foo_id, sqoop_2.id as bar_id from sqoop_1, sqoop_2 WHERE $ CONDITIONS "-target-dir / user / sqoop / test -split-by sqoop_1.id -hadoop-home = / home / hdfs / hadoop-0.20.2-CDH3B3 -num-mappers 2
Note: the red part of the parameters, followed by an order derived parameter values
1) set the Input
DataDrivenImportJob.configureInputFormat (the Job Job, String tableName, tableClassName String, String splitByCol)
a) DBConfiguration.configureDB (the conf the Configuration, driverClass String,
String dbUrl, the userName String, String the passwd, the fetchSize Integer)
1) com.mysql.jdbc.Driver .mapreduce.jdbc.driver.class
2) .mapreduce.jdbc.url JDBC: MySQL: // localhost / Test
. 3) the root .mapreduce.jdbc.username
4).mapreduce.jdbc.password 123456
5).mapreduce.jdbc.fetchsize -2147483648
b)DataDrivenDBInputFormat.setInput(Job job,Class<? extends DBWritable> inputClass, String inputQuery, String inputBoundingQuery)
1)job.setInputFormatClass(DBInputFormat.class); 2)mapred.jdbc.input.bounding.query SELECT MIN(sqoop_1.id), MAX(sqoop_2.id) FROM (select sqoop_1.id as foo_id, sqoop_2.id as bar_id from sqoop_1 ,sqoop_2 WHERE (1 = 1) ) AS t1
3)job.setInputFormatClass(com.cloudera.sqoop.mapreduce.db.DataDrivenDBInputFormat.class);
4)mapreduce.jdbc.input.orderby sqoop_1.id
c)mapreduce.jdbc.input.class QueryResult
d)sqoop.inline.lob.length.max 16777216

2)设置Output
ImportJobBase.configureOutputFormat(Job job, String tableName,String tableClassName)
a)job.setOutputFormatClass(getOutputFormatClass()); b)FileOutputFormat.setOutputCompressorClass(job, codecClass);
c)SequenceFileOutputFormat.setOutputCompressionType(job,CompressionType.BLOCK);
d)FileOutputFormat.setOutputPath(job, outputPath);
3)设置Map
DataDrivenImportJob.configureMapper(Job job, String tableName,String tableClassName)
a)job.setOutputKeyClass(Text.class);
b)job.setOutputValueClass(NullWritable.class);
c)job.setMapperClass(com.cloudera.sqoop.mapreduce.TextImportMapper);

4)设置task number
JobBase.configureNumTasks(Job job)
mapred.map.tasks 4
job.setNumReduceTasks(0);

  1. Probably Process

1. To read the import table structure of data to generate operation class, the default is the QueryResult, labeled jar package, and then submitted to Hadoop

2. Set a good job, that is the main parameter settings for each Chapter VI well above the
3 here performed by Hadoop MapReduce to perform the Import command,
1) first of all to the data segmentation, which is DataSplit
DataDrivenDBInputFormat.getSplits (Job JobContext)
2) after splitting the good range, the range of writing, in order to read
DataDrivenDBInputFormat.write (DataOutput output) and this is upperBoundQuery lowerBoundQuery
. 3) to read two or more) write range
DataDrivenDBInputFormat.readFields (INPUT of DataInput for primitive)
. 4 ) and then create RecordReader read data from the database
DataDrivenDBInputFormat.createRecordReader (InputSplit Split, TaskAttemptContext context)
5) create a Map
TextImportMapper.setup (context context)
6) RecordReader line by line to read data from a relational database, set the Map Key and Value, to the Map
DBRecordReader.nextKeyValue ()
7) run the Map
TextImportMapper.map (LongWritable Key, SqoopRecord Val, context context)
Key generation is the last line data, generated by QueryResult, Value is NullWritable.get ()

Guess you like

Origin www.cnblogs.com/leon0/p/12066840.html