Sword Finger Data Warehouse-Hive04

1. Review of the last lesson

Two, Hive03

Third, the use of various functions in Hive

1. Review of the last lesson

  • https://blog.csdn.net/SparkOnYarn/article/details/105182082
  • Join is divided into inner connection, outer connection, left outer connection, right outer connection, full connection; three complex data types: Map, Struct, Array, the definition of these three types, stored value and value; built-in functions only need Check the detailed definition; it is not recommended to store json data in Hive, the built-in function can parse out the URL (parse_url_tuple)

2. Sorting in big data (order by, sort by, cluster by)

The sorting problem is no problem when the amount of data is small, but as soon as the amount of data is large, the problem comes out:

1, order by (global order)
  • If you want to implement global sorting as above, there must be only one reduce; if there are multiple reducers, there is order in reduce1 and order in reduce2, but global order cannot be guaranteed. Order by is used with caution in production, the amount of data is large, but there is only one reduce,

select * from emp order by empno desc;

0: jdbc:hive2://hadoop001:10000/ruozedata_hiv> set hive.mapred.mode=strict;
No rows affected (0.004 seconds)
0: jdbc:hive2://hadoop001:10000/ruozedata_hiv> select * from emp order by empno desc;
Error: Error while compiling statement: FAILED: SemanticException 1:27 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token ‘empno’ (state=42000,code=40000)

Open hive.configure.properties, pay attention to the parameter: hive.mapred.mode; in strict mode, some heavily loaded queries are not allowed to run, for example: full table scan is prohibited, if you only use order by, you must add Upper limit field:
Insert picture description here
We find a partition table to operate:
/ ruozedata_hiv> select * from order_partition where event_month = '2020-01' order by order_no desc;
Error: Error while compiling statement: FAILED: SemanticException 1:67 In strict mode, if ORDER BY is specified, LIMIT must also be specified. Error encountered near token 'order_no' (state = 42000, code = 40000)

0: jdbc:hive2://hadoop001:10000/ruozedata_hiv> set hive.mapred.mode=nonstrict;
No rows affected (0.002 seconds)

0: jdbc:hive2://hadoop001:10000/ruozedata_hiv> select * from order_partition where event_month=‘2020-01’ order by order_no desc;

In strict mode:
if it is a normal table: order by + limit
partition table: order by + limit where partition column

  • Order by Use with caution in production:
sort by
  • It can guarantee that each partition is ordered, you have several reduce, and the results it produces are all ordered, but cannot guarantee global order; sort by is not affected by strict mode and non-strict mode: the column can be more , Numeric type-> number; string type-> dictionary:
  • The lexicographic order is arranged in the order of letters abcd:
name value description
mapred.reduce.tasks -1

Insert picture description here
1、设置reduce数:
set mapred.redcue.tasks=3; 设置3个reduce
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3

2、select * from emp sort by empno;
结果在控制台上输出是没那么明显的直观的:

3、MR中有多少个reduce作业就对应输出多少个文件、spark中多少个task就对应输出多少个输出文件:

4、把输出文件放到本地liunux目录上再做决断:
insert overwrite local directory '/home/hadoop/tmp/hivetmp/sort/' select * from emp sort by empno;

5、自行去到输出目录中查看确实是有3个输出文件 --> 这就是所谓的分区有序:
[hadoop@hadoop001 sort]$ pwd
/home/hadoop/tmp/hivetmp/sort
[hadoop@hadoop001 sort]$ ll
total 12
-rw-r--r-- 1 hadoop hadoop 335 Apr  3 13:59 000000_0
-rw-r--r-- 1 hadoop hadoop 282 Apr  3 13:59 000001_0
-rw-r--r-- 1 hadoop hadoop  91 Apr  3 13:59 000002_0
  • The display of the select statement on the official website:
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

Why must limit be added in strict mode? The default sorting mechanism asc ascending, lexicographical (lexicographical),

Distribute by:

How to use: distribute by + col: distribute data to different reducers according to the specified fields; equivalent to Partitioner in MapReduce, usually used in conjunction with Sort by;

  • Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.

1. Distribute according to the length of the name, and sort in ascending order of employee number within the partition:
insert overwrite local directory '/ home / hadoop / tmp / hivetmp / distribute' select * from emp distribute by length (ename) sort by empno;

// The scene of data skew in big data needs to rely on distribute,

Cluster by:
  • cluster by is a short-cut for both Distribute by and sort by: only for data distribution:

  • insert overwrite local directory ‘/home/hadoop/tmp/hivetmp/distribute’ select * from emp cluster by empno;

to sum up:

order by: global order, low efficiency under a large reduce data
sort by: order in each reduce, can not guarantee global order
distribute by: is the data distribution according to the specified field
often used in conjunction with sort by, to ensure that each reduce Internally ordered
cluster by = distribute by + cluster by

Thinking: Production requires orderly business statistics results, order by cannot be used, and sort by cannot guarantee global order, how should it be solved?

Three, the use of Sqoop

3.1 The background and introduction of Sqoop

Scenarios:
1), the data is stored in MySQL, you want to use Hive for processing
2), use Hive statistical analysis, the data is still in Hive, how to export to MySQL:
–> Finally, the report is visually displayed, the results How to connect to the report: (1), HiveServer2 (2), Hive statistical results are exported to the RDBMS, the report data is directly connected to the RDBMS, the two necessary scenarios

Solution: MapReduce (too complicated)
-> abstract into common tools

Sqoop is a data import and export tool:
similar to Hive's access to MySQL's hive-site.xml information under $ HIVE_HOME / cong:
RDBMS requires the following link information: url driver db table user password (required information)
Link to hdfs requires: Path
link to Hive requires: database table partition

–> This leads to the Sqoop framework: sqoop.apache, org

sqoop: sql to hadoop
hue: a visual framework, write SQL on it, the results can be used in the form of reports: as soon as the CDH configuration results are out.

Look at the company in the production cluster, first connect it to the server through the springboard, can Dbeaver be linked to the production server, can

Sqoop introduction:
  • Apache Sqoop™ is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • sqoop is a tool used to efficiently transfer data, is a tool for interoperation between hadoop and structured databases, such as RDBMS and Hadoop.

RDBMS <==> Hadoop, you can also import the data to Hbase and Hive.

Two major versions:
sqoop1.x: sqoop1 *** (about 70% of companies use this)
sqoop1.99.x: sqoop2 (it is very troublesome to use)

  • We are using sqoop1, and the two frameworks are unmatched.
Responsibilities:
把数据从RDBMS和Hadoop之间记性导入导出操作,底层就是使用MapReduce来实现的。
Sqoop只有Map,它不需要Reduce;

Import of Sqoop: Based on Hadoop as a reference point and reference point:
RDBMS <==> Hadoop

Sqoop export: from Hadoop as the starting point to the outside:
Hadoop <==> RDBMS
Insert picture description here

3.2 Installation and deployment of Sqoop

1. Download:
wget http://archive.cloudera.com/cdh5/cdh/5/sqoop-1.4.6-cdh5.16.2.tar.gz

2. Unzip and make a soft connection:
tar -zxvf sqoop-1.4.6-cdh5.16.2.tar.gz -C ~ / app /
ln -s sqoop-1.4.6-cdh5.16.2 sqoop

3. Configure system environment variables: vi ~ / .bashrc
export SQOOP_HOME = / home / hadoop / app / sqoop
export PATH = S Q THE THE P H O M E / b i n : SQOOP_HOME/bin: PATH

4. Effective environment variables:
source ~ / .bashrc

5. Copy a file under $ SQOOP_HOME / conf, and then configure the parameters:
cp sqoop-env-template.sh sqoop-env.sh
export HADOOP_COMMON_HOME = / home / hadoop / app / hadoop
export HADOOP_MAPRED_HOME = / home / hadoop / app / hadoop
export HIVE_HOME = / home / hadoop / app / hive

6. Copy a driver package to the $ SQOOP_HOME / lib directory:
cp mysql-connector-java-5.1.27-bin.jar $ SQOOP_HOME / lib /

3.3, simple use of Sqoop

sqoop help View command help, learn sqoop is to look up the dictionary:

1. sqoop version:
View the version number

2. List the databases under mysql, view the command help: sqoop help list_databases

sqoop list-databases \
--connect jdbc:mysql://hadoop001:3306 \
--username root \
--password 960210

输出结果如下:
20/04/03 15:08:06 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
mysql
performance_schema
ruozedata_hive
sqoop
sys
wordpress

3. List the tables under the database:

sqoop list-tables \
--connect jdbc:mysql://hadoop001:3306/sqoop \
--username root \
--password 960210

//运行结果如下:
20/04/03 15:39:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
dept
emp
salgrade

3.4. Detailed use of Sqoop import (from MySQL–> hdfs)

sqoop help import: View command help:

参数先捋一遍:
--append
--columns <col,col,col...> 
--delete-target-dir     
-m,--num-mappers <n> 
--mapreduce-job-name <name>
--target-dir <dir>
Requirement 1: Import the emp data table in sqoop to HDFS:
sqoop import \
--connect jdbc:mysql://hadoop001:3306/sqoop \
--username root \
--password 960210 \
--table emp \
-m 1

//出现如下报错:
Exception in thread "main" java.lang.NoClassDefFoundError: org/json/JSONObject
        at org.apache.sqoop.util.SqoopJsonUtil.getJsonStringforMap(SqoopJsonUtil.java:43)
        at org.apache.sqoop.SqoopOptions.writeProperties(SqoopOptions.java:784)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
Caused by: java.lang.ClassNotFoundException: org.json.JSONObject
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

//原因是缺少java-json.jar包,上传一个即可解决问题:
  • This sqoop statement will run the mapreduce task, the App Name is emp.jar (the default name is emp.jar if no name is specified), and the output result is by default under / user / hadoop / emp
1、如下所示,顺利的把MySQL中的数据导入到了hdfs目录上:
[hadoop@hadoop001 bin]$ hdfs dfs -ls /user/hadoop/emp
20/04/03 16:10:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2020-04-03 16:08 /user/hadoop/emp/_SUCCESS
-rw-r--r--   1 hadoop supergroup        282 2020-04-03 16:08 /user/hadoop/emp/part-m-00000

2、使用了一个map所以结果是一个文件
[hadoop@hadoop001 bin]$ hdfs dfs -text /user/hadoop/emp/part-m-00000
20/04/03 16:10:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
7369,SMITH,CLERK,7902,1980-12-17,800.0,20.0,40
7499,SMITH,CLERK,7902,1980-12-17,800.0,20.0,40
7499,SMITH,CLERK,7902,1980-12-17,800.0,20.0,40
7499,SMITH,CLERK,7902,1980-12-17,800.0,20.0,40
7499,SMITH,CLERK,7902,1980-12-17,800.0,20.0,40
7499,SMITH,CLERK,7902,1980-12-17,800.0,20.0,40

Insert picture description here

test:

ERROR tool: FileAlreadyExist, the file directory already exists; meaning that the target table needs to be deleted in the import command:

1. The log information printed on the console when the command is executed in the $ SQOOP_HOME / bin directory, why there is a limit1, mainly to check whether this table exists:
20/04/03 16:21:09 INFO manager. SqlManager: Executing SQL statement: SELECT t. * FROM empAS t LIMIT 1

2. Why
the map set by number of splits: 1 is 1

3. Every sqoop import in the $ SQOOP_HOME / bin directory will generate a .java file:

The lead is: delete the target table every time before executing the import command:
  • At this time, we add the --delete-target-dir parameter, and specify the name of the job run frommysql2hdfs through the --mapreduce-job-name parameter, and then restore the specified map number to the default value of 4;
sqoop import \
--connect jdbc:mysql://hadoop001:3306/sqoop \
--username root \
--password 960210 \
--table emp \
--mapreduce-job-name frommysql2hdfs \
--delete-target-dir 
  • We delete the number of maps to 1, and the error is reported as above: the reason is very clear. If there is a primary key in the emp table, it will not report an error. The default value is 4 maps. If there is no primary key in the emp table, you need to specify the map. number:
  • 20/04/03 16:32:18 ERROR tool.ImportTool: Import failed: No primary key could be found for table emp. Please specify one with --split-by or perform a sequential import with ‘-m 1’
Requirement 2: When importing data into hdfs, specify the columns yourself, and change the directory (target-dir)
sqoop import \
--connect jdbc:mysql://hadoop001:3306/sqoop \
--username root \
--password 960210 \
--table emp \
--mapreduce-job-name frommysql2hdfs2 \
--delete-target-dir \
--target-dir emp_column \
--columns "empno,ename,job,sal,comm" \
-m 1
Requirement 3: The exported data splitter is, we need to modify it to '\ t' and replace the null value with 0;
sqoop import \
--connect jdbc:mysql://hadoop001:3306/sqoop \
--username root \
--password 960210 \
--table emp \
--mapreduce-job-name frommysql2hdfs2 \
--delete-target-dir \
--target-dir emp_column \
--columns "empno,ename,job,sal,comm" \
--fields-terminated-by '\t' \
-- null-string '' \
--null-non-string '0' \
-m 1

//--fields-terminated-by '\t'把导出数据的分割符换成\t,-- null-string '' \
--null-non-string '0' \ -->意思是把空的字符串值替换为0
Common interview questions:

1. What is the difference between Sort by and order by?

  • Sort by is the order of each partition, order by is to ensure the global order:

2. What is the difference between hadoop fs -ls and hadoop fs -ls /?
No slash means: / user / hadoop-> / user / current user name A
slash means the root directory of hdfs

Published 23 original articles · praised 0 · visits 755

Guess you like

Origin blog.csdn.net/SparkOnYarn/article/details/105291173