Sword Finger Data Warehouse-Hive02

1. Review of the last lesson

Two, Hive02

3. Multi-level partition && static partition && dynamic partition in Hive

4. Interview questions

1. Review of the last lesson

  • https://blog.csdn.net/SparkOnYarn/article/details/105140753

  • Programming with MapReduce is difficult and cannot meet the needs of SQL-based queries; Hive is mainly used for offline data warehouse; advantages and disadvantages of Hive compared with ordinary MySQL; Hive execution process; MetaStore metadata information. PK offline data concurrent 350,000

  • Hive must have two copies of the data, the data on hdfs and the data on mysql, both are indispensable; when deploying all four information of hive-site.xml and a mysql jar package are copied to $ HIVE_HOME / lib ;

  • The table corresponding to hdfs in Hive is a folder, and several types of separators in Hive, row to row separator \ n, column to column separator \ t

  • Create a database, specify the path when creating the database, modify the properties of the database, and organize it in the form of a folder; to delete the database, you must first delete the tables in the database. If the database is no longer needed, use cascade to delete directly.

Two, Hive02

2.1, Table DDL statement

  • https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable

The syntax for creating a table:

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)

中括号可以不选择使用,小括号需要选择使用:
1、最基础的创建表,必要的数据:
create table emp(
	colname datatype,
	colname1 datatype,
	colname2 datatype
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

//意味着字段与字段的分割符使用tab制表符分割

2.2, several ways to create a table

1, DDL statement to create a table

1. Data preparation emp.txt, column name resolution:

员工编号	员工姓名	员工岗位	员工上级领导编号	入职时间	工资		津贴		员工所处部门编号
7369	SMITH	CLERK	7902	1980-12-17	800.00		20
7499	ALLEN	SALESMAN	7698	1981-2-20	1600.00	300.00	30
7521	WARD	SALESMAN	7698	1981-2-22	1250.00	500.00	30
7566	JONES	MANAGER	7839	1981-4-2	2975.00		20
7654	MARTIN	SALESMAN	7698	1981-9-28	1250.00	1400.00	30
7698	BLAKE	MANAGER	7839	1981-5-1	2850.00		30
7782	CLARK	MANAGER	7839	1981-6-9	2450.00		10
7788	SCOTT	ANALYST	7566	1987-4-19	3000.00		20

2. Create a table according to the data:

create table emp(
	empno int,
	ename string,
	job string,
	mgr int,
	date string,
	sal double,
	comm double,
	deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

hive (ruozedata_hive)> create table emp(emono int,ename string,job string,mgr int,date string,sal double,comm double,deptno int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.077 seconds

3. Load data:

	LOAD DATA LOCAL INPATH 'filepath' [OVERWRITE] INTO TABLE tablename;

	在本次案例如下写:
	load data local inpath '/home/hadoop/data/emp.txt' into table emp;

4. Verify whether it has been loaded successfully:

hive (ruozedata_hive)> select * from emp;
OK
emp.emono       emp.ename       emp.job emp.mgr emp.date        emp.sal emp.comm emp.deptno
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    20
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    30
7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    10
Time taken: 0.052 seconds, Fetched: 15 row(s)
Note: If the data you load into Hive is displayed as null, it is most likely that the separator when creating the table is inconsistent with the separator in the file, that is, the schema information and the file content are not correct.
2. Create a table based on the existing table (only contains the table structure)
hive (ruozedata_hive)> create table emp2 like emp;
OK
Time taken: 0.096 seconds
hive (ruozedata_hive)> select * from emp2;
OK
emp2.emono      emp2.ename      emp2.job        emp2.mgr        emp2.date       emp2.sal  emp2.comm       emp2.deptno
Time taken: 0.043 seconds
3. Copy both table data and table structure:
hive (ruozedata_hive)> create table emp3 as select * from emp;
Query ID = hadoop_20200328172424_0442dd5a-507d-4b4f-b00b-85b3799af226
Total jobs = 3
Launching Job 1 out of 3
  • The file name loaded by load is emp.txt, and the file name from mapreduce is 00000_0
    load the loaded data
    Run the data from mapreduce

2.3, basic SQL commands in Hive && modify table information && delete table

Basic SQL commands in Hive:

1. Query the table name in ruozedata_hive in the default database:

hive (default)> show tables in ruozedata_hive;
OK
tab_name
emp
emp2
stu2
Time taken: 0.024 seconds, Fetched: 3 row(s)

2. Query the table-building statement, only for viewing:

hive (default)> show create table ruozedata_hive.emp;

3. Query the table under the current database based on fuzzy matching:

hive (ruozedata_hive)> show tables 'emp*';
OK
tab_name
emp
emp2
Time taken: 0.015 seconds, Fetched: 2 row(s)
Modify table information:

1. Modify the table name, the table name in hdfs will also be modified accordingly:
hive (ruozedata_hive)> alter table emp3 rename to emp3bak;
OK
Time taken: 0.089 seconds

The difference between drop and truncate:

1. drop table emp2; both table structure and table data are deleted
2. truncate table emp3; only delete table data, retain table structure

2.4 External and internal tables in Hive (very important, the interview is most often asked)

MANAGED_TABLE: internal table, when deleted, hdfs data and mysql metadata are all deleted;

EXTERNAL_TABLE: external table, when deleted, mysql metadata is deleted, the files on hdfs will be retained;

1. Test the creation and deletion of internal tables:

	1、在hive中操作如下:
	create table emp_managed as select * from emp;

	2、select * from emp_managed;

	3、drop table emp_managed;

	4、进入到mysql中,查看ruozedata_hive.tbls数据中的内容:
	*************************** 4. row ***************************
        TBL_ID: 13
   CREATE_TIME: 1585388797
         DB_ID: 6
  LAST_ACCESS_TIME: 0
             OWNER: hadoop
        OWNER_TYPE: USER
         RETENTION: 0
             SD_ID: 13
          TBL_NAME: emp_managed
          TBL_TYPE: MANAGED_TABLE
	VIEW_EXPANDED_TEXT: NULL
	VIEW_ORIGINAL_TEXT: NULL
	4 rows in set (0.00 sec)
	
	ERROR: 
	No query specified
  • When deleting the internal table, the data on hdfs and the metadata information in mysql will be deleted;

2. Test the creation and deletion of external tables:

1、创建外部表:
create external table emp_external(emono int,ename string,job string,mgr int,date string,sal double,comm double,deptno int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

2、还是加载数据和检验有无内容:
load data local inapth '/home/hadoop/data/emp.txt' into table emp_external;
select * from emp_external;

3、drop table emp_externa;

4、继续去mysql中查询tbls表:
已经查询不到相关信息
  • When deleting an external table, its metadata information in mysql is deleted, and the data on hdfs is retained.
Hive's official website describes external table and managed table:
  • https://cwiki.apache.org/confluence/display/Hive/Managed+vs.+External+Tables
How to operate to directly convert between external and internal tables:

The syntax is as follows:

1、修改内部表属性为外部表:
hive (ruozedata_hive)> alter table emp_managed set tblproperties('EXTERNAL'='TRUE');
OK
Time taken: 0.131 seconds
hive (ruozedata_hive)> desc formatted emp_managed;

2、删除外部表:
- 既然已经将它修改为外部表了,那它就具有了外部表的属性,删除它之后,hdfs数据保留,删除mysql中的元数据信息

Bold style

2.5. Typical applications of external watches

  • Typical application of external tables:
    use flume to collect logs to hdfs, for example, under a certain path / ruozedata / access / day = 2020-03-28 / ..., ETL data cleaning is required, and after cleaning, it is still a table of data Store on hdfs, know how many columns after the cleaning, what is the separator, then you can create an external table and then specify the past by location; this situation is very safe, even if the table is deleted, the source data on hdfs is also retained Written.

  • Any operation that anyone has to perform must be recorded. This action needs to record a lot of information. In mysql, it needs to be divided into tables, log_20190327, log_20190328; the fields in the table need to be indexed.

–> Corresponding to Hive, if you do not partition, log 0327 0328
select… from log where day = '...'
The IO overhead involved in full path scan is very large: reading data –> disk IO, distributed computing – > Network IO

–> Partition table: The partition actually corresponds to a folder / directory on HDFS:
log
day = 20190327
day = 20190328

Compare: Search in full directory vs search in specified directory?

2.6, Hive partition table && refresh metadata information msck

  • In the 10086 telecommunications business, we will enter 1 to enter the call charge query, and then enter 2 in the next layer to enter the call charge recharge column, this is Ticket: Service Request

      1、创建订单分区表:order_partition
      create table order_partition(
      	order_no string,
      	event_time string
      )
      PARTITIONED BY (event_month string)
      ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
    
      hive (ruozedata_hive)> create table order_partition(order_no string,event_time string) partitioned by (event_month string) row format delimited fields terminated by '\t';
      OK
      Time taken: 0.057 seconds
    
      2、分区列并不是一个真正的表字段,它对应在hdfs就是一个文件夹
      # col_name              data_type               comment             
               
      order_no                string                                      
      event_time              string                                      
                       
      # Partition Information          
      # col_name              data_type               comment             
                       
      event_month             string      
    
      3、加载数据进分区表:
      load data local inpath '/home/hadoop/data/order_created.txt' into table order_partition PARTITION (event_month='2020-01');
    
      4、查询分区表中的条件:
      hive (ruozedata_hive)> select * from order_partition where event_month='2020-01';
      OK
      order_partition.order_no        order_partition.event_time      order_partition.event_month
      10703007267488  2014-05-01 06:01:12.334+01      2020-01
      10101043505096  2014-05-01 07:28:12.342+01      2020-01
      10103043509747  2014-05-01 07:50:12.33+01       2020-01
      10103043501575  2014-05-01 09:27:12.33+01       2020-01
      10104043514061  2014-05-01 09:03:12.324+01      2020-01
      Time taken: 0.204 seconds, Fetched: 5 row(s)
    
      //注意:分区表查询的时候要根据分区条件进行查找
    

Insert picture description here

  • Suppose we have cleaned some of the data, created the 2020-02 folder directly on hdfs, and put the data up; at this time we chose to go to Hive for query, which is not available;

  • For partition table operations, if your data is written to HDFS, the default SQL query is not found, why? Because there is no record of this event_month = 202002 in the metadata.

      1、在hdfs上级联创建目录,与2020-01分区文件夹并行:
    [hadoop@hadoop001 data]$ hdfs dfs -mkdir -p /user/hive/warehouse/ruozedata_hive.db/order_partition/event_month=2020-02
      20/03/28 23:01:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      
      2、把本地Linux文件上传到2020-02目录下:
      [hadoop@hadoop001 data]$ hdfs dfs -put /home/hadoop/data/order_created.txt /user/hive/warehouse/ruozedata_hive.db/order_partition/event_month=2020-02/
      
      3、直接在Hive中测试能否查询到这个数据:
      hive (ruozedata_hive)> select * from order_partition where event_time='2020-02';
      OK
      order_partition.order_no        order_partition.event_time      order_partition.event_month
      Time taken: 0.106 seconds
    
      4、经验证:在Hive中查询不到2020-02的分区数据:
      原因是mysql中的元数据信息没有它,去到MySQL中验证:
    
      //查看到创建的订单分区表id是18;
      mysql> select * from tbls \G;
      *************************** 4. row ***************************
          TBL_ID: 18
     CREATE_TIME: 1585406888
           DB_ID: 6
    LAST_ACCESS_TIME: 0
               OWNER: hadoop
          OWNER_TYPE: USER
           RETENTION: 0
               SD_ID: 18
            TBL_NAME: order_partition
            TBL_TYPE: MANAGED_TABLE
      VIEW_EXPANDED_TEXT: NULL
      VIEW_ORIGINAL_TEXT: NULL
      4 rows in set (0.00 sec)
    
      //分区表信息存储在partitions中,根据表id18查询分区的信息,mysql中没有关于event_month=2020-02的相关信息:
      mysql> select * from partitions where TBL_ID=18;
      +---------+-------------+------------------+---------------------+-------+--------+
      | PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME           | SD_ID | TBL_ID |
      +---------+-------------+------------------+---------------------+-------+--------+
      |       1 |  1585407336 |                0 | event_month=2020-01 |    19 |     18 |
      +---------+-------------+------------------+---------------------+-------+--------+
      1 row in set (0.00 sec)
      
      5、如何解决?
      刷新分区信息:msck repair table order_partition;
      hive (ruozedata_hive)> msck repair table order_partition;
      OK
      Partitions not in metastore:    order_partition:event_month=2020-02
      Repair: Added partition to metastore order_partition:event_month=2020-02
      Time taken: 0.249 seconds, Fetched: 2 row(s)
    
      	6、再去到mysql中查询到元数据信息:
      	mysql> select * from partitions where TBL_ID=18;
      +---------+-------------+------------------+---------------------+-------+--------+
      | PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME           | SD_ID | TBL_ID |
      +---------+-------------+------------------+---------------------+-------+--------+
      |       1 |  1585407336 |                0 | event_month=2020-01 |    19 |     18 |
      |       2 |  1585409080 |                0 | event_month=2020-02 |    20 |     18 |
      +---------+-------------+------------------+---------------------+-------+--------+
      2 rows in set (0.00 sec)
    

Insert picture description here

Note: The external table does not refresh the metadata, and the data cannot be found; the msck operation must not be used in production. This is a heavy operation

Instead of msck is another operation:

  • alter table order_partition ADD IF NOT EXISTS PARTITION(event_month=‘2020-03’);

  • How to show how much partition information there is in a table?

  • hive (ruozedata_hive)> show partitions order_partition;
    OK
    partition
    event_month=2020-01
    event_month=2020-02
    event_month=2020-03
    Time taken: 0.049 seconds, Fetched: 3 row(s)

3. Multi-level partition && static partition && dynamic partition in Hive

  • The difference with single-level partitions is that when creating a more designated partition column, when loading data, more than one partition column information is specified:

      1、创建两层分区表:
      create table order_multi_partition(
      		order_no string,
      		event_time string
      	)
      	PARTITIONED BY (event_month string,step string)
      	ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
    
      2、两层分区也就是两个目录:
      # Partition Information          
      # col_name              data_type               comment             
                       
      event_month             string                                      
      step                    string   
      
      3、加载数据也是一样的:
      hive (ruozedata_hive)> load data local inpath '/home/hadoop/data/order_created.txt' into table order_multi_partition partition(event_month='2020-01',step='1');
      Loading data to table ruozedata_hive.order_multi_partition partition (event_month=2020-01, step=1)
      Partition ruozedata_hive.order_multi_partition{event_month=2020-01, step=1} stats: [numFiles=1, numRows=0, totalSize=213, rawDataSize=0]
      OK
      Time taken: 0.27 seconds
    
Note: When using a partitioned table, when loading data, be sure to specify all the partition fields. It is not acceptable to not specify them.
Create department employee partition table:
hive (ruozedata_hive)> create table emp_partition(emono int,ename string,job string,mgr int,date string,sal double,comm double,deptno int) PARTITIONED BY (deptno int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
  • Use sql to insert data from the emp table into the emp_partition table:

      Standard syntax:
      INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
      INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
    
      2、如下:
      insert overwrite table emp_partition PARTITION (deptno=10) select * from emp where deptno=10;
      
      放到hive中执行出现如下错误:
      FAILED: SemanticException [Error 10044]: Line 1:23 Cannot insert into target table because column number/types are different '10': Table insclause-0 has 7 columns, but query has 8 columns.
    
      //错误原因:我们创建的hive分区表只有7个字段,而select * 的表中有8个字段。
      
      3、最笨的解决办法:一个一个字段的去写:
      insert overwrite table emp_partition PARTITION (deptno=10) select empno,ename,job,mgr,date,sal,comm from emp where deptno=10;
      //这种方式是非常麻烦的
    

–> Dynamic partitioning (different from static partitioning in terms of parameters, not in syntax):
find an effective way to insert data quickly?

Create a dynamic partition:

  • In order to enable deptno to automatically register a seat:

    1、语法:
    create table emp_dynamic_partition(empno int,ename string,job string,mgr int,date string,sal double,comm double) PARTITIONED BY (deptno int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;

    2、如下:
    hive (ruozedata_hive)> insert overwrite table emp_dynamic_partition PARTITION (deptno) select emono,ename,job,mgr,date,sal,comm,deptno from emp;
    FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
    hive (ruozedata_hive)> set hive.exec.dynamic.partition.mode=nonstrict;

  • The premise of dynamic partitioning is to set to non-strict mode

3.1, data loading in Hive

  • 1. Syntax for loading data:

      LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
    
      LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)
    

load data: load data;
local: local from Linux; local if no local is the path on hdfs;
inpath: specified path
overwrite: overwrite if there is, and append if not (data duplication)

1、从hdfs加载数据进emp表,会发现/ruozedata/emp.txt这个目录下的文件,会被移动到/user/hive/warehouse/ruozedata_hive.db/emp这个目录下:
hive (ruozedata_hive)> load data inpath '/ruozedata/emp.txt' into table emp;

2、为什么数据没有,因为emp表是内部表,它的所有生命周期都是交给/user/hive/warehouse来进行管理的:

Insert picture description here
Insert picture description here

For example, emp3 is the table we need, how to write data from emp:
  • create table emp3 as select empno,ename from emp;

  • create table… as select… table cannot exist beforehand

      1、可以从emp表中选取字段再进行写入emp3:
      hive (ruozedata_hive)> create table emp3 as select empno,ename from emp;
      Query ID = hadoop_20200329142929_60ac5c1f-2dbd-4688-9316-aedcde01c186
      Total jobs = 3
      Launching Job 1 out of 3
      Number of reduce tasks is set to 0 since there's no reduce operator
    
  • insert overwrite table emp4 select * from emp; The table must exist in advance, first copy the table structure: create table emp4 like emp;

  • Also supported syntax: from emp insert into table emp4 select *;

Note: There are 2 types of problems that may occur:
1、目标表中有8个字段,你sql中只插入两个字段,所以会出现问题:
hive (ruozedata_hive)> from emp insert into table emp4 select ename,mgr;
FAILED: SemanticException [Error 10044]: Line 1:27 Cannot insert into target table because column number/types are different 'emp4': Table insclause-0 has 8 columns, but query has 2 columns.

2、假设插入数据的时候,两个字段类型一致,但是你把它们的顺序搞反了,就会出现问题:
hive (ruozedata_hive)> insert overwrite table emp4 select emono,job,ename,mgr,date,sal,comm,deptno from emp;

3、因为字段顺序反了,所以第二种方式运行出来的结果就会废咯:去到emp4表种查询数据就会废咯
hive (ruozedata_hive)> select * from emp4 where ename = 'SMITH';

3.2 Data export in Hive

1. Write to the local Linux directory: / home / hadoop / tmp / hivetmp directory will be created automatically if it is not created:

  • hive (ruozedata_hive)> insert overwrite local directory ‘/home/hadoop/tmp/hivetmp’ select empno,ename,mgr from emp;

  • We have not set any separators:

      [hadoop@hadoop001 hivetmp]$ cat 000000_0 
      7369SMITH7902
      7499ALLEN7698
      7521WARD7698
      7566JONES7839
    
  • 指定分割符:insert overwrite local directory ‘/home/hadoop/tmp/hivetmp’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ select empno,ename,mgr from emp;

      [hadoop@hadoop001 hivetmp]$ cat 000000_0 
      7369    SMITH   7902
      7782    CLARK   7839
      7788    SCOTT   7566
      7839    KING    \N
    

2. Write to the hdfs directory, delete local, and then specify a directory;

  • hive (ruozedata_hive)> insert overwrite directory ‘/ruozedata/’ select empno,ename,mgr from emp;

  • Also specify the separator: insert overwrite directory '/ ruozedata /' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\ t' select empno, ename, mgr from emp;
    Insert picture description here
    3. Get directly from hdfs to the local linux system directory:

  • hdfs dfs -get /ruozedata/000000_0 /home/hadoop/data/

4. Use the interactive command: hive -e

1、hive还能够grep过滤字段
[hadoop@hadoop001 data]$ hive -e 'select * from ruozedata_hive.emp' |grep SALESMAN > file
[hadoop@hadoop001 data]$ cat file
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30
7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   30
7521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   30
7654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  30
7844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     30

5. Sqoop can do data import and export operations

3.3, use SQL to query

1、进行select的时候对字段进行重命名:
hive (ruozedata_hive)> select empno as id,ename as name from emp;
OK
id      name
7369    SMITH
7499    ALLEN
7521    WARD

2、根据where条件来进行判断:‘
hive (ruozedata_hive)> select * from emp where ename = 'SMITH';
OK
emp.empno       emp.ename       emp.job emp.mgr emp.date        emp.sal emp.comm      emp.deptno
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20

hive (ruozedata_hive)> select * from emp where sal between 800 and 1500;
OK
emp.empno       emp.ename       emp.job emp.mgr emp.date        emp.sal emp.comm      emp.deptno
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20

hive (ruozedata_hive)> select * from emp where ename in ('SMITH','john');
OK
emp.empno       emp.ename       emp.job emp.mgr emp.date        emp.sal emp.comm      emp.deptno
7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    20

hive (ruozedata_hive)> select * from emp where sal like '1%';

//like需要使用_ %这些
hive (ruozedata_hive)> select * from emp where sal like '_2%'

//rlike包含的是正则表达式
hive (ruozedata_hive)> select * from emp where sal rlike '[2]'
For data null values ​​in production: null, NULL, "", ""

3.4, aggregate functions in Hive

1、求deptno=10的总次数 --> 为deptno=10计次:
hive (ruozedata_hive)> select count(1) from emp where deptno=10;
Query ID = hadoop_20200329155151_17c78e33-141a-4f58-9109-fca8b3a22f5a
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1

2、求最大、最小、工资和、平均工资
hive (ruozedata_hive)> select max(sal),min(sal),sum(sal),avg(sal) from emp;
Query ID = hadoop_20200329155252_1f9c3d47-a567-4884-afe7-5d1c062bfba4

3、求每个部门的平均工资(按照部门进行分组):
hive (ruozedata_hive)> select deptno,avg(sal) from emp group by deptno;

//select中出现的字段,如果没有出现在group by中,必须出现在聚合函数中;
错误的演示:
hive (ruozedata_hive)> select ename,deptno,avg(sal) from emp group by deptno;
FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'ename'

4、求每个部门的平均工资大于2000的
select deptno,avg(sal) as avg_sal from emp group by deptno having avg_sal > 2000;
deptno  avg_sal
NULL    10300.0
10      2916.6666666666665
20      2175.0

3.5. Why do some sql run mapreduce, some don't run mapreduce

  • On the hive parameter configuration page: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties, find this parameter: hive.fetch.task.conversion

Check that the parameter set in the current Hive is more; return to the official website for interpretation:
hive (ruozedata_hive)> set hive.fetch.task.conversion;
hive.fetch.task.conversion = more

This parameter can be set to none, minimal, more;
1. After setting to none, any operation is to run mapreduce
2. When set to more: select / filter / limit will not run mapreduce

Remember one thing: select *, specified fields, common filtering conditions are partitions or a certain field content will not run mapredcue; if statistics are involved, mapreduce will run
Insert picture description here

4. Interview questions

1. What is the difference between where and having in Hive?

  • where is the filtering of a single record, having is the filtering after grouping

2. What is the difference between count (1) and count (*)?

Published 23 original articles · praised 0 · visits 755

Guess you like

Origin blog.csdn.net/SparkOnYarn/article/details/105163737