Sword Finger Data Warehouse-Hive01

1. Review

Two, Hive01

3. Installation and deployment of Hive

Four, Hive's DDL operation

V. Interview questions that may be involved

1. Review

Hadoop:
narrow sense: only refers to apache hadoop software (a part of the most important foundation)
broadly: Hadoop ecosystem: Hadoop, Hive, Sqoop, HBase ...

Each type of component is to solve specific things. For example, Hive is an analysis engine for SQL, and production needs to be combined based on multiple frameworks to achieve business scenarios.

How to quickly locate the problem? And how to solve problems quickly

For example, in Hadoop, the following business needs to be processed: join / group by,

For mapreduce programming, the complexity will be very high. Take wordcount for example:
  • https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
  • Needs local development, testing, packaging, and thrown to the server to run; later maintenance is very troublesome

2.1, Hive background (lead to Schema)

===> Hive background

  • Because MapReduce-oriented programming is very troublesome: for example, practitioners who were originally engaged in RDBMS relational databases are very laborious to develop for them:
    so this is the disadvantage of processing big data based on Hadoop MapReduce:
Schema information:
  • The equivalent to understand in MySQL is-> table names, column names, column types in relational databases
  • In MySQL or Oracle, if we perform SQL statistics, we must first create a table, including: table name, column name, column type ...
    and then use SQL for statistical analysis of the corresponding business

The files on HDFS are mainly divided into: txt text format, which does not exist table name, column name, column type; the most we can do about text format is to follow the separator (\ t): to know the first column What is it and what is the second column

Source data vs metadata:
源数据:HDFS上的数据
元数据:是描述数据的数据,比如数据的表名,列名,列类型

2.2, first acquaintance with Hive

Hive official website: hive.apache.org

The official website describes Hive:

  • The Apache Hive data warehouse software facilitates(促进)
    reading,
    writing,
    managing large datasets residing
    in distributed storage(HDFS/S3)
    using SQL.

  • A command line tool
    and JDBC driver are provided to connect users to Hive.

translation:

  • Hive is a data warehouse that can read, write, and manage large data sets using SQL on distributed file systems
The generation of the Hive framework is a great boon for practitioners of RDBMS. Whether it is Spark or Flink, they are all pushing batch processing. The two frameworks are connected, so it is more convenient to learn.
Expansion: Hive is open sourced by Facebook and is used to solve the statistical problem of massive structured logs. It is based on logs for statistical analysis of various dimensions; Hive is a data warehouse built on Hadoop.

1. The data in Hive is stored on hdfs;
2. The underlying execution engine of Hive is MapReduce / Spark / Tez, and only one parameter needs to be modified;

(hive (ruozedata_hive)> set hive.execution.engine;
 hive.execution.engine=mr)

3. Hive jobs are submitted to Yarn to run;
4. Hive provides HiveQL query language, which is similar to SQL, but not completely the same.

Interview question: What is the relationship between HiveQL and SQL?
  • Answer: Apart from the similar syntax, the others are completely different
Big data "cloudification": The relational database is migrated to the big data platform. The original SQL needs to be sorted out. The replacement of the platform must consider syntax compatibility, understand business logic, and then use HiveQL or Spark SQL to complete. This is a big project.

Hive is suitable for offline processing / batch processing (processing a batch of data at once), real-time processing / stream processing;
so far: data on HDFS: text, compression, column storage, Hive uses sql to process data on hadoop .

SQL on Hadoop: Hive / Presto / Impala / Spark SQL. If the SQL syntax is not very compatible, you need to customize UDF functions.

Advantages and disadvantages of Hive:

Advantages: SQL (convenient processing, wide audience)

Disadvantages: If the underlying execution engine of Hive is MapReduce, its execution performance is necessarily not high.

2.3, Hive architecture

The following piece is automatically completed by Hive for us:

  • A SQL is just an ordinary string that can no longer be ordinary. It needs to use the SQL Parser (SQL parser) to parse the syntax of SQL when it enters the Driver program-> Query Optimizer (> Query Optimizer)-> Pyhsical Plan (Logical execution plan)-> If the syntax does not support you need to customize UDF function-> SerDes (serialization, deserialization)-> Execution (execution).

  • A table belongs to which DB, the table is stored in HDFS by default in the path / user / hive / warehouse, the field information of the table-> is stored in the MetaStore, this information needs to be known to MapReduce; Metastore in the red box is very important Yes, the metadata is stored in derby and MySQL by default, derby is generally not used, use MySQL, MySQL can not use a single point here, a single point of failure will appear out of service, the production is using the MySQL master and backup mechanism.

Hive architecture

For a detailed description of the above picture:

The first layer is the user interface layer: through linux black window, JDBC

The second layer is the Driver driver:

SQL解析:SQL --> AST(antlr抽象语法树)
查询优化:逻辑/物理执行计划
UDF/SerDes:
Execution:执行

Metadata: table name, column (name, type, index), database, table type (internal and external tables), path of table data on hdfs

Interview question: What is the relationship between Hive and RDBMS?

Hive and RDBMS comparison:


  • Similarities : Both use SQL, Hive now supports transactions (generally not used, the main reason: Hive is batch processing),
    all support insert, update, delete. For
    offline Hive's data warehouse, write content is relatively small, generally It is to batch load data to Hive and then perform statistical analysis;

  • Differences: Mass: The relational database is a button on the page, which can be displayed on the page immediately; Hive has a higher latency; it is also reflected in the cluster size of Hive (Hive needs to be deployed at higher performance On the machine);

One of the operations we should never do with Hive is: make a button on the page and execute the Hive job in the background. The amount of data is fine. If the amount of data is large, this job runs normally for 8 or 9 hours. .

2.4, which version of insert / update / delete in Hive appears

  • Website: https://cwiki.apache.org/confluence/display/Hive#Home-UserDocumentation
A few points to note:

1. Support insert syntax
in Hive 0.14 : 2. Support update syntax
in Hive 0.14 : 3. Support delete syntax in
Hive 0.14: This Hive version refers to Apache Hive
Insert picture description here

3. Hive installation and deployment

1. Pre-environment requirements:
at least jdk1.7, Hadoop 2.x, pay attention to the storage location of metadata information: MySQL needs to be deployed in advance

2. Two ways of installation:

  • Download the release version and unzip the installation
  • Compile from Hive source code

Choose to decompress and install here:
because the version of Hadoop is: hadoop2.6.0-cdh5.16.2, so we use the hive-1.1.0-cdh5.16.2 version of Hive; use wget to download: wget http: //archive.cloudera. com / cdh5 / cdh / 5 / hive-1.1.0-cdh5.16.2.tar.gz, domestic download is extremely slow, you can buy overseas servers haha.

We download the URL of the CDH version: http://archive.cloudera.com/cdh5/cdh/5/

1、下载并且解压:
[hadoop@hadoop001 software]$ tar -xzvf hive-1.1.0-cdh5.16.2.tar.gz -C /home/hadoop/app/ 

2、配置环境变量:vi ~/.bashrc
export HIVE_HOME=/home/hadoop/app/hive
export PATH=${HIVE_HOME}/bin:$PATH

3、生效环境变量:
source ~/.bashrc

4、生效完后记得which一下:
[hadoop@hadoop001 app]$ which hive
~/app/hive/bin/hive
HIVE_HOME directory description:

bin: script
lib: dependent package
conf: configuration file

3、配置$HIVE_HOME/conf/hive-site.zml

  • By default there are only 4 files in this directory, you need to create a new hive-site.xml, the content of the file is as follows:

  • Analysis of hive-site.xml configuration content: connection url (port 3306 of mysql database of hadoop001 machine, ruozedata_hive does not need to be created manually, it will be created if the library does not exist), driver name (mysql driver package), username (mysql user name) , Password (mysql password);

      <?xml version="1.0"?>
      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      
      <configuration>
      <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://hadoop001:3306/ruozedata_hive?createDatabaseIfNotExist=true</value>
      </property>
      
      <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
      </property>
    
      <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
      </property>
    
      <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>xxxx</value>
      </property>	
    
      <property>
      	<name>hive.cli.print.current.db</name>
      	<value>true</value>
      </property>
      //打印出来当前的数据库名称
      
      <property>
      	<name>hive.cli.print.header</name>
      	<value>true</value>
      </property>
      //打印数据库表内容中的头部信息
    

4. Upload the MySQL driver package to the $ HIVE_HOME / conf directory:

  • You can go to the official website, Maven warehouse to download, I use mysql-connector-java-5.1.47.jar

5. You must start hdfs before starting hive, start-dfs.sh

3.1 Problems in the Hive startup process

The first error: SSL authentication problem
Fri Mar 27 22:18:58 CST 2020 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.

翻译:不建议不使用服务器身份验证建立SSL连接;
解决:在尾部添加useSSL=true;
<value>jdbc:mysql://hadoop001:3306/ruozedata_hive?createDatabaseIfNotExist=true&amp;useSSL=true</value>
The second error: suggest that there is a problem with the metadata (to learn to view the log log)
hive (default)> use default;
FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

First of all, we must know that if you encounter an error, Baidu ’s solution may not be applicable to you; we must learn to view the log log, the path of the hive log log: / tmp / {user name} /hive.log

1、日志路径如下:
[root@hadoop001 hadoop]# pwd
/tmp/hadoop
[root@hadoop001 hadoop]# ll
total 5508
drwx------ 2 hadoop hadoop    4096 Mar 27 22:27 34490222-9c95-43ba-a00f-e0dddee5cd14
-rw-rw-r-- 1 hadoop hadoop       0 Mar 27 22:27 34490222-9c95-43ba-a00f-e0dddee5cd146362926509336115670.pipeout
-rw-rw-r-- 1 hadoop hadoop       0 Mar 27 22:27 34490222-9c95-43ba-a00f-e0dddee5cd147456770903589387510.pipeout
-rw-rw-r-- 1 hadoop hadoop 5621893 Mar 27 22:46 hive.log

2、tail -F hive.log --> 实时查看hive.log打印出来的日志信息,发现有如下两行信息:
Caused by: java.sql.SQLException: Unable to open a test connection to the given database. JDBC url = jdbc:mysql://hadoop001:3306/ruozedata_hive?createDatabaseIfNotExist=true&useSSL=true, username = root. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Access denied for user 'root'@'hadoop001' (using password: YES)

//提示访问受限,哭了,原来是账号密码设置错了;还有可能的情况,注意mysql数据库有没有允许其它地址的主机去访问它,也就是一个“%”权限。

3.2, basic Sql commands of Hive

1、展示所有数据库:
hive (default)> show databases;
OK
database_name
default
Time taken: 0.03 seconds, Fetched: 1 row(s)

2、使用default数据库:
hive (default)> use default;
OK
Time taken: 0.025 seconds

3、展示数据表:
hive (default)> show tables;
OK
tab_name
Time taken: 0.128 seconds

4、在default数据库下创建stu表:
hive (default)> create table stu(id int,name string,age int);
OK
Time taken: 0.363 seconds

5、描述stu表:
hive (default)> desc stu;
OK
col_name        data_type       comment
id                      int                                         
name                    string                                      
age                     int                                         
Time taken: 0.175 seconds, Fetched: 3 row(s)

6、查看表结构:
- desc stu;
- desc extended stu;
- desc formatted stu;		//这种方式更好一些
hive (default)> desc formatted stu;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
id                      int                                         
name                    string                                      
age                     int                                         
                 
# Detailed Table Information             
Database:               default                  
OwnerType:              USER                     
Owner:                  hadoop                   
CreateTime:             Fri Mar 27 23:29:21 CST 2020     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://hadoop001:9000/user/hive/warehouse/stu    
Table Type:             MANAGED_TABLE            
Table Parameters:                
        transient_lastDdlTime   1585322961          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat     
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1                   
Time taken: 0.138 seconds, Fetched: 29 row(s)

6、展示建表语句:show create table stu;
hive (default)> show create table stu;
OK
createtab_stmt
CREATE TABLE `stu`(
  `id` int, 
  `name` string, 
  `age` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://hadoop001:9000/user/hive/warehouse/stu'
TBLPROPERTIES (
  'transient_lastDdlTime'='1585322961')
Time taken: 0.133 seconds, Fetched: 14 row(s)

7、插入语句:千万不要这么用(仅仅做演示)
insert into stu values(1,'john',24);

8、select * from stu;
Why not use insert, because even if it is the simplest to insert a piece of data, it will run a mapreduce job:
1、insert插入数据,还会跑mapreduce
hive (default)> insert into stu values(1,"john",24);
Query ID = hadoop_20200327233535_a0100136-77e5-4d07-ad69-de835ec91403
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator

3.3. Some parameters and commands in Hive

1. Hive storage directory on hdfs:
the stu table we created, the default storage directory in hdfs is: Location: hdfs: // hadoop001: 9000 / user / hive / warehouse / stu

So how do we change the default storage directory of this hdfs?

  • The page is as follows: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

The parameters of the storage directory we found are as follows:

Hive.metastore.warehouse.dir
Default Value: /user/hive/warehouse
Added In: Hive 0.2.0
Location of default database for the warehouse.

2. Hive log file storage directory:

1、路径如下:
[hadoop@hadoop001 conf]$ pwd
/home/hadoop/app/hive/conf

2、拷贝一份配置文件:
[hadoop@hadoop001 conf]$ cp hive-log4j.properties.template hive-log4j.properties

3、编辑这份配置文件:vi hive-log4j.properties
hive.log.dir=${java.io.tmpdir}/${user.name}
hive.log.file=hive.log

因为linux定期清理规则,所以/tmp下的文件建议进行修改:
hive.log.dir=/home/hadoop/tmp/hive

3. Hive configuration properties:
global: $ HIVE_HOME / conf / hive-site.xml

Temporary / current session:
view current attributes: set key
set current attributes: set key = value

1、查看这个显示数据库的参数开关是否打开,发现未打开
hive> set hive.cli.print.current.db;
hive.cli.print.current.db=false

2、显示数据库的参数打开:
hive> set hive.cli.print.current.db=true;
hive (default)> 

//注意:临时session只对当前窗口有效,新开窗口后就是无效的

4. View the file storage path on Hive:

hive (default)> dfs -ls /user/hive/warehouse;
Found 1 items
drwx------   - hadoop supergroup          0 2020-03-27 23:36 /user/hive/warehouse/stu

3.4, interactive commands in Hive

View hive command help:

[hadoop@hadoop001 conf]$ hive -help
which: no hbase in (/home/hadoop/app/hive/bin:/home/hadoop/app/hadoop/bin:/home/hadoop/app/hadoop/sbin:/usr/java/jdk1.8.0_45/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)
20/03/28 00:13:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
usage: hive
 -d,--define <key=value>          Variable subsitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>     Specify the database to use
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -H,--help                        Print help information
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable subsitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)
1. The use of Hive -e, you can follow the SQL statement to query without entering the hive command line:
  • hive -e “select * from stu;”

      [hadoop@hadoop001 conf]$ hive -e "select * from stu;"
      which: no hbase in (/home/hadoop/app/hive/bin:/home/hadoop/app/hadoop/bin:/home/hadoop/app/hadoop/sbin:/usr/java/jdk1.8.0_45/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)
      20/03/28 00:13:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      
      Logging initialized using configuration in file:/home/hadoop/app/hive-1.1.0-cdh5.16.2/conf/hive-log4j.properties
      OK
      stu.id  stu.name        stu.age
      1       john    24
      Time taken: 6.21 seconds, Fetched: 1 row(s)
    
Hive -f: execute the specified file (the content of the file needs to be a SQL statement)
1、在ruoze.log中写入如下这句话:
[hadoop@hadoop001 data]$ cat ruoze.log 
select * from stu;

2、使用hive -f运行:
[hadoop@hadoop001 data]$ hive -f ruoze.log 
which: no hbase in (/home/hadoop/app/hive/bin:/home/hadoop/app/hadoop/bin:/home/hadoop/app/hadoop/sbin:/usr/java/jdk1.8.0_45/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)
20/03/28 00:17:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Logging initialized using configuration in file:/home/hadoop/app/hive-1.1.0-cdh5.16.2/conf/hive-log4j.properties
OK
stu.id  stu.name        stu.age
1       john    24
Time taken: 5.523 seconds, Fetched: 1 row(s)
Expansion: Hive-based offline statistics / data warehouse
  • How to integrate SQL with Hive, encapsulate SQL into shell script, use hive -e "query sql ..."
  • Scheduled scheduling: crontab

3.5, data abstraction in Hive

The table (stu) in Hive must belong to a database (default)

Insert picture description here

Question: Are buckets in Hive files or folders?

Four, DDL operation in Hive

The database Database contains 0 to N tables, each db corresponds to a folder on HDFS;

The creation syntax of the database:
1、官方提供的语法:
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
  [COMMENT database_comment]
  [LOCATION hdfs_path]
  [WITH DBPROPERTIES (property_name=property_value, ...)];

2、小括号()有竖线是二选一,中括号[]是可有可无;测试使用如下语句创建数据库:
create database ruozedata_hive;

3、数据库创建好后去hdfs目录上查询目录,发现目录的命名就是:数据库名字加上后缀db
[hadoop@hadoop001 data]$ hdfs dfs -ls /user/hive/warehouse/
20/03/28 00:37:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
drwx------   - hadoop supergroup          0 2020-03-28 00:36 /user/hive/warehouse/ruozedata_hive.db
drwx------   - hadoop supergroup          0 2020-03-27 23:36 /user/hive/warehouse/stu

4、建议创建数据库的时候加上if not exists;
hive (default)> create database ruozedata_hive;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Database ruozedata_hive already exists
hive (default)> create database IF NOT EXISTS ruozedata_hive;
OK
Time taken: 0.014 seconds

2. Instead of using the default / user / hive / warehosue directory when creating the database, specify the directory yourself:

  • create database IF NOT EXISTS ruozedata_hive LOCATION ‘/wordcount’

3. The difference between specifying the directory by yourself and using the default storage directory to view:

1、ruozedata_hive2是自己指定的目录,这个库是看不见的
hive (ruozedata_hive2)> desc database ruozedata_hive2;
OK
db_name comment location        owner_name      owner_type      parameters
ruozedata_hive2         hdfs://hadoop001:9000/wordcount hadoop  USER
Time taken: 0.019 seconds, Fetched: 1 row(s)

2、ruozedata_hive3是使用的默认目录:
hive (ruozedata_hive2)> desc database ruozedata_hive3;
OK
db_name comment location        owner_name      owner_type      parameters
ruozedata_hive3         hdfs://hadoop001:9000/user/hive/warehouse/ruozedata_hive3.db  hadoop   USER
Time taken: 0.017 seconds, Fetched: 1 row(s)

4. Add attributes when creating ruozedata_hive3:

hive (ruozedata_hive2)> create database if not exists ruozedata_hive3 COMMENT 'this is your first created database' WITH DBPROPERTIES('creator'='pk','date'='2020-03-29');
OK
Time taken: 0.049 seconds
hive (ruozedata_hive2)> desc database ruozedata_hive3;
OK
db_name comment location        owner_name      owner_type      parameters
ruozedata_hive3 this is your first created database     hdfs://hadoop001:9000/user/hive/warehouse/ruozedata_hive3.db   hadoop  USER
Time taken: 0.056 seconds, Fetched: 1 row(s)
Database modification:
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);   -- (Note: SCHEMA added in Hive 0.14.0)
 
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;   -- (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0)
  
ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path; -- (Note: Hive 2.2.1, 2.4.0 and later)
Deletion of the database:
1、在ruozedata_hive3数据库下创建测试表test
hive (ruozedata_hive2)> use ruozedata_hive3;
OK
Time taken: 0.012 seconds
hive (ruozedata_hive3)> create table test(id int,name string);
OK
Time taken: 0.16 seconds
hive (ruozedata_hive3)> show tables;
OK
tab_name
test
Time taken: 0.026 seconds, Fetched: 1 row(s)

2、此时ruozedata_hive3数据库下存在数据表test,我们使用drop删除,提示数据库下存在表就无法删除:
hive (ruozedata_hive3)> drop database if exists ruozedata_hive3;
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database ruozedata_hive3 is not empty. One or more tables exist.)

3、直接加上cascade,就是级联删除,直接删除数据库下的所有表:
hive (ruozedata_hive3)> drop database if exists ruozedata_hive3 cascade;
OK
Time taken: 1.783 seconds

CASCADE:1 db to many table

Use CASCADE with caution in production

4.1, basic data types in Hive

Data type:
For files on HDFS: string type

Data type official website page: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

1. For numeric types: int, bigint, float, double, and bank projects that require higher precision will use decimal

2. Boolean type boolean: true / false-> replace Boolean type with TINYINT in PK production

3. String type: at least 90% of string production uses this

4. Date type: date, timestamp ...-> Can also be replaced with string type

4.2 Splitter in Hive

Splitter:

delimiter code description
^A \001 Field and field separator
\n \n Line-to-line separator
^B \002 ARRAY / STRUCT (complex data type in Hive)
^C \003 key / value of MAP (complex data type in Hive)
我们可以自己定义分割符,但是分割符别与字段内容相重合
1,pk,30
1$$$pk$$$30


1、create table stu2(id int, name string, age int) row format delimited fields terminated by ',';

2、暂时不用load data的方式,采用insert插入的方式:
insert into stu2 values(1,'john','24');

3、查询数据是否写进去了:
hive (ruozedata_hive)> select * from stu2;
OK
stu2.id stu2.name       stu2.age
1       john    24
Time taken: 0.065 seconds, Fetched: 1 row(s)

4、直接查看hdfs目录上的文件内容是不是以逗号分割:
hive (ruozedata_hive)> dfs -ls /user/hive/warehouse/ruozedata_hive.db/stu2;
Found 1 items
-rwxrwxrwx   1 hadoop supergroup         10 2020-03-28 14:47 /user/hive/warehouse/ruozedata_hive.db/stu2/000000_0
hive (ruozedata_hive)> dfs -cat /user/hive/warehouse/ruozedata_hive.db/stu2/000000_0;
1,john,24
If you don't directly view it in the Hive client, you can also use this way to view:
1、把这张表的数据下载到data目录下:
[hadoop@hadoop001 data]$ hdfs dfs -get /user/hive/warehouse/ruozedata_hive.db/stu2/000000_0 ./
20/03/28 14:52:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop001 data]$ ll
total 44
-rw-r--r-- 1 hadoop hadoop    10 Mar 28 14:52 000000_0
-rw------- 1 hadoop hadoop 32768 Mar 26 13:39 15074
-rw-r--r-- 1 hadoop hadoop    14 Mar 28 14:11 ruoze.log
-rw-rw-r-- 1 hadoop hadoop    35 Mar 24 23:23 wordcount.log

2、使用cat查看一下内容:
[hadoop@hadoop001 data]$ cat 000000_0
1,john,24

5. Interview questions that may be involved

1. Does hive support insert, which version appeared from

2. What is the relationship between HiveQL and SQL? Please tell me your understanding of Hive?

Published 23 original articles · praised 0 · visits 755

Guess you like

Origin blog.csdn.net/SparkOnYarn/article/details/105140753