The primary configuration Hadoop Hive

Hive Big Data technologies

1. A basic concept Hive

1.1 What is the Hive

Hive: Facebook open source for the statistics to solve the massive structure of the log.

Hive is based on Hadoop data warehousing tools, you can map the structure of the data file to a table, and provides SQL-like query.

Essentially: HQL converted to the MapReduce programs

1) in the data storage processing Hive HDFS

2) analysis of the underlying data to achieve Hive is MapReduce

3) the implementation of a program running on YARN

 

1.2 Hive advantages and disadvantages

1.2.1 advantage

1) Operation interface using SQL-like syntax, the ability to provide rapid development (simple, easy to use)

2) avoiding to write MapReduce, reduce learning costs for developers.

. 3) is performed Hive delay is relatively high, so Hive commonly used in the data analysis, less demanding real-time applications;

4) Hive advantage of big data processing, for processing data no small advantage, because Hive execution delay is relatively high.

5) Hive support for user-defined functions, users can implement your own functions according to their needs.

1.2.2 shortcomings

1) the limited ability to express Hive of HQL

(1) iterative algorithm can not express

(2) data mining are not good

2) Hive efficiency is relatively low

(1) Hive MapReduce jobs automatically generated, usually enough intelligence

(2) Hive tuning more difficult, coarser

1.3 Hive architecture principles

 

As shown therein, Hive through a series of interactive interface provided to the user, receiving a user instruction (the SQL), uses its own Driver, binding the metadata (the MetaStore), translate the instructions into the MapReduce, submitted to execute Hadoop Finally, the results of the implementation of the output returned to the user interaction interface.

1) User Interface: Client

CLI (hive shell), JDBC / ODBC (java access hive), WEBUI (browser access hive) HUE

2) Metadata: Metastore

Metadata includes: table name, table belongs to a database (the default is the default), owner of the table, column / partition field, type the table (if an external table), where the table of contents and other data;

The default is stored in the database comes derby, it is recommended to use MySQL storage Metastore

3)Hadoop

Use HDFS store, is calculated using the MapReduce.

4) Drive: Driver

(1) parser (SQL Parser): Convert SQL string into an abstract syntax tree AST, this step is usually done with a third-party tool libraries, such as antlr; for AST parsing, such table exists, field is present, Are SQL semantics wrong.

(2) the compiler (Physical Plan): The AST compiled logic execution plan.

(3) optimizer (Query Optimizer): logic execution plan optimization.

(4) the actuator (Execution): to convert into a physical execution plan logic program can run. For Hive, it is to MR / Spark.

1.4 Hive database and compare

 

Since the Hive uses a SQL-like query language HQL (Hive Query Language), so it is easy to understand for the Hive database. In fact, from a structural point of view, Hive, and in addition has a similar database query language, no similarities. This paper will elaborate more aspects and differences Hive database. Online database can be used in applications in, but Hive is a data warehouse designed, aware of this, from the application point of view helps to understand the characteristics of the Hive.

1.4.1 Query Language

Because SQL is widely used in data warehouse, therefore, designed specifically for the characteristics of the Hive SQL-like query language HQL. Developers familiar with SQL developers can easily use Hive developed.

1.4.2 Data storage location

Hive is built on top of Hadoop, Hive all the data are stored in HDFS's. And the database data can be stored in the block device or local file system.

1.4.3 Data Update

Since Hive is designed for data warehouse applications, and read the contents of the data warehouse is to write less. Therefore, Hive does not support data rewriting and adding, all data is loaded when determining good. The data in the database are usually require frequent changes, so you can use INSERT INTO ... VALUES add data, modify data using UPDATE ... SET.

1.4.4 Index

Hive is not in the process of loading the data in any data processing, the data will not even scanned, so there is no index for some Key data. To access a particular data value Hive meet the conditions, it is necessary to scan the entire data of violence, thus high access delay. Since the introduction of MapReduce, Hive parallel access to data, so even if there is no index for large amount of data access, Hive can still reflect the advantage. Database, usually established for one or a few column index, and therefore access to a small number of specific conditions of the data, the database can have a very high efficiency, lower latency. Due to high latency access to data, determines the Hive is not suitable for online data inquiry.

1.4.5 execution

Most Hive query execution is provided by MapReduce Hadoop achieved. The database usually have their own execution engine.

1.4.6 execution delay

Hive query data, the index because there is no need to scan the entire table, so the delay is high. Another cause Hive execution delay factor is high MapReduce framework. Because MapReduce itself has higher latency, so when using the MapReduce execution Hive queries, will have higher latency. In contrast, low latency execution of the database. Of course, this low is conditional, that is, the smaller the size of the data, when large-scale data processing capabilities over time to the database, parallel computing Hive clearly reflect the advantage.

1.4.7 Scalability

Since Hive is built on top of Hadoop, Hive therefore scalability is scalability of Hadoop and is consistent (the world's largest Hadoop cluster at Yahoo!, 2009 in the scale at around 4000 nodes). The database due to strict restrictions ACID semantics, expanding the line is very limited. The most advanced parallel Oracle database scalability in theory only about 100 units.

1.4.8 Data scale

Since Hive is built on a cluster and can take advantage of MapReduce parallel computing, so you can support very large-scale data; corresponding database can support data smaller scale.

2. The two Hive prepare the installation environment

2.1 Hive installation instructions address

1) Hive official website address:

http://hive.apache.org/

2) document viewing Address:

https://cwiki.apache.org/confluence/display/Hive/GettingStarted

3) Download:

http://archive.apache.org/dist/hive/

4) github Address:

https://github.com/apache/hive

2.2 Hive installation and deployment

1) Hive Installation and Configuration

(1) The apache-hive-1.2.1-bin.tar.gz upload the linux / opt / software directory

(2) extracting apache-hive-1.2.1-bin.tar.gz to below / opt / directory

 

[root@hadoop102 software]$ tar -zxvf apache-hive-1.2.1-bin.tar.gz -C /opt/

(3) modify the name apache-hive-1.2.1-bin.tar.gz is hive

 

[root@hadoop102 module]$ mv apache-hive-1.2.1-bin/ hive

(4) modify the / opt / hive / hive-env.sh.template conf directory name is hive-env.sh

 

[root@hadoop102 conf]$ mv hive-env.sh.template hive-env.sh

(5) configuration file hive-env.sh

(A) Path Configuration HADOOP_HOME

 

export HADOOP_HOME=/opt/hadoop-2.7.2

(B) Path Configuration HIVE_CONF_DIR

 

export HIVE_CONF_DIR=/opt/hive/conf

2) Hadoop cluster configuration

(1) must be started and yarn hdfs

 

[root@hadoop102 hadoop-2.7.2]$ sbin/start-dfs.sh

[root@hadoop103 hadoop-2.7.2]$ sbin/start-yarn.sh

 

(2) create / tmp and / user / hive / warehouse two directories and modify them in the same group permissions can be written on HDFS

 

[root@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -mkdir /tmp

[root@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -mkdir -p /user/hive/warehouse

 

[root@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -chmod g+w /tmp

[root@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -chmod g+w /user/hive/warehouse

 

3) Hive Basic Operation

(1) Start hive

 

     [root@hadoop102 hive]$ bin/hive

(2) View database

 

     hive>show databases;

(3) open the default database

 

     hive>use default;

(4) Display default database table

 

     hive>show tables;

(5) create a table

 

     hive> create table student(id int, name string);

(6) There are several tables in the database are

 

     hive>show tables;

Structure (7) View table

 

     hive>desc student;

(8) inserting data into a table

 

hive> insert into student values(1000,"ss");

(9) the data look-up table

 

     hive> select * from student;

(10) to exit the hive

 

     hive> quit;

2.3 The local file into Hive Case

Demand: introducing the data in the local directory to the hive /opt/datas/student.txt the student (id int, name string) table.

1) Data preparation: Data in this catalog /opt/datas/student.txt

(1) create datas in / opt / directory

 

    [root@hadoop102 module]$ mkdir datas

(2) create student.txt files in / opt / datas / directory and add the data

 

[root@hadoop102 module]$ touch student.txt

[root@hadoop102 module]$ vi student.txt

1001 zhangshan

1002 lishi

1003 zhaoliu

 

Note that in the tab spacing.

 

2) hive practical

(1) Start hive

 

    [root@hadoop102 hive]$ bin/hive

(2) Display Database

 

hive>show databases;

(3) using the default database

 

    hive>use default;

(4) Display default database table

 

    hive>show tables;

student table (5) delete created

 

hive> drop table student;

(6) to create student table and declares file separator '\ t'

 

hive> create table student(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

(7) /opt/datas/student.txt file loaded into student database table.

 

hive> load data local inpath '/opt/datas/st udent.txt' into table student;

(8) Hive query results

 

hive> select * from student;

OK

1001 zhangshan

1002 lishi

1003 zhaoliu

Time taken: 0.266 seconds, Fetched: 3 row(s)

3) problems encountered

Then open a client window to start hive, will produce java.sql.SQLException exception.

 

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

        at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)

        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)

        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:606)

        at org.apache.hadoop.util.RunJar.run (RunJar.java:221)

        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

        at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)

        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:86)

        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)

        at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)

        at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)

        at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)

        at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)

        ... 8 more

 

The reason is, Metastore default stored in the database comes derby, it is recommended to use MySQL storage Metastore;

2.4.4 mysql enter the following:

mysql>create database metastore;

mysql>alter database metastore character set latin1;

. Mysql> grant all on metastore * TO root @ '%' IDENTIFIED BY 'root'; important :: remote connection

mysql>flush privileges;

2, copy JDBC driver package

The MySQL JDBC driver package copied to the lib directory Hive

2.5 Hive metadata arranged to MySql

2.5.1 drive copy

1) extracting mysql-connector-java-5.1.27.tar.gz driver package in / opt / software / mysql-libs directory

 

[root@hadoop102 mysql-libs]# tar -zxvf mysql-connector-java-5.1.27.tar.gz

2) Copy mysql-connector-java-5.1.27-bin.jar /opt/software/mysql-libs/mysql-connector-java-5.1.27 directory under the / opt / hive / lib /

 

[root@hadoop102 mysql-connector-java-5.1.27]# cp mysql-connector-java-5.1.27-bin.jar /opt/hive/lib/

2.5.2 Configuration Metastore to MySql

1) Create a hive-site.xml in / opt / hive / conf directory

 

[root@hadoop102 conf]# touch hive-site.xml

[root@hadoop102 conf]# vi hive-site.xml

 

 

2) According to the official document configuration parameters, copy data to the hive-site.xml file.

https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin
Specific parameter configuration

 

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

  <name>javax.jdo.option.ConnectionURL</name>

  <value>jdbc:mysql://hadoop102:3306/metastore?createDatabaseIfNotExist=true</value>

  <description>JDBC connect string for a JDBC metastore</description>

</property>

 

<property>

  <name>javax.jdo.option.ConnectionDriverName</name>

  <value>com.mysql.jdbc.Driver</value>

  <description>Driver class name for a JDBC metastore</description>

</property>

 

<property>

  <name>javax.jdo.option.ConnectionUserName</name>

  <value>root</value>

  <description>username to use against metastore database</description>

</property>

 

<property>

  <name>javax.jdo.option.ConnectionPassword</name>

  <value>root</value>

  <description>password to use against metastore database</description>

</property>

</configuration>

 

3) configuration is complete, if you start hive exceptions, you can restart the virtual machine. (After restarting, do not forget to start hadoop cluster)

2.5.3 Multi-window start testing Hive

1) Start the MySQL

 

[root@hadoop102 mysql-libs]$ mysql -uroot -proot

   Access to several databases

mysql> show databases;

+--------------------+

| Database           |

+--------------------+

| information_schema |

| mysql             |

| performance_schema |

| test               |

+--------------------+

 

2) open multiple windows again, are starting hive

 

[root@hadoop102 hive]$ bin/hive

3) After starting the hive, back window to view the MySQL database, the database shows an increase of metastore

 

mysql> show databases;

+--------------------+

| Database           |

+--------------------+

| information_schema |

| metastore          |

| mysql             |

| performance_schema |

| test               |

+--------------------+

Guess you like

Origin www.cnblogs.com/zxn0628/p/11273857.html