以词频统计算法为例,来介绍怎么在具体应用中使用Hive
一、 创建input目录,output目录会自动生成
其中input为输入目录,output目录为输出目录。
命令:
cd /usr/local/hadoop
mkdir input
二、 在input文件夹中创建两个测试文件file1.txt和file2.txt
命令:
cd /usr/local/hadoop/input
echo "hello world" > file1.txt
echo "hello hadoop" > file2.txt
三、 在终端编译运行程序
查看文件:
四、 通过HiveQL实现词频统计功能
只要编写下面7行代码,而且不需要进行编译生成jar来执行:
create table docs(line string);
load data inpath 'input' overwrite into table docs;
create table word_count as
select word, count(1) as count from
(select explode(split(line,' '))as word from docs) w
group by word
order by word;
五、 创建fusioninsight-hive的Project
六、 为创建好的fusioninsight-hive Project导入依赖jar
七、 创建一个名为com.fusioninsight.hive的Package
八、 创建实验所需的Java Class
在上一步创建好的Package下,创建一个名为HiveJdbc的Java Class:
创建一个名为HiveUDF的Java Class:
九、 编写HiveJdbc代码
package com.fusioninsight.hive;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
public class HiveJdbc {
private static final String HIVE_DRIVER = "org.apache.hive.jdbc.HiveDriver";
private static final String ZK_HOSTS = "host1:24002,host2:24002,host3:24002";
private static final String SERVICES_DISCOVERY_MODE = "zooKeeper";
private static final String ZOOKEEPER_NAMESPACE = "hiveserver2";
static {
try {
Class.forName(HIVE_DRIVER);
} catch (ClassNotFoundException e)
{
e.printStackTrace();
}
}
private String getConnectionUrl(){
String url = "jdbc:hive2://" + ZK_HOSTS+"/"+";serviceDiscoveryMode="+SERVICES_DISCOVERY_MODE +";zooKeeperNamspace="+ZOOKEEPER_NAMESPACE;
return url;
}
private Connection getConnection(String url){
Connection connection = null;
try {
connection = DriverManager.getConnection(url);
} catch (SQLException e)
{
e.printStackTrace();
}
return connection;
}
private PreparedStatement getStatement(Connection connection, String sql) throws SQLException {
return connection.prepareStatement(sql);
}
private void execDDL(String sql) {
Connection connection = getConnection(getConnectionUrl());
try{
PreparedStatement statement = getStatement(connection, sql);
statement.executeUpdate();
}catch (SQLException e){
e.printStackTrace();
}
}
private ResultSet execDQL(String sql) {
ResultSet resultSet = null;
Connection connection = getConnection(getConnectionUrl());
try {
PreparedStatement statement = getStatement(connection, sql);
resultSet = statement.executeQuery();
}catch (SQLException e){
e.printStackTrace();
}
return resultSet;
}
private void printResultSet(ResultSet resultSet) throws SQLException {
ResultSetMetaData resultSetMetaData = resultSet.getMetaData();
int columnCount = resultSetMetaData.getColumnCount();
StringBuilder columnName = new StringBuilder();
StringBuilder value = new StringBuilder();
for (int i = 1; i <= columnCount; i++) {
columnName.append(resultSetMetaData.getColumnName(i) + "\t");
}
while (resultSet.next()) {
for (int i = 1; i <= columnCount; i++) {
value.append(resultSet.getString(i) + "\t");
}
value.append("\n");
}
System.out.printf("%s\n%s", columnName, value);
}
private void execDML(String sql){
execDDL(sql);
}
public static void main(String[] args) {
HiveJdbc jdbc = new HiveJdbc();
//请将username替换为自己开发机的真实用户名
ResultSet resultSet = jdbc.execDQL("SELECT * FROM username.bucket_test LIMIT 5");
try {
jdbc.printResultSet(resultSet);
} catch (SQLException e){
e.printStackTrace();
}
}
}
十、 编写HiveUDF代码
package com.fusioninsight.hive;
import org.apache.hadoop.hive.ql.exec.UDF;
public class HiveUDF extends UDF{
public String evaluate(String str){
return str+"hello";
}
public String evaluate(String... strs){
StringBuilder sb = new StringBuilder();
for (String s: strs){
sb.append(s+":");
}
return sb.toString();
}
}
十一、 运行HiveJdbc程序并查看结果
十二、 程序打包,配置Hive自定义函数
十三、 将jar文件put到HDFS上
十四、 创建函数