hive 从文件中批量加载数据到分区表使用 sshxcute 架构

文章目录

本篇使用的方式 sshxcute
hive从文件中加载数据到分区表

1.方法一:shell 脚本 (常用) 其他脚本也可以
2.方法二: 就是本篇博客所述的方法 (常用)
3.方法三:处理数据的时候直接使用多文件输出,输出到hive中(或者使用mapreduce)
4.方法四

本篇使用的方式 sshxcute

需要用到的jar 或 pom 文件
这里提一下 sshxcute.jar

链接：https://pan.baidu.com/s/1sHbXSpAb8qaCYo6K6HJr5w 
提取码：uxuj
说明 内容是一个sshexec 的文件夹,
文件夹里面有:
1.示例代码: DataImport.java
2.jar包 : sshxucute.jar
3.maven仓库导入后的jar是一个文件夹 :  sshxcute
只需把它放到个人仓库的\net\neoremind\目录下就行 repository\net\neoremind
可以看 下图
//考虑到直接使用 要引用两个源因此直接把该jar手动导入了个人仓库
导入个人仓库后记得刷新一下个人仓库(不然不会导入) !!!!!!!
//导入之后直接在maven中引用就行 ,直接引用应该会报错
<!-- https://mvnrepository.com/artifact/net.neoremind/sshxcute -->
		<dependency>
            <groupId>net.neoremind</groupId>
            <artifactId>sshxcute</artifactId>
            <version>1.0</version>
        </dependency>

maven仓库导入后的jar 图片
在这里插入图片描述
pom.xml (第二个)
可以不用我的,只要能进行hdfs API 操作就可以如果用不用我的pom记得把1添加上

<!-- https://mvnrepository.com/artifact/net.neoremind/sshxcute -->
		<dependency>
            <groupId>net.neoremind</groupId>
            <artifactId>sshxcute</artifactId>
            <version>1.0</version>
        </dependency>

	<repositories>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
    </repositories>

    <dependencies>
        <!-- https://mvnrepository.com/artifact/net.neoremind/sshxcute -->
        <dependency>
            <groupId>net.neoremind</groupId>
            <artifactId>sshxcute</artifactId>
            <version>1.0</version>
        </dependency>


        <dependency>
            <groupId>org.apache.Hadoop</groupId>
            <artifactId>Hadoop-client</artifactId>
            <version>2.6.0-mr1-cdh5.14.0</version>
        </dependency>


        <dependency>
            <groupId>org.apache.Hadoop</groupId>
            <artifactId>Hadoop-common</artifactId>
            <version>2.6.0-cdh5.14.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.Hadoop</groupId>
            <artifactId>Hadoop-hdfs</artifactId>
            <version>2.6.0-cdh5.14.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.Hadoop</groupId>
            <artifactId>Hadoop-mapreduce-client-core</artifactId>
            <version>2.6.0-cdh5.14.0</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.11</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.testng</groupId>
            <artifactId>testng</artifactId>
            <version>RELEASE</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

代码如下

import net.neoremind.sshxcute.core.ConnBean;
import net.neoremind.sshxcute.core.Result;
import net.neoremind.sshxcute.core.SSHExec;
import net.neoremind.sshxcute.exception.TaskExecFailException;
import net.neoremind.sshxcute.task.impl.ExecCommand;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * @version v 1.0
 * @date 2019.12.24
 */
public class DataImport {
    public static void main(String[] args) throws TaskExecFailException, IOException, URISyntaxException {
    	//获取文件系统使用 hdfsAPI 操作文件系统
        FileSystem fileSystem = FileSystem.get(new URI("hdfs://192.168.100.100:8020"), new Configuration());
        //获取文件系统目录下的文件
        FileStatus[] fileStatuses = fileSystem.listStatus(new Path("/aaa/project/app_output"));

        //设置连接的服务器 ip(如果在hosts文件中配置过只需要用主机名) , 用户名 , 密码
        ConnBean connBean = new ConnBean("hadoop01", "root", "123456");
        //连接服务器
        SSHExec sshExec = SSHExec.getInstance(connBean);
        sshExec.connect();
        //定义一个变量用来保存拼接的命令
        StringBuilder order = new StringBuilder();
        //定义一个只负责用来保存初步添加命令
        String str = "hive -e \"use bigdata_project01;";
        //初步添加命令
        order.append(str);

        ExecCommand execCommand = null ;
        for (int i = 0; i < fileStatuses.length; i++) {
            FileStatus fileStatus = fileStatuses[i];
            //获取路径和年月日  获取的path  1970-01-01.txt
            String path = fileStatus.getPath().getName();
            // 1970
            String year = path.substring(0, 4);
            // 01
            String month = path.substring(5, 7);
            // 01
            String day = path.substring(8, 10);
            //拼接
            //   /aaa/project/app_output/ 为 hdfs 文件系统的路径
            // 初始命令为 hive -e "LOAD DATA  INPATH '/aaa/project/app_output/1970-01-01.txt' OVERWRITE INTO TABLE tableName PARTITION (year = '年',month = '月',day = '日');"
            String s = "LOAD DATA  INPATH '/aaa/project/app_output/" + path + "' OVERWRITE INTO TABLE app_traffic PARTITION (year = '" + year + "',month = '" + month + "',day = '" + day + "');";
            order.append(s);

            //由于 exec 提交命令的长度 是有限的 因此 字符串每拼接一百次 提交一次 否则长度如果过长会出现索引越界异常
            if (i != 0 && i % 100 == 0){
                //拼接完毕
                order.append("\"");
                //提交命令
                execCommand = new ExecCommand(order.toString());
                //执行 如果是此行运行爆数组索引越界异常出 说明 长度还是过大
                Result exec = sshExec.exec(execCommand);

                //清空  StringBuilder
                order = new StringBuilder();
                //重新拼接
                order.append(str);
            }
        }

        //最后一次拼接
        order.append("\"");
        //最后一次提交命令
        execCommand = new ExecCommand(order.toString());
        //执行
        Result exec = sshExec.exec(execCommand);

        //关闭文件系统
        fileSystem.close();
        //关闭连接
        sshExec.disconnect();
    }
}

hive从文件中加载数据到分区表

1.方法一:shell 脚本 (常用) 其他脚本也可以

此脚本文件输出路径是:
/aaa/output_app/year/month/day/year-month-day.txt

##!/bin/bash   
yyyy=`ls /aaa/output_app` #定义遍历的目录
  for yy in $yyyy
    do
      MM=`ls /aaa/output_app/${yy}`
      for  mm in $MM
        do
          DD=`ls/aaa/output_app/${yy}/${mm}`
            for dd in $DD
              do
                hive -e "use app_traffic;LOAD DATA LOCAL  INPATH '/aaa/output_app/${yy}/${mm}/${dd}/${yy}-${mm}-${dd}.txt' OVERWRITE INTO TABLE app_traffic PARTITION (year='${yy}',month='${mm}',day='${dd}');"
              done
        done
    done

2.方法二: 就是本篇博客所述的方法 (常用)

3.方法三:处理数据的时候直接使用多文件输出,输出到hive中(或者使用mapreduce)

然后刷新hive就行

导入格式
在这里插入图片描述
提供reduce 端的代码

package com.czxy.app;

import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * @author dell
 * @version v 1.0
 * @date 2019.12.24
 */
public class AppReducer extends Reducer<Text, Text, Text, NullWritable> {
    private static FileSystem fileSystem;

    @Override
    protected void setup(Context context) throws IOException {
        try {
            //获取文件系统对象
            fileSystem = FileSystem.get(new URI("hdfs://192.168.100.100:8020"), context.getConfiguration());
        } catch (URISyntaxException e) {
            e.printStackTrace();
        }
    }

    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException {
//            直接往hive表中导入数据
        String s = key.toString();
        // 1970
        String year = s.substring(0, 4);
        // 01
        String month = s.substring(5, 7);
        // 01
        String day = s.substring(8, 10);
        //创建文件
        FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path("/user/hive/warehouse/bigdata_project01/app_traffic/year=" + year + "/month=" + month + "/day=" + day + "/" + s + ".txt"));

        //输出
        for (Text value : values) {
            //往文本中输出
            fsDataOutputStream.write((value.toString() + "\r\n").getBytes());
        }
        fsDataOutputStream.close();
    }
}

4.方法四

1.使用hdfsAPI操作遍历文件目录
2.把所有要执行的代码使用字符串拼接整合成一个文件
此文件可以在windows上生成 , 也可以在linux上生成, 还可以在hdfs 上生成
看个人
(在linux使用API上生成,这样可以省略上传(下载)的操作)
3.然后上传(下载)到 linux
4.使用 hive -f 执行

最后 : 能力有限,如果有什么问题,或者对代码有疑惑的地方,欢迎给我留言 !!!

红尘丶世界

发布了88 篇原创文章 · 获赞 114 · 访问量 3017

私信关注

hive 从 文件中 批量 加载 数据 到 分区 表 使用 sshxcute 架构