Big Data study notes --Hadoop actual programming of HDFS

Application of HDFS basic API (including basic settings of IDEA)

 In a previous blog, I am finishing detail how to build a distributed Hadoop platform HA mode from 0 down, then, on the basis of previous, we finally can be programmed practical operation, the same, before programming need to do some preparation work, well, then let's get started!

1. Prepare for Programming

In a subsequent study, we are basically using this IntelliJ IDEA integrated development environment, so the Windows side, we first need the following three software:

1) IntelliJ IDEA software download and install, can be pirated, the activation code on their own Baidu search, I use the version of IntelliJ IDEA 2019.2.3

2) Windows version of Hadoop: hadoop-2.7.3 download and configure the environment variables, and then be sure to map built before writing on a virtual machine host names and IP addresses on the Windows side of the hosts file

3) Maven: follow-up study in large data need to involve a large number of other packages and dependencies between packets, thus greatly simplify the process of using Maven package guide, download and configuration Maven environment variable, apache-maven-3.6.1

4) Maven Description: Maven is a build tool that simplifies the process of building the projects, dependency management, and project information management, core Maven programming is pom file, which is a xml file, which, groupId, artifactId, and a version that Id specifies a three coordinate a project in maven world

2. IDEA basic settings

2.1 Appearance settings and shortcuts Settings Import

After a turn on IDEA, first click on the lower left corner of the left side bar to open

Import font settings: Click Settings, search Appearance, right side to change your favorite fonts, themes can

 Be shortcut keys or import custom shortcut set: Click Settings -> Keymap, generally select the default on the line

 You can also import custom settings: Click File -> Import Settings

 开启Maven的自动导包功能:点击Settings -> 搜索"Maven",其中第一个框中是环境变量设置的MAVEN_HOME,第二个框是设置xml文件,而Local repository指的是当用户install了某个package后可以查看的目录

 编译器版本设置:编译器版本不对或过低会导致编译过程报错,因此需要进行设置,点击Settings -> 搜索Java Compiler,将所有出现version的地方都改为8

 2.2 Maven项目转化步骤

将一个普通的java项目工程转化为Maven项目的具体流程如下所示:

1)添加框架支持,Add Framework Support

 2) 拖动到最后,找到Maven,打勾

 

 3)此时右下角会出现提示,选择Enable Auto-Import即可

 至此,IDEA的基础设置已经完成!

3. Hadoop正式编程

3.1 Maven依赖和资源文件夹

需要注意,下的是什么版本的hadoop在写依赖的时候也要指定相应版本的hadoop,而后续编程会使用到单元测试简化测试流程,因此又需要用到junit这个包,pom.xml文件设置如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>groupId</groupId>
    <artifactId>untitled</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.3</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
    </dependencies>
    
</project>

点进项目结构都可以看到一个名为resources的资源文件夹,往资源文件夹下放入这三个文件:

1. core-site.xml

2. hdfs-site.xml

3. log4j.properties

3.2 文件读写

写出一个helloworld文件到hdfs文件系统上去

//测试文件的写出
    @Test
    public void testWrite() throws Exception{
        //需要注意的是,首先要设置成能使用root权限进行文件的读写
        System.setProperty("HADOOP_USER_NAME","root");
        //新建一个conf对象
        Configuration conf = new Configuration();
        //使用conf对象获得到
        FileSystem fs = FileSystem.get(conf);
        //设置文件输出的路径
        Path path = new Path("hdfs://mycluster/helloworld.txt");
        //使用path对象获取到输出流
        FSDataOutputStream fos = fs.create(path);
        for(int i = 0; i < 100; i++){
            fos.write("helloworld".getBytes());
            fos.write("\r\n".getBytes());
        }
        fos.close();
    }

 从hdfs文件系统读取一个文件并打印到控制台

//测试文件的读取
    @Test
    public void testRead() throws Exception{
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path("hdfs://mycluster/helloworld.txt");
        //输入流的读取方法是open方法
        FSDataInputStream fis = fs.open(path);
        byte[] buf = new byte[1024];
        int len = 0;
        while((len = fis.read(buf)) != -1){
            System.out.println(new String(buf,0,len));
        }
        fis.close();
    }

使用一个十分方便的工具类IOUtils进行文件的互拷

//使用IOUtils类方便地实现文件的拷贝
    @Test
    public void testCopy() throws Exception{
        //首先需要准备好一个输入流以及一个输出流
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path("hdfs://mycluster/helloworld.txt");
        FSDataInputStream fis = fs.open(path);
        FileOutputStream fos = new FileOutputStream("d:/helloworld.txt");
        //使用IOUtils进行文件拷贝
        IOUtils.copyBytes(fis,fos,1024);
        //关流
        fis.close();
        fos.close();
    }

3.3 Hadoop序列化技术

Hadoop是使用Writable接口下面的各种实现子类来实现序列化技术的,有IntWritable,LongWritable,Text等,查看Writable接口可知它的实现子类需要实现两个抽象方法

void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;

 

再查看IntWritable类的实现关系可知,IntWritable,Text等都是实现了WritableComparable接口,而这个接口又继承了Comparable接口,其中有一个方法要实现

public int compareTo(T o);

 

自定义Student类

public class Student {
    private int id;
    private int age;

    public Student() {
    }

    public Student(int id, int age) {
        this.id = id;
        this.age = age;
    }

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public int getAge() {
        return age;
    }

    public void setAge(int age) {
        this.age = age;
    }

    @Override
    public String toString() {
        return "Student{" +
                "id=" + id +
                ", age=" + age +
                '}';
    }
}

 

自定义StudentWritable类实现Writable接口

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/*
    自定义一个类继承Writable接口
 */
public class StudentWritable implements Writable {
    private Student stu;

    public StudentWritable(Student stu) {
        this.stu = stu;
    }

    public Student get() {
        return stu;
    }

    public void set(Student stu) {
        this.stu = stu;
    }

    public void write(DataOutput out) throws IOException {
        out.writeInt(stu.getId());
        out.writeInt(stu.getAge());
    }

    public void readFields(DataInput in) throws IOException {
        //注意:在这里需要先初始化Student对象,再把该对象的成员变量当成容器获取读取到的值
        stu = new Student();
        stu.setId(in.readInt());
        stu.setAge(in.readInt());
    }
}

 

测试各种数据类型的序列化以及反序列化

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.junit.Test;

import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;

/*
    测试hadoop的序列化技术
 */
public class TestSerial {

    //序列化一个数字
    @Test
    public void testSerial1() throws Exception{
        //新建一个IntWritable对象
        IntWritable iw = new IntWritable(100);
        //新建一个数据流
        DataOutputStream dos = new DataOutputStream(new FileOutputStream("d:/test.h"));
        //调用iw的write方法将数据写出到磁盘
        iw.write(dos);
        //关流
        dos.close();
    }

    //反序列化
    @Test
    public void testDeserial1() throws Exception{
        //构造一个空的IntWritable对象,相当于是一个空蓝子
        IntWritable iw = new IntWritable();
        //新建一个数据输入流
        DataInputStream dis = new DataInputStream(new FileInputStream("d:/test.h"));
        //将数据装入篮子
        iw.readFields(dis);
        //从篮子获取数据
        System.out.println(iw.get());
        //关流
        dis.close();
    }

    //序列化一个字符串
    @Test
    public void testSerial2() throws Exception{
        Text text = new Text("helloworld");
        DataOutputStream dos = new DataOutputStream(new FileOutputStream("d:/test1.h"));
        text.write(dos);
        dos.close();
    }

    //反序列化
    @Test
    public void testDeserial2() throws Exception{
        Text text = new Text();
        DataInputStream dis = new DataInputStream(new FileInputStream("d:/test1.h"));
        text.readFields(dis);
        //Text对象调用的是toString方法
        System.out.println(text.toString());
    }

    //序列化一个对象
    @Test
    public void testSerial3() throws Exception{
        StudentWritable sw = new StudentWritable(new Student(1, 18));
        DataOutputStream dos = new DataOutputStream(new FileOutputStream("d:/test2.h"));
        sw.write(dos);
        dos.close();
    }
    
    //反序列化
    @Test
    public void testDeserial3() throws Exception{
        StudentWritable sw = new StudentWritable(new Student());
        DataInputStream dis = new DataInputStream(new FileInputStream("d:/test2.h"));
        sw.readFields(dis);
        System.out.println(sw.get().toString());
    }

}

 

compareTo方法:由于后续讲到hadoop的二次排序时会遇到需要重写compareTo方法 的场景,因此需要进行相应练习

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/*
    自定义一个类继承WritableComparable接口
 */
public class StudentWritable2 implements WritableComparable<StudentWritable2> {
    private Student stu;

    public StudentWritable2(Student stu) {
        this.stu = stu;
    }

    public Student get() {
        return stu;
    }

    public void set(Student stu) {
        this.stu = stu;
    }

    public void write(DataOutput out) throws IOException {
        out.writeInt(stu.getId());
        out.writeInt(stu.getAge());
    }

    public void readFields(DataInput in) throws IOException {
        //注意:在这里需要先初始化Student对象,再把该对象的成员变量当成容器获取读取到的值
        stu = new Student();
        stu.setId(in.readInt());
        stu.setAge(in.readInt());
    }

    public int compareTo(StudentWritable2 o) {
        //实现二次排序功能,id不一样就升序比较id,否则就比较age,降序排列
        return this.stu.getId() == o.stu.getId()? o.stu.getAge() - o.stu.getAge(): this.stu.getId() - o.stu.getId();
    }
}

Guess you like

Origin www.cnblogs.com/w950219/p/11828244.html