《进击大数据》系列教程之hadoop大数据基础

前言

一、使用http的方式访问hdfs

二、hdfs各组件及其作用

三、hdfs中的数据块（Block）

四、java api 操作hdfs

五、java 开发 hdfs 应用时注意的事项

六、DataNode心跳机制的作用

七、NameNode中EditsLog 和FSImage 的作用

八、SecondaryNameNode 帮助NameNode 减负

九、NameNode 如何扩展？

扫描二维码关注公众号，回复： 12545133 查看本文章

十、创建文件的快照（备份）命令

十一、平衡数据

十二、safemode 安全模式

前言

时隔一年多，忙忙碌碌一直在做java web端的业务开发，大数据基本忘的差不多了，此次出一个大数据系列教程博文将其捡起。

hadoop，hdfs的下载安装以及启动，停止在这里就不一一介绍了，不会的可以查看我的历史博客。

默认我们已经搭建好了一个三节点的主从hadoop节点，一个master，两个salve

一、使用http的方式访问hdfs

在hdfs-site.xml中增加如下配置，然后重启hdfs：

<property>

<name>dfs.webhdfs.enabled</name>

<value>true</value>

<description>使得可以使用http的方式访问hdfs</description>

</property>

使用http访问：

查询user/hadoop-twq/cmd 文件系统中的error.txt文件

http://master:50070/webhdfs/v1/user/hadoop-twq/cmd/error.txt?op=LISTSTATUS

http://master:50070/webhdfs/v1/user/hadoop-twq/cmd/error.txt?op=OPEN

支持的op见：

http://hadoop.apache.org/docs/r2.7.5/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

二、hdfs各组件及其作用

三、hdfs中的数据块（Block）

数据块的默认大小：128M

设置数据块的大小为：256M = 256*1024*1024

在${HADOOP_HOME}/ect/hadoop/hdfs-site.xml增加配置：

<property>

<name>dfs.block.size</name>

<value>268435456</value>

</property>

数据块的默认备份数是3

设置数据块的备份数

对指定的文件设置备份数

hadoop fs -setrep 2 /user/hadoop-twq/cmd/big_file.txt

全局文件备份数是直接在hdfs-site进行设置：

<property>

   <name>dfs.replication</name>

 <value>3</value>

</property>

数据块都是存储在每一个datanode 所在的机器本地磁盘文件中

四、java api 操作hdfs

使用javaapi 写数据到文件

package com.dzx.hadoopdemo;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.nio.charset.StandardCharsets;

/**
 * @author duanzhaoxu
 * @ClassName:
 * @Description:
 * @date 2020年12月17日 17:35:37
 */
public class hdfs {
    public static void main(String[] args) throws Exception {
        String content = "this is a example";
        String dest = "hdfs://master:9999/user/hadoop-twq/cmd/java_writer.txt";
        Configuration configuration = new Configuration();
        FileSystem fileSystem = FileSystem.get(URI.create(dest), configuration);
        FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path(dest));
        fsDataOutputStream.write(content.getBytes(StandardCharsets.UTF_8));
        fsDataOutputStream.close();
    }
}

使用java api 读取文件

package com.dzx.hadoopdemo;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URI;

/**
 * @author duanzhaoxu
 * @ClassName:
 * @Description:
 * @date 2020年12月17日 17:35:37
 */
public class hdfs {
    public static void main(String[] args) throws Exception {
        String dest = "hdfs://master:9999/user/hadoop-twq/cmd/java_writer.txt";
        Configuration configuration = new Configuration();
        FileSystem fileSystem = FileSystem.get(URI.create(dest), configuration);
        FSDataInputStream fsDataInputStream = fileSystem.open(new Path(dest));
        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(fsDataInputStream));
        String line = null;
        while (bufferedReader.readLine() != null) {
            System.out.println(line);
        }
        fsDataInputStream.close();
        bufferedReader.close();
    }
}

使用java api 获取文件状态信息

package com.dzx.hadoopdemo;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.permission.FsAction;
import org.apache.hadoop.fs.permission.FsPermission;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URI;

/**
 * @author duanzhaoxu
 * @ClassName:
 * @Description:
 * @date 2020年12月17日 17:35:37
 */
public class hdfs {
    public static void main(String[] args) throws Exception {
        //获取指定文件的文件状态信息
        String dest = "hdfs://master:9999/user/hadoop-twq/cmd/java_writer.txt";
        Configuration configuration = new Configuration();
        FileSystem fileSystem = FileSystem.get(URI.create("hdfs://master:9999/"), configuration);
        FileStatus fileStatus = fileSystem.getFileStatus(new Path(dest));

        System.out.println(fileStatus.getPath());
        System.out.println(fileStatus.getAccessTime());
        System.out.println(fileStatus.getBlockSize());
        System.out.println(fileStatus.getGroup());
        System.out.println(fileStatus.getLen());
        System.out.println(fileStatus.getModificationTime());
        System.out.println(fileStatus.getOwner());
        System.out.println(fileStatus.getPermission());
        System.out.println(fileStatus.getReplication());
        System.out.println(fileStatus.getSymlink());

        //获取指定目录下的所有文件的文件状态信息
        FileStatus[] fileStatuses = fileSystem.listStatus(new Path("hdfs://master:9999/user/hadoop-twq/cmd"));
        for (FileStatus status : fileStatuses) {
            System.out.println(status.getPath());
            System.out.println(status.getAccessTime());
            System.out.println(status.getBlockSize());
            System.out.println(status.getGroup());
            System.out.println(status.getLen());
            System.out.println(status.getModificationTime());
            System.out.println(status.getOwner());
            System.out.println(status.getPermission());
            System.out.println(status.getReplication());
            System.out.println(status.getSymlink());
        }

        //创建目录
        fileSystem.mkdirs(new Path("hdfs://master:9999/user/hadoop-twq/cmd/java"));
        //创建目录并指定权限  rwx--x---
        fileSystem.mkdirs(new Path("hdfs://master:9999/user/hadoop-twq/cmd/temp"), new FsPermission(FsAction.ALL, FsAction.EXECUTE, FsAction.NONE));

        //删除指定文件
        fileSystem.delete(new Path("hdfs://master:9999/user/hadoop-twq/cmd/java/1.txt"), false);
        //删除指定目录
        fileSystem.delete(new Path("hdfs://master:9999/user/hadoop-twq/cmd/java"), true);

    }
}

五、java 开发 hdfs 应用时注意的事项

 //需要把core-site.xml文件放到resources目录下，自动读取hdfs的ip端口配置
        String dest = "user/hadoop-twq/cmd/java_writer.txt";
        Configuration configuration = new Configuration();
        FileSystem fileSystem = FileSystem.get(configuration);
        FileStatus fileStatus = fileSystem.getFileStatus(new Path(dest));

六、DataNode心跳机制的作用

七、NameNode中EditsLog 和FSImage 的作用

八、SecondaryNameNode 帮助NameNode 减负

九、NameNode 如何扩展？

向主节点master的hdfs-site.xml 增加如下配置

查看 master节点的 clusterId

将hdfs-site.xml 从主节点master 拷贝到 slave1 和 slave2

形成三个节点的集群之后，我们使用java api 就不知道指定哪个 nameNode 的ip：端口了，所以我们需要进行viewFs配置

首先将core-site.xml 中的fs.defaultFS 配置项注释掉，然后添加如下配置

<configuration xmlns:xi="http://www.w3.org/2001/XInclude">
		<xi:include href="mountTable.xml"/>
        <property>
            <name>fs.default.name</name>
            <value>viewfs://my-cluster</value>        
        </property>		
</configuration>

然后增加一个 mountTable.xml 文件( 元数据管理分布映射，相当于将namenode 管理的元数据分散到不同的namenode节点上)

然后将修改好的配置文件同步到 slave1 和 slave2 上，重启hdfs集群即可

重启之后在任意节点都可以使用通用的请求方式，例如：

hadoop fs -ls viewfs:://my-cluster/

十、创建文件的快照（备份）命令

给指定的目录授权创建快照

hadoop dfsadmin -allowSnapshot /user/hadoop-twq/data

创建快照

hadoop fs -createSnapshot /user/hadoop-twq/data data-20180317-snapshot

查看创建的快照文件

hadoop fs -ls /user/hadoop-twq/data/.snapshot/data-20180317-snapshot

其他快照相关命令

十一、平衡数据

当我们对hdfs 集群进行扩展的时候，难免新扩展进来的节点，分配的数据量较少，这个时候为了能够均衡的分配数据，可以使用hdfs balancer命令

十二、safemode 安全模式

开启安全模式之后无法创建，删除目录和文件，只允许查看目录和文件

hadoop dfsadmin -safemode get

Safe mode is OFF

hadoop dfsadmin -safemode enter

Safe mode is ON

hadoop dfsadmin -safemode leave

Safe mode is ON