大数据之hbase（一） --- HBase介绍，特性，安装部署，shell命令，client端与hbase的交互过程，编程API访问hbase实现百万写入

一、HBase介绍
----------------------------------------------
    1.基于hadoop的数据库，具有分布式，可伸缩的大型数据储存

    2.用于对数据的随机访问，实时读写

    3.巨大的表，十亿行*百万列

    4.版本化、非关系型数据库


二、HBase特性
-------------------------------------------------
    Linear and modular scalability.
    线性模块化扩展

    Strictly consistent reads and writes.
    严格一直性读写

    Automatic and configurable sharding of tables
    自动的可配置的表切割

    Automatic failover support between RegionServers.
    支持区域服务器之间动态容灾

    Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
    便利的，支持Hadoop MR的基本类库

    Easy to use Java API for client access.
    易于使用的API

    Block cache and Bloom Filters for real-time queries.
    块缓存和布隆过滤器，用于实时查询

    Query predicate push down via server side Filters
    通过服务器端过滤器，实现查询预测

    Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
    具有支持XML，Protobuf，二进制的选项的Thrift网关和服务

    Extensible jruby-based (JIRB) shell
    可拓展的shell

    Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
    支持可视化

    面向列的非关系型数据路。遵循严格的一致性读写（行级锁，乐观锁）


三、hbase的存储机制
-------------------------------------------------------------
    面向列存储，表按照行排序


四、HBase的部署
-------------------------------------------------------------
    1.jdk

    2.ssh

    3.hadoop

    4.下载并tar开Hbase

    5.环境变量，符号链接，验证安装是否成功（hbase version）

    6.导入JAVA_HOME的环境变量到配置文件[hbase/conf/hbase-env.sh]中
        JAVA_HOME="/soft/jdk"

    7.配置hbase
        (1)本地模式
            a.配置[hbase/conf/hbase-site.xml]
                ...
                <property>
                    <name>hbase.rootdir</name>
                    <value>file:/home/hadoop/HBase/HFiles</value>
                </property>
                ...

        (2)伪分布式
           a.配置[hbase/conf/hbase-site.xml]
                <property>
                    <name>hbase.cluster.distributed</name>
                    <value>true</value>
                </property
                <property>
                    <name>hbase.rootdir</name>
                    <value>hdfs://localhost:8030/hbase</value>
                </property>

        (3)完全分布式
            a.[hbase/conf/hbase-env.sh]
                export JAVA_HOME=/soft/jdk
                export HBASE_MANAGES_ZK=false

            b.[hbase/conf/hbse-site.xml]
                <!-- 使用完全分布式 -->
                <property>
                    <name>hbase.cluster.distributed</name>
                    <value>true</value>
                </property>

                <!-- 指定hbase数据在hdfs上的存放路径 -->
                <property>
                    <name>hbase.rootdir</name>
                    <value>hdfs://mycluster/hbase</value>
                </property>
                <!-- 配置zk地址 -->
                <property>
                    <name>hbase.zookeeper.quorum</name>
                    <value>192.168.43.131:2181,192.168.43.132:2181,192.168.43.133:2181</value>
                </property>
                <!-- zk的本地目录 -->
                <property>
                    <name>hbase.zookeeper.property.dataDir</name>
                    <value>/home/ubuntu/zookeeper</value>
                </property>

    8.配置regionservers
        [hbase/conf/regionservers]
        s200
        s300
        s400

    9.启动hbase集群(s100)
        $>start-hbase.sh

    10.登录hbase的webui
        $> http://s100:16010


五、hbase集群管理命令
--------------------------------------
    $> hbase-daemon.sh start master;
    $> hbase-daemon.sh stop master;
    $> hbase-daemon.sh start regionserver
    $> hbase-daemon.sh stop regionserver
    $> hbase-daemons.sh start regionserver


六、hbase的HA部署
----------------------------------------------------------
    因为hbase已经依赖于zk了，所以，直接在其他另外安装了hbase的主机上执行命令
    $> hbase-daemon.sh start master
    开启master管理进程即可


七、命令分组
-----------------------------------------------------------
    [general]
    status, table_help, version, whoami

    [ddl]
    alter, alter_async, alter_status, create, describe,
    disable, disable_all, drop, drop_all, enable, enable_all, exists,
    get_table, is_disabled, is_enabled, list, locate_region, show_filters

    [namespace]
    alter_namespace, create_namespace, describe_namespace,
    drop_namespace, list_namespace, list_namespace_tables

    [dml]
    append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve

    [tools]
    assign, balance_switch, balancer, balancer_enabled, catalogjanitor_enabled, catalogjanitor_run,
    catalogjanitor_switch, close_region, compact, compact_rs, flush, major_compact, merge_region, move, normalize,
    normalizer_enabled, normalizer_switch, split, trace, unassign, wal_roll, zk_dump

    [replication]
    add_peer, append_peer_tableCFs, disable_peer, disable_table_replication, enable_peer,
    enable_table_replication, list_peers, list_replicated_tables, remove_peer, remove_peer_tableCFs,
    set_peer_tableCFs, show_peer_tableCFs

    [snapshots]
    clone_snapshot, delete_all_snapshot, delete_snapshot, list_snapshots, restore_snapshot, snapshot

    [configuration]
    update_all_config, update_config

    [quotas]
    list_quotas, set_quota

    [security]
    grant, list_security_capabilities, revoke, user_permission

    [procedures]
    abort_procedure, list_procedures

    [visibility labels]
    add_labels, clear_auths, get_auths, list_labels, set_auths, set_visibility



七、hbase shell
-----------------------------------------------------------
    $> hbase shell                      //登录hbase shell 终端

    $hbase> help                        //查看help
    $hbase> help 'list_namespace'       //查看指定命令的帮助
    $hbase> list_namespace               //列出名字空间（里面存放table,相当于mysql的database）
    $hbase> list_namespace_tables 'mynamespace'               //列出指定名字空间里面存放的所有table
    $hbase> create_namespace 'ns1'               //创建名称空间ns1
    $hbase> create 'ns1:t1','f1'                    //在ns1名称空间下创建表t1
    $hbase> desc 'ns1:t1'                       //显示表细节
    $hbase> put 'ns1:t1','row1','f1:id',100     //在ns1的表t1的row1行，f1列族的id列put数据，100,三级[行，列族的列，时间戳版本]定位
    $hbase> get 'ns1:t1','row1'                 //获取表t1的行id为row1的数据
    $hbase> scan 'ns1:t1'                       //全表扫描
    $hbase> disable 'ns1:t1'                    //禁用t1表
    $hbase> drop 'ns1:t1'                       //删除t1(注意删除表之前要先禁用表才能删除)



八、hbase的表在hdfs上的存储结构
------------------------------------------------------------------
    /hbase/data/{namespace}/{tablename}/{regionID}/{columnfamily}/{StoreFile}
    /hbase/data/ns1/t1/bdaa7c6d847b310af02cefe6a917864b/f1/a97d571e34df440d8d115bb685dbc5b0


九、client端与hbase的交互过程
------------------------------------------------------------------
    0.hbase启动时，hmaster会负责将所有的region分配到每个HregionSerer上

    1.联系zk找出meta所在的rs(regionsServer)
        zk:/hbase/meta-region-server

    2.通过meta表，查找到row key所在的具体rs服务器，这样就找到了region位置。并缓存这次的查询的主要信息，以便于下次查找

    3.联系定位好的rs服务器，让其打开region,得到一个HRegion的实例，实例中每个列族都会对应一个Store

    4.每个store都会包含一个或者多个storefile（里面封装有实际的存储数据hfile，是一种轻量级封装）
    同时还会有一个与该store对应的memStore（内存中存放数据）

    5.写过程分析：
        a.每个HRegionServer都会有一个与之对应的WAL，WAN是一个顺序文件（kv）用来储存要写入的数据的序列号和实际数据（实际数据存储在memStore中）
        b.当发生写请求的时候，会先执行1-4步骤，找到所要写入数据的region所在的具体位置，然后查看是否开启了预写日志（写入到内存）
        c.如果开启了，那么数据就会被写入HSR的WAL中，进而数据就会被储存到内存中（实际上是存储在memStore中）
        d.检查memStore是否已经满了，满了的话，就会将memStore中的数据刷新到磁盘（hdfs）中,持久化存储
        e.当然，当hbase关闭的时候，memStore中未写入到hdfs中的数据也会被刷新到hdfs中进行持久化储存


十、编程API访问hbase，实现百万写入
------------------------------------------------------------------
    1.创建hbase模块，添加maven依赖
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>1.2.6</version>
        </dependency>

    2.代码实现

   import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.hbase.Cell;
    import org.apache.hadoop.hbase.HBaseConfiguration;
    import org.apache.hadoop.hbase.TableName;
    import org.apache.hadoop.hbase.client.*;
    import org.apache.hadoop.hbase.util.Bytes;
    import org.junit.Before;
    import org.junit.Test;

    import java.io.IOException;
    import java.text.DecimalFormat;
    import java.util.List;

    /**
     * 测试 --hbaseapi
     */
    public class TsCRUD {

        public Connection conn;
        public Table tb;

        @Before
        public void getConn() throws Exception {
            //获取配置文件
            Configuration conf = HBaseConfiguration.create();
            //工厂类创建连接
            conn = ConnectionFactory.createConnection(conf);
            //get table
            TableName tbName = TableName.valueOf("ns1:t1");
            tb = conn.getTable(tbName);
        }


        @Test
        public void tsPut() throws Exception {
            //new put row
            Put put = new Put(Bytes.toBytes("row3"));
            //add put column cile
            put.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("id"), Bytes.toBytes(102));
            tb.put(put);
            System.out.println("put over");

        }

        @Test
        public void tsGet() throws Exception {
            Get get = new Get(Bytes.toBytes("row3"));
            Result rs = tb.get(get);
            byte [] bs = rs.getValue(Bytes.toBytes("f1"), Bytes.toBytes("id"));
            System.out.println(Bytes.toInt(bs));
        }

        @Test
        public void tsBigPut() throws Exception {
            long time = System.currentTimeMillis();
            //关闭自动刷新
            HTable t = (HTable) tb;
            t.setAutoFlush(false);

            //开始插入
            for (int i = 5; i < 1000000; i++) {

                //为了保证id的一致性（8 --> 0000008）
                DecimalFormat format = new DecimalFormat();
                format.applyPattern("0000000");
                Put put = new Put(Bytes.toBytes("row" + format.format(i)));

                //关闭写前日志Hlog，不让其执行WAL(不推荐使用，因为当服务器宕机会丢失数据)
                put.setWriteToWAL(false);

                //add put column cile
                put.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("id"), Bytes.toBytes(i));
                put.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("name"), Bytes.toBytes("tom" + i));
                put.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("age"), Bytes.toBytes(i % 100));
                t.put(put);

                //每2000，一提交
                if (i % 2000 == 0) {
                    t.flushCommits();
                }
            }
            //提交剩余
            t.flushCommits();

            System.out.println("耗时：" + (System.currentTimeMillis() - time));
        }

    }

大数据之hbase（一） --- HBase介绍，特性，安装部署，shell命令，client端与hbase的交互过程，编程API访问hbase实现百万写入

猜你喜欢