Summary of using HBase in Python 3

HBase Introduction and Installation

Please refer to the article: Read and understand HBase in one article

Python3 HBase API

HBase preliminary preparation

1 安装happybase库操作hbase
安装该库 pip install happybase

2 确保 Hadoop 和 Zookeeper 可用并开启
确保Hadoop 正常运行
确保Zookeeper 正常运行

3 开启HBase thrift服务
使用命令开启
$HBASE_HOME/bin/hbase-daemon.sh start thrift

4、使用jps 命令查看thrift 服务 是否正常启动
[root@Hadoop3-master bin]# jps
69760 Worker
120160 ResourceManager
81811 QuorumPeerMain
119541 DataNode
93143 Jps
56695 Worker
119387 NameNode
119802 SecondaryNameNode
92890 ThriftServer
69549 Master
69759 Worker
[root@Hadoop3-master bin]#

Introduction to HappyBase

Happybase is a library for Python to access HBase through Thrift, which is convenient and fast to implement.

HappyBase core class

Centos operating instructions 


[root@Hadoop3-master bin]# ./zkServer.sh start #启动 ZooKeeper
ZooKeeper JMX enabled by default
Using config: /usr/local/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[root@Hadoop3-master bin]# hbase-daemon.sh start thrift #开启守护hbase 线程并开启thrift 服务
running thrift, logging to /usr/local/hbase/logs/hbase-root-thrift-Hadoop3-master.out
[root@Hadoop3-master bin]# jps #hadoop 3 服务/Hbase 服务
69760 Worker
120160 ResourceManager
81811 QuorumPeerMain
119541 DataNode
93143 Jps
56695 Worker
119387 NameNode
119802 SecondaryNameNode
92890 ThriftServer
69549 Master
69759 Worker
[root@Hadoop3-master bin]# 

 Summary of problems encountered in HBase pseudo-cluster/stand-alone version

 ERROR: KeeperErrorCode = NoNode for /hbase/master

The reason for this kind of problem is the use of the ZooKeeper distributed scheduling framework that comes with HBase. Since my environment is a stand-alone version, my general setting is to use an independent ZooKeeper service. The following is the relevant configuration of my hbase-site.xml and hbase-env.sh

hbase-env.sh:

export HBASE_MANAGES_ZK=false # 推荐不使用HBash 自带zookeeper

hbase-site.xml: Configure Hadoop 3 storage address, ZooKeeper service address and port

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://Hadoop3-master:9000/hbase</value>
  </property>
  <!--必须设置为True,否则无法连接ZooKeeper-->
  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>

  <!-- zk 端口 -->
    <property>
      <name>hbase.zookeeper.property.clientPort</name>
      <value>2181</value>
    </property>
    <!-- hbase 依赖 zk的地址 -->
    <property>
      <name>hbase.zookeeper.quorum</name>
      <value>Hadoop3-master</value>
    </property>

Zookeeper:Unable to read additional data from client sessionid 0x00, likely client has closed socket

Error message:

EndOfStreamException: Unable to read additional data from client sessionid 0x6362257b44e5068d, likely client has closed socket

Specific reason: When the client connects to Zookeeper, the configured timeout is too short.

Solution: Adjust the zoo.cfg timeout parameter value

[root@Hadoop3-master conf]# vi /usr/local/zookeeper/conf/zoo.cfg
# The number of milliseconds of each tick
tickTime=10000

 Change the timeout from 2 seconds to 10 seconds

HRegionServer: Failed construction RegionServer

2023-08-16 11:47:22,026 ERROR [main] regionserver.HRegionServer: Failed construction RegionServer
java.lang.StackOverflowError
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:2000)

Reason: ZooKeeper stores HBase information abnormally

Solution: Use zkCli.sh to delete /hbase node data

[root@Hadoop3-master bin]# ./zkCli.sh
Connecting to localhost:2181
******
WatchedEvent state:SyncConnected type:None path:null
[zk: localhost:2181(CONNECTED) 0] ls /
[hbase, zookeeper]
[zk: localhost:2181(CONNECTED) 5] deleteall /hbase
[zk: localhost:2181(CONNECTED) 6] ls /
[zookeeper]
[zk: localhost:2181(CONNECTED) 7] quit

WATCHER::

WatchedEvent state:Closed type:None path:null
2023-08-16 14:18:58,892 [myid:] - INFO  [main:ZooKeeper@1288] - Session: 0x10002e674ad0027 closed
2023-08-16 14:18:58,893 [myid:] - INFO  [main-EventThread:ClientCnxn$EventThread@568] - EventThread shut down for session: 0x10002e674ad0027
2023-08-16 14:18:58,895 [myid:] - INFO  [main:ServiceUtils@45] - Exiting JVM with code 0

Thriftpy2.transport.base.TTransportException: TTransportException(type=1, message="Could not connect to ('*.*.*.*', 9090)")

Reason: HBase did not start the thrift daemon service.

Solution: Enable the HBase thrift daemon service.

[root@Hadoop3-master bin]# ./hbase-daemon.sh start thrift
running thrift, logging to /usr/local/hbase/logs/hbase-root-thrift-Hadoop3-master.out
[root@Hadoop3-master bin]# jps
69760 Worker
128612 QuorumPeerMain
70406 SecondaryNameNode
59081 HRegionServer
69549 Master
70190 DataNode
76078 ThriftServer
56695 Worker
70040 NameNode
70744 ResourceManager
58845 HMaster
69759 Worker
76286 Jps
[root@Hadoop3-master bin]#

Check whether the ThriftServer identifier is included through the jsp command.

HBase Shell and its common commands

 hbase shell is a command line tool. On linux, enter the command: ./hbase shell  

HBase Shell

  • version: Display the version number of the current hbase
  • status: displays the status of each master node, and parameters can be added later
  • whoami: Display the current username
  • Exit shell mode: exit or quit.
[test@cs010 bin]$ ./hbase shell
//version显示当前hbase版本号
hbase(main):001:0> version
1.4.12, r6ae4a77408ad35d6a7a4e5cebfd401fc4b72b5ec, Sun Nov 24 13:25:41 CST 2019
//status显示各主节点的状态
hbase(main):002:0> status
1 active master, 0 backup masters, 1 servers, 1 dead, 7.0000 average load
//whoami显示当前用户名
hbase(main):003:0> whoami
test(auth:SIMPLE)
    groups: test

Table and column family operations 

Hbase's table structure (schema) only contains two contents: table name and column family. However, the column family has many attributes. When modifying and establishing the table structure, the number and attributes of the column family can be set.

HBase Shell operation table commands: 

Create table 

//创建表,必须指明两个参数:表名和列族的名字
1.  create 'table1','basic'    //建立表名为table1,含有一个列族basic
2.  create 'table1','basic','advanced' //建立表名为table1,建立了2个列族basic,advanced.
3.  create 'table2','basic',MAX_FILESIZE=>'134217728' //对表中所有列族设定,所有分区单次持久化的最大值为128MB
4.  create 'TABLE1','basic' //hbase区分大小写,与第一个table1是2张不同的表
5.  create 'table1',{NAME => 'basic',VERSION => 5,BLOCKCACHE => true}
//大括号内是对列族basic进行描述,定义了VERSION=>5,表示对于同一个cell,保留最近的5个历史版本,BLOCKCACHE赋值为true,允许读取数据时进行缓存.其他未指定的参数,采用默认值
//大括号中的语法,NAME和VERSION为参数名,不需要用括号引用.

//创建命名空间
create_namespace 'bigdata'

//命名空间下创建表
create 'bigdata:student','info'

//命名空间下删除表,如果有表,需要先删除表drop_namespace 'bigdata'

View table name list 

//list命令查看当前所有表名
list

//list命令查看当前命名空间
list_space

//exists 命令查看此表是否存在 
exists 'table_test1'

eg:
hbase(main):010:0> list
TABLE
table_test1
1 row(s) in 0.0060 seconds

hbase(main):043:0> list_namespace
NAMESPACE
default
1 row(s) in 0.0190 seconds

hbase(main):011:0> exists 'table_test1'
Table table_test1 does exist
0 row(s) in 0.0070 seconds

Describe table structure 

//描述表结构 describe命令查看指定表的列族信息,包括有多少个列族、每个列族的参数信息
describe 'table_test1'//描述命名空间下的表结构describe 'bigdata:table_test1'
eg:
hbase(main):013:0> describe 'table_test1'
Table table_test1 is ENABLED
table_test1
COLUMN FAMILIES DESCRIPTION
{NAME => 'test001', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER',
 COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE =>'65536', REPLICATION_SCOPE => '0'}
{NAME => 'test002', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER',
 COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE =>'65536', REPLICATION_SCOPE => '0'}
2 row(s) in 0.0250 seconds

Modify table structure 

//修改表结构,alter命令,比如增加列族或修改列族参数.
//eg:表table_test1中新增列族test002
1. alter 'table_test1','test001','test002' //新增列族test002
2. alter 'table_test1','test002' //新增列族test002
3. alter 'table_test1','test001',{NAME=> 'test002',IN_MEMORY =>true} //新增列族test002

//修改列族名称,该列族下已存有数据,需要对数据进行修改
4. alter 'table_test1',{NAME=> 'test001',IN_MEMORY =>true}

//删除一个列族,以及其中的数据(前提是至少要有一个列族)
5. alter 'table_test1','delete'=>'test001'
6. alter 'table_test1',{NAME=> 'test002',METHOD=>'delete'}

eg:
[haishu@cs010 bin]$ . /hbash shell
hbase(main):001:0> list
TABLE
table_test1
1 row(s) in 0.1710 seconds

hbase(main):002:0> alter 'table_test1','delete'=>'test001'
Updating all regions with the new schema...
1/1 regions updated.
Done.
0 row(s) in 1.9480 seconds

hbase(main):003:0> alter 'table_test1',{NAME=>'test002',METHOD=>'delete'}
Updating all regions with the new schema...
1/1 regions updated.
Done.
0 row(s) in 1.8710 seconds

Delete table 

//先禁用表,再删除表
disable 'table1' //禁用表table1
is_disable 'table1'//查看是否禁用成功
drop 'table1'//删除表

//顺序完成禁用、删除表、删除所有数据、重新建立空表,即清空表中所有的数据
truncate 'table1'

Data Update

HBase Shell addition, deletion, modification and query commands:

Data insertion 

//数据插入,参数依次显示为:表名、行键名称、列族:列的名称、单元格的值、时间戳或数据版本号,数值越大表示时间或版本越新,如果省略,默认显示当前时间戳
put 'table_test','001','basic:test001','micheal jordan',1
put 'table_test','002','basic:test002','kobe'

Data Update 

//数据更新,put语句行键、列族已存在,但不考虑时间戳。建表时设定VERSIONS=>n,则用户可以查询到同一个cell,最新的n个数据版本
put 'table_test','001','basic:test001','air jordan',2

Data deletion

The deletion operation of HBase does not immediately delete the data from the disk. The deletion operation mainly marks the data to be deleted.

When performing a delete operation, HBase inserts a new piece of the same KeyValue data, but sets keytype=Delete, which means that the data is deleted. The data will not be actually deleted from the disk until a Major compaction operation occurs. The deletion mark Also deleted from StoreFile.

//数据删除,用delete,必须指明表名和列族名称
delete 'table_test','001','basic'
delete 'table_test','002','basic:test002'
delete 'table_test','002','basic:test002',2
//如果指明了版本,默认删除的是所有版本<=2的数据
//delete命令的最小粒度是cell,且不能跨列族删除。

//删除表中所有列族在某个行键上的数据,即删除一个逻辑行,则需要使用deleteall命令
deleteall 'table_test','001'
deleteall 'table_test','002',1
//hbase并不能做实时删除数据,当hbase删除数据时,可以看作为这条数据put了新的版本,有一个删除标记(tombstone)

counter 

//incr命令可以将cell的数值在原值上加入指定数值
incr 'table_test','001','basic:scores',10

//get_counter命令可以查看计数器的当前值
get_counter 'table_test','001','basic:scores' 

data query 

HBase has 2 basic data query methods:

  1.get: Get a piece of data by row key

  2.scan: Scan a table, you can specify the row key range or use filters to limit the range.

  3.count: Use the count instruction to calculate the number of logical rows in the table

//get命令的必选参数为表名和行键名
get 'table_test','001'
//可选项,指明列族名称、时间戳的范围、数据版本数、使用过滤器
get 'table_test','001',{COLUMN=>'basic'}
get 'table_test','001',{COLUMN=>'basic',TIMERANGE=>[1,21]}
get 'table_test','001',{COLUMN=>'basic',VERSIONS=>3}
get 'table_test','001',{COLUMN=>'basic',TIMERANG=>[1,2],VERSION=>3}
get 'table_test','001',{FILTER=>"ValueFilter(=,'binary:Michael Jordan 1')"}
//scan数据扫描,不指定行键,hbase只能通过全表扫描的方式查询数据
scan 'table_test'
//指定列族名称
scan 'table_test' ,{COLUMN =>'basic'}
//指定列族和列名
scan 'table_test' ,{COLUMN =>'basic:name'}
//指定输出行数
scan 'table_test' ,{LIMIT => 1}
//指定行键的范围,中间用逗号隔开
scan 'table_test' ,{LIMIT =>'001',LIMIT => '003'}
//指定时间戳或时间范围
scan 'table_test' ,{TIMESTAMP => 1}
scan 'table_test' ,{TIMESTAMP => [1,3]}
//使用过滤器
scan 'table_test' ,FILTER=>"RowFilter(=,substring:0')"
//指定对同一个键值返回的最多历史版本数量
scan 'table_test' ,{version=> 1}
//采用count指令可以计算表的逻辑行数
count 'table_test' 

Filter query 

Whether in the get method or the scan method, you can use a filter to display the scan or output range.

//Filter performs filtering queries and is used with comparison operators or comparators: >, <, =, >=, <=, != 
show_filters

Comparators:

  • BinaryComparator: Complete byte comparator, such as binary:001, which means comparing all bytes of data in dictionary order.
  • BinaryPrefixComparator: Prefix byte comparator, such as: binaryprefix:001, which means comparing the first 3 bytes of data in dictionary order.
  • RegexStringComparator: Regular expression comparator, such as regexstring:a*c, which means all strings starting with the string 'a' and having a structure of 'c'. Only two operators, = or !=, can be used.
  • SubstringComparator: Substring comparator, such as substring:00. Only two operators, = or !=, can be used.
  • BitComparator: bit comparator. Only two operators, = or !=, can be used.
  • NullComparator: Null value comparator.
//When using comparator syntax, use FILTER=> "Filter (comparison method)" to specify the filtering method used 
//In terms of syntax format, the filtering method is quoted in double quotes, while the comparison method is quoted in parentheses in 
scan . 'table_test',FILTER=>"RowFilter(=,'substring:0')" 
scan 'table_test',{FILTER=>"RowFilter(=,'substring:0')"}

 Filter purpose:

  • Row key filter
  • Column families and column filters
  • value filter
  • Other filters

Row key filter:

//行键过滤器,RowFilter:可以配合比较器及运算符,实现行键字符串的比较和过滤。
//需求:显示行键前缀为0开头的键值对,进行子串过滤只能用=或!=两种方式,不支持采用大于或小于
scan 'table_test',FILTER=>"RowFilter(=,'Substring:0')"
scan 'table_test',FILTER=>"RowFilter(>=,'BinaryPrefix:0')"
//行键前缀比较器,PrefixFilter:比较行键前缀(等值比较)的命令
scan 'table_test',FILTER=>"PrefixFilter('0')"
//KeyOnlyFilter:只对cell的键进行过滤和显示,不显示值,扫描效率比RowFilter高
scan 'table_test',{FILTER=>"KeyOnlyFilter()"}

//FirstKeyFilter:只扫描相同键的第一个cell,其键值对都会显示出来,如果有重复的行键则跳过。可以用来实现对行键(逻辑行)的计数,和其他计数方式相比。
scan 'table_test',{FILTER=>"FirstKeyFilter()"}

//InclusiveStopFilter:使用STARTROW和ENDROW进行设定范围的scan时,结果会包含STARTROW行,但不包括ENDROW,使用该过滤器替代ENDROW条件
scan 'table_test',{STARTROW=>'001',ENDROW=>'002'}
scan 'table_test',{STARTROW=>'001',FILTER=>"InclusiveStopFilter ('binary:002')",ENDROW=>'002'}

 Column families and column filters:

//列族和列过滤器
//列族过滤器:FamilyFilter
scan 'table_test',FILTER=>"FamilyFilter(=,'substring:test001')"
//列名(列标识符)过滤器:QualifierFilter
scan 'table_test',FILTER=>"QualifierFilter(=,'substring:test001')"
//列名前缀过滤器:ColumnPrefixFilter
scan 'table_test',FILTER=>"ColumnPrefixFilter('f')"
//指定多个前缀的ColumnPrefixFilter:MultipleColumnPrefixFilter
scan 'table_test',FILTER=>"MultipleColumnPrefixFilter('f','l')"
//时间戳过滤器:TimestampsFilter
scan 'table_test',{FILTER=>"TimestampsFilter(1,2)"}
//列名范围过滤器:ColumnRangeFilter
scan 'table_test',{FILTER=>"ColumnRangeFilter('f',false,'lastname',true)"}
//参考列过滤器:DependentColumnFilter,设定一个参考列(即列名),如果某个逻辑行包含该列,则返回该行中和参考列时间戳相同的所有键值对
//过滤器参数中,第一项是需要过滤数据的列族名,第二项是参考列名,第三项是false说明扫描包含"basic:firstname",如果是true则说明在basic列族的其他列中进行扫描。
scan 'table_test',{FILTER=>"DependentColumnFilter('basic','firstname',false)"}

 Value filter:

/ValueFilter:值过滤器,get或者scan方法找到符合值条件的键值对,变量=值:Michael Jordan
 get 'table_test','001',{FILTER=>"ValueFilter(=,'binary:Michael Jordan')"} 
 scan 'table_test',{FILTER=>"ValueFilter(=,'binary:Michael Jordan')"} 
//SingleColumnValueFilter:在指定的列族和列中进行比较的值过滤器,使用该过滤器时尽量在前面加上一个独立的列名限定 
scan 'table_test',{ COLUMN => 'basic:palyername' , FILTER => "SingleColumnValueExcludeFilter('basic','playername',=,'binary:Micheal Jordan 3')"}
//SingleColumnValueExcludeFilter:和SingleColumnValueFilter类似,但功能正好相反,即排除匹配成功的值 
scan 'table_test', FILTER => "SingleColumnValueExcludeFilter( 'basic' , 'playername' ,=,'binary:Micheal Jordan 3')" 
SingleColumnValueFilter和SingleColumnValueExcludeFilter区别: Value = "Micheal Jordan "的键值对,或者返回除此之外的其他所有键值对。 
//其他过滤器 
1. ColumnCountGetFilter:限制每个逻辑行最多返回多少个键值对(cell),一般用get,不用scan.
2. PageFilter:对显示结果按行进行分页显示 
3. ColumnPaginationFilter:对显示结果按列进行分页显示 
4. 自定义过滤器:hbase允许采用Java编程的方式开发新的过滤器 
eg: scan 'table_test', FILTER => "ColumnPrefixFilter( 'first' ) AND ValueFilter(=, 'substring:kobe')"

eg:
hbase(main):012:0> get 'Test','002',{FILTER=>"ValueFilter(=,'binary:test004')"}
COLUMN CELL
zhangsan:wendy001 timestamp=1587208488702, value=test004
zhangsan:wendy002 timestamp=1587208582262, value=test004
1 row(s) in 0.0100 seconds

hbase(main):013:0> scan 'Test',{FILTER=>"ValueFilter(=,'binary:test004')"}
ROW COLUMN+CELL
001 column=zhangsan:wendy001, timestamp=1587208452109, value=test004
002 column=zhangsan:wendy001, timestamp=1587208488702, value=test004
002 column=zhangsan:wendy002, timestamp=1587208582262, value=test004
2 row(s) in 0.0100 seconds


hbase(main):018:0> scan 'Test',{ COLUMN => 'zhangsan:wendy002' , FILTER => "SingleColumnValueExcludeFilter('zhangsan','wendy002',=,'binary:test004')"}
ROW COLUMN+CELL
0 row(s) in 0.0040 seconds

hbase(main):019:0> scan 'Test',{COLUMN=>'zhangsan:wendy002',FILTER=>"SingleColumnValueFilter('zhangsan','wendy002',=,'binary:test004')"}
ROW COLUMN+CELL
002 column=zhangsan:wendy002, timestamp=1587208582262, value=test004
1 row(s) in 0.0060 seconds

 Snapshot operation

Snapshot: A method to create a copy of a table without copying the data. It can be used for data recovery, building daily, weekly or monthly data reports, and using it in testing.

Snapshot premise: Configure the hbase.snpashot.enabled attribute in the Hbase configuration file hbase-site.xml to true. Under normal circumstances, the default option of HBase is true.

//建立表的快照p1
snapshot 'test001','p1'
//显示快照列表
List_snapshots
//删除快照
delete_snapshot 'p1'
PS:注意删除快照后,原表的数据仍然存在。删除原表,快照的数据也仍然存在。

//通过快照生成新表play_1,注意用此种方法生成新表,不会发生数据复制,只会进行元数据操作
clone_snapshot 'p1','play_1'
//快照恢复原表格,将抛弃快照之后的所有变化
restore_snapshot 'p1'

//利用快照实现表改名,方法:制作一个快照,再将快照生成为新表,最后将不需要的旧表和快照删除
snapshot 'player','p1'
clone_snapshot 'p1','play_1'
disable 'player'
drop 'player'
delete_snapshot 'p1'

 Batch import and export

Scenario: The put method is used to collect data one by one, but if a large amount of data needs to be written to HBase at one time, batch operations are required. In addition, if you need to back up data to a location such as HDFS, you also need to perform batch operations, which is implemented based on Hadoopde's MapReduce method. The data import source and backup purpose are usually on HDFS.

There are two ways to import data in batches:

  1. The first is parallel data insertion, using MapReduce and other methods to send data to multiple RegionServers.

  2. The second method is to directly convert the original data into HFile based on the table information, copy the data to the corresponding location in HDFS, and then incorporate the data in the file into management.

Method 1, use the ImportTsv class method: Import the text file stored on HDFS into the specified table of HBase. The TXT file should have clear column delimiters, such as TSV format separated by '\t' (TAB key), or Comma-separated CSV format.

  Principle: The execution mechanism is to scan the entire file and write data one by one. Use the MapReduce method to launch multiple processes on multiple nodes and read file chunks on multiple HDFS simultaneously. Data is sent to different Regionservers according to the partitions it belongs to, using distributed parallel reading and writing to speed up data import.

//在linux的命令行通过HBase指令调用ImportTsv类
//player为表名,hdfs://namenode:8020/input/为导入文件所在的目录,这里不需要指定文件名,导入时会遍历目录中的所有文件。
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns= HBASE_ROW_KEY,basic:playername,advance:scores -Dimporttsv.skip.bad.lines =true player hdfs://namenode:8020/input/
//-Dimporttsv.columns=HBASE_ROW_KEY,参数依次为:第一个关键字HBASE_ROW_KEY是指定文本文件中的行键,第二个是写入列族basic下名为playername的列,第三个是写入advance列族下的scores列,这一参数一般为必选项。
//-Dimporttsv.skip.bad.lines=true表示略过无效的行,如果设置为false,则遇到无效行会导入报告失败 

//可选参数 
//-Dimporttsv.separator=',',用逗号作为分隔符,也可以指定为其他形式的分隔符,例如'\0',默认情况下分隔符为'\t'。 
//-Dimporttsv.timestamp =1298529542218,导入时使用指定的时间戳,如果不指定则采用当前时间。 

Method two, use the bulk-load method: directly convert the original data into HFile, copy the data to the corresponding location in HDFS, and then incorporate the data in the file into management, which is divided into two steps.

//前提:表结构已经建立好,并且在命令中指定了表名,因为要根据表结构和分区状况准备文件
//第一步:利用ImportTsv生成文件
//第二步:复制

//第一步:利用ImportTsv生成文件
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns= HBASE_ROW_KEY,basic:playername,advance:scores -Dimporttsv.skip.bad.lines =true -Dimporttsv.bulk.output=hdfs://namenode:8020/bulkload/ player hdfs://namenode:8020/input/
//-Dimporttsv.bulk.output 参数,设定了HDFS路径,准备好HFile文件的存放地址:hdfs://namenode:8020/bulkload/,由于MapReduce的特性,该路径不能提前存在
//第二步:复制,利用MapReduce实现,参数为HFile文件所在路径和表名。
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
hdfs://namenode:8020/bulkload player

Method three, import data from a relational database into HBase: Among the Hadoop series components, there is a component called Sqoop that can implement data import between Hadoop, Hive, HBase and other big data tools and relational databases (such as MySQL, Oracle). Export.

Sqoop is divided into two versions: 1 and 2. Sqoop1 is relatively simple to use, while Sqoo2 inherits more functions and has a more complex architecture.

//以sqoop1为例,其安装过程基本为解压。
//访问MySQL等数据库,则需要自行下载数据库连接组件(mysql-connector-java-x.jar),并复制到其lib目录中。
sqoop import --connect jdbc:mysql://node1:3306/database1 --table table1 --hbase-table player --column-family f1 --hbase-row-key playername --hbase-create-table --username 'root' -password '123456'
//从mysql中导入数据(import),之后指明了作为数据源的mysql的访问地址(node1)、端口(3306)、数据库名(database1)、表名(table1)。
//数据导入名为player的HBase表,并存入名为f1的列族,列名则和MySQL中保持一致,行键为MySQL表中名为playername的列。
//--hbase-create-table :HBase中建立这个表,最后指明了访问mysql的用户名和密码。

backup and restore 

HBase supports copying tables or snapshots to HDFS, and supports copying data to other HBase clusters to achieve data backup and recovery functions. There are four ways:

//Export、Import、ExportSnapshot、CopyTable
//Export:将HBase的数据导出到HDFS。目的;备份,文件并不能直接以文本方式查看。
//参数中<tablename>为表名,<outputdir>为HDFS路径。
hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir>

//Import:导出的数据可以恢复到HBase。
hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <outputdir>

//ExportSnapshot
hbase org.apache.hadoop.hbase.mapreduce.ExportSnapshot -snapshot <snapshot name> -copy-to <outputdir>
//snapshot 快照名 ;outputdir为HDFS路径,导出的快照文件可以利用Import方法恢复到表中。

//CopyTable:可以将一个表的内容复制到新表中,新表和原表可以在同一个集群内,也可以在不同的集群上。复制过程利用MapReduce进行。
//前提:新表已经建立起来
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=<NEW_TABLE_NAME> -peer.adr=<zookeeper_peer:2181:/hbase> <TABLE_NAME>
//--new.name=<NEW_TABLE_NAME>参数描述新表的名字,如果不指定则默认和原表名相同。
//-peer.adr=<zookeeper_peer:2181:/hbase>参数指向目标集群Zookeeper服务中的hbase数据入口(包括meta表的地址信息等)

//CopyTable帮助
hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help

HappyBase API Practice

Connect to HBase

# _*_ coding : UTF-8_*_
# 开发者 : zhuozhiwengang
# 开发时间 : 2023/8/14 22:56
# 文件名称 : python_hbase_1.py
# 开发工具 : PyCharm
import happybase

con = happybase.Connection('192.168.43.11')
con.open()  # 打开传输
print(con.tables())  # 输出所有表名
con.close()  # 关闭传输

Effect screenshot:

table operation

Create table

# _*_ coding : UTF-8_*_
# 开发者 : zhuozhiwengang
# 开发时间 : 2023/8/16 15:11
# 文件名称 : python_hbase_2
# 开发工具 : PyCharm
import happybase

con = happybase.Connection('192.168.43.11')  # 默认9090端口
con.open()  # 打开thrift传输,TCP连接

families = {
    'wangzherongyao': dict(max_versions=2),  # 设置最大版本为2
    'hepingjingying': dict(max_versions=1, block_cache_enabled=False),
    'xiaoxiaole': dict(),  # 使用默认值.版本默认为3
}
con.create_table('games', families)  # games是表名,families是列簇,列簇使用字典的形式表示,每个列簇要添加配置选项,配置选项也要用字典表示

print(con.tables())  # 输出表
con.close()  # 关闭传输

Configuration options:

  • max_versions (int type)
  • compression (str type)
  • in_memory (bool type)
  • bloom_filter_type (str type)
  • bloom_filter_vector_size (int type)
  • bloom_filter_nb_hashes (int type)
  • block_cache_enabled (bool type)
  • time_to_live (int type)

 enable or disable table 

Warm reminder: When setting or deleting a table, you must first disable the table and then delete it. It can only be disabled or enabled once and cannot be repeated, otherwise an error will be reported.

# _*_ coding : UTF-8_*_
# 开发者 : zhuozhiwengang
# 开发时间 : 2023/8/16 15:15
# 文件名称 : python_hbase_3
# 开发工具 : PyCharm
# 禁用表
import happybase

con = happybase.Connection('192.168.43.11')  # 默认9090端口
con.open()  # 打开thrift传输,TCP连接

con.disable_table('games')  # 禁用表,games代表表名
print(con.is_table_enabled('games'))  # 查看表的状态,False代表禁用,True代表启动
print(con.tables())  # 即使禁用了该表,该表还是存在的,只是状态改变了

con.close()  # 关闭传输

 Effect screenshot:

# _*_ coding : UTF-8_*_
# 开发者 : zhuozhiwengang
# 开发时间 : 2023/8/16 15:16
# 文件名称 : python_hbase_4
# 开发工具 : PyCharm

# 启动表
import happybase

con = happybase.Connection('192.168.43.11')  # 默认9090端口
con.open()  # 打开thrift传输,TCP连接

con.enable_table('games')  # 启动该表
print(con.is_table_enabled('games'))  # 查看表的状态,False代表禁用,True代表启动
print(con.tables())  # 即使禁用了该表,该表还是存在的,只是状态改变了

con.close()  # 关闭传输

Effect screenshot: 

Delete table

To delete a table, you must first disable the table before deleting it. The delete_table function of HappyBase can not only disable the table but also delete the table. If the table has been disabled previously, the delete_table function does not need to add the second parameter. The default is False.

# _*_ coding : UTF-8_*_
# 开发者 : zhuozhiwengang
# 开发时间 : 2023/8/16 15:20
# 文件名称 : python_hbase_5.py
# 开发工具 : PyCharm
import happybase

con = happybase.Connection('192.168.43.11')  # 默认9090端口
con.open()  # 打开thrift传输,TCP连接

con.delete_table('games', disable=True)  # 第一个参数表名,第二个参数表示是否禁用该表

print(con.tables())

con.close()

 Effect screenshot:

data manipulation 

Create data

Note: If there is no such column name when writing data, create a new column name and then write the data.

In the hbase shell, using the put command, only one cell can be written at a time, while the put function of the happybase library can write multiple cells.

# _*_ coding : UTF-8_*_
# 开发者 : zhuozhiwengang
# 开发时间 : 2023/8/16 15:24
# 文件名称 : python_hbase_6.py
# 开发工具 : PyCharm

import happybase

con = happybase.Connection('192.168.43.11')  # 默认9090端口
con.open()  # 打开传输

biao = con.table('games')  # games是表名,table('games')获取某一个表对象

wangzhe = {
    'wangzherongyao:名字': '别出大辅助',
    'wangzherongyao:等级': '30',
    'wangzherongyao:段位': '最强王者',
}
biao.put('0001', wangzhe)  # 提交数据,0001代表行键,写入的数据要使用字典形式表示

# 下面是查看信息,如果不懂可以继续看下一个
one_row = biao.row('0001')  # 获取一行数据,0001是行键
for value in one_row.keys():  # 遍历字典
    print(value.decode('utf-8'), one_row[value].decode('utf-8'))  # 可能有中文,使用encode转码

con.close()  # 关闭传输

Effect screenshot:

View operations 

After connecting below, create a table object, and then operate on this table object. Here are a variety of viewing operations. The first is to view the data of a row, and the second is to view the data of a cell, because when I store Chinese is used. What is stored in hbase is not Chinese, but UTF-8 encoding. Here, the encoded data passed by hbase is received and decoded. The third is to obtain multi-line data, and the fourth is to use The scanner fetches data for the entire table.

# _*_ coding : UTF-8_*_
# 开发者 : zhuozhiwengang
# 开发时间 : 2023/8/16 15:34
# 文件名称 : python_hbase_7.py
# 开发工具 : PyCharm
import happybase

con = happybase.Connection('192.168.43.11')  # 默认9090端口
con.open()  # 打开传输

biao = con.table('games')  # games是表名,table('games')获取某一个表对象

print('-----------------------第一个-----------------------------')
one_row = biao.row('0001')  # 获取一行数据,0001是行键
for value in one_row.keys():  # 遍历字典
    print(value.decode('utf-8'), one_row[value].decode('utf-8'))  # 可能有中文,使用encode内置函数转码

print('-----------------------第二个-----------------------------')
print(biao.cells('0001', 'wangzherongyao:段位')[0].decode('utf-8'))  # 获取一个单元格信息,返回列表,转码输出,0001是行键,wangzherongyao是列簇名,是列名

print('-----------------------第三个-----------------------------')
for key, value in biao.rows(['0001', '0002']):  # 获取多行的数据,列表或元组中可以写入多个行键
    # print(key, '<=====>', value)  # 由于0002我没有写入数据,就查不到,也不返回信息
    for index in value.keys():  # 遍历字典
        print(key.decode('utf-8'), index.decode('utf-8'), value[index].decode('utf-8'))  # 可能有中文,使用encode转码

print('-----------------------第四个----------------------------')
for rowkey, liecu in biao.scan():  # 获取扫描器对象,该对象是可迭代对象。扫描器记录了一个表的结构
    # print(rowkey, '<=====>', liecu)
    for index in liecu.keys():  # 遍历字典
        print(rowkey.decode('utf-8'), index.decode('utf-8'), liecu[index].decode('utf-8'))  # 可能有中文,使用encode转码

con.close()  # 关闭传输

Effect screenshot:

 delete data

# _*_ coding : UTF-8_*_
# 开发者 : zhuozhiwengang
# 开发时间 : 2023/8/16 15:38
# 文件名称 : python_hbase_8.py
# 开发工具 : PyCharm

import happybase

con = happybase.Connection('192.168.43.11')  # 默认9090端口
con.open()  # 打开传输

biao = con.table('games')  # games是表名,table('games')获取某一个表对象

biao.delete('0003', ['wangzherongyao:段位'])  # 删除一个单元格信息
# biao.delete('0003', ['wangzherongyao:名字', 'wangzherongyao:等级'])  # 删除多个单元个信息
# biao.delete('0003', ['wangzherongyao'])  # 删除一列簇信息
# biao.delete('0003')  # 删除一整行信息

# 查看数据,看看是否还在
for rowkey, liecu in biao.scan():  # 获取扫描器对象,该对象是可迭代对象。扫描器记录了一个表的结构
    # print(rowkey, '<=====>', liecu)
    for index in liecu.keys():  # 遍历字典
        print(rowkey.decode('utf-8'), index.decode('utf-8'), liecu[index].decode('utf-8'))  # 可能有中文,使用encode转码
con.close()  # 关闭传输

As mentioned before, deletion deletes the most recent version based on the timestamp. When viewing it again, the next most recent version with timestamp will be displayed. Let’s test whether this is the case.

# _*_ coding : UTF-8_*_
# 开发者 : zhuozhiwengang
# 开发时间 : 2023/8/16 15:45
# 文件名称 : python_hbase_9
# 开发工具 : PyCharm
import happybase

con = happybase.Connection('192.168.43.11')  # 默认9090端口
con.open()  # 打开传输

biao = con.table('games')  # games是表名,table('games')获取某一个表对象
biao.put('0001', {'wangzherongyao:段位': '最强王者'})
biao.put('0001', {'wangzherongyao:段位': '永恒钻石V'})
biao.put('0001', {'wangzherongyao:段位': '尊贵铂金I'})  # 重复写三个值
print(biao.cells('0001', 'wangzherongyao:段位'))  # 查看单元格的数据显示为最后一个时间戳的版本,即尊贵铂金I

biao.delete('0001', ['wangzherongyao:段位'])  # 删除单元格的信息,按照正常的理论查看时显示永恒钻石V
print(biao.cells('0001', 'wangzherongyao:段位'))  # 查看单元格的信息,显示为空

con.close()  # 关闭传输

Problem description: Use happybase.delete to delete the specified cell data and clear all specified cell records. According to theory, the most recent cell record should be deleted.

Reason: The delete function of the happybase library encapsulates the deleteall function in the hbase shell, so be careful when calling the delete function.

Batch processing

The batch() function can create an executable object and then perform batch operations. In fact, this function returns a Batch object. The Batch object supports the context management protocol and can perform batch write put operations, batch delete operations, and then use send The send function submits to the server

Reference article: HBase Shell and its command operations

                 HappyBase Official Documentation 

Guess you like

Origin blog.csdn.net/zhouzhiwengang/article/details/132286690