table of Contents
Unzip the hbase compressed package
Create a table, specify the column family
Insert data, specify the row key, column family qualifier and value
Let the table support multiple versions of data
background
Record the installation process of HBase under CentOS7, please install Hadoop, Kafka and ZooKeeper in advance, you can refer to the article Hadoop2.5.0 installation and deployment under CentOS7 , CentOS7 installation and use of kafka and its monitoring components , CentOS7 installation zookeeper
data structure
1. Logical structure
The relationship between columns, column families, row keys, districts and other nouns is shown in the figure below
Row keys are sorted lexicographically
2. The physical storage structure is shown in the figure below
3. The architecture diagram is as follows
4. Writing process
memstore flashing:
5. Reading process
6. Merging storage files
7. Partitioning
installation
启动hadoop、kafka、ZooKeeper
Unzip the hbase compressed package
[root@localhost szc]# tar -zxvf hbase-1.3.1-bin.tar.gz
Configure HBase
Enter the conf directory under the hbase directory and modify the three files hbase-site.xml, hbase-env.sh, and regionservers, all of which are the local ips of CentOS
[root@localhost szc]# cd hbase-1.3.1/conf/
hbase-env.sh modify JAVA_HOME and turn off ZooKeeper that comes with hbase
export JAVA_HOME=/home/szc/jdk8_64
export HBASE_MANAGES_ZK=false
hbase-site.xml configures hdfs path, distribution, webUI port, ZooKeeper and other information
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://192.168.57.141:8020/Hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>192.168.57.141</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/szc/zookeeper/data</value>
</property>
<property>
<name>hbase.master.info.port</name>
<value>16010</value>
</property>
</configuration>
regionservers configuration server
192.168.57.141
Start hbase
[root@localhost conf]# cd ..
[root@localhost hbase-1.3.1]# ./bin/hbase-daemon.sh start master
[root@localhost hbase-1.3.1]# ./bin/hbase-daemon.sh start regionserver
After opening port 16010, you can see the webui of HBase on the windows browser
The arrow in the figure points to the regionserver, which is used to allocate table space.
If you want to start an HBase cluster, you can run start-hbase.sh
[root@localhost hbase-1.3.1]# ./bin/start-hbase.sh
Close hbase
[root@localhost hbase-1.3.1]# ./bin/stop-hbase.sh
Enter the hbase command line
[root@localhost hbase-1.3.1]# ./bin/hbase shell
2020-05-09 09:19:41,328 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/szc/hbase-1.3.1/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/szc/cdh/hadoop-2.5.0-cdh5.3.6/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.3.1, r930b9a55528fe45d8edce7af42fef2d35e77677a, Thu Apr 6 19:36:54 PDT 2017
hbase(main):001:0>
command
Create a table, specify the column family
hbase(main):019:0> create 'testtable', 'colfaml';
Insert data, specify the row key, column family qualifier and value
hbase(main):007:0> put 'testtable', 'myrow-1', 'colfaml:ql', 'value-1'
hbase(main):008:0> put 'testtable', 'myrow-2', 'colfaml:qk', 'value-2'
hbase(main):009:0> put 'testtable', 'myrow-2', 'colfaml:qj', 'value-3'
Scan table
hbase(main):010:0> scan 'testtable'
ROW COLUMN+CELL
myrow-1 column=colfaml:ql, timestamp=1580287260033, value=value-1
myrow-2 column=colfaml:qj, timestamp=1580287323632, value=value-3
myrow-2 column=colfaml:qk, timestamp=1580287294044, value=value-2
2 row(s) in 0.0220 seconds
Get a single row of data
hbase(main):011:0> get 'testtable', 'myrow-1'
COLUMN CELL
colfaml:ql timestamp=1580287260033, value=value-1
1 row(s) in 0.0220 seconds
Delete data (cell)
hbase(main):012:0> delete 'testtable', 'myrow-2', 'colfaml:qj'
Disable and delete table
hbase(main):013:0> disable 'testtable'
hbase(main):014:0> drop 'testtable'
Let the table support multiple versions of data
hbase(main):015:0> alter 'test', { NAME => 'cf1', VERSIONS => 3 }
Then view the description information of the test table
hbase(main):016:0> describe 'test'
Table test is ENABLED
test
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_E
NCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '655
36', REPLICATION_SCOPE => '0'}
1 row(s) in 0.0110 seconds
You can see that 3 versions of storage information can be supported, and then insert two pieces of data for row1 and cf1
hbase(main):017:0> put 'test', 'row1', 'cf1', 'val2'
hbase(main):018:0> put 'test', 'row1', 'cf1', 'val3'
After checking the data of multiple versions, the historical data of multiple versions can be displayed.
hbase(main):019:0> scan 'test', { VERSIONS => 3}
ROW COLUMN+CELL
row1 column=cf1:, timestamp=1580357787736, value=val3
row1 column=cf1:, timestamp=1580357380211, value=val2
1 row(s) in 0.0120 seconds
Integration with hive
First add the environment variable HBASE_HOME to point to the hbase installation directory. Then start the metastore of hbase and hive
Create a table in hive
create table pokes(key string, value string) row format delimited fields terminated by ',';
Read data
load data local inpath '/home/szc/data.txt' into table pokes;
Then build a table in hive, use HBaseStorageHandler to dump, and specify the corresponding relationship between the row key, column and the column in the hive table
create table hbase_table(key string, value string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties('hbase.columns.mapping' = ':key, cf1:val')
After completion, a table named hbase_table will appear in both hive and hbase, and then insert data on the hive side
insert overwrite table hbase_table select * from pokes;
Wait for this command to complete, run the following command to verify the content
select * from hbase_table;
Integration with pig
First copy PigHome\lib\spark\netty-all-***.Final.jar to the upper directory, and then start the historyserver of hadoop
mr-jobhistory-daemon.sh start historyserver
Then copy the data file to /user/username of hdfs
hadoop fs -copyFromLocal /home/szc/pig/tutorial/data/excite-small.log /user/songzeceng
After creating the excite table in hbase, start pig. The following commands are all done in the pig command line.
Load the data file first, and split the column according to the separator
raw = LOAD 'hdfs:///user/songzeceng/excite-small.log' USING PigStorage ('\t') AS (user, time, query);
Then perform the splicing of user, \u0000, and time on each row, the splicing result is the row key, and the remaining query is the column value
T = FOREACH raw GENERATE CONCAT(CONCAT(user, '\u0000'), time), query;
Then store T in the excite table of HBase and specify the column
store T into 'excite' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf1:query');
Then you can view the data, load the table on the pig side, store the row key and column value in the key and query variables, both of which are strings
R = LOAD 'excite' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf1:query', '-loadKey') as (key: chararray, query: chararray);
View relationship R
dump R;
The sample output is as follows
(FE785BA19AAA3CBB 970916083558,dystrophie musculaire duch?ne)
(FE785BA19AAA3CBB 970916083732,)
(FE785BA19AAA3CBB 970916083839,)
(FE785BA19AAA3CBB 970916084121,dystrophie)
(FE785BA19AAA3CBB 970916084512,dystrophie musculaire)
(FE785BA19AAA3CBB 970916084553,)
(FE785BA19AAA3CBB 970916085100,dystrophie musculaire)
(FEA681A240A74D76 970916193646,kerala)
(FEA681A240A74D76 970916194158,kerala)
(FEA681A240A74D76 970916194554,kerala)
(FEA681A240A74D76 970916195314,kerala)
(FF5C9156B2D27FBD 970916114959,fredericksburg)
(FFA4F354D3948CFB 970916055045,big cocks)
(FFA4F354D3948CFB 970916055704,big cocks)
(FFA4F354D3948CFB 970916060431,big cocks)
(FFA4F354D3948CFB 970916060454,big cocks)
(FFA4F354D3948CFB 970916060901,big cocks)
(FFA4F354D3948CFB 970916061009,big cocks)
(FFCA848089F3BA8C 970916100905,marilyn manson)
Then you can also press \u0000 to expand the variable key (row key) into user and time
S = foreach R generate FLATTEN (STRSPLIT(key, '\u0000', 2)) AS (user:chararray, time:long), query;
The sample output of dump S is as follows
(FE785BA19AAA3CBB,970916083435,dystrophie musculaire)
(FE785BA19AAA3CBB,970916083531,dystrophie musculaire duch?ne)
(FE785BA19AAA3CBB,970916083558,dystrophie musculaire duch?ne)
(FE785BA19AAA3CBB,970916083732,)
(FE785BA19AAA3CBB,970916083839,)
(FE785BA19AAA3CBB,970916084121,dystrophie)
(FE785BA19AAA3CBB,970916084512,dystrophie musculaire)
(FE785BA19AAA3CBB,970916084553,)
(FE785BA19AAA3CBB,970916085100,dystrophie musculaire)
(FEA681A240A74D76,970916193646,kerala)
(FEA681A240A74D76,970916194158,kerala)
(FEA681A240A74D76,970916194554,kerala)
(FEA681A240A74D76,970916195314,kerala)
(FF5C9156B2D27FBD,970916114959,fredericksburg)
(FFA4F354D3948CFB,970916055045,big cocks)
(FFA4F354D3948CFB,970916055704,big cocks)
(FFA4F354D3948CFB,970916060431,big cocks)
(FFA4F354D3948CFB,970916060454,big cocks)
(FFA4F354D3948CFB,970916060901,big cocks)
(FFA4F354D3948CFB,970916061009,big cocks)
(FFCA848089F3BA8C,970916100905,marilyn manson)
Row key design principles
Length principle
The maximum value is 64KB, and the recommended length is 10~100 bytes, preferably a multiple of 8. It can be short or short, and long row keys will affect performance
Sole principle
The row key must be unique
Hashing principle
1), salt hash: add a random number in front of the timestamp, and then use it as the row key
2) String reversal: Convert the timestamp to a string, and then reverse it as a row key. This method is commonly used on timestamp, phone number line keys
3) Calculate the partition number: define the partition number through custom logic
Use of coprocessor
The role of the coprocessor in HBase is similar to the role of triggers in RDBS, and the method of use is as follows
1. Define the class, inherit from BaseRegionObserver, overwrite the preXX() or postXX() method in it, the function of the method, as the name suggests, is to customize the logic before and after the execution of the XX operation
public class InsertCoprocessor extends BaseRegionObserver {
@Override
public void postPut(ObserverContext<RegionCoprocessorEnvironment> e, Put put, WALEdit edit, Durability durability) throws IOException {
Table table = e.getEnvironment().getTable(TableName.valueOf(Names.TABLE.getValue()));
String rowKey = new String(put.getRow());
String[] values = rowKey.split("_");
// 5_17885275338_20200103174616_19565082510_0713_1
CoprocessorDao dao = new CoprocessorDao();
String from = values[1];
String to = values[3];
String callTime = values[2];
String duration = values[4];
if (values[5].equals("0")) {
String flag = "1";
int regionNum = dao.getRegionNumber(to, callTime);
rowKey = regionNum + "_" + from + "_" + callTime
+ "_" + to + "_" + duration + "_" + flag;
byte[] family_in = Names.CF_CALLIN.getValue().getBytes();
Put put_in = new Put(rowKey.getBytes());
put_in.addColumn(family_in, "from".getBytes(), from.getBytes());
put_in.addColumn(family_in, "to".getBytes(), to.getBytes());
put_in.addColumn(family_in, "callTime".getBytes(), callTime.getBytes());
put_in.addColumn(family_in, "duration".getBytes(), duration.getBytes());
put_in.addColumn(family_in, "flag".getBytes(), flag.getBytes());
table.put(put_in);
table.close();
}
}
private class CoprocessorDao extends BaseDao {
public int getRegionNumber(String tel, String date) {
return genRegionNum(tel, date);
}
}
}
The above is to insert another put, according to the sixth part of the row key to decide whether to insert another record, and remember to close the table after inserting.
2. Then when creating the table, add the processor class
public HBaseDao() throws Exception {
...
createTable(Names.TABLE.getValue(), ValueConstant.REGION_COUNT,
"com.szc.telcom.consumer.coprocessor.InsertCoprocessor", new String[] {
Names.CF_CALLER.getValue(), Names.CF_CALLIN.getValue()
});
...
}
protected void createTable(String name, Integer regionCount, String coprocessorClass, String[] colFamilies) throws Exception {
...
createNewTable(name, coprocessorClass, regionCount, colFamilies);
}
private void createNewTable(String name, String coprocessorClass, Integer regionCount, String[] colFamilies) throws Exception {
...
if (!StringUtils.isEmpty(coprocessorClass)) {
tableDescriptor.addCoprocessor(coprocessorClass);
}
....
admin.createTable(tableDescriptor, splitKeys);
}
3. With dependency packaging, the packaged plug-in dependency of the relevant pom
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>
jar-with-dependencies
</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.szc.telcom.producer.Bootstrap</mainClass> <!-- 自定义主类 -->
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
The coprocessor here is a module of the project, so the entire project can be packaged
4. Put the dependent coprocessor jar package into the lib directory of the HBase cluster
5. Restart the HBase cluster and re-run the project jar package
Conclusion
Above, if you have any questions, please discuss in the comment section