Install and use HBase under CentOS7

table of Contents

background

data structure

installation

启动hadoop、kafka、ZooKeeper

Unzip the hbase compressed package

Configure HBase

Start hbase

Close hbase

Enter the hbase command line

command

Create a table, specify the column family

Insert data, specify the row key, column family qualifier and value

Scan table

Get a single row of data

Delete data (cell)

Disable and delete table

Let the table support multiple versions of data

Integration with hive

Integration with pig

Row key design principles

Length principle

Sole principle

Hashing principle

Use of coprocessor

Conclusion

background

Record the installation process of HBase under CentOS7, please install Hadoop, Kafka and ZooKeeper in advance, you can refer to the article Hadoop2.5.0 installation and deployment under CentOS7 , CentOS7 installation and use of kafka and its monitoring components , CentOS7 installation zookeeper

data structure

1. Logical structure

The relationship between columns, column families, row keys, districts and other nouns is shown in the figure below

Row keys are sorted lexicographically

2. The physical storage structure is shown in the figure below

3. The architecture diagram is as follows

4. Writing process

memstore flashing:

5. Reading process

6. Merging storage files

7. Partitioning

installation

启动hadoop、kafka、ZooKeeper

Unzip the hbase compressed package

[root@localhost szc]# tar -zxvf hbase-1.3.1-bin.tar.gz

Configure HBase

Enter the conf directory under the hbase directory and modify the three files hbase-site.xml, hbase-env.sh, and regionservers, all of which are the local ips of CentOS

[root@localhost szc]# cd hbase-1.3.1/conf/

hbase-env.sh modify JAVA_HOME and turn off ZooKeeper that comes with hbase

export JAVA_HOME=/home/szc/jdk8_64

export HBASE_MANAGES_ZK=false

hbase-site.xml configures hdfs path, distribution, webUI port, ZooKeeper and other information

<configuration>

    <property>

        <name>hbase.rootdir</name>

        <value>hdfs://192.168.57.141:8020/Hbase</value>

    </property>

    <property>

        <name>hbase.cluster.distributed</name>

        <value>true</value>

    </property>

    <property>

        <name>hbase.zookeeper.quorum</name>

        <value>192.168.57.141</value>

    </property>

    <property>

        <name>hbase.zookeeper.property.dataDir</name>

        <value>/home/szc/zookeeper/data</value>

    </property>

    <property>

        <name>hbase.master.info.port</name>

        <value>16010</value>

    </property>

</configuration>

regionservers configuration server

192.168.57.141

Start hbase

[root@localhost conf]# cd ..

[root@localhost hbase-1.3.1]# ./bin/hbase-daemon.sh start master

[root@localhost hbase-1.3.1]# ./bin/hbase-daemon.sh start regionserver

After opening port 16010, you can see the webui of HBase on the windows browser

The arrow in the figure points to the regionserver, which is used to allocate table space.

If you want to start an HBase cluster, you can run start-hbase.sh

[root@localhost hbase-1.3.1]# ./bin/start-hbase.sh

Close hbase

[root@localhost hbase-1.3.1]# ./bin/stop-hbase.sh

Enter the hbase command line

[root@localhost hbase-1.3.1]# ./bin/hbase shell



2020-05-09 09:19:41,328 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/home/szc/hbase-1.3.1/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/home/szc/cdh/hadoop-2.5.0-cdh5.3.6/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version 1.3.1, r930b9a55528fe45d8edce7af42fef2d35e77677a, Thu Apr  6 19:36:54 PDT 2017





hbase(main):001:0>

command

Create a table, specify the column family

hbase(main):019:0> create 'testtable', 'colfaml';

Insert data, specify the row key, column family qualifier and value

hbase(main):007:0> put 'testtable', 'myrow-1', 'colfaml:ql', 'value-1'

hbase(main):008:0> put 'testtable', 'myrow-2', 'colfaml:qk', 'value-2'

hbase(main):009:0> put 'testtable', 'myrow-2', 'colfaml:qj', 'value-3'

Scan table

hbase(main):010:0> scan 'testtable'



ROW                             COLUMN+CELL                                                                             

myrow-1                        column=colfaml:ql, timestamp=1580287260033, value=value-1                               

myrow-2                        column=colfaml:qj, timestamp=1580287323632, value=value-3                               

myrow-2                        column=colfaml:qk, timestamp=1580287294044, value=value-2                               

2 row(s) in 0.0220 seconds

Get a single row of data

hbase(main):011:0> get 'testtable', 'myrow-1'



COLUMN                          CELL                                                                                    

colfaml:ql                     timestamp=1580287260033, value=value-1                                                  

1 row(s) in 0.0220 seconds

Delete data (cell)

hbase(main):012:0> delete 'testtable', 'myrow-2', 'colfaml:qj'

Disable and delete table

hbase(main):013:0> disable 'testtable'

hbase(main):014:0> drop 'testtable'

Let the table support multiple versions of data

hbase(main):015:0> alter 'test', { NAME => 'cf1', VERSIONS => 3 }

Then view the description information of the test table

hbase(main):016:0> describe 'test'



Table test is ENABLED                                                                                                   

test                                                                                                                    

COLUMN FAMILIES DESCRIPTION                                                                                             

{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_E

NCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '655

36', REPLICATION_SCOPE => '0'}                                                                                          

1 row(s) in 0.0110 seconds

You can see that 3 versions of storage information can be supported, and then insert two pieces of data for row1 and cf1

hbase(main):017:0> put 'test', 'row1', 'cf1', 'val2'

hbase(main):018:0> put 'test', 'row1', 'cf1', 'val3'

After checking the data of multiple versions, the historical data of multiple versions can be displayed.

hbase(main):019:0> scan 'test', { VERSIONS => 3}

ROW                             COLUMN+CELL                                                                             

row1                           column=cf1:, timestamp=1580357787736, value=val3                                        

row1                           column=cf1:, timestamp=1580357380211, value=val2                                        

1 row(s) in 0.0120 seconds

Integration with hive

First add the environment variable HBASE_HOME to point to the hbase installation directory. Then start the metastore of hbase and hive

Create a table in hive

create table pokes(key string, value string) row format delimited fields terminated by ',';

Read data

load data local inpath '/home/szc/data.txt' into table pokes;

Then build a table in hive, use HBaseStorageHandler to dump, and specify the corresponding relationship between the row key, column and the column in the hive table

create table hbase_table(key string, value string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties('hbase.columns.mapping' = ':key, cf1:val')

After completion, a table named hbase_table will appear in both hive and hbase, and then insert data on the hive side

insert overwrite table hbase_table select * from pokes;

Wait for this command to complete, run the following command to verify the content

select * from hbase_table;

Integration with pig

First copy PigHome\lib\spark\netty-all-***.Final.jar to the upper directory, and then start the historyserver of hadoop

mr-jobhistory-daemon.sh start historyserver

Then copy the data file to /user/username of hdfs

hadoop fs -copyFromLocal /home/szc/pig/tutorial/data/excite-small.log /user/songzeceng

After creating the excite table in hbase, start pig. The following commands are all done in the pig command line.

Load the data file first, and split the column according to the separator

raw = LOAD 'hdfs:///user/songzeceng/excite-small.log' USING PigStorage ('\t') AS (user, time, query);

Then perform the splicing of user, \u0000, and time on each row, the splicing result is the row key, and the remaining query is the column value

T = FOREACH raw GENERATE CONCAT(CONCAT(user, '\u0000'), time), query;

Then store T in the excite table of HBase and specify the column

store T into 'excite' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf1:query');

Then you can view the data, load the table on the pig side, store the row key and column value in the key and query variables, both of which are strings

R = LOAD 'excite' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf1:query', '-loadKey') as (key: chararray, query: chararray);

View relationship R

dump R;

The sample output is as follows

(FE785BA19AAA3CBB 970916083558,dystrophie musculaire duch?ne)

(FE785BA19AAA3CBB 970916083732,)

(FE785BA19AAA3CBB 970916083839,)

(FE785BA19AAA3CBB 970916084121,dystrophie)

(FE785BA19AAA3CBB 970916084512,dystrophie musculaire)

(FE785BA19AAA3CBB 970916084553,)

(FE785BA19AAA3CBB 970916085100,dystrophie musculaire)

(FEA681A240A74D76 970916193646,kerala)

(FEA681A240A74D76 970916194158,kerala)

(FEA681A240A74D76 970916194554,kerala)

(FEA681A240A74D76 970916195314,kerala)

(FF5C9156B2D27FBD 970916114959,fredericksburg)

(FFA4F354D3948CFB 970916055045,big cocks)

(FFA4F354D3948CFB 970916055704,big cocks)

(FFA4F354D3948CFB 970916060431,big cocks)

(FFA4F354D3948CFB 970916060454,big cocks)

(FFA4F354D3948CFB 970916060901,big cocks)

(FFA4F354D3948CFB 970916061009,big cocks)

(FFCA848089F3BA8C 970916100905,marilyn manson)

Then you can also press \u0000 to expand the variable key (row key) into user and time

S = foreach R generate FLATTEN (STRSPLIT(key, '\u0000', 2)) AS (user:chararray, time:long), query;

The sample output of dump S is as follows

(FE785BA19AAA3CBB,970916083435,dystrophie musculaire)

(FE785BA19AAA3CBB,970916083531,dystrophie musculaire duch?ne)

(FE785BA19AAA3CBB,970916083558,dystrophie musculaire duch?ne)

(FE785BA19AAA3CBB,970916083732,)

(FE785BA19AAA3CBB,970916083839,)

(FE785BA19AAA3CBB,970916084121,dystrophie)

(FE785BA19AAA3CBB,970916084512,dystrophie musculaire)

(FE785BA19AAA3CBB,970916084553,)

(FE785BA19AAA3CBB,970916085100,dystrophie musculaire)

(FEA681A240A74D76,970916193646,kerala)

(FEA681A240A74D76,970916194158,kerala)

(FEA681A240A74D76,970916194554,kerala)

(FEA681A240A74D76,970916195314,kerala)

(FF5C9156B2D27FBD,970916114959,fredericksburg)

(FFA4F354D3948CFB,970916055045,big cocks)

(FFA4F354D3948CFB,970916055704,big cocks)

(FFA4F354D3948CFB,970916060431,big cocks)

(FFA4F354D3948CFB,970916060454,big cocks)

(FFA4F354D3948CFB,970916060901,big cocks)

(FFA4F354D3948CFB,970916061009,big cocks)

(FFCA848089F3BA8C,970916100905,marilyn manson)

Row key design principles

Length principle

The maximum value is 64KB, and the recommended length is 10~100 bytes, preferably a multiple of 8. It can be short or short, and long row keys will affect performance

Sole principle

The row key must be unique

Hashing principle

1), salt hash: add a random number in front of the timestamp, and then use it as the row key

2) String reversal: Convert the timestamp to a string, and then reverse it as a row key. This method is commonly used on timestamp, phone number line keys

3) Calculate the partition number: define the partition number through custom logic

Use of coprocessor

The role of the coprocessor in HBase is similar to the role of triggers in RDBS, and the method of use is as follows

1. Define the class, inherit from BaseRegionObserver, overwrite the preXX() or postXX() method in it, the function of the method, as the name suggests, is to customize the logic before and after the execution of the XX operation

public class InsertCoprocessor extends BaseRegionObserver {

    @Override
    public void postPut(ObserverContext<RegionCoprocessorEnvironment> e, Put put, WALEdit edit, Durability durability) throws IOException {

        Table table = e.getEnvironment().getTable(TableName.valueOf(Names.TABLE.getValue()));

        String rowKey = new String(put.getRow());
        String[] values = rowKey.split("_");
        // 5_17885275338_20200103174616_19565082510_0713_1

        CoprocessorDao dao = new CoprocessorDao();

        String from = values[1];
        String to = values[3];
        String callTime = values[2];
        String duration = values[4];

        if (values[5].equals("0")) {
            String flag = "1";
            int regionNum = dao.getRegionNumber(to, callTime);

            rowKey = regionNum + "_" + from + "_" + callTime
                    + "_" + to + "_" + duration + "_" + flag;


            byte[] family_in = Names.CF_CALLIN.getValue().getBytes();

            Put put_in = new Put(rowKey.getBytes());

            put_in.addColumn(family_in, "from".getBytes(), from.getBytes());
            put_in.addColumn(family_in, "to".getBytes(), to.getBytes());
            put_in.addColumn(family_in, "callTime".getBytes(), callTime.getBytes());
            put_in.addColumn(family_in, "duration".getBytes(), duration.getBytes());
            put_in.addColumn(family_in, "flag".getBytes(), flag.getBytes());

            table.put(put_in);

            table.close();
        }
    }


    private class CoprocessorDao extends BaseDao {

        public int getRegionNumber(String tel, String date) {
            return genRegionNum(tel, date);
        }
    }

}

The above is to insert another put, according to the sixth part of the row key to decide whether to insert another record, and remember to close the table after inserting.

2. Then when creating the table, add the processor class

public HBaseDao() throws Exception {
    ...
    createTable(Names.TABLE.getValue(), ValueConstant.REGION_COUNT,
            "com.szc.telcom.consumer.coprocessor.InsertCoprocessor", new String[] {
            Names.CF_CALLER.getValue(), Names.CF_CALLIN.getValue()

    });
    ...
}

protected void createTable(String name, Integer regionCount, String coprocessorClass, String[] colFamilies) throws Exception {
    ...
    createNewTable(name, coprocessorClass, regionCount, colFamilies);
}

private void createNewTable(String name, String coprocessorClass, Integer regionCount, String[] colFamilies) throws Exception {
    ...
    if (!StringUtils.isEmpty(coprocessorClass)) {
        tableDescriptor.addCoprocessor(coprocessorClass);
    }
    ....
    admin.createTable(tableDescriptor, splitKeys);
}

3. With dependency packaging, the packaged plug-in dependency of the relevant pom

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.6.1</version>

            <configuration>
                <source>1.8</source>
                <target>1.8</target>
            </configuration>
        </plugin>

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>3.0.0</version>

            <configuration>

                <descriptorRefs>
                    <descriptorRef>
                        jar-with-dependencies
                    </descriptorRef>
                </descriptorRefs>

                <archive>
                    <manifest>
                        <mainClass>com.szc.telcom.producer.Bootstrap</mainClass> <!-- 自定义主类 -->
                    </manifest>
                </archive>
            </configuration>

            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

The coprocessor here is a module of the project, so the entire project can be packaged

4. Put the dependent coprocessor jar package into the lib directory of the HBase cluster

5. Restart the HBase cluster and re-run the project jar package

Conclusion

Above, if you have any questions, please discuss in the comment section

Guess you like

Origin blog.csdn.net/qq_37475168/article/details/107304312