cdh solr 索引hbase数据

使用简介

详细配置介绍: https://docs.cloudera.com/documentation/enterprise/5-12-x/topics/search_use_hbase_indexer_service.html#
在这里插入图片描述

  • 数据实时同步流程: hbase – > solr
    在这里插入图片描述
  • 数据实时查询流程: solr – > hbase
    在这里插入图片描述

hbase-indexer 索引hbase数据到solr

官方文档:https://github.com/NGDATA/hbase-indexer/wiki/Tutorial

1. 服务端:CDH添加服务 (配置1g内存)

在这里插入图片描述

[root@hadoop1jms]# ps -ef |grep hbase-index
hbase    105985  22652  0 Feb12 ?        00:18:43 /usr/java/jdk1.8.0_211-amd64/bin/java -Dproc_server -XX:OnOutOfMemoryError=kill -9 %p -Djava.net.preferIPv4Stack=true \
	-Xms1048576 -Xmx1048576 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled -XX:+HeapDumpOnOutOfMemoryError \
	-XX:HeapDumpPath=/tmp/ks_indexer_ks_indexer-HBASE_INDEXER-144a0ada48aafa3485e33bb69df57f5c_pid105985.hprof -XX:OnOutOfMemoryError=/opt/cm-5.12.2/lib64/cmf/service/common/killparent.sh \
	-Dhbaseindexer.log.dir=/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hbase-solr/bin/../logs -Dhbaseindexer.log.file=hbase-indexer.log \
	-Dhbaseindexer.home.dir=/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hbase-solr/bin/.. -Dhbaseindexer.id.str= -Dhbaseindexer.root.logger=INFO,console \
	-Djava.library.path=/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hadoop/lib/native \
	com.ngdata.hbaseindexer.Main


[zk: localhost:2181(CONNECTED) 3] ls /ngdata/hbaseindexer
[masters, indexer, indexerprocess, indexer-trash]

2. 客户端: hbase-indexer命令行工具

[root@hadoop1 ~]$ which hbase-indexer
/usr/bin/hbase-indexer
[root@hadoop1 ~]# readlink -f /usr/bin/hbase-indexer
/opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/bin/hbase-indexer
[root@hadoop1 ~]$ cat /opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/bin/hbase-indexer
#!/bin/bash
  # Reference: http://stackoverflow.com/questions/59895/can-a-bash-script-tell-what-directory-its-stored-in
  SOURCE="${BASH_SOURCE[0]}"
  BIN_DIR="$( dirname "$SOURCE" )"
  while [ -h "$SOURCE" ]
  do
    SOURCE="$(readlink "$SOURCE")"
    [[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE"
    BIN_DIR="$( cd -P "$( dirname "$SOURCE"  )" && pwd )"
  done
  BIN_DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
  LIB_DIR=$BIN_DIR/../lib

# Autodetect JAVA_HOME if not defined
. $LIB_DIR/bigtop-utils/bigtop-detect-javahome
exec $LIB_DIR/hbase-solr/bin/hbase-indexer "$@"


[root@hadoop1 ~]$ cat  /opt/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hbase-solr/bin/hbase-indexer
...
# figure out which class to run
if [ "$COMMAND" = "server" ] ; then
  CLASS='com.ngdata.hbaseindexer.Main'
elif [ "$COMMAND" = "daemon" ] ; then
  CLASS='com.ngdata.hbaseindexer.Main daemon'
elif [ "$COMMAND" = "add-indexer" ] ; then
  CLASS='com.ngdata.hbaseindexer.cli.AddIndexerCli'
elif [ "$COMMAND" = "update-indexer" ] ; then
  CLASS='com.ngdata.hbaseindexer.cli.UpdateIndexerCli'
elif [ "$COMMAND" = "delete-indexer" ] ; then
  CLASS='com.ngdata.hbaseindexer.cli.DeleteIndexerCli'
elif [ "$COMMAND" = "list-indexers" ] ; then
  CLASS='com.ngdata.hbaseindexer.cli.ListIndexersCli'
...
# Exporting classpath since passing the classpath with -cp seems to choke daemon mode
export CLASSPATH
# Exec unless HBASE_INDEXER_NOEXEC is set.
if [ "${HBASE_INDEXER_NOEXEC}" != "" ]; then
  "$JAVA" -Dproc_$COMMAND {-XX:OnOutOfMemoryError="kill -9 %p" $JAVA_HEAP_MAX $HBASE_INDEXER_OPTS $CLASS "$@"
else
  exec "$JAVA" -Dproc_$COMMAND -XX:OnOutOfMemoryError="kill -9 %p" $JAVA_HEAP_MAX $HBASE_INDEXER_OPTS $CLASS "$@"
fi

2.1 客户端配置文件

配置文档:https://github.com/NGDATA/hbase-indexer/wiki/Indexer-configuration

  • 确保hbase表启用复制 alter 'HBase_Indexer_Test' , { NAME => 'f', REPLICATION_SCOPE => '1' }
#1, 创建索引配置文件
[root@test-c6 ~]# cat /var/lib/solr/hbase_2solr/indexerconf.xml
<indexer table="HBase_Indexer_Test" read-row="never">
  <field name="HBase_Indexer_Test_cf1_name" value="cf1:name" type="string"/>
  <field name="HBase_Indexer_Test_cf1_job" value="cf1:job" type="string"/>
</indexer>

####1.2快速根据schema.xml生成配置
[root@test-c6 ~]# cat schema.xml
<field name="id" type="string" indexed="true" stored="true" />
<field name="name" type="string" indexed="true" stored="true" />
#
[root@test-c6 ~]# sed 's@name="\(.*\)" type=.*@ name="\1" \t value="f:\1" \t type="string" \/\>@' schema.xml
<field  name="id"        value="f:id"    type="string" />
<field  name="name"      value="f:name"          type="string" />

2.2 创建实时索引

2.2.1 solr cloud

  • 确保hbase表启用复制 alter 'HBase_Indexer_Test' , { NAME => 'f', REPLICATION_SCOPE => '1' }
  • 注意: 实时索引,只对新数据才有效;旧的hbase表历史数据,需要使用批量索引方式添加进来
#命令行添加索引
#短格式的命令参数
hbase-indexer add-indexer \
--name hbase_2solr \
-c /var/lib/solr/hbase_2solr/indexerconf.xml  \
-cp solr.zk=test-c6,hadoop01:2181/solr \
-cp solr.collection=hbase_2solr \
-z  test-c6:2181 

#长格式的命令参数
#hbase-indexer add-indexer \
#--name hbase_2solr \
#--indexer-conf /var/lib/solr/hbase_2solr/indexerconf.xml   \
#--connection-param solr.zk=test-c6:2181/solr \
#--connection-param solr.collection=hbase_2solr \
#--zookeeper  test-c6:2181


#2,查看索引状态
[root@test-c6 ~]# hbase-indexer list-indexers --zookeeper test-c6:2181
Number of indexes: 1
hbase_2solr 
  + Lifecycle state: ACTIVE
  + Incremental indexing state: SUBSCRIBE_AND_CONSUME
  + Batch indexing state: INACTIVE
  + SEP subscription ID: Indexer_collection_hbase_2solr_indexer
  + SEP subscription timestamp: 2020-11-27T15:17:04.569+08:00
  + Connection type: solr
  + Connection params:
    + solr.zk = test-c6:2181/solr
    + solr.collection = hbase_2solr
  + Indexer config:
      795 bytes, use -dump to see content
  + Indexer component factory: com.ngdata.hbaseindexer.conf.DefaultIndexerComponentFactory
  + Additional batch index CLI arguments:
      (none)
  + Default additional batch index CLI arguments:
      (none)
  + Processes
    + 1 running processes
    + 0 failed processes

2.2.2 solr stand-alone

根据代码中参数的解析:来获取参数名
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
添加http solr: 确保hbase表启用复制 alter 'HBase_Indexer_Test' , { NAME => 'f', REPLICATION_SCOPE => '1' }

hbase-indexer add-indexer \
--name test2 \
-c /var/lib/solr/index.conf  \
-cp  solr.mode=classic \
-cp  solr.shard.1=http://192.168.56.1:8089/solr/test 

2.3 创建批量索引:hadoop jar hbase-indexer-mr-job.jar

  • 确保hbase表启用复制 alter 'HBase_Indexer_Test' , { NAME => 'f', REPLICATION_SCOPE => '1' }
# --reducers INTEGER    0  indicates that no reducers should be used, and documents
#                         should be sent directly from the mapper  tasks  to  live  Solr  servers
                         

# (Re)index a table with direct writes to SolrCloud
#基于hbase-indexer 配置文件
hadoop jar /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-job.jar \
  --hbase-indexer-file /var/lib/solr/hbase_2solr/indexerconf.xml \
  --zk-host localhost:2181/solr \
  --collection hbase_2solr \
  --reducers 0 
  
# (Re)index a table based on a indexer config stored in ZK
#基于已经创建的hbase-indexer 索引名称
hadoop jar  /opt/cloudera/parcels/CDH/lib/hbase-solr/tools/hbase-indexer-mr-job.jar \
  --hbase-indexer-zk localhost:2181 \
  --hbase-indexer-name hbase_2solr \
  --reducers 0 

2.4 验证hbase数据是否被solr索引

  • hbase shell 插入数据
hbase(main):023:0> put "HBase_Indexer_Test",'015','cf1:job', 'java3'
hbase(main):023:0> put "HBase_Indexer_Test",'015','cf1:name', 'bb'                                 
  • 查看Key-Value Store Indexer, solr日志
    在这里插入图片描述
#Key-Value Store Indexer
[root@test-c6 ~]# tailf  /var/log/hbase-solr/lily-hbase-indexer-cmf-ks_indexer-HBASE_INDEXER-test-c6.log.out
2020-11-27 15:31:16,890 INFO org.kitesdk.morphline.api.MorphlineContext: Importing commands
2020-11-27 15:31:17,341 INFO org.kitesdk.morphline.api.MorphlineContext: Done importing commands
  • solr查询
    在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/eyeofeagle/article/details/110235329