elasticsearch数据到hive、es-hadoop6.3.0

同一模块中日志输出到不同文件

需求：项目的同一个子模块下，已经配置过日志输出，现在加入新的功能后，想把新功能的日志输出到不同的日志文件中。
操作：

rootLogger中申明多个输出对象，比如这里的stdout, es, other log4j.rootLogger=info, stdout, es, other
分别为每个输出对象进行设置
log4j.appender.es=
log4j.appender.other=
然后这样使用：

private static final Logger ES = LoggerFactory.getLogger("stdout");
    private static final Logger OTHER = LoggerFactory.getLogger("stdout");

    public static void main(String[] args) {
        ES.warn("ES");
        OTHER.warn("OTHER");
    }

完整配置

log4j.rootLogger=info, stdout, es, other
# stdout--------log4j:ERROR Could not find value for key log4j.appender.stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH\:mm\:ss} %p %c %-5l %m%n

# es
log4j.appender.es=org.apache.log4j.DailyRollingFileAppender
log4j.appender.es.File=/Users/wanghai/IdeaProjects/wanghai/ipbdLab_/logs/es.log
log4j.appender.es.Encoding=UTF-8
# 然后当这一天过去的时候，生成一个新的blog时，才会给原来的日志加上日期格式后缀
log4j.appender.es.DatePattern='.'yyyy-MM-dd
log4j.appender.es.layout=org.apache.log4j.PatternLayout
log4j.appender.es.layout.ConversionPattern=%d - %m%n
log4j.appender.es.Append=true
log4j.appender.es.Threshold=info
# 不继承父输出
log4j.additivity.es=false

# other
log4j.appender.other=org.apache.log4j.DailyRollingFileAppender
log4j.appender.other.File=/Users/wanghai/IdeaProjects/wanghai/ipbdLab_/logs/other_restapi.log
log4j.appender.other.Encoding=UTF-8
log4j.appender.other.DatePattern='.'yyyy-MM-dd
log4j.appender.other.layout=org.apache.log4j.PatternLayout
log4j.appender.other.layout.ConversionPattern=%d - %m%n
log4j.appender.other.Append=true
log4j.appender.other.Threshold=info
log4j.additivity.other=false

elasticsearch数据到hive

思路1

调用api从es中取得所有数据，利用jackson等将json转为pojo（bean），再将数据写入hive——太麻烦。

思路2

使用ES-hadoop

安装ES-hadoop

1、下载对应的版本并解压
wget https://artifacts.elastic.co/downloads/elasticsearch-hadoop/elasticsearch-hadoop-6.3.0.zip
2、然后我们可以看见，有以下jar包
这里写图片描述

添加jar包到hive

1、也许是考虑到集群中jar包传输？es提供各种细分场景的jar包
hdfs dfs -put /home/elasticsearch-hadoop-6.3.0/dist/elasticsearch-hadoop-6.3.0.jar /jarsput
2、jar包添加到hive的执行环境
提示：至少这一次，最后我还是通过把本地jar包添加到hive,后面的才得以正确执行
~**可以add jar hdfs:/jarsput/elasticsearch-hadoop-hive-6.3.0.jar或者**~
可以add jar /home/elasticsearch-hadoop-6.3.0/dist/elasticsearch-hadoop-hive-6.3.0.jar(仅仅当前shell会话起作用)或者

以下三种方式将jar包添加到hive的执行环境

注意⚠️
hive -hiveconf hive.aux.jars.path=hdfs:/jarsput/elasticsearch-hadoop-hive-6.3.0.jar之后
原先的
这里写图片描述
被覆盖为：

保险起见，最好还是追加上以前的classpath。

es数据导入到hive的最基础设置

注意⚠️:下方字段名大写有坑

CREATE EXTERNAL TABLE if not exists database.table (
AB string,
AD string,
ADDR string,
AGC string,
......
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'index/type',
              'es.index.auto.create' = 'false') ;

此时回车执行报错：
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'

重新进入cli，控制台打印日志，重新执行建表加载数据
hive -hiveconf hive.root.logger=DEBUG,console
发现
Cannot find class 'org.elasticsearch.hadoop.hive.EsStorageHandler'
看来，添加hdfs的jar包，操作错了。

添加本地jar包后，多了一个error：
9200端口拒绝连接
这里写图片描述

最终解决办法
增加3个配置，node指定为data节点：

'es.nodes'='ip11', 
'es.port'='9200',
'es.nodes.wan.only' = 'true'

解决过程中参考的文章：
discuss.elastic
过往记忆：使用Hive读取ElasticSearch中的数据
 ES-Spark连接ES后，ES Client节点流量打满分析
连接成功的话，可以看到熟悉的文字：
这里写图片描述

查询：select * from mapper.intell_property limit 10;
可以发现：
1、共45个字段，符合预期；
2、仅最后3个字段有值，且符合预期
这里写图片描述
为什么会这样

我这里，es中的字段名称大多是大写的，但是在建立hive映射表时，输入的大写字段名会转化成小写；导致字段会以小写的方式去es查找字段名；

所以，

es数据导入到hive进阶之mapping设置

'es.mapping.names' = 'ab:AB , ad:AD , addr:ADDR , agc:AGC , agt:AGT , an:AN , au:AU , cdd:CDD , cdn:CDN , clm:CLM , clmd:CLMD , cnlx:CNLX , ctd:CTD , ctn:CTN , dc:DC , ecla:ECLA , ft:FT , ian:IAN , ipc:IPC , ipn:IPN , jpc:JPC , ls:LS , lsd:LSD , lse:LSE , lso:LSO , lsr:LSR , pa:PA , pc:PC , pctf:PCTF, 
              pd:PD , pf:PF , pfd:PFD , pn:PN , pr:PR , qwft:QWFT , ti:TI , upc:UPC , uspc:USPC , wxzt:WXZT , xgd:XGD , yxzt:YXZT , zyft:ZYFT');

删除该外部表（也需要add jar）
重新执行，error：
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.elasticsearch.hadoop.mr.WritableArrayWritable cannot be cast to org.apache.hadoop.io.Text

感觉不妙。。。WritableArrayWritable不能写为hadoop中的Text
es中，确实很多字段，内容都是以数组的形式存储的，比如

“AU”: [
“ADAM,GISELA”
],
“CTN”: [
“1377923”,
“2516728”,
“2552155”,
“4416396”,
“4781314”,
“4972972”,
“5244021”
]

解决方案：
hive的复杂数据类型，比如
Array，Map，Struct

es数据导入到hive进阶之array

这里写图片描述
修改为array<string>即可

执行下方查询时，出现了map job持续停留在80%进度的情况：
这里写图片描述

这里写图片描述

使用es查询时，耗时不到0.1s

GET /intell_property/text_info/_search
{
  "query": {
    "match": {
      "docid": "6857984"
    }
  }
}

解决这个问题，单独一个博客做个记录

无文档ID的Json文件批量导入(Java/Python)
java api连接集群+简单检索+停用词+同义词

参考

[1]、discuss.elastic
[2]、过往记忆：使用Hive读取ElasticSearch中的数据
[3]、ES-Spark连接ES后，ES Client节点流量打满分析