0629抽取柳州数据实录

1.topic改为liuzhouPLC

2.写入没成功,发现bg01的datanode没有启动。重启hdfs集群。

[bg@BG01 sbin]$ jps
178592 RunJar
178311 RunJar
50727 NodeManager
112297 Worker
50606 -- process information unavailable
99153 Master
38196 NodeManager
50103 NameNode
147994 QuorumPeerMain
50235 DataNode
50427 SecondaryNameNode
50780 Jps

38079 ResourceManager

3.02上的kafka原本就是启动状态。01.03上的kafka没启动。于是开始分别启动:

01上的kafka启动失败。03上的kafka启动成功。发现01上的datanode进程又没了。

发现01机器上的datanode ClusterID 和namenode ClusterID不同:

namenode:

[bg@BG01 current]$ cat VERSION 
#Fri Jun 29 10:39:49 CST 2018
namespaceID=1944008406
clusterID=CID-c9f7f4cf-612d-49cb-8cce-2c9a51da2dd9
cTime=0
storageType=NAME_NODE
blockpoolID=BP-1311925081-192.168.7.151-1528165529710
layoutVersion=-63
[bg@BG01 current]$ pwd
/home/bg/data/nn/current

[bg@BG01 current]$ 

datanode:

[bg@BG01 current]$ cat VERSION 
#Fri Apr 20 17:56:47 CST 2018
storageID=DS-6380f718-1086-4406-a640-175891ea8d00
clusterID=CID-61c66b7c-3fcc-4291-a69b-4a17834d517e
cTime=0
datanodeUuid=e973f775-90dc-4f97-b25a-06d34818965e
storageType=DATA_NODE
layoutVersion=-56
[bg@BG01 current]$ pwd
/home/bg/data/dn/current
[bg@BG01 current]$ 

把datanode的ClusterID 改为namenode 的,然后重启。

仍然报错:

[2018-06-29 11:14:22,760] ERROR There was an error in one of the threads during logs loading: java.lang.NumberFormatException: For input string: "derby" (kafka.log.LogManager)
[2018-06-29 11:14:22,762] FATAL [KafkaServer id=0] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)

java.lang.NumberFormatException: For input string: "derby"

发现是在logs loading:日志加载的时候报错,所以干脆就重新建一个新文件夹kafka-logs1,然后在server.properties中将log.dir的值修改为新的文件夹。然后再启动kafka,成功。

4.再运行flume写入hdfs,写入成功。

5.细节修改:

查看hdfs文件发现,每隔1分钟就产生一个新文件。然后我改了配置

#agent1.sinks.k1.hdfs.round = false
#agent1.sinks.k1.hdfs.roundValue = 6
#agent1.sinks.k1.hdfs.roundUnit = hour
#agent1.sinks.k1.hdfs.rollInterval = 21600
agent1.sinks.k1.hdfs.useLocalTimeStamp = true
agent1.sinks.k1.hdfs.rollSize = 1024000000

agent1.sinks.k1.hdfs.rollCount = 0

想让他每2G才产生一个新文件,结果并不起作用,仍然是每分钟产生一个新文件。

原因是数据是存储在hdfs上,hdfs的默认复制份数是3,默认是读取hadoop中的dfs.replication属性,默认为3。

每当开始复制就产生新文件。

在flume的conf文件中加属性agent1.sinks.k1.hdfs.minBlockReplicas = 1


截止目前,flume的配置为:

agent1.sources = r1
agent1.sinks = k1
agent1.channels = c1

# Describe/configure the source
# kafka source

agent1.sources.r1.kafka.bootstrap.servers = 192.168.7.151:9092,192.168.7.152:9092,192.168.7.153:9092
agent1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
agent1.sources.r1.channels = c1

agent1.sources.r1.kafka.topics = liuzhouPLC
#agent1.sources.r1.kafka.topics = aa,bb
agent1.sources.r1.kafka.consumer.group.id = flume_ng
agent1.sources.r1.kafka.consumer.auto.offset.reset = earliest
agent1.sources.r1.batchSize=10000

# Describe the sink
#agent1.sinks.k1.type = logger

# hdfs sink
agent1.sinks.k1.type = hdfs
agent1.sinks.k1.channel = c1
#event.sinks.k1.hdfs.path = /flume/data/%Y-%m-%d/%{topic}
agent1.sinks.k1.hdfs.path = hdfs://bg01:9000/flume/data/%Y/%m/%d/liuzhouPLC
agent1.sinks.k1.hdfs.filePrefix = %Y%m%d%H
agent1.sinks.k1.hdfs.fileSuffix = .log
agent1.sinks.k1.hdfs.minBlockReplicas = 1
agent1.sinks.k1.hdfs.fileType = DataStream
agent1.sinks.k1.hdfs.writeFormat = Text
agent1.sinks.k1.hdfs.round = true
agent1.sinks.k1.hdfs.roundValue = 6
agent1.sinks.k1.hdfs.roundUnit = hour
agent1.sinks.k1.hdfs.rollInterval = 21600
agent1.sinks.k1.hdfs.useLocalTimeStamp = true
agent1.sinks.k1.hdfs.rollSize = 1024000000
agent1.sinks.k1.hdfs.rollCount = 0
agent1.sinks.k1.hdfs.batchSize = 10000

# Use a channel which buffers events in memory
#agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 100000
agent1.channels.c1.transactionCapacity = 10000

# use file channel
#agent1.channels.c1.type = file
#agent1.channels.c1.checkpointDir = /home/bg/data/checkpoint/event
#agent1.channels.c1.dataDirs = /home/bg/data/data/event

agent1.channels.c1.type=memory
agent1.channels.c1.capacity=10000000
agent1.channels.c1.transactionCapacity=2000
agent1.channels.c1.keep-alive=3

# Bind the source and sink to the channel
agent1.sources.r1.channels = c1
agent1.sinks.k1.channel = c1


目前的问题是:

1.文件大小在160722就不再变化了。

2.select count(*) from t06291;报错:

hive> select count(*) from t06291;
Query ID = bg_20180629164731_bf95c3bb-a537-44bd-8fac-3ae511f82f43
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
java.io.IOException: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory < 0, or requested memory > max configured, requestedMemory=1536, maxMemory=1024
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:272)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:228)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:236)
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:385)
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:330)

解决方法:

mapreduce默认需要的内存为1536M,分配的过小

<property>

            <name>mapreduce.map.memory.mb</name>

            <value>512</value>

    </property>

    <property>

           <name>mapreduce.map.java.opts</name>

           <value>-Xmx410m</value>

    </property>

    <property>

           <name>mapreduce.reduce.memory.mb</name>

           <value>512</value>

    </property>

    <property>

           <name>mapreduce.reduce.java.opts</name>

           <value>-Xmx410m</value>

    </property>


将yarn-site.xml中

<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1024</value>

</property>

修改为2048.然后重启hadoop试试。

仍然报一样的错误。

原因是yarn-site.xml中的两处配置值太小了,不满足作业的申请条件

把yarn-site.xml中的两处配置加大一点:

容器内存 yarn.nodemanager.resource.memory-mb

最大容器内存 yarn.scheduler.maximum-allocation-mb

相关参数

  1. Yarn
    (1) yarn.scheduler.minimum-allocation-mb 最小容器内存,默认1024M
    (2) yarn.scheduler.maximum-allocation-mb 最大容器内存,默认8192M
    (3) yarn.nodemanager.vmem-pmem-ratio 物理内存与虚拟内存比值,默认2.1,即为使用1G物理内存可以使用2.1G虚拟内存,生产环境中一般会调整大一些,具体虚拟内存分配由操作系统执行,在此不再赘述
    (4) yarn.nodemanager.resource.memory-mb 可以分配给container的物理内存数量,默认8192M
    (5) yarn.scheduler.increment-allocation-mb container内存增量,默认1024M

  2. MapReduce
    (1) mapreduce.{map|reduce}.java.opts
    (2) mapreduce.{map|reduce}.memory.mb

查看内存使用情况:


可用内存只有200M了。。。


问题1,周一的时候修改下channel的type试试。

猜你喜欢

转载自blog.csdn.net/yblbbblwsle/article/details/80853401