cloudera server与agent失连问题

#该主机已与 Cloudera Manager Server 未建立联系

server端monitor服务正常agent连不上
#该主机已与 Cloudera Manager Server 建立联系。 该主机未与 Host Monitor 建立联系。
[20/Feb/2020 16:51:51 +0000] 22086 MonitorDaemon-Reporter firehoses INFO Creating a connection to the ACTIVITYMONITOR.
[20/Feb/2020 16:51:51 +0000] 22086 MonitorDaemon-Reporter firehoses INFO Creating a connection to the SERVICEMONITOR.
[20/Feb/2020 16:51:51 +0000] 22086 MonitorDaemon-Reporter firehoses INFO Creating a connection to the HOSTMONITOR.
[20/Feb/2020 16:51:51 +0000] 22086 MonitorDaemon-Reporter throttling_logger ERROR Error sending messages to firehose: mgmt-HOSTMONITOR-d592ed6aea0516a09027c2cf834d8979
Traceback (most recent call last):
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/monitor/firehose.py", line 121, in _send
    self._port)
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 469, in __init__
    self.conn.connect()
  File "/usr/lib64/python2.7/httplib.py", line 833, in connect
    self.timeout, self.source_address)
  File "/usr/lib64/python2.7/socket.py", line 571, in create_connection
    raise err
error: [Errno 111] Connection refused
参考:

server日志里
2020-02-20 17:25:06,371 WARN New I/O boss #388:com.cloudera.server.cmf.log.AgentResponseAsyncHandler: (2 skipped) Exception thrown while trying to get log search results from agent on host: creative
java.net.ConnectException: Connection timed out: creative/172.19.40.203:9000
。。
2020-02-20 17:35:17,209 ERROR ParcelUpdateService:com.cloudera.parcel.components.ParcelDownloaderImpl: (10 skipped) Unable to retrieve remote parcel repository manifest
java.util.concurrent.ExecutionException: java.net.UnknownHostException: archive.cloudera.com: Name or service not known

cloudera agent monitor firehose error: [Errno 111] Connection refused
#重新添加主机
2020-02-20 20:19:57,879 ERROR scm-web-4143:com.cloudera.cmf.model.DbCommand: Command null(DeployClusterClientConfig) has completed. finalstate:FINISHED, success:false, msg:Command Deploy Client Configuration is not currently available for execution.
2020-02-20 20:19:57,894 INFO scm-web-4143:com.cloudera.enterprise.JavaMelodyFacade: Exiting HTTP Operation: Method:POST, Path:/v7/clusters/LogServerClu/commands/deployClientConfig, Status:200
2020-02-20 20:19:57,978 WARN scm-web-4105:com.cloudera.cmf.command.flow.SeqFlowCmd: Invalid command state json
com.cloudera.enterprise.JsonUtil2$JsonRuntimeException: com.fasterxml.jackson.databind.exc.MismatchedInputException: No content to map due to end-of-input
 at [Source: (String)""; line: 1, column: 0]
	at com.cloudera.enterprise.JsonUtil2.valueFromString(JsonUtil2.java:193)
不是JDK的原因!
搞了一天最终大法:
把170,171,172,221四台agent停掉,停掉170 server;然后再重启server,四个agent
#四台
systemctl stop cloudera-scm-agent
systemctl stop cloudera-scm-server
#170
systemctl start cloudera-scm-server
#四台
systemctl start cloudera-scm-agent
还是没解决221节点(内网ip映射)从cloudera删除集群:四台节点都是配置221的公网ip映射;然后从新添加到集群。
#scm-status.log
20/Feb/2020 21:56:44 +0000] 5440 MainThread _cplogging   INFO     [20/Feb/2020:21:56:44] ENGINE Started monitor thread 'Autoreloader'.
[20/Feb/2020 21:56:44 +0000] 5440 MainThread _cplogging   INFO     [20/Feb/2020:21:56:44] ENGINE Started monitor thread '_TimeoutMonitor'.
[20/Feb/2020 21:56:44 +0000] 5440 HTTPServer Thread-3 _cplogging   ERROR    [20/Feb/2020:21:56:44] ENGINE Error in HTTP server: shutting down
Traceback (most recent call last):
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cherrypy/process/servers.py", line 225, in _start_http_thread
    self.httpserver.start()
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cheroot/server.py", line 1326, in start
    raise socket.error(msg)
error: No socket could be created -- (('47.103.112.221', 9000): [Errno 99] Cannot assign requested address)

[20/Feb/2020 21:56:44 +0000] 5440 HTTPServer Thread-3 _cplogging   INFO     [20/Feb/2020:21:56:44] ENGINE Bus STOPPING
[20/Feb/2020 21:56:44 +0000] 5440 HTTPServer Thread-3 _cplogging   INFO     [20/Feb/2020:21:56:44] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('creative', 9000)) already shut down
[20/Feb/2020 21:56:44 +0000] 5440 HTTPServer Thread-3 _cplogging   INFO     [20/Feb/2020:21:56:44] ENGINE Stopped thread '_TimeoutMonitor'.
[20/Feb/2020 21:56:44 +0000] 5440 HTTPServer Thread-3 _cplogging   INFO     [20/Feb/2020:21:56:44] ENGINE Stopped thread 'Autoreloader'.
[20/Feb/2020 21:56:44 +0000] 5440 HTTPServer Thread-3 _cplogging   INFO     [20/Feb/2020:21:56:44] ENGINE Bus STOPPED
[20/Feb/2020 21:56:44 +0000] 5440 HTTPServer Thread-3 _cplogging   INFO     [20/Feb/2020:21:56:44] ENGINE Bus EXITING
[20/Feb/2020 21:56:44 +0000] 5440 HTTPServer Thread-3 _cplogging   INFO     [20/Feb/2020:21:56:44] ENGINE Bus EXITED
#scm-agent.log
[20/Feb/2020 21:56:35 +0000] 5322 MainThread _cplogging   INFO     [20/Feb/2020:21:56:35] ENGINE Serving on http://127.0.0.1:9001
[20/Feb/2020 21:56:35 +0000] 5322 MainThread _cplogging   INFO     [20/Feb/2020:21:56:35] ENGINE Bus STARTED
[20/Feb/2020 21:56:37 +0000] 5322 MainThread main         ERROR    Top-level exception: <Fault 40: 'ABNORMAL_TERMINATION: status_server'>
Traceback (most recent call last):
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/main.py", line 107, in main_impl
    ag.start(legacy_supervisor)
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/agent.py", line 839, in start
    self.supervisor_client.start_process(STATUS_SERVER_PROC)
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/util/__init__.py", line 531, in new_fn
    return fn(self, *args, **kwargs)
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/supervisor.py", line 406, in start_process
    raise RetryableProcessException(fault)
RetryableProcessException: <Fault 40: 'ABNORMAL_TERMINATION: status_server'>
    
###查看ip及hostname对应关系
[root@creative cloudera-scm-agent]# python -c 'import socket; print socket.getfqdn(), socket.gethostbyname(socket.getfqdn())'
creative 47.103.112.221
最终删除agent从新安装用公网ip配置hosts文件映射
creative: IOException thrown while collecting data from host: Connection refused (Connection refused)
#agent.log
[20/Feb/2020 22:48:42 +0000] 11398 MonitorDaemon-Reporter throttling_logger ERROR (10 skipped) Error sending messages to firehose: mgmt-HOSTMONITOR-d592ed6aea0516a09027c2cf834d8979
Traceback (most recent call last):
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/monitor/firehose.py", line 121, in _send
    self._port)
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/avro/ipc.py", line 469, in __init__
    self.conn.connect()
  File "/usr/lib64/python2.7/httplib.py", line 833, in connect
    self.timeout, self.source_address)
  File "/usr/lib64/python2.7/socket.py", line 571, in create_connection
    raise err
error: [Errno 111] Connection refused
#/var/log/cloudera-scm-firehose
#activemontor日志
2020-02-20 21:01:43,753 WARN com.cloudera.cmf.BasicScmProxy: Exception while getting current fragments hashes
java.net.ConnectException: Connection refused (Connection refused)
...
2020-02-20 21:02:40,203 INFO com.cloudera.cmon.firehose.Main: Starting Firehose. JVM Args: [-XX:+UseConcMarkSweepGC, -XX:+UseParNewGC, -Dmgmt.log.file=mgmt-cmf-mgmt-ACTIVITYMONITOR-hz-seeing-bg-01.log.out, -Djava.awt.headless=true, -Djava.net.preferIPv4Stack=true, -Dfirehose.schema.dir=/opt/cloudera/cm/schema, -Xms1073741824, -Xmx1073741824, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/tmp/mgmt_mgmt-ACTIVITYMONITOR-d592ed6aea0516a09027c2cf834d8979_pid43982.hprof, -XX:OnOutOfMemoryError=/opt/cloudera/cm-agent/service/common/killparent.sh], Args: [--pipeline-type, ACTIVITY_MONITORING_TREE, --mgmt-home, /opt/cloudera/cm], Version: 6.2.0 (#968826 built by jenkins on 20190314-1704 git: 16bbe6211555460a860cf22d811680b35755ea81)
...#hostmontor日志
2020-02-20 21:02:45,838 WARN com.cloudera.cmon.firehose.HMONToSMONHostSubjectRecordPublisher: Failed to send messages to SMON.
java.lang.reflect.UndeclaredThrowableException
        at com.sun.proxy.$Proxy23.writeStatusRecords(Unknown Source)
        at com.cloudera.cmon.firehose.BasicFirehoseClient.writeStatusRecords(BasicFirehoseClient.java:75)
        at com.cloudera.cmon.firehose.HMONToSMONHostSubjectRecordPublisher.processRecords(HMONToSMONHostSubjectRecordPublisher.java:107)
        at com.cloudera.cmon.tstore.leveldb.LDBSubjectRecordStore.write(LDBSubjectRecordStore.java:399)
        at com.cloudera.cmon.kaiser.HMONTestRunner.runHostTestsForSession(HMONTestRunner.java:86)
        at com.cloudera.cmon.kaiser.HMONTestRunner.runTestsForSession(HMONTestRunner.java:66)
        at com.cloudera.cmon.kaiser.BaseTestRunner.runTestsOnAllSubjects(BaseTestRunner.java:143)
        at com.cloudera.cmon.kaiser.KaiserService$KaiserServiceRunner.run(KaiserService.java:138)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused (Connection refused)
                                                                                            
smon服务的端口9999和firehose端口9998
通过对比只有server服务器启动9999,9998端口而且agent必须能访问两个端口
而221阿里云机器无法访问IDC170(server)机器9999端口
内网机器才可以,不能通过server公网ip访问,尽管是一台机器
将9999相关的端口绑定成通配符地址:clouderamanagerserver-配置-activemonitor修改为通配符地址
cd /var/log/cloudera-scm-firehose
    #只有hostmonitor报错了activemonitor不报错了
    2020-02-21 11:18:07,529 INFO com.cloudera.cmon.tstore.leveldb.LDBPartitionManager: Opening partition LDBPartitionMetadataWrapper{tableName=ts_subject, partiti
onName=ts_subject_2020-02-11T07:41:01.428Z, startTime=2020-02-11T07:41:01.428Z, endTime=null, version=9, state=CLOSED}
2020-02-21 11:18:07,546 WARN com.cloudera.cmon.firehose.HMONToSMONHostSubjectRecordPublisher: Failed to send messages to SMON.
java.lang.reflect.UndeclaredThrowableException
        at com.sun.proxy.$Proxy23.writeStatusRecords(Unknown Source)
。。。
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused (Connection refused)
        at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:104)
        ... 9 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
接着同样操作:勾上即可
MainThread main ERROR Top-level exception: <Fault 40: 'ABNORMAL_TERMINATION: status_server'>
    #查看cloudera-scm-eventserver
2020-02-21 11:34:07,569 INFO org.apache.avro.ipc.NettyServer: [id: 0xe2bcd0eb, /192.168.20.170:51594 => /192.168.20.170:7184] OPEN
2020-02-21 11:34:07,570 INFO org.apache.avro.ipc.NettyServer: [id: 0xe2bcd0eb, /192.168.20.170:51594 => /192.168.20.170:7184] BOUND: /192.168.20.170:7184
2020-02-21 11:34:07,570 INFO org.apache.avro.ipc.NettyServer: [id: 0xe2bcd0eb, /192.168.20.170:51594 => /192.168.20.170:7184] CONNECTED: /192.168.20.170:51594
2020-02-21 11:34:07,576 ERROR com.cloudera.cmf.eventcatcher.server.EventMetricsPublisher: Could not publish metrics to HMON:
java.lang.reflect.UndeclaredThrowableException
。。。
2020-02-21 11:34:07,590 ERROR com.cloudera.cmf.eventcatcher.server.EventMetricsPublisher: Could not publish metrics to SMON:
java.lang.reflect.UndeclaredThrowableException
        at com.sun.proxy.$Proxy22.writeMetrics(Unknown Source)
        at com.cloudera.cmon.firehose.BasicFirehoseClient.writeMetrics(BasicFirehoseClient.java:87)
        at com.cloudera.cmf.eventcatcher.server.EventMetricsPublisher.publishToSMON(EventMetricsPublisher.java:233)
        at com.cloudera.cmf.eventcatcher.server.EventMetricsPublisher.run(EventMetricsPublisher.java:110)
        at com.cloudera.enterprise.PeriodicEnterpriseService$UnexceptionablePeriodicRunnable.run(PeriodicEnterpriseService.java:67)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.avro.AvroRemoteException: java.net.ConnectException: Connection refused (Connection refused)
        at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:104)
        ... 6 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
#最后开启servermonitor的通配符,还是上面的错误查看agent scm-status.log 
[21/Feb/2020 11:57:55 +0000] 16366 MainThread _cplogging   INFO     [21/Feb/2020:11:57:55] ENGINE Started monitor thread '_TimeoutMonitor'.
[21/Feb/2020 11:57:55 +0000] 16366 HTTPServer Thread-3 _cplogging   ERROR    [21/Feb/2020:11:57:55] ENGINE Error in HTTP server: shutting down
Traceback (most recent call last):
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cherrypy/process/servers.py", line 225, in _start_http_thread
    self.httpserver.start()
  File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cheroot/server.py", line 1326, in start
    raise socket.error(msg)
error: No socket could be created -- (('47.103.112.221', 9000): [Errno 99] Cannot assign requested address)
#supervisord
2020-02-21 11:42:12,122 INFO gave up: status_server entered FATAL state, too many start retries too quickly
2020-02-21 11:57:46,783 INFO spawned: 'status_server' with pid 16328
2020-02-21 11:57:47,355 INFO exited: status_server (exit status 70; not expected)
9000是内网ip绑定,是不是这个原因=》agent换成内网映射
server 映射是内网ip
server是外网映射

虽然这样但是这台机器显示警告阀值50的时候前面是27 entropy爆红,后面集群自己调节出100的阀值,主机就正常了
最终效果

#补充
159启动cloudera-manager失败发现启动过程中event-server失败,后面接着三个monitor就失败了
因此查看event-server日志
2020-02-21 23:27:04,647 INFO com.cloudera.enterprise.DebugServer: Running debug HTTP server on 0.0.0.0:8084
2020-02-21 23:27:04,766 ERROR com.cloudera.cmf.eventcatcher.server.EventCatcherService: Error starting EventServer
org.jboss.netty.channel.ChannelException: Failed to bind to: 0.0.0.0/0.0.0.0:7184
        at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:298)
        at org.apache.avro.ipc.CustomNettyServer.<init>(CustomNettyServer.java:76)
        at com.cloudera.cmf.eventcatcher.server.AvroEventStoreServer.<init>(AvroEventStoreServer.java:107)
        at com.cloudera.cmf.eventcatcher.server.EventCatcherService.main(EventCatcherService.java:179)
Caused by: java.net.BindException: Address already in use

netstat -nltpa
#连接等待关闭
ss -ano|grep 7184 #查看进程加上-p就能看到进程号


猜你喜欢

转载自www.cnblogs.com/bchjazh/p/f55e1fe3630fea5ed3cb4c9fc505572e.html
今日推荐