在使用mesos framework中遇到的一个问题

    公司有个实现了mesos framework的项目,用来做任务调度。简单来说就是接到客户端提交的任务,包括镜像、参数、执行时间之类的数据,将任务以task的形式提交给mesos,由mesos执行,并返回执行结果。

    最近遇到一个问题,发现服务启动一段时间后,会出现很多错误任务。mesos update事件返回的任务状态为:TASK_LOST,错误信息为:Task launched with invalid offers: Offer XXXX is no longer valid。于是检查代码,确认代码并没有问题,framework收到的offer后,循环处理,获取待提交任务,并没有重复使用offer的情况。理论上mesos master收到slave提交的offer会轮流发给各个framework,正常情况下并不会出现offer被其他framework占用的情况。

于是用offerId搜索master日志,只发现了offer已失效的日志:

I1101 18:47:34.989588 26966 master.cpp:3641] Processing DECLINE call for offers: [ 03bcc0aa-92ce-4304-b1cd-efc90b29f896-O1473410 ] for framework framework-name
I1101 18:47:34.989820 26966 http.cpp:312] HTTP POST for /master/api/v1/scheduler from 192.168.2.38:39948 with User-Agent='galaxy-schedule/3.1.3 mesos-rxjava-client/0.1.2 rxnetty/0.4.13'
W1101 18:47:34.989948 26966 master.cpp:3056] Ignoring accept of offer 03bcc0aa-92ce-4304-b1cd-efc90b29f896-O1473410 since it is no longer valid
W1101 18:47:34.989971 26966 master.cpp:3067] ACCEPT call used invalid offers '[ 03bcc0aa-92ce-4304-b1cd-efc90b29f896-O1473410 ]': Offer 03bcc0aa-92ce-4304-b1cd-efc90b29f896-O1473410 is no longer valid
I1101 18:47:34.989994 26966 master.cpp:4806] Sending status update TASK_LOST for task task-name of framework framework-name 'Task launched with invalid offers: Offer 03bcc0aa-92ce-4304-b1cd-efc90b29f896-O1473410 is no longer valid'

应该是日志级别较高,正常执行的任务日志没有打印出来。

再用taskName搜索master日志,仍然只有报错的日志:

Adding task task-name with resources cpus(*):0.25; mem(*):512; disk(*):64 on slave 03bcc0aa-92ce-4304-b1cd-efc90b29f896-S0 (192.168.2.38)
I1101 18:47:58.990278 26962 master.cpp:3589] Launching task task-name of framework framework-name (galaxy-3.0) with resources cpus(*):0.25; mem(*):512; disk(*):64 on slave 03bcc0aa-92ce-4304-b1cd-efc90b29f896-S0 at slave(1)@192.168.2.38:5051 (192.168.2.38)
I1101 18:48:00.295632 26968 master.cpp:4763] Status update TASK_RUNNING (UUID: 7a192af0-6702-4a27-bbf4-f79d2ecdae86) for task task-name of framework framework-name from slave 03bcc0aa-92ce-4304-b1cd-efc90b29f896-S0 at slave(1)@192.168.2.38:5051 (192.168.2.38)
I1101 18:48:00.295724 26968 master.cpp:4811] Forwarding status update TASK_RUNNING (UUID: 7a192af0-6702-4a27-bbf4-f79d2ecdae86) for task task-name of framework framework-name
I1101 18:48:00.295905 26968 master.cpp:6421] Updating the state of task task-name of framework framework-name (latest state: TASK_RUNNING, status update state: TASK_RUNNING)
Sending status update TASK_LOST for task task-name of framework framework-name 'Task launched with invalid offers: Offer 03bcc0aa-92ce-4304-b1cd-efc90b29f896-O1473410 is no longer valid'

用offerId搜索slave日志,发现搜不到相关日志。

用taskName搜索slave日志,似乎也没有有助的日志:

I1101 18:47:53.255692 25195 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 106f04ef-4a7f-0000-60f0-220100000000) for task task-name of framework framework-name
E1101 18:47:53.255821 25195 slave.cpp:2405] Failed to handle status update acknowledgement (UUID: 106f04ef-4a7f-0000-60f0-220100000000) for task task-name of framework framework-name: Cannot find the status update stream for task task-name of framework framework-name
I1101 18:47:58.985147 25203 slave.cpp:1361] Got assigned task task-name for framework framework-name
I1101 18:47:58.985749 25203 gc.cpp:83] Unscheduling '/data/mesos-slave/slaves/03bcc0aa-92ce-4304-b1cd-efc90b29f896-S0/frameworks/framework-name' from gc
I1101 18:47:58.985919 25195 slave.cpp:1480] Launching task task-name for framework framework-name
I1101 18:47:58.986297 25195 paths.cpp:528] Trying to chown '/data/mesos-slave/slaves/03bcc0aa-92ce-4304-b1cd-efc90b29f896-S0/frameworks/framework-name/executors/task-name/runs/b1d9da34-1172-4797-bceb-d1be90e13a81' to user 'root'
I1101 18:47:58.996266 25195 slave.cpp:5367] Launching executor task-name of framework framework-name with resources cpus(*):0.1; mem(*):32 in work directory '/data/mesos-slave/slaves/03bcc0aa-92ce-4304-b1cd-efc90b29f896-S0/frameworks/framework-name/executors/task-name/runs/b1d9da34-1172-4797-bceb-d1be90e13a81'
I1101 18:47:58.996690 25195 slave.cpp:1698] Queuing task 'task-name' for executor 'task-name' of framework framework-name
I1101 18:47:59.004479 25203 docker.cpp:1036] Starting container 'b1d9da34-1172-4797-bceb-d1be90e13a81' for task 'task-name' (and executor 'task-name') of framework 'framework-name'
I1101 18:47:59.011270 25199 slave.cpp:4374] Current disk usage 8.82%. Max allowed age: 5.682413235856320days
I1101 18:47:59.547658 25197 systemd.cpp:95] Assigned child process '28712' to 'mesos_executors.slice'
I1101 18:47:59.577991 25194 slave.cpp:2643] Got registration for executor 'task-name' of framework framework-name from executor(1)@192.168.2.38:3449
I1101 18:47:59.578552 25202 docker.cpp:1316] Ignoring updating container 'b1d9da34-1172-4797-bceb-d1be90e13a81' with resources passed to update is identical to existing resources
I1101 18:47:59.578943 25196 slave.cpp:1863] Sending queued task 'task-name' to executor 'task-name' of framework framework-name at executor(1)@192.168.2.38:3449
I1101 18:48:00.287401 25191 slave.cpp:3002] Handling status update TASK_RUNNING (UUID: 7a192af0-6702-4a27-bbf4-f79d2ecdae86) for task task-name of framework framework-name from executor(1)@192.168.2.38:3449
I1101 18:48:00.287945 25198 status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: 7a192af0-6702-4a27-bbf4-f79d2ecdae86) for task task-name of framework framework-name
I1101 18:48:00.289232 25189 slave.cpp:3400] Forwarding the update TASK_RUNNING (UUID: 7a192af0-6702-4a27-bbf4-f79d2ecdae86) for task task-name of framework framework-name to [email protected]:5050
I1101 18:48:00.289379 25189 slave.cpp:3310] Sending acknowledgement for status update TASK_RUNNING (UUID: 7a192af0-6702-4a27-bbf4-f79d2ecdae86) for task task-name of framework framework-name to executor(1)@192.168.2.38:3449
I1101 18:48:00.491029 25192 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 7a192af0-6702-4a27-bbf4-f79d2ecdae86) for task task-name of framework framework-name
I1101 18:48:19.142347 25204 slave.cpp:3002] Handling status update TASK_FAILED (UUID: d23cb37f-c332-4284-9aae-8f95142bb531) for task task-name of framework framework-name from executor(1)@192.168.2.38:3449
I1101 18:48:19.207042 25194 status_update_manager.cpp:320] Received status update TASK_FAILED (UUID: d23cb37f-c332-4284-9aae-8f95142bb531) for task task-name of framework framework-name
I1101 18:48:19.207393 25189 slave.cpp:3400] Forwarding the update TASK_FAILED (UUID: d23cb37f-c332-4284-9aae-8f95142bb531) for task task-name of framework framework-name to [email protected]:5050
I1101 18:48:19.207545 25189 slave.cpp:3310] Sending acknowledgement for status update TASK_FAILED (UUID: d23cb37f-c332-4284-9aae-8f95142bb531) for task task-name of framework framework-name to executor(1)@192.168.2.38:3449
I1101 18:48:19.240367 25194 status_update_manager.cpp:392] Received status update acknowledgement (UUID: d23cb37f-c332-4284-9aae-8f95142bb531) for task task-name of framework framework-name
I1101 18:48:20.145004 25200 slave.cpp:3528] executor(1)@192.168.2.38:3449 exited
I1101 18:48:20.191318 25192 docker.cpp:1932] Executor for container 'b1d9da34-1172-4797-bceb-d1be90e13a81' has exited
I1101 18:48:20.191365 25192 docker.cpp:1696] Destroying container 'b1d9da34-1172-4797-bceb-d1be90e13a81'
I1101 18:48:20.191390 25192 docker.cpp:1824] Running docker stop on container 'b1d9da34-1172-4797-bceb-d1be90e13a81'
I1101 18:48:20.192230 25195 slave.cpp:3886] Executor 'task-name' of framework framework-name exited with status 0
I1101 18:48:20.192279 25195 slave.cpp:3990] Cleaning up executor 'task-name' of framework framework-name at executor(1)@192.168.2.38:3449
I1101 18:48:20.192435 25195 slave.cpp:4078] Cleaning up framework framework-name
I1101 18:48:20.192541 25194 status_update_manager.cpp:282] Closing status update streams for framework framework-name
I1101 18:52:58.997258 25195 slave.cpp:4282] Framework framework-name seems to have exited. Ignoring registration timeout for executor 'task-name'

无奈在代码里加日志,把收到的所有offer打印出来。

竟然发现会连续收到两次相同的offer,如果第一个offer被使用提交task给了mesos,那么提交第二个的时候,mesos当然会认为这个offer已失效。

于是加了一个缓存,将过去收到的10个offer存起来,如果再次收到,直接跳过不用。

问题原因有几个猜测:

1,mesos的bug,目前mesos版本 0.28.1,有点老,但是感觉可能性不大。

2,mesos-rxjava-protobuf-client的bug,由于之前把mesos连接方式从基于ProtocolBuffer的方式,换成了基于HTTP的方式,所以怀疑官方提供的mesos-rxjava-protobuf-client(0.1.2)有bug,不过仍然感觉官方不能可能留这种bug.

3,怀疑公司网络波动比较大,mesos master在切换过程中出现问题,导致发送了重复的offer。

目前仍然没有深究原因,暂且做个记录。如果有大佬知道原因,欢迎指教。

猜你喜欢

转载自my.oschina.net/u/1013857/blog/1588731
今日推荐