A problem encountered in using the mesos framework

    The company has a project that implements the mesos framework for task scheduling. Simply put, it is to receive the task submitted by the client, including data such as image, parameters, execution time, etc., submit the task to mesos in the form of a task, execute it by mesos, and return the execution result.

    I recently encountered a problem and found that after the service has been started for a period of time, there will be many error tasks. The task status returned by the mesos update event is: TASK_LOST, and the error message is: Task launched with invalid offers: Offer XXXX is no longer valid. So check the code and confirm that there is no problem with the code. After the framework receives the offer, it processes it in a loop to obtain the tasks to be submitted, and there is no situation where the offer is reused. In theory, the mesos master receives the offer submitted by the slave and sends it to each framework in turn. Under normal circumstances, the offer will not be occupied by other frameworks.

So I used the offerId to search the master log, and only found the log with an invalid offer:

I1101 18:47:34.989588 26966 master.cpp:3641] Processing DECLINE call for offers: [ 03bcc0aa-92ce-4304-b1cd-efc90b29f896-O1473410 ] for framework framework-name
I1101 18:47:34.989820 26966 http.cpp:312] HTTP POST for /master/api/v1/scheduler from 192.168.2.38:39948 with User-Agent='galaxy-schedule/3.1.3 mesos-rxjava-client/0.1.2 rxnetty/0.4.13'
W1101 18:47:34.989948 26966 master.cpp:3056] Ignoring accept of offer 03bcc0aa-92ce-4304-b1cd-efc90b29f896-O1473410 since it is no longer valid
W1101 18:47:34.989971 26966 master.cpp:3067] ACCEPT call used invalid offers '[ 03bcc0aa-92ce-4304-b1cd-efc90b29f896-O1473410 ]': Offer 03bcc0aa-92ce-4304-b1cd-efc90b29f896-O1473410 is no longer valid
I1101 18:47:34.989994 26966 master.cpp:4806] Sending status update TASK_LOST for task task-name of framework framework-name 'Task launched with invalid offers: Offer 03bcc0aa-92ce-4304-b1cd-efc90b29f896-O1473410 is no longer valid'

It should be that the log level is high, and the log of the normally executed task is not printed out.

Then use taskName to search the master log, there are still only error logs:

Adding task task-name with resources cpus(*):0.25; mem(*):512; disk(*):64 on slave 03bcc0aa-92ce-4304-b1cd-efc90b29f896-S0 (192.168.2.38)
I1101 18:47:58.990278 26962 master.cpp:3589] Launching task task-name of framework framework-name (galaxy-3.0) with resources cpus(*):0.25; mem(*):512; disk(*):64 on slave 03bcc0aa-92ce-4304-b1cd-efc90b29f896-S0 at slave(1)@192.168.2.38:5051 (192.168.2.38)
I1101 18:48:00.295632 26968 master.cpp:4763] Status update TASK_RUNNING (UUID: 7a192af0-6702-4a27-bbf4-f79d2ecdae86) for task task-name of framework framework-name from slave 03bcc0aa-92ce-4304-b1cd-efc90b29f896-S0 at slave(1)@192.168.2.38:5051 (192.168.2.38)
I1101 18:48:00.295724 26968 master.cpp:4811] Forwarding status update TASK_RUNNING (UUID: 7a192af0-6702-4a27-bbf4-f79d2ecdae86) for task task-name of framework framework-name
I1101 18:48:00.295905 26968 master.cpp:6421] Updating the state of task task-name of framework framework-name (latest state: TASK_RUNNING, status update state: TASK_RUNNING)
Sending status update TASK_LOST for task task-name of framework framework-name 'Task launched with invalid offers: Offer 03bcc0aa-92ce-4304-b1cd-efc90b29f896-O1473410 is no longer valid'

Search slave logs with offerId and find that no relevant logs can be found.

Searching the slave log with taskName doesn't seem to be helpful either:

I1101 18:47:53.255692 25195 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 106f04ef-4a7f-0000-60f0-220100000000) for task task-name of framework framework-name
E1101 18:47:53.255821 25195 slave.cpp:2405] Failed to handle status update acknowledgement (UUID: 106f04ef-4a7f-0000-60f0-220100000000) for task task-name of framework framework-name: Cannot find the status update stream for task task-name of framework framework-name
I1101 18:47:58.985147 25203 slave.cpp:1361] Got assigned task task-name for framework framework-name
I1101 18:47:58.985749 25203 gc.cpp:83] Unscheduling '/data/mesos-slave/slaves/03bcc0aa-92ce-4304-b1cd-efc90b29f896-S0/frameworks/framework-name' from gc
I1101 18:47:58.985919 25195 slave.cpp:1480] Launching task task-name for framework framework-name
I1101 18:47:58.986297 25195 paths.cpp:528] Trying to chown '/data/mesos-slave/slaves/03bcc0aa-92ce-4304-b1cd-efc90b29f896-S0/frameworks/framework-name/executors/task-name/runs/b1d9da34-1172-4797-bceb-d1be90e13a81' to user 'root'
I1101 18:47:58.996266 25195 slave.cpp:5367] Launching executor task-name of framework framework-name with resources cpus(*):0.1; mem(*):32 in work directory '/data/mesos-slave/slaves/03bcc0aa-92ce-4304-b1cd-efc90b29f896-S0/frameworks/framework-name/executors/task-name/runs/b1d9da34-1172-4797-bceb-d1be90e13a81'
I1101 18:47:58.996690 25195 slave.cpp:1698] Queuing task 'task-name' for executor 'task-name' of framework framework-name
I1101 18:47:59.004479 25203 docker.cpp:1036] Starting container 'b1d9da34-1172-4797-bceb-d1be90e13a81' for task 'task-name' (and executor 'task-name') of framework 'framework-name'
I1101 18:47:59.011270 25199 slave.cpp:4374] Current disk usage 8.82%. Max allowed age: 5.682413235856320days
I1101 18:47:59.547658 25197 systemd.cpp:95] Assigned child process '28712' to 'mesos_executors.slice'
I1101 18:47:59.577991 25194 slave.cpp:2643] Got registration for executor 'task-name' of framework framework-name from executor(1)@192.168.2.38:3449
I1101 18:47:59.578552 25202 docker.cpp:1316] Ignoring updating container 'b1d9da34-1172-4797-bceb-d1be90e13a81' with resources passed to update is identical to existing resources
I1101 18:47:59.578943 25196 slave.cpp:1863] Sending queued task 'task-name' to executor 'task-name' of framework framework-name at executor(1)@192.168.2.38:3449
I1101 18:48:00.287401 25191 slave.cpp:3002] Handling status update TASK_RUNNING (UUID: 7a192af0-6702-4a27-bbf4-f79d2ecdae86) for task task-name of framework framework-name from executor(1)@192.168.2.38:3449
I1101 18:48:00.287945 25198 status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: 7a192af0-6702-4a27-bbf4-f79d2ecdae86) for task task-name of framework framework-name
I1101 18:48:00.289232 25189 slave.cpp:3400] Forwarding the update TASK_RUNNING (UUID: 7a192af0-6702-4a27-bbf4-f79d2ecdae86) for task task-name of framework framework-name to [email protected]:5050
I1101 18:48:00.289379 25189 slave.cpp:3310] Sending acknowledgement for status update TASK_RUNNING (UUID: 7a192af0-6702-4a27-bbf4-f79d2ecdae86) for task task-name of framework framework-name to executor(1)@192.168.2.38:3449
I1101 18:48:00.491029 25192 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 7a192af0-6702-4a27-bbf4-f79d2ecdae86) for task task-name of framework framework-name
I1101 18:48:19.142347 25204 slave.cpp:3002] Handling status update TASK_FAILED (UUID: d23cb37f-c332-4284-9aae-8f95142bb531) for task task-name of framework framework-name from executor(1)@192.168.2.38:3449
I1101 18:48:19.207042 25194 status_update_manager.cpp:320] Received status update TASK_FAILED (UUID: d23cb37f-c332-4284-9aae-8f95142bb531) for task task-name of framework framework-name
I1101 18:48:19.207393 25189 slave.cpp:3400] Forwarding the update TASK_FAILED (UUID: d23cb37f-c332-4284-9aae-8f95142bb531) for task task-name of framework framework-name to [email protected]:5050
I1101 18:48:19.207545 25189 slave.cpp:3310] Sending acknowledgement for status update TASK_FAILED (UUID: d23cb37f-c332-4284-9aae-8f95142bb531) for task task-name of framework framework-name to executor(1)@192.168.2.38:3449
I1101 18:48:19.240367 25194 status_update_manager.cpp:392] Received status update acknowledgement (UUID: d23cb37f-c332-4284-9aae-8f95142bb531) for task task-name of framework framework-name
I1101 18:48:20.145004 25200 slave.cpp:3528] executor(1)@192.168.2.38:3449 exited
I1101 18:48:20.191318 25192 docker.cpp:1932] Executor for container 'b1d9da34-1172-4797-bceb-d1be90e13a81' has exited
I1101 18:48:20.191365 25192 docker.cpp:1696] Destroying container 'b1d9da34-1172-4797-bceb-d1be90e13a81'
I1101 18:48:20.191390 25192 docker.cpp:1824] Running docker stop on container 'b1d9da34-1172-4797-bceb-d1be90e13a81'
I1101 18:48:20.192230 25195 slave.cpp:3886] Executor 'task-name' of framework framework-name exited with status 0
I1101 18:48:20.192279 25195 slave.cpp:3990] Cleaning up executor 'task-name' of framework framework-name at executor(1)@192.168.2.38:3449
I1101 18:48:20.192435 25195 slave.cpp:4078] Cleaning up framework framework-name
I1101 18:48:20.192541 25194 status_update_manager.cpp:282] Closing status update streams for framework framework-name
I1101 18:52:58.997258 25195 slave.cpp:4282] Framework framework-name seems to have exited. Ignoring registration timeout for executor 'task-name'

I have no choice but to add a log to the code and print out all the offers received.

It turns out that the same offer will be received twice in a row. If the first offer is submitted to mesos using the submit task, then when the second one is submitted, mesos will of course think that the offer has expired.

So a cache was added to store the 10 offers received in the past. If it is received again, it will be skipped directly.

There are several guesses about the cause of the problem:

1. The bug of mesos, the current mesos version is 0.28.1, which is a bit old, but it seems unlikely.

2. The bug of mesos-rxjava-protobuf-client, because the mesos connection method was changed from the ProtocolBuffer-based method to the HTTP-based method, so it is suspected that the official mesos-rxjava-protobuf-client (0.1.2) has bug, but still feel that it is impossible for the official to leave this kind of bug.

3. It is suspected that the company's network fluctuates greatly, and the mesos master has problems during the switching process, resulting in repeated offers.

There is still no in-depth study of the reason, so let's make a record for the time being. If anyone knows the reason, please advise.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325464358&siteId=291194637