In MesosCon Europe 2015 welcome speech, Benjamin Hindman said that Mesos was designed to support building and running distributed systems. In order to survive production, systems have to learn to adapt occasional network blips and outages. As research shows network can have problems and thinking that network is reliable is one of the 8 fallacies of distributed computing.
In this post, I will cover what happens when Mesos Master has network problems with it’s Mesos Agent and how Mesos applications (frameworks) gets notified. So, let’s consider the following architecture:
This is a bit simplified architecture, because Scheduler can reside in a different host. White boxes are Mesos components, red shaded boxes are components, which are developed by the Mesos app (framework) developer. So, let’s consider what happens when Mesos Master cannot connect to the Mesos Agent.
Mesos Master detects the network blip using pings. The detection is controlled using
–agent_ping_timeout (default 15s) and –max_agent_ping_timeouts (default 5), so the Agent which does not answer to a ping after agent_ping_timeout times max_agent_ping_timeouts seconds, will be considered lost. In this case, on reconnection Mesos agent will kill all tasks, executors and shutdown. The scheduler will be informed using task status updates, it will receive TASK_LOST for each task on that Agent and an agent failure event.
Here is log of reconnected agent shutting everything down:
I0203 14:31:00.725201 128352256 slave.cpp:809] Agent asked to shut down by [email protected]:5050 because ‘Agent attempted to re-register after removal’
I0203 14:31:00.725296 128352256 slave.cpp:2218] Asked to shut down framework e1637465-8791-460c-8c66-fadaa19f8148-0000 by [email protected]:5050
I0203 14:31:00.725313 128352256 slave.cpp:2243] Shutting down framework e1637465-8791-460c-8c66-fadaa19f8148-0000
I0203 14:31:00.725334 128352256 slave.cpp:4407] Shutting down executor ‘59160602-24bc-4a44-9c53-26a43d32402e’ of framework e1637465-8791-460c-8c66-fadaa19f8148-0000 (via HTTP)
E0203 14:31:00.725385 131571712 process.cpp:2105] Failed to shutdown socket with fd 14: Socket is not connected
I0203 14:31:00.725507 128352256 slave.cpp:4407] Shutting down executor ‘7eebc364-a1ab-464d-8624-0f785afccc38’ of framework e1637465-8791-460c-8c66-fadaa19f8148-0000 (via HTTP)
If the network blip is shorter than agent_ping_timeout times max_agent_ping_timeouts seconds, everything should still work.
This is not a very good approach as it doesn’t allow app developer to change this behavior in any way. For example, if you run Spark on Mesos, maybe this behavior is fine (some tasks will be killed, but automatically will be rescheduled on the different Agents), but when using Cassandra on Mesos, you probably wouldn’t want your Cassandra nodes to be killed after 80s of network blip?
Thats why in Mesos 1.1. there was added experimental support for partition-aware Mesos frameworks. If the framework developer will opt in into this feature, tasks and executors wont be killed when agent reconnects. This allows frameworks to define their own policies for how to handle partitioned tasks. Also, it adds new task states, which allows handling network partitions differently.
Let’s take a look at the new task states:
- TASK_UNREACHABLE – is sent when Mesos Master detects the network blip. On Mesos Agent rejoining the the task will be set as TASK_RUNNING again (if that was the case).
- TASK_DROPPED – is sent when a task fails to launch because of a transient error. The task is not running.
- TASK_GONE – is sent during task reconciliation process, when Master knows that task was running on an Agent that has been terminated. So, the task is not running.
- TASK_GONE_BY_OPERATOR – As I understand mesos will provide some maintenance primitive for operators to mark tasks as gone, so this task state has the possibility of human error. So, the task is probably gone.
- TASK_UNKNOWN – will be sent during task reconciliation process, when the Master does not know about the task’s agent. So, the task may still be running.
The left work with partition-aware frameworks can be tracked in this Jira ticket.