Apache Mesos in the face of network partitions

In MesosCon Europe 2015 welcome speech, Benjamin Hindman said that Mesos was designed to support building and running distributed systems. In order to survive production, systems have to learn to adapt occasional network blips and outages. As research shows network can have problems and thinking that network  is reliable is one of the 8 fallacies of distributed computing.

In this post, I will cover what happens when Mesos Master has network problems with it’s Mesos Agent and how Mesos applications (frameworks) gets notified. So, let’s consider the following architecture:

Mesos arch1 - Page 1 (1) 2.pngThis is a bit simplified architecture, because Scheduler can reside in a different host. White boxes are Mesos components, red shaded boxes are components, which are developed by the Mesos app (framework) developer. So, let’s consider what happens when Mesos Master cannot connect to the Mesos Agent.

Mesos Master detects the network blip using pings. The detection is controlled using
–agent_ping_timeout (default 15s) and –max_agent_ping_timeouts (default 5), so the Agent which does not answer to a ping after agent_ping_timeout times max_agent_ping_timeouts seconds, will be considered lost.  In this case, on reconnection Mesos agent will kill all tasks, executors and shutdown. The scheduler will be informed using task status updates, it will receive TASK_LOST for each task on that Agent and an agent failure event.
Here is log of reconnected agent shutting everything down:

I0203 14:31:00.725201 128352256 slave.cpp:809] Agent asked to shut down by master@ because ‘Agent attempted to re-register after removal’
I0203 14:31:00.725296 128352256 slave.cpp:2218] Asked to shut down framework e1637465-8791-460c-8c66-fadaa19f8148-0000 by master@
I0203 14:31:00.725313 128352256 slave.cpp:2243] Shutting down framework e1637465-8791-460c-8c66-fadaa19f8148-0000
I0203 14:31:00.725334 128352256 slave.cpp:4407] Shutting down executor ‘59160602-24bc-4a44-9c53-26a43d32402e’ of framework e1637465-8791-460c-8c66-fadaa19f8148-0000 (via HTTP)
E0203 14:31:00.725385 131571712 process.cpp:2105] Failed to shutdown socket with fd 14: Socket is not connected
I0203 14:31:00.725507 128352256 slave.cpp:4407] Shutting down executor ‘7eebc364-a1ab-464d-8624-0f785afccc38’ of framework e1637465-8791-460c-8c66-fadaa19f8148-0000 (via HTTP)

If the network blip is shorter than agent_ping_timeout times max_agent_ping_timeouts seconds, everything should still work.

This is not a very good approach as it doesn’t allow app developer to change this behavior in any way. For example, if you run Spark on Mesos, maybe this behavior is fine (some tasks will be killed, but automatically will be rescheduled on the different Agents), but when using Cassandra on Mesos,  you probably wouldn’t want your Cassandra nodes to be killed after 80s of network blip?

Thats why in Mesos 1.1. there was added experimental support for partition-aware Mesos frameworks. If the framework developer will opt in into this feature, tasks and executors wont be killed when agent reconnects. This allows frameworks to define their own policies for how to handle partitioned tasks. Also, it adds new task states, which allows handling network partitions differently.

Let’s take a look at the new task states:

  • TASK_UNREACHABLE – is sent when Mesos Master detects the network blip. On Mesos Agent rejoining the the task will be set as TASK_RUNNING again (if that was the case).
  • TASK_DROPPED – is sent when a task fails to launch because of a transient error. The task is not running.
  • TASK_GONE – is sent during task reconciliation process, when Master knows that task was running on an Agent that has been terminated. So, the task is not running. 
  • TASK_GONE_BY_OPERATOR – As I understand mesos will provide some maintenance primitive for operators to mark tasks as gone, so this task state has the possibility of human error. So, the task is probably gone.
  • TASK_UNKNOWN – will be sent during task reconciliation process, when the Master does not know about the task’s agent. So, the task may still be running.

The left work with partition-aware frameworks can be tracked in this Jira ticket.

Thanks for reading! If you have any questions or want to provide feedback, you can contact me via twitter @PofkeVe.


Cool feature of jUnit 5

One thing I love in the upcoming release of jUnit is the @DisplayName annotation. This annotation allows to use language to name a unit test. Kevlin Henney in his talk “What We Talk About When We Talk About Unit Testing” has shown power of using English language to specify what unit test is doing. For example:

@DisplayName("ISBNs with more than 13 Digits are Malformed")
public void isValid() {
    ISBNValidator isbnValidator = new ISBNValidator();

    Boolean isValid = isbnValidator.isValid("1111222233334444");

    assertEquals(false, isValid);

Which results in unit tests being a great specification for a component. Furthermore you can use this cool IntelliJ IDEA feature, which allows to see all component tests by name. So the final result looks like this:


By looking at the image, we can clearly know what component does and what it’s corner cases are, which gives us a lot of insight. Actually, we even know that it does what it say it does, because test ran fine!

Although you can do this using regular camel case method naming, but it is really hard to read:

public void isbnsWithMoreThan13DigitsAreMalformed() {
    ISBNValidator isbnValidator = new ISBNValidator();

    Boolean isValid = isbnValidator.isValid("1111222233334444");

    assertEquals(false, isValid);

and the resulting spec:



This example is taken from the talk mentioned above, I really suggest you to watch it.

Thanks for reading ! Until next time!

Unit tests

Probably you have heard many advantages of unit testing such as getting bugs out of the code, producing better code design, having overall good code quality. But for me it’s all about reducing the fear of making changes in production. It gives me confidence in my code so that my new code additions didn’t break something. Also it shortens the development-testing cycle, which allows me to always publish code which works.

It’s important to distinguish unit tests from other kind of tests. If you put all kind of tests into same place it quickly can become a mess: some tests run slowly, some require testing database, a queue or other external dependency. At least in my company people don’t have a clear understanding what a unit test is and so we ended up with ice cream cone anti-pattern, where we have many end to end tests, a lot of manual testing, zero unit tests and few integration tests. Why this pattern is an anti pattern you can read in this blog post from Google testing blog.

So, what is a unit test? There are many different definitions what are the unit tests, I tend too look at them as the low-level tests, that test a single unit: function call, method, class or a group of classes. These tests should run quickly, use test stubs or mocks for external dependencies, so that they should test only the system under test and not the external dependencies. These tests are really really fast. you can run thousands of the in a second or two.

Lastly, I wanted to share some cool articles about what is and what isn’t a unit test: