Scheduling Services on a Docker Swarm Mode Cluster

Introduction

This tutorial is the second one in our series of articles on container orchestration with Docker Swarm. In the first tutorial, we covered how to bootstrap a Docker Swarm Mode cluster, and in this second tutorial, we’ll cover how Swarm schedules workloads across the cluster’s nodes.

Scheduling is a key component of container orchestration, and helps us maximise the workload’s availability, whilst making maximum use of the resources available for those workloads. Automated scheduling removes the need for manual deployment of services, which would otherwise be an onerous task, especially when those services require scaling up and down horizontally.

Sometimes, however, it’s important for an operator to be able to change where workloads are scheduled, and we’ll look into how it’s possible to change how Swarm’s scheduler places workloads across a cluster. We’ll also see what action Swarm takes with regard to deployed services, when failures are detected in the cluster.

Prerequisites

In order to follow the tutorial, the following items are required:

a four-node Swarm Mode cluster, as detailed in the first tutorial of this series,
a single manager node (node-01), with three worker nodes (node-02, node-03, node-04), and
direct, command-line access to node-01, or, access to a local Docker client configured to communicate with the Docker Engine on node-01.

The most straightforward configuration can be achieved by following the first tutorial.

Service Mode

Services in Swarm Mode are an abstraction of a workload, and comprise of one or more tasks, which are implemented as individual containers.

Services in Docker Swarm have a mode, which can be set to one of two types. The default mode for a service when it is created is ‘replicated’, which means that the service comprises of a configurable number of replicated tasks. This mode is useful when services need to be horizontally scaled in order to cater for load, and to provide resilience.

If a service is created with the default, assumed mode set to replicated, the service will be created with just a single task. But, it is possible to set the number of replicas when the service is created. For example, if a service needs to be scaled from the outset, we would create the service using the following command executed on a manager node (node-01):

$ docker service create --name nginx --replicas 3 nginx
7oic28vvku4n9pihd13gyt6nk
$ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx
ID                  NAME                NODE                CURRENT STATE
vwgrtg0dbvhc        nginx.1             node-02             Running 8 minutes ago
kri3un3t7b7k        nginx.2             node-03             Running 8 minutes ago
kjg4ddlmxble        nginx.3             node-04             Running 8 minutes ago
$ docker service rm nginx
nginx

The service is created with three tasks running on three of the four nodes in the cluster. We could, of course, achieve the same result by creating the service with the single, default replica, and then use the docker service scale command to scale the service to the required number of replicas.

Whilst a replicated service allows for any number of tasks to be created for the service, a ‘global’ service results in a single task on every node that is configured to accept tasks (including managers). Global mode is useful where it is desirable or imperative to run a service on every node — an agent for monitoring purposes, for example.

It’s necessary to use the --mode global config option when creating the service:

$ docker service create --name nginx --mode global nginx
xziwvfhasmbydgie1r16dlp2j
$ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx
ID                  NAME                              NODE                CURRENT STATE
e4aj4x2dc85j        nginx.d1euoo53in1krtd4z8swkgwxo   node-01             Running 34 seconds ago
xgg9kn9upsi4        nginx.8eg423bamur5uj2cq2lw5803v   node-04             Running 50 seconds ago
qfckheylojz1        nginx.txznjotqie2z89le8qbuqy7ew   node-02             Running 50 seconds ago
x76ca4da1sir        nginx.gu32egf50bk20mnif25b3rh4y   node-03             Running 50 seconds ago
$ docker node ls
ID                           HOSTNAME  STATUS  AVAILABILITY  MANAGER STATUS
8eg423bamur5uj2cq2lw5803v    node-04   Ready   Active        
d1euoo53in1krtd4z8swkgwxo *  node-01   Ready   Active        Leader
gu32egf50bk20mnif25b3rh4y    node-03   Ready   Active        
txznjotqie2z89le8qbuqy7ew    node-02   Ready   Active
$ docker service rm nginx
nginx

This time, each task name gets a suffix which is the ID of the node it is scheduled on (e.g. nginx.d1euoo53in1krtd4z8swkgwxo), rather than a sequential number in the case of replicated tasks. This is because each task object is associated with a specific node in the cluster. If a new node joins the cluster, new tasks are scheduled on the node for each and every service with a global mode.

Whilst we could have used --mode replicated in conjunction with --replicas 3 in the first example above, because replicated mode is the default, it wasn’t necessary to use this config option. Once the mode has been set for a service, it cannot be changed to its alternative. The service will need to be removed and re-created in order to change service mode.

Scheduling Strategy

The way that tasks or containers are scheduled on a Swarm Mode cluster is governed by a scheduling strategy. Currently, Swarm Mode has a single scheduling strategy, called ‘spread’. The spread strategy attempts to schedule a service task based on an assessment of the resources available on cluster nodes.

In its simplest form, this means that tasks are evenly spread across the nodes in a cluster. For example, if we create a service with three replicas, each replicated task will be scheduled on a different node:

$ docker service create --name nginx-01 --replicas 3 nginx
27bxqadwqa56p92suscx6256t
$ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx-01
ID                  NAME                NODE                CURRENT STATE
r5jq9tijt9jr        nginx-01.1          node-03             Running 4 seconds ago
rwt7eq01qnvo        nginx-01.2          node-04             Running 4 seconds ago
5lvtz0csgmco        nginx-01.3          node-01             Running 4 seconds ago

If we now schedule a single replica for a second service, it will be scheduled on the node with no allocated tasks; node-02 in this case:

$ docker service create --name nginx-02 nginx
eb1h84c906bfmgovti0hmabr2
calculus [~] docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE
27bxqadwqa56        nginx-01            replicated          3/3                 nginx:latest
eb1h84c906bf        nginx-02            replicated          1/1                 nginx:latest
calculus [~] docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx-02
ID                  NAME                NODE                CURRENT STATE
d3ldybxlxky2        nginx-02.1          node-02             Running 28 seconds ago
$ docker  service rm nginx-01 nginx-02
nginx-01
nginx-02

The one caveat to this simplistic approach to spread-based scheduling, occurs when scaling an existing service. The scheduler will seek to schedule a new task, such that the new task will be scheduled on a node, if one exists, that is not running a task for the same service, irrespective of how many other tasks it is running for other services. If all the cluster nodes are running at least one task for the service, then the scheduler selects the node with the fewer tasks from the same service, before it uses the general assessment of all tasks running across all nodes. This is informally referred to as ‘HA scheduling’.

In the real world, workloads consume resources, and when those workloads co-habit, they need to be good neighbours. Swarm Mode allows the definition of a service with a reservation of, and limit to, cpus or memory for each of its tasks. Specifying a limit with --limit-cpus or --limit-memory, ensures that a service’s tasks do not consume more of the specified resource than is defined in the limit. In contrast to limits, reserving resources for tasks has a direct bearing on where tasks are scheduled.

Let’s see how reserving resources works in practice. The four nodes in our cluster have 1 GB of memory each. If the nodes you are using to follow this tutorial have more or less memory, you will need to adjust the reserved memory values appropriately. First, we’ll create a service with three replicas, and reserve 900 MB of memory for each task:

$ docker service create --name nginx-01 --reserve-memory 900Mb --replicas 3 nginx
nqhtb10hrjtqallqpz36ickow
$ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx-01
ID                  NAME                NODE                CURRENT STATE
y5o1yhkrgls3        nginx-01.1          node-03             Running 53 seconds ago
nhfc1lirgton        nginx-01.2          node-04             Running 53 seconds ago
uq3lqdzxwc4y        nginx-01.3          node-01             Running 53 seconds ago

The service’s tasks are scheduled on three different nodes, just as we’d expect with Swarm’s use of the spread scheduling strategy. Now, let’s deploy another service, this time with four replicas, and reserve 200 MB of memory for each task:

$ docker service create --name nginx-02 --reserve-memory 200Mb --replicas 4 nginx
dmvfxpghle49d3b448bhfj60h
$ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx-02
ID                  NAME                NODE                CURRENT STATE
wnzyipc4n6s4        nginx-02.1          node-02             Running 23 seconds ago
ye9z4rn5qeth        nginx-02.2          node-02             Running 23 seconds ago
j6s6hxviyhrf        nginx-02.3          node-02             Running 23 seconds ago
vpngenvs6whg        nginx-02.4          node-02             Running 23 seconds ago
$ docker service rm nginx-01 nginx-02
nginx-01
nginx-02

Ordinarily, with the spread scheduling strategy, we’d expect one task to end up on node-02, and the others to end up on node-01, node-03 and node-04. However, there is not enough memory available on any of node-01, node-03 and node-04, to reserve 200 MB, and as a result, the remaining tasks are scheduled on node-02, instead.

An amount of CPU can also be reserved for tasks, and is treated in exactly the same way with regard to scheduling. Note that it is possible to specify fractions of CPU (e.g. --reserve-cpu 1.5), as the reserve is based on a calculation which involves the CFS Quota and Period.

Be aware that if the scheduler is unable to allocate a service task, because insufficient resources are available on cluster nodes, the task will remain in a ‘pending’ state until sufficient resources become available for it to be scheduled.

Service Constraints

Whilst the secheduling aspect of orchestration removes the headache of manually deploying container workloads, sometimes it’s convenient (and, sometimes, imperative) to influence where workloads are scheduled. We might want manager nodes to be excluded from consideration. We may need to ensure a stateful service is scheduled on a node where the corresponding data resides. We might want a service to make use of specialist hardware associated with a particular node, etc.

Swarm Mode uses the concept of constraints, which are applied to services, in order to influence where tasks are scheduled. A constraint is applied with the --constraint config option, which takes an expression as a value, in the form <attribute><opeartor><value>. Swarm Mode has a number of in-built attributes, but it’s also possible to specify arbitrary attributes using labels associated with nodes.

For the purposes of demonstrating the use of constraints, we can use the in-built node.role attribute, for specifying that we only want a service to be scheduled on worker nodes:

$ docker service create --name nginx --mode global --constraint 'node.role==worker' nginx
$ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx
ID                  NAME                              NODE                CURRENT STATE
s01wg79uxrhj        nginx.qnxgs3l6ddoau1yc23cchd9zd   node-04             Running 8 seconds ago
tain2qcu8gog        nginx.kldh65668vz3xyt1es2kapyjr   node-03             Running 49 seconds ago
vdinug90035m        nginx.4v8esk94i1bgbtzhm7xp39b52   node-02             Running 49 seconds ago
$ docker service rm nginx
nginx

We used the ‘global’ mode for the service, and would normally have expected a task to be scheduled on every node, including the manager node, node-01. The constraint, however, limited the deployment of the service to the workers, only. We could have achieved the same using the constraint expression node.role!=manager.

Now, let’s assume we want to deploy our service to a specific node. First, we need to label the node in question, preferably using the Docker object labelling recommendations:

$ docker node update --label-add 'com.acme.server=db' node-03
node-03
$ docker node inspect -f '{{index .Spec.Labels "com.acme.server"}}' node-03
db

To schedule a single replica to this node, we must specify a suitable constraint:

$ docker service create --name redis --constraint 'node.labels.com.acme.server==db' redis
iavdmioaz1omruo86b0d1xxvp
$ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' redis
ID                  NAME                NODE                CURRENT STATE
nktfpqouwjcb        redis.1             node-03             Running about a minute ago
$ docker service rm redis
redis

The single replica for the task has been scheduled on node-03, which has been imbued with the label associated with the constraint. Any task or tasks associated with a service that has a constraint applied, which cannot be scheduled due to the imposition of other constraints or lack of resources, will remain in a ‘pending’ state, until such time that it is possible for the task or tasks to be scheduled.

Scheduling Preferences

Whilst constraints provide the ability to deterministically influence the scheduling of tasks, placement preferences provide a ‘soft’ means of influencing scheduling. Placement preferences direct the scheduler to account for expressed preferences, but if they can’t be met due to resource limitations or defined constraints, then scheduling continues according to the normal spread strategy. The placement preference scheme was born from a need to schedule tasks based on topology.

Let’s schedule a service based on the cluster’s nodes, and their location in (pretend) availability zones. We’ll place node-01 and node-02 in zone ‘a’, node-03 in zone ‘b’, and node-04 in zone ‘c’. When we specify a placement preference based on a zone-related label, the tasks for the service in question will be scheduled equally across the zones. To create the labels for the nodes:

$ docker node update --label-add 'com.acme.zone=a' node-01
node-01
$ docker node update --label-add 'com.acme.zone=a' node-02
node-02
$ docker node update --label-add 'com.acme.zone=b' node-03
node-03
$ docker node update --label-add 'com.acme.zone=c' node-04
node-04

Now that the nodes have their labels, we can deploy a service with a placement preference, and observe the results:

$ docker service create --name nginx --placement-pref 'spread=node.labels.com.acme.zone' --replicas 12 nginx
uiany6ohly6h2lnn6r782g44z
$ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx
ID                  NAME                NODE                CURRENT STATE
sd54gvigyu54        nginx.1             node-03             Running about a minute ago
mrhsbes9786f        nginx.2             node-02             Running about a minute ago
142r9nhtp8e2        nginx.3             node-03             Running about a minute ago
vd4vueih1smb        nginx.4             node-04             Running about a minute ago
qcsk4flisdnl        nginx.5             node-04             Running about a minute ago
o5nfd0tmjxdv        nginx.6             node-01             Running about a minute ago
l55q6tz86tua        nginx.7             node-02             Running about a minute ago
soq3lyh2k02o        nginx.8             node-01             Running about a minute ago
i0s6s7se8i3r        nginx.9             node-03             Running about a minute ago
zb7t9ef0lovo        nginx.10            node-03             Running about a minute ago
a44o5f437obx        nginx.11            node-04             Running about a minute ago
obkyhe6iu4r3        nginx.12            node-04             Running about a minute ago
$ docker service rm nginx
nginx

The tasks have been scheduled equally amongst the three ‘zones’, with node-01 and node-02 acquiring two tasks apiece, whilst node-03 and node-04 have been allocated four tasks each.

The outcome of the deployment of this service would have been very different if we had applied a resource reservation in conjunction with the placement preference. As each node is configured with 1 GB of memory, if we created the service with --reserve-memory 300Mb, the placement preferences could not physically be honoured by the scheduler, and each node would be scheduled with three tasks apiece, instead.

Multiple placement preferences can be expressed for a service, using --placement-pref multiple times, with the order of the preferences being significant. For example, if two placement preferences are defined, the tasks will be spread between the nodes satisfying the first expressed preference, before being further divided according to the second preference. This allows refined placement of tasks, to effect the high availability of services.

Rescheduling on Failure

Those who have spent time with an ops-oriented hat on can identify with the adage, “Anything that can go wrong, will go wrong”. Workloads will fail. Cluster nodes, or other infrastructure components, will fail, or become unavailable for periods of time. Ensuring the continued operation of a deployed service, and the recovery to a pre-defined status quo, is an important component of orchestration.

Swarm Mode uses a declarative approach to workloads, and employs ‘desired state reconciliation’ in order to maintain the desired state of the cluster. If components of the cluster fail, whether they be individual tasks, or a cluster node, Swarm’s reconciliation loop attempts to restore the desired state for all workloads affected.

The easiest way for us to demonstrate this is to simulate a node becoming unavailable in the cluster. We can achieve this with relative ease, by changing the ‘availability’ of a node in the cluster for scheduling purposes. When we issue the command docker node ls, one of the node attributes reported on is ‘availability’, which normally yields ‘Active’:

$ docker node ls
ID                           HOSTNAME  STATUS  AVAILABILITY  MANAGER STATUS
8eg423bamur5uj2cq2lw5803v    node-04   Ready   Active        
d1euoo53in1krtd4z8swkgwxo *  node-01   Ready   Active        Leader
gu32egf50bk20mnif25b3rh4y    node-03   Ready   Active        
txznjotqie2z89le8qbuqy7ew    node-02   Ready   Active

Before we alter a node’s availability, let’s create a service which has a replica scheduled on each node in the cluster:

$ docker service create --name nginx --replicas 4 nginx
vnk7mx1liy11gfpu14z7sexg2
$ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx
ID                  NAME                NODE                CURRENT STATE
w4mkj7645d5k        nginx.1             node-01             Running 29 seconds ago
fdre59b5ijhj        nginx.2             node-03             Running 29 seconds ago
skrmh7kievq7        nginx.3             node-02             Running 29 seconds ago
o5w3eyh5dyo0        nginx.4             node-04             Running 29 seconds ago

Now, let’s set the availability of node-02 to ‘drain’, which will take it out of the pool for scheduling purposes, and terminate the task nginx.3. It will then get rescheduled on one of the other nodes in the cluster:

$ docker node update --availability drain node-02
node-02
$ docker node ls
ID                           HOSTNAME  STATUS  AVAILABILITY  MANAGER STATUS
txznjotqie2z89le8qbuqy7ew    node-02   Ready   Drain         
d1euoo53in1krtd4z8swkgwxo *  node-01   Ready   Active        Leader
gu32egf50bk20mnif25b3rh4y    node-03   Ready   Active        
8eg423bamur5uj2cq2lw5803v    node-04   Ready   Active        
$ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx
ID                  NAME                NODE                CURRENT STATE
w4mkj7645d5k        nginx.1             node-01             Running 8 minutes ago
fdre59b5ijhj        nginx.2             node-03             Running 8 minutes ago
t86fdnxuze0m        nginx.3             node-03             Running 20 seconds ago
skrmh7kievq7         \_ nginx.3         node-02             Shutdown 20 seconds ago
o5w3eyh5dyo0        nginx.4             node-04             Running 8 minutes ago

The output from docker service ps shows the history for the task in ‘slot 3’; a container was shutdown on node-02, and then replaced with a container running on node03.

Conclusion

This tutorial has provided an overview of Docker Swarm Mode’s scheduling capabilities. Like most projects in the open source domain, Swarmkit, the project that Docker Swarm Mode is based on, continues to evolve on each new release, and it’s probable that its scheduling capabilities will be further enhanced over time. In the meantime, we’ve highlighted:

Swarm’s default spread scheduling strategy,
How resource reservation, and constraints affect scheduling,
How it’s possible to influence the scheduler, using placement preferences, and
Swarm’s approach to rescheduling on failure.

In the next tutorial, we’ll explore how deployed services are consumed, internally and externally.

If you have any questions and/or comments, feel free to leave them in the section below.

Want to continuously deliver your applications made with Docker? Check out Semaphore’s Docker platform with full layer caching for tagged Docker images.

Read next: