24 May 2017 · Software Engineering

    Scheduling Services on a Docker Swarm Mode Cluster

    17 min read
    Contents

    Introduction

    This tutorial is the second one in our series of articles on container orchestration with Docker Swarm. In the first tutorial, we covered how to bootstrap a Docker Swarm Mode cluster, and in this second tutorial, we’ll cover how Swarm schedules workloads across the cluster’s nodes.

    Scheduling is a key component of container orchestration, and helps us maximise the workload’s availability, whilst making maximum use of the resources available for those workloads. Automated scheduling removes the need for manual deployment of services, which would otherwise be an onerous task, especially when those services require scaling up and down horizontally.

    Sometimes, however, it’s important for an operator to be able to change where workloads are scheduled, and we’ll look into how it’s possible to change how Swarm’s scheduler places workloads across a cluster. We’ll also see what action Swarm takes with regard to deployed services, when failures are detected in the cluster.

    Continuous Delivery with Kubernetes

    Prerequisites

    In order to follow the tutorial, the following items are required:

    • a four-node Swarm Mode cluster, as detailed in the first tutorial of this series,
    • a single manager node (node-01), with three worker nodes (node-02, node-03, node-04), and
    • direct, command-line access to node-01, or, access to a local Docker client configured to communicate with the Docker Engine on node-01.

    The most straightforward configuration can be achieved by following the first tutorial.

    Service Mode

    Services in Swarm Mode are an abstraction of a workload, and comprise of one or more tasks, which are implemented as individual containers.

    Services in Docker Swarm have a mode, which can be set to one of two types. The default mode for a service when it is created is ‘replicated’, which means that the service comprises of a configurable number of replicated tasks. This mode is useful when services need to be horizontally scaled in order to cater for load, and to provide resilience.

    If a service is created with the default, assumed mode set to replicated, the service will be created with just a single task. But, it is possible to set the number of replicas when the service is created. For example, if a service needs to be scaled from the outset, we would create the service using the following command executed on a manager node (node-01):

    $ docker service create --name nginx --replicas 3 nginx
    7oic28vvku4n9pihd13gyt6nk
    $ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx
    ID                  NAME                NODE                CURRENT STATE
    vwgrtg0dbvhc        nginx.1             node-02             Running 8 minutes ago
    kri3un3t7b7k        nginx.2             node-03             Running 8 minutes ago
    kjg4ddlmxble        nginx.3             node-04             Running 8 minutes ago
    $ docker service rm nginx
    nginx

    The service is created with three tasks running on three of the four nodes in the cluster. We could, of course, achieve the same result by creating the service with the single, default replica, and then use the docker service scale command to scale the service to the required number of replicas.

    Whilst a replicated service allows for any number of tasks to be created for the service, a ‘global’ service results in a single task on every node that is configured to accept tasks (including managers). Global mode is useful where it is desirable or imperative to run a service on every node — an agent for monitoring purposes, for example.

    It’s necessary to use the --mode global config option when creating the service:

    $ docker service create --name nginx --mode global nginx
    xziwvfhasmbydgie1r16dlp2j
    $ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx
    ID                  NAME                              NODE                CURRENT STATE
    e4aj4x2dc85j        nginx.d1euoo53in1krtd4z8swkgwxo   node-01             Running 34 seconds ago
    xgg9kn9upsi4        nginx.8eg423bamur5uj2cq2lw5803v   node-04             Running 50 seconds ago
    qfckheylojz1        nginx.txznjotqie2z89le8qbuqy7ew   node-02             Running 50 seconds ago
    x76ca4da1sir        nginx.gu32egf50bk20mnif25b3rh4y   node-03             Running 50 seconds ago
    $ docker node ls
    ID                           HOSTNAME  STATUS  AVAILABILITY  MANAGER STATUS
    8eg423bamur5uj2cq2lw5803v    node-04   Ready   Active        
    d1euoo53in1krtd4z8swkgwxo *  node-01   Ready   Active        Leader
    gu32egf50bk20mnif25b3rh4y    node-03   Ready   Active        
    txznjotqie2z89le8qbuqy7ew    node-02   Ready   Active
    $ docker service rm nginx
    nginx

    This time, each task name gets a suffix which is the ID of the node it is scheduled on (e.g. nginx.d1euoo53in1krtd4z8swkgwxo), rather than a sequential number in the case of replicated tasks. This is because each task object is associated with a specific node in the cluster. If a new node joins the cluster, new tasks are scheduled on the node for each and every service with a global mode.

    Whilst we could have used --mode replicated in conjunction with --replicas 3 in the first example above, because replicated mode is the default, it wasn’t necessary to use this config option. Once the mode has been set for a service, it cannot be changed to its alternative. The service will need to be removed and re-created in order to change service mode.

    Scheduling Strategy

    The way that tasks or containers are scheduled on a Swarm Mode cluster is governed by a scheduling strategy. Currently, Swarm Mode has a single scheduling strategy, called ‘spread’. The spread strategy attempts to schedule a service task based on an assessment of the resources available on cluster nodes.

    In its simplest form, this means that tasks are evenly spread across the nodes in a cluster. For example, if we create a service with three replicas, each replicated task will be scheduled on a different node:

    $ docker service create --name nginx-01 --replicas 3 nginx
    27bxqadwqa56p92suscx6256t
    $ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx-01
    ID                  NAME                NODE                CURRENT STATE
    r5jq9tijt9jr        nginx-01.1          node-03             Running 4 seconds ago
    rwt7eq01qnvo        nginx-01.2          node-04             Running 4 seconds ago
    5lvtz0csgmco        nginx-01.3          node-01             Running 4 seconds ago

    If we now schedule a single replica for a second service, it will be scheduled on the node with no allocated tasks; node-02 in this case:

    $ docker service create --name nginx-02 nginx
    eb1h84c906bfmgovti0hmabr2
    calculus [~] docker service ls
    ID                  NAME                MODE                REPLICAS            IMAGE
    27bxqadwqa56        nginx-01            replicated          3/3                 nginx:latest
    eb1h84c906bf        nginx-02            replicated          1/1                 nginx:latest
    calculus [~] docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx-02
    ID                  NAME                NODE                CURRENT STATE
    d3ldybxlxky2        nginx-02.1          node-02             Running 28 seconds ago
    $ docker  service rm nginx-01 nginx-02
    nginx-01
    nginx-02

    The one caveat to this simplistic approach to spread-based scheduling, occurs when scaling an existing service. The scheduler will seek to schedule a new task, such that the new task will be scheduled on a node, if one exists, that is not running a task for the same service, irrespective of how many other tasks it is running for other services. If all the cluster nodes are running at least one task for the service, then the scheduler selects the node with the fewer tasks from the same service, before it uses the general assessment of all tasks running across all nodes. This is informally referred to as ‘HA scheduling’.

    In the real world, workloads consume resources, and when those workloads co-habit, they need to be good neighbours. Swarm Mode allows the definition of a service with a reservation of, and limit to, cpus or memory for each of its tasks. Specifying a limit with --limit-cpus or --limit-memory, ensures that a service’s tasks do not consume more of the specified resource than is defined in the limit. In contrast to limits, reserving resources for tasks has a direct bearing on where tasks are scheduled.

    Let’s see how reserving resources works in practice. The four nodes in our cluster have 1 GB of memory each. If the nodes you are using to follow this tutorial have more or less memory, you will need to adjust the reserved memory values appropriately. First, we’ll create a service with three replicas, and reserve 900 MB of memory for each task:

    $ docker service create --name nginx-01 --reserve-memory 900Mb --replicas 3 nginx
    nqhtb10hrjtqallqpz36ickow
    $ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx-01
    ID                  NAME                NODE                CURRENT STATE
    y5o1yhkrgls3        nginx-01.1          node-03             Running 53 seconds ago
    nhfc1lirgton        nginx-01.2          node-04             Running 53 seconds ago
    uq3lqdzxwc4y        nginx-01.3          node-01             Running 53 seconds ago

    The service’s tasks are scheduled on three different nodes, just as we’d expect with Swarm’s use of the spread scheduling strategy. Now, let’s deploy another service, this time with four replicas, and reserve 200 MB of memory for each task:

    $ docker service create --name nginx-02 --reserve-memory 200Mb --replicas 4 nginx
    dmvfxpghle49d3b448bhfj60h
    $ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx-02
    ID                  NAME                NODE                CURRENT STATE
    wnzyipc4n6s4        nginx-02.1          node-02             Running 23 seconds ago
    ye9z4rn5qeth        nginx-02.2          node-02             Running 23 seconds ago
    j6s6hxviyhrf        nginx-02.3          node-02             Running 23 seconds ago
    vpngenvs6whg        nginx-02.4          node-02             Running 23 seconds ago
    $ docker service rm nginx-01 nginx-02
    nginx-01
    nginx-02

    Ordinarily, with the spread scheduling strategy, we’d expect one task to end up on node-02, and the others to end up on node-01, node-03 and node-04. However, there is not enough memory available on any of node-01, node-03 and node-04, to reserve 200 MB, and as a result, the remaining tasks are scheduled on node-02, instead.

    An amount of CPU can also be reserved for tasks, and is treated in exactly the same way with regard to scheduling. Note that it is possible to specify fractions of CPU (e.g. --reserve-cpu 1.5), as the reserve is based on a calculation which involves the CFS Quota and Period.

    Be aware that if the scheduler is unable to allocate a service task, because insufficient resources are available on cluster nodes, the task will remain in a ‘pending’ state until sufficient resources become available for it to be scheduled.

    Service Constraints

    Whilst the secheduling aspect of orchestration removes the headache of manually deploying container workloads, sometimes it’s convenient (and, sometimes, imperative) to influence where workloads are scheduled. We might want manager nodes to be excluded from consideration. We may need to ensure a stateful service is scheduled on a node where the corresponding data resides. We might want a service to make use of specialist hardware associated with a particular node, etc.

    Swarm Mode uses the concept of constraints, which are applied to services, in order to influence where tasks are scheduled. A constraint is applied with the --constraint config option, which takes an expression as a value, in the form <attribute><opeartor><value>. Swarm Mode has a number of in-built attributes, but it’s also possible to specify arbitrary attributes using labels associated with nodes.

    For the purposes of demonstrating the use of constraints, we can use the in-built node.role attribute, for specifying that we only want a service to be scheduled on worker nodes:

    $ docker service create --name nginx --mode global --constraint 'node.role==worker' nginx
    $ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx
    ID                  NAME                              NODE                CURRENT STATE
    s01wg79uxrhj        nginx.qnxgs3l6ddoau1yc23cchd9zd   node-04             Running 8 seconds ago
    tain2qcu8gog        nginx.kldh65668vz3xyt1es2kapyjr   node-03             Running 49 seconds ago
    vdinug90035m        nginx.4v8esk94i1bgbtzhm7xp39b52   node-02             Running 49 seconds ago
    $ docker service rm nginx
    nginx

    We used the ‘global’ mode for the service, and would normally have expected a task to be scheduled on every node, including the manager node, node-01. The constraint, however, limited the deployment of the service to the workers, only. We could have achieved the same using the constraint expression node.role!=manager.

    Now, let’s assume we want to deploy our service to a specific node. First, we need to label the node in question, preferably using the Docker object labelling recommendations:

    $ docker node update --label-add 'com.acme.server=db' node-03
    node-03
    $ docker node inspect -f '{{index .Spec.Labels "com.acme.server"}}' node-03
    db

    To schedule a single replica to this node, we must specify a suitable constraint:

    $ docker service create --name redis --constraint 'node.labels.com.acme.server==db' redis
    iavdmioaz1omruo86b0d1xxvp
    $ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' redis
    ID                  NAME                NODE                CURRENT STATE
    nktfpqouwjcb        redis.1             node-03             Running about a minute ago
    $ docker service rm redis
    redis

    The single replica for the task has been scheduled on node-03, which has been imbued with the label associated with the constraint. Any task or tasks associated with a service that has a constraint applied, which cannot be scheduled due to the imposition of other constraints or lack of resources, will remain in a ‘pending’ state, until such time that it is possible for the task or tasks to be scheduled.

    Scheduling Preferences

    Whilst constraints provide the ability to deterministically influence the scheduling of tasks, placement preferences provide a ‘soft’ means of influencing scheduling. Placement preferences direct the scheduler to account for expressed preferences, but if they can’t be met due to resource limitations or defined constraints, then scheduling continues according to the normal spread strategy. The placement preference scheme was born from a need to schedule tasks based on topology.

    Let’s schedule a service based on the cluster’s nodes, and their location in (pretend) availability zones. We’ll place node-01 and node-02 in zone ‘a’, node-03 in zone ‘b’, and node-04 in zone ‘c’. When we specify a placement preference based on a zone-related label, the tasks for the service in question will be scheduled equally across the zones. To create the labels for the nodes:

    $ docker node update --label-add 'com.acme.zone=a' node-01
    node-01
    $ docker node update --label-add 'com.acme.zone=a' node-02
    node-02
    $ docker node update --label-add 'com.acme.zone=b' node-03
    node-03
    $ docker node update --label-add 'com.acme.zone=c' node-04
    node-04

    Now that the nodes have their labels, we can deploy a service with a placement preference, and observe the results:

    $ docker service create --name nginx --placement-pref 'spread=node.labels.com.acme.zone' --replicas 12 nginx
    uiany6ohly6h2lnn6r782g44z
    $ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx
    ID                  NAME                NODE                CURRENT STATE
    sd54gvigyu54        nginx.1             node-03             Running about a minute ago
    mrhsbes9786f        nginx.2             node-02             Running about a minute ago
    142r9nhtp8e2        nginx.3             node-03             Running about a minute ago
    vd4vueih1smb        nginx.4             node-04             Running about a minute ago
    qcsk4flisdnl        nginx.5             node-04             Running about a minute ago
    o5nfd0tmjxdv        nginx.6             node-01             Running about a minute ago
    l55q6tz86tua        nginx.7             node-02             Running about a minute ago
    soq3lyh2k02o        nginx.8             node-01             Running about a minute ago
    i0s6s7se8i3r        nginx.9             node-03             Running about a minute ago
    zb7t9ef0lovo        nginx.10            node-03             Running about a minute ago
    a44o5f437obx        nginx.11            node-04             Running about a minute ago
    obkyhe6iu4r3        nginx.12            node-04             Running about a minute ago
    $ docker service rm nginx
    nginx

    The tasks have been scheduled equally amongst the three ‘zones’, with node-01 and node-02 acquiring two tasks apiece, whilst node-03 and node-04 have been allocated four tasks each.

    The outcome of the deployment of this service would have been very different if we had applied a resource reservation in conjunction with the placement preference. As each node is configured with 1 GB of memory, if we created the service with --reserve-memory 300Mb, the placement preferences could not physically be honoured by the scheduler, and each node would be scheduled with three tasks apiece, instead.

    Multiple placement preferences can be expressed for a service, using --placement-pref multiple times, with the order of the preferences being significant. For example, if two placement preferences are defined, the tasks will be spread between the nodes satisfying the first expressed preference, before being further divided according to the second preference. This allows refined placement of tasks, to effect the high availability of services.

    Rescheduling on Failure

    Those who have spent time with an ops-oriented hat on can identify with the adage, “Anything that can go wrong, will go wrong”. Workloads will fail. Cluster nodes, or other infrastructure components, will fail, or become unavailable for periods of time. Ensuring the continued operation of a deployed service, and the recovery to a pre-defined status quo, is an important component of orchestration.

    Swarm Mode uses a declarative approach to workloads, and employs ‘desired state reconciliation’ in order to maintain the desired state of the cluster. If components of the cluster fail, whether they be individual tasks, or a cluster node, Swarm’s reconciliation loop attempts to restore the desired state for all workloads affected.

    The easiest way for us to demonstrate this is to simulate a node becoming unavailable in the cluster. We can achieve this with relative ease, by changing the ‘availability’ of a node in the cluster for scheduling purposes. When we issue the command docker node ls, one of the node attributes reported on is ‘availability’, which normally yields ‘Active’:

    $ docker node ls
    ID                           HOSTNAME  STATUS  AVAILABILITY  MANAGER STATUS
    8eg423bamur5uj2cq2lw5803v    node-04   Ready   Active        
    d1euoo53in1krtd4z8swkgwxo *  node-01   Ready   Active        Leader
    gu32egf50bk20mnif25b3rh4y    node-03   Ready   Active        
    txznjotqie2z89le8qbuqy7ew    node-02   Ready   Active

    Before we alter a node’s availability, let’s create a service which has a replica scheduled on each node in the cluster:

    $ docker service create --name nginx --replicas 4 nginx
    vnk7mx1liy11gfpu14z7sexg2
    $ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx
    ID                  NAME                NODE                CURRENT STATE
    w4mkj7645d5k        nginx.1             node-01             Running 29 seconds ago
    fdre59b5ijhj        nginx.2             node-03             Running 29 seconds ago
    skrmh7kievq7        nginx.3             node-02             Running 29 seconds ago
    o5w3eyh5dyo0        nginx.4             node-04             Running 29 seconds ago

    Now, let’s set the availability of node-02 to ‘drain’, which will take it out of the pool for scheduling purposes, and terminate the task nginx.3. It will then get rescheduled on one of the other nodes in the cluster:

    $ docker node update --availability drain node-02
    node-02
    $ docker node ls
    ID                           HOSTNAME  STATUS  AVAILABILITY  MANAGER STATUS
    txznjotqie2z89le8qbuqy7ew    node-02   Ready   Drain         
    d1euoo53in1krtd4z8swkgwxo *  node-01   Ready   Active        Leader
    gu32egf50bk20mnif25b3rh4y    node-03   Ready   Active        
    8eg423bamur5uj2cq2lw5803v    node-04   Ready   Active        
    $ docker service ps --format 'table {{.ID}}\t{{.Name}}\t{{.Node}}\t{{.CurrentState}}' nginx
    ID                  NAME                NODE                CURRENT STATE
    w4mkj7645d5k        nginx.1             node-01             Running 8 minutes ago
    fdre59b5ijhj        nginx.2             node-03             Running 8 minutes ago
    t86fdnxuze0m        nginx.3             node-03             Running 20 seconds ago
    skrmh7kievq7         \_ nginx.3         node-02             Shutdown 20 seconds ago
    o5w3eyh5dyo0        nginx.4             node-04             Running 8 minutes ago

    The output from docker service ps shows the history for the task in ‘slot 3’; a container was shutdown on node-02, and then replaced with a container running on node03.

    Conclusion

    This tutorial has provided an overview of Docker Swarm Mode’s scheduling capabilities. Like most projects in the open source domain, Swarmkit, the project that Docker Swarm Mode is based on, continues to evolve on each new release, and it’s probable that its scheduling capabilities will be further enhanced over time. In the meantime, we’ve highlighted:

    • Swarm’s default spread scheduling strategy,
    • How resource reservation, and constraints affect scheduling,
    • How it’s possible to influence the scheduler, using placement preferences, and
    • Swarm’s approach to rescheduling on failure.

    In the next tutorial, we’ll explore how deployed services are consumed, internally and externally.

    If you have any questions and/or comments, feel free to leave them in the section below.

    Want to continuously deliver your applications made with Docker? Check out Semaphore’s Docker platform with full layer caching for tagged Docker images.

    Read next:

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Avatar
    Writen by:
    Nigel is an independent Docker specialist who writes, teaches, and consults all things Docker-related. Based in the UK, he travels regularly, and can be found at windsock.io, and on GitHub.