Docker Swarm Cluster Guide — Services, Stacks, and High Availability

What Is Docker Swarm?

Docker Swarm is a container orchestration tool built into Docker. It groups multiple Docker hosts into a single cluster to deploy and manage services. It has a simpler setup than Kubernetes and can be operated using the Docker CLI alone, making it suitable for small to medium-scale services.

As an analogy, if a single Docker host is one chef, Swarm is a kitchen with multiple chefs working together. The head chef (manager node) distributes orders (services), and the cooks (worker nodes) prepare the dishes (containers). If one cook is unavailable, another takes over.

Component	Role
Manager node	Cluster management, scheduling, storing service definitions
Worker node	Running containers
Service	Deployment unit (image + replica count + network config)
Task	Individual container instance of a service
Stack	Application unit that bundles multiple services

Cluster Initialization

Here’s the process of setting up a Swarm cluster.

# === Run on the manager node ===
# Initialize Swarm (current host becomes manager)
docker swarm init --advertise-addr 192.168.1.10
# Swarm initialized: current node (abc123) is now a manager.
#
# To add a worker to this swarm, run the following command:
#   docker swarm join --token SWMTKN-1-xxx 192.168.1.10:2377
#
# To add a manager to this swarm, run 'docker swarm join-token manager'

# Check worker join token
docker swarm join-token worker
# SWMTKN-1-xxx...

# Check manager join token (3 managers recommended for high availability)
docker swarm join-token manager

# === Run on the worker node ===
docker swarm join --token SWMTKN-1-xxx 192.168.1.10:2377
# This node joined a swarm as a worker.

# === Check cluster status on manager node ===
docker node ls
# ID           HOSTNAME    STATUS  AVAILABILITY  MANAGER STATUS  ENGINE VERSION
# abc123 *     manager-1   Ready   Active        Leader          24.0.7
# def456       worker-1    Ready   Active                        24.0.7
# ghi789       worker-2    Ready   Active                        24.0.7

For high availability, configure an odd number of manager nodes (3, 5, or 7). The Raft consensus algorithm elects a leader, and a majority of managers must be alive for the cluster to operate normally.

Service Deployment

In Swarm, containers are deployed as services. Manage them with the docker service command.

# Create a service (3 Nginx web servers)
docker service create \
  --name web \
  --replicas 3 \
  --publish 80:80 \
  --update-delay 10s \
  --update-parallelism 1 \
  --restart-condition on-failure \
  nginx:alpine

# List services
docker service ls
# ID           NAME  MODE        REPLICAS  IMAGE
# xyz123       web   replicated  3/3       nginx:alpine

# Detailed service status (which node each task runs on)
docker service ps web
# ID         NAME    IMAGE          NODE       DESIRED STATE  CURRENT STATE
# aaa111     web.1   nginx:alpine   manager-1  Running        Running 2 minutes ago
# bbb222     web.2   nginx:alpine   worker-1   Running        Running 2 minutes ago
# ccc333     web.3   nginx:alpine   worker-2   Running        Running 2 minutes ago

# View service logs (aggregated from all tasks)
docker service logs -f web

# Scale service (3 → 5)
docker service scale web=5
# web scaled to 5
# overall progress: 5 out of 5 tasks

# Detailed service information
docker service inspect --pretty web

Thanks to Swarm’s Routing Mesh, you can access the service through any node in the cluster. For example, even if the web service is running only on worker-1 and worker-2, a request to manager-1’s port 80 is automatically routed to the appropriate node.

Rolling Updates

The process of updating images without service interruption.

# Update service image (rolling update)
docker service update \
  --image nginx:1.25-alpine \
  --update-delay 15s \
  --update-parallelism 1 \
  --update-failure-action rollback \
  --update-order start-first \
  web

# Check update progress
docker service ps web
# ID         NAME      IMAGE               NODE       DESIRED STATE  CURRENT STATE
# ddd444     web.1     nginx:1.25-alpine   manager-1  Running        Running 10 seconds ago
# aaa111     \_web.1   nginx:alpine        manager-1  Shutdown       Shutdown 15 seconds ago
# bbb222     web.2     nginx:alpine        worker-1   Running        Running 5 minutes ago
# ccc333     web.3     nginx:alpine        worker-2   Running        Running 5 minutes ago

# Manual rollback on failure
docker service rollback web
# web rolled back

# Check rollback status
docker service inspect --pretty web
# UpdateStatus:
#  State: rollback_completed

Update option descriptions:

Option	Description
`--update-parallelism 1`	Update 1 task at a time
`--update-delay 15s`	Wait 15 seconds between each task update
`--update-failure-action rollback`	Automatically rollback on failure
`--update-order start-first`	Start new task first, then stop the old one

Stack Deployment

docker stack deploys multiple services at once using the Docker Compose file format.

# stack.yml — web application stack
version: "3.8"

services:
  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    networks:
      - frontend
    deploy:
      replicas: 2
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3

  api:
    image: my-registry.com/my-app:latest
    environment:
      - NODE_ENV=production
      - DATABASE_URL=postgres://user:pass@db:5432/myapp
    networks:
      - frontend
      - backend
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: "1.0"
          memory: 512M
        reservations:
          cpus: "0.25"
          memory: 128M
      update_config:
        parallelism: 1
        delay: 15s
        order: start-first

  db:
    image: postgres:16-alpine
    environment:
      - POSTGRES_DB=myapp
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=pass
    volumes:
      - pgdata:/var/lib/postgresql/data
    networks:
      - backend
    deploy:
      replicas: 1
      placement:
        constraints:
          # Run DB only on manager node (volume consistency)
          - node.role == manager

  redis:
    image: redis:7-alpine
    networks:
      - backend
    deploy:
      replicas: 1

networks:
  frontend:
    driver: overlay
  backend:
    driver: overlay
    internal: true

volumes:
  pgdata:

# Deploy stack
docker stack deploy -c stack.yml myapp

# List stacks
docker stack ls
# NAME    SERVICES  ORCHESTRATOR
# myapp   4         Swarm

# Check stack services
docker stack services myapp
# ID         NAME          MODE        REPLICAS  IMAGE
# aaa        myapp_nginx   replicated  2/2       nginx:alpine
# bbb        myapp_api     replicated  3/3       my-registry.com/my-app:latest
# ccc        myapp_db      replicated  1/1       postgres:16-alpine
# ddd        myapp_redis   replicated  1/1       redis:7-alpine

# Check all tasks in the stack
docker stack ps myapp

# Update stack (run same command after modifying stack.yml)
docker stack deploy -c stack.yml myapp

# Remove stack
docker stack rm myapp

Node Management

How to manage cluster node states and roles.

# Check node status
docker node ls

# Set node to maintenance mode (tasks migrate to other nodes)
docker node update --availability drain worker-1
# All tasks on worker-1 are rescheduled to other nodes

# Reactivate after maintenance
docker node update --availability active worker-1

# Add labels to nodes (used for placement constraints)
docker node update --label-add zone=ap-northeast-2a worker-1
docker node update --label-add ssd=true worker-2

# Label-based placement (run DB only on SSD nodes)
docker service create \
  --name db \
  --constraint 'node.labels.ssd == true' \
  postgres:16-alpine

# Remove a node from the Swarm
# Run on the worker node:
docker swarm leave
# Remove from manager:
docker node rm worker-1

Health Checks and Service Discovery

Swarm provides built-in health checks and load balancing.

# Add health check to Dockerfile
FROM node:22-alpine
WORKDIR /app
COPY . .
RUN npm ci --omit=dev

# Health check every 30s, 5s timeout, unhealthy after 3 failures
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

USER node
EXPOSE 3000
CMD ["node", "server.js"]

# Set health check when creating a service
docker service create \
  --name api \
  --replicas 3 \
  --health-cmd "wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1" \
  --health-interval 30s \
  --health-timeout 5s \
  --health-retries 3 \
  my-app:latest

# Check health check results in service status
docker service ps api
# Only healthy tasks receive traffic

Practical Tips

Number of manager nodes: In production, configure 3 or more managers (odd number). With 1, it becomes a single point of failure (SPOF). With 2, consensus cannot be reached if 1 goes down. With 3, the cluster can tolerate 1 failure.
Workloads on manager nodes: Manager nodes can run tasks, but in large clusters, it’s more stable to set managers to drain mode so they focus only on management operations.
Secrets management: Use docker secret to store encrypted secrets in the Raft log and mount them to containers at /run/secrets/. This is more secure than environment variables.
Rolling update strategy: Using --update-order start-first ensures the new task starts successfully before the old task is stopped, preventing downtime. Set --update-failure-action rollback for automatic rollback on failure.
Swarm vs Kubernetes: Swarm is suitable for clusters with fewer than 10 nodes, teams familiar with Docker Compose, and situations requiring quick setup. Choose Kubernetes when you need large-scale clusters, complex scheduling, or a rich ecosystem.