[SOLVED]: How to get Drone working in Docker (Swarm Mode) with docker stack deploy

Heya all!

After many weeks of frustration, digging, poking around I have finally found a configuration that works in the following scenario:

  • Docker (Swarm Mode)
  • Deployed as a service with docker stack deploy

There are some known issues with Load Balancers with Drone server<->agent communications
and this apparently also includes Docker’s Overlay networking; but I suspect this has more to do with VIPs.

This is my working configuration:

version: "3.3"

services:
  drone-server:
    image: drone/drone:latest
    ports:
      - target: 9000
        published: 9000
        protocol: tcp
        mode: host
    environment:
      - DRONE_DEBUG=true
      - DRONE_OPEN=true
      - DRONE_HOST=https://ci.mydomain.com
      - DRONE_GOGS=true
      - DRONE_GOGS_PRIVATE_MODE=true
      - DRONE_GOGS_URL=https://git.mydomain.com
      - DRONE_SECRET=XXXX
      - DRONE_ADMIN=admin
    networks:
      - traefik
    volumes:
      - dronedata:/var/lib/drone
    deploy:
      placement:
        constraints:
          - "node.hostname == node1.mydomain.com"
      endpoint_mode: dnsrr
      labels:
        - "traefik.enable=true"
        - "traefik.port=8000"
        - "traefik.backend=ci"
        - "traefik.docker.network=traefik"
        - "traefik.frontend.rule=Host:ci.mydomain.com"
      restart_policy:
        condition: on-failure
      replicas: 1

  drone-agent:
    image: drone/agent:latest
    command: agent
    networks:
      - bridge
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      - DRONE_DEBUG=true
      - DRONE_SERVER=10.0.0.10:9000
      - DRONE_SECRET=XXX
    deploy:
      placement:
        constraints:
          - "node.role != manager"
      restart_policy:
        condition: on-failure
      replicas: 3

networks:
  bridge:
    external: true
  traefik:
    external: true

volumes:
  dronedata:
    external: true

To deploy:

$ docker stack deploy -c ci.yml ci

I believe the main thing that got this working correctly and reliably was publishing the port to the host.

Some notes:

  • I run the server on node1 (10.0.0.10)
  • I run the agents on all other nodes
2 Likes

Hi prologic

I was going to dig into the topic sooner or later, thanks for the inspiration :slight_smile:. I’m curious about the issue you faced, is it unstable/not working without the hardcoded node1 ip as node server ? Or the server:9000 being published ? Why did you had to specify the endpoint_mode ?

Best regards,

It was unstable. You’d have to restart the server+agent(s) to get things
going again.

The endpoint_mode can be one of the following:

endpoint_mode: vip - Docker assigns the service a virtual IP (VIP), which
acts as the “front end” for clients to reach the service on a network.
Docker routes requests between the client and available worker nodes for
the service, without client knowledge of how many nodes are participating
in the service or their IP addresses or ports. (This is the default.)

endpoint_mode: dnsrr - DNS round-robin (DNSRR) service discovery does not
use a single virtual IP. Docker sets up DNS entries for the service such
that a DNS query for the service name returns a list of IP addresses, and
the client connects directly to one of these. DNS round-robin is useful in
cases where you want to use your own load balancer, or for Hybrid Windows
and Linux applications.

And changing the way the port is published avoids using thte overlay
network at all.

The gRPC internally used in Drone doesn’t seem to like going through
overlay networks much (unclear why)
and I’m not sure how the VIP comes in to play (probably not needed?)

cheers
James

James Mills / prologic

E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au

Oh and I think I remember why publishing the port to the host is required;
otherwise the host won’t have a bound listening interface so it won’t work
without.

James Mills / prologic

E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au

You do not have to expose the port 9000 if your only build on agents in the swarm.
on the agent use the service name as a dns, and you’ll be good.
And please do not use the bridge network on swarm if you need port communications between nodes.
Overlay is the way to go.

I’m using drone in swarm and it works flowesly.
I can share my stack if needed.

@zaggash Please share your config. You’ve completely missed the point here.
Please look in the forum for other references and you’ll find comments from
@bradrydzewski clearly stating several issues with Server<->Drone comms
over load balancers or reverse proxies which includes the Overlay
networking in Docker.

James Mills / prologic

E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au

Here is my stack, I’m using the overlay without publishing anything other than my Traefik ports.
All the magic done with drone goes through swarm and the overlay network.
I had an issue with “pending build”, solved by adding the endpoint_mode: to dnsrr on the server, that’s how it solves the VIP IP translation ( similar to a reverse proxy of the port 9000)

The stack is started with docker stack deploy -c docker-compose.yml ci
This explain some ci prefix :wink:

version: '3.4'

services:
  lb:
    image: traefik:1.4.5
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "3"
    command:
      - "--graceTimeOut=5s"
      - "--logLevel=info"
      - "--defaultentrypoints=http,https"
      - "--entryPoints=Name:http Address::80 Redirect.EntryPoint:https"
      - "--entryPoints=Name:https Address::443 TLS"
      - "--web"
      - "--docker"
      - "--docker.domain=yourdomain.com"
      - "--docker.swarmmode=true"
      - "--docker.exposedbydefault=false"
      - "--acme"
      #- "--acme.caServer=https://acme-staging.api.letsencrypt.org/directory"
      - "--acme.email=webmaster@yourdomain.fr"
      - "--acme.entryPoint=https"
      - "--acme.onhostrule=true"
      - "--acme.storage=/etc/traefik/acme/acme.json"
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"
      - "/dev/null:/traefik.toml"
      - "/opt/docker_data/traefik/:/etc/traefik/acme"
    networks:
      - gateway
    ports:
      - "80:80/tcp"
      - "443:443/tcp"
    deploy:
      mode: replicated
      replicas: 1
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: pause
        order: start-first
        monitor: 30s
      restart_policy:
        condition: on-failure
        max_attempts: 3


  drone-server:
    image: drone/drone
    logging:
      driver: json-file
      options:
        max-size: "5m"
        max-file: "3"
    environment:
      - DRONE_DEBUG=true
      - DRONE_OPEN=false
      - DRONE_GITHUB=true
      - DRONE_ADMIN=_HIDDEN_
      - DRONE_GITHUB_URL=https://github.com
      - DRONE_GITHUB_CLIENT=_HIDDEN_
      - DRONE_GITHUB_SECRET=_HIDDEN_
      - DRONE_GITHUB_CONTEXT=continuous-integration/drone
      - DRONE_GITHUB_SCOPE=repo,repo:status,user:email,read:org
      - DRONE_HOST=https://drone.yourdomain.com
      - DRONE_SECRET=_HIDDEN_
    volumes:
      - drone-sqlite:/var/lib/drone/
    networks:
      - gateway
      - ci
    deploy:
      placement:
        constraints:
          - node.role!=manager
      labels:
      - "traefik.backend=drone"
      - "traefik.frontend.rule=Host:drone.yourdomain.com"
      - "traefik.port=8000"
      - "traefik.docker.network=ci_gateway"
      - "traefik.enable=true"
      endpoint_mode: dnsrr
      mode: replicated
      replicas: 1
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: pause
        order: start-first
        monitor: 10s
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3

  drone-agent:
    image: drone/agent
    command: agent
    logging:
      driver: json-file
      options:
        max-size: "5m"
        max-file: "3"
    depends_on:
      - drone-server
    environment:
      - DRONE_DEBUG=true
      - DRONE_SERVER=drone-server:9000
      - DRONE_SECRET=_HIDDEN_
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    networks:
      - ci
    deploy:
      placement:
        constraints:
          - node.role!=manager
       preferences:
          - spread: node.labels.ci
      mode: replicated
      replicas: 10
      update_config:
        parallelism: 2
        delay: 10s
        failure_action: pause
        order: start-first
        monitor: 10s
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3

volumes:
  drone-sqlite:

networks:
  gateway:
  ci:

So it looks like you’re still utilizing the overlay networking here but
avoiding the VIP endpoint discovery and opting instead for DNS RR.

Is that what makes all the difference in stability? (As I said there are
some open issues around this and overlay in general in moby upstream)

James Mills / prologic

E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au

Yep; dns-rr solves it because, it avoids the load balancing with the VIP.
Instead the container ip is directly seen by the agents.

Yeap makes sense! I’ll update my config later to match :slight_smile:

Glad we finally sorted this out; I’ve been struggling to get a stable Drone
CI going for a few months :slight_smile: (what precious little spare time!)

cheers
James

James Mills / prologic

E: prologic@shortcircuit.net.au
W: prologic.shortcircuit.net.au

Hi zaggash,

i’ve tried your solution of running drone in docker swarm. Unfortunately my treafik complains, that drone-server and agent service are badly configured “ignored endpoint-mode not supported”. I guess the reason is this https://github.com/containous/traefik/blob/master/provider/docker/docker.go#L379 (TLDR traefik does not support service with endpoint_mode dnsrr)

Do you have any idea how could I get your stack going? (BTW prologic’s hack with hardcoded ip address does work, but it’s not convenient)

I see.
I never saw this message in my stack.
That’s weird, I cannot explain.

How many drone-server do you start ? more than one ?
Did you check if drone-server start before or after traefik ? ( try to start it first )

My setup is swarm cluster with 2 machines (manager + worker) with following compose file (basically i removed all ssl stuff - i’ve got my own ssl termination before this stack and running single instance of each service). I tried starting traefik as first service, as last service, I’ve tried running all service on single node (manager) as well as on separate nodes and traefik still complains. The only idea I’ve got is, that traefic introduced this “feature” in relatively new version (my stack is created like 2 days ago) and you are running older version without this restriction… (I’m just wildly guessing)

version: '3.4'
services:
  lb:
    image: traefik:1.5
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "3"
    command:
      - "--graceTimeOut=5s"
      - "--logLevel=info"
      - "--defaultentrypoints=http"
      - "--entryPoints=Name:http Address::80"
      - "--web"
      - "--docker"
      - "--docker.domain=__MYDOMAIN.COM"
      - "--docker.swarmmode=true"
      - "--docker.exposedbydefault=false"
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"
      - "/dev/null:/traefik.toml"
    networks:
      - gateway
    ports:
      - "80:80/tcp"
    deploy:
      mode: replicated
      replicas: 1
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: pause
        order: start-first
        monitor: 30s
      restart_policy:
        condition: on-failure
        max_attempts: 3

  drone-server:
    image: drone/drone:0.8
    volumes:
      - /var/lib/drone:/var/lib/drone/
#    ports:
#      - target: 9000
#        published: 9000
#        protocol: tcp
#        mode: host
    environment:
      - DRONE_OPEN=true
      - DRONE_HOST=https://MACHINE.__MYDOMAIN.COM
      - DRONE_BITBUCKET=true
      - DRONE_BITBUCKET_CLIENT=SECRET
      - DRONE_BITBUCKET_SECRET=SECRET
      - DRONE_SECRET=SECRET
      - HTTPS_PROXY=MY_HTTP_PROXY
      - HTTP_PROXY=MY_HTTP_PROXY
      - https_proxy=MY_HTTP_PROXY
      - http_proxy=MY_HTTP_PROXY
      - DRONE_DEBUG=true
    networks:
      - gateway
      - ci
    deploy:
      placement:
        constraints:
          - node.role!=manager
      labels:
      - "traefik.backend=drone"
      - "traefik.frontend.rule=Host:MACHINE.__MYDOMAIN.COM"
      - "traefik.port=8000"
      - "traefik.docker.network=drone_gateway"
      - "traefik.enable=true"
      endpoint_mode: dnsrr
      mode: replicated
      replicas: 1
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: pause
        order: start-first
        monitor: 10s
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
  drone-agent:
    image: drone/agent:0.8
    command: agent
    depends_on:
      - drone-server
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    networks:
      - ci
    environment:
      #ugly hack - see http://discuss.harness.io/t/solved-how-to-get-drone-working-in-docker-swarm-mode-with-docker-stack-deploy/1166
#      - DRONE_SERVER=10.97.23.171:9000
      - DRONE_SERVER=drone-server:9000
      - DRONE_SECRET=SECRET
      - DRONE_DEBUG=true
    deploy:
      placement:
        constraints:
          - node.role!=manager
      endpoint_mode: dnsrr
      mode: replicated
      replicas: 1
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: pause
        order: start-first
        monitor: 10s
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
networks:
  gateway:
  ci:

Oh, DAMN! that’s it. It works with older version of traefic - I’ve just tried it with 1.4.5 (I guess older rc of 1.5 are fine too) and it all works. So either i use hardcoded ips (which I don’t want to) or I use older version of software (which is even worse :slight_smile: )

Yep actually in my stack I’m currently using v1.4.5, I mistype the version in the compose file, its on me.

They may have done some change in the 1.5rc-ish versions.

Good to know you solved it by yourself.

I am going to edit the upper compose file right now