After many weeks of frustration, digging, poking around I have finally found a configuration that works in the following scenario:
Docker (Swarm Mode)
Deployed as a service with docker stack deploy
There are some known issues with Load Balancers with Drone server<->agent communications
and this apparently also includes Docker’s Overlay networking; but I suspect this has more to do with VIPs.
I was going to dig into the topic sooner or later, thanks for the inspiration . I’m curious about the issue you faced, is it unstable/not working without the hardcoded node1 ip as node server ? Or the server:9000 being published ? Why did you had to specify the endpoint_mode ?
It was unstable. You’d have to restart the server+agent(s) to get things
going again.
The endpoint_mode can be one of the following:
endpoint_mode: vip - Docker assigns the service a virtual IP (VIP), which
acts as the “front end” for clients to reach the service on a network.
Docker routes requests between the client and available worker nodes for
the service, without client knowledge of how many nodes are participating
in the service or their IP addresses or ports. (This is the default.)
endpoint_mode: dnsrr - DNS round-robin (DNSRR) service discovery does not
use a single virtual IP. Docker sets up DNS entries for the service such
that a DNS query for the service name returns a list of IP addresses, and
the client connects directly to one of these. DNS round-robin is useful in
cases where you want to use your own load balancer, or for Hybrid Windows
and Linux applications.
And changing the way the port is published avoids using thte overlay
network at all.
The gRPC internally used in Drone doesn’t seem to like going through
overlay networks much (unclear why)
and I’m not sure how the VIP comes in to play (probably not needed?)
Oh and I think I remember why publishing the port to the host is required;
otherwise the host won’t have a bound listening interface so it won’t work
without.
You do not have to expose the port 9000 if your only build on agents in the swarm.
on the agent use the service name as a dns, and you’ll be good.
And please do not use the bridge network on swarm if you need port communications between nodes.
Overlay is the way to go.
I’m using drone in swarm and it works flowesly.
I can share my stack if needed.
@zaggash Please share your config. You’ve completely missed the point here.
Please look in the forum for other references and you’ll find comments from @bradrydzewski clearly stating several issues with Server<->Drone comms
over load balancers or reverse proxies which includes the Overlay
networking in Docker.
Here is my stack, I’m using the overlay without publishing anything other than my Traefik ports.
All the magic done with drone goes through swarm and the overlay network.
I had an issue with “pending build”, solved by adding the endpoint_mode: to dnsrr on the server, that’s how it solves the VIP IP translation ( similar to a reverse proxy of the port 9000)
The stack is started with docker stack deploy -c docker-compose.yml ci
This explain some ci prefix
i’ve tried your solution of running drone in docker swarm. Unfortunately my treafik complains, that drone-server and agent service are badly configured “ignored endpoint-mode not supported”. I guess the reason is this https://github.com/containous/traefik/blob/master/provider/docker/docker.go#L379 (TLDR traefik does not support service with endpoint_mode dnsrr)
Do you have any idea how could I get your stack going? (BTW prologic’s hack with hardcoded ip address does work, but it’s not convenient)
My setup is swarm cluster with 2 machines (manager + worker) with following compose file (basically i removed all ssl stuff - i’ve got my own ssl termination before this stack and running single instance of each service). I tried starting traefik as first service, as last service, I’ve tried running all service on single node (manager) as well as on separate nodes and traefik still complains. The only idea I’ve got is, that traefic introduced this “feature” in relatively new version (my stack is created like 2 days ago) and you are running older version without this restriction… (I’m just wildly guessing)
Oh, DAMN! that’s it. It works with older version of traefic - I’ve just tried it with 1.4.5 (I guess older rc of 1.5 are fine too) and it all works. So either i use hardcoded ips (which I don’t want to) or I use older version of software (which is even worse )