Memory Leak on Drone Server

Hi all,

So I believe to be suffering from a memory leak in my drone server:

image

Percentage of memory usage (every “cliff” is the result of a server restart):

image

Not sure if this helps, but here goes some pprof outputs:

go tool pprof:

(pprof) top
69814.75kB of 72888.85kB total (95.78%)
Dropped 781 nodes (cum <= 364.44kB)
Showing top 10 nodes out of 98 (cum >= 520.04kB)
      flat  flat%   sum%        cum   cum%
16974.50kB 23.29% 23.29% 16974.50kB 23.29%  runtime.memhash
16379.96kB 22.47% 45.76% 16379.96kB 22.47%  unicode.init
11776.38kB 16.16% 61.92% 11776.38kB 16.16%  runtime.adjustframe
 8195.25kB 11.24% 73.16%  8195.25kB 11.24%  runtime.execute
 7356.34kB 10.09% 83.25%  7356.34kB 10.09%  github.com/golang/protobuf/proto.(*Buffer).unmarshalType
 4096.38kB  5.62% 88.87%  4096.38kB  5.62%  runtime.efaceeq
 1906.81kB  2.62% 91.49%  1906.81kB  2.62%  runtime.(*Frames).Next
 1536.15kB  2.11% 93.60%  1536.15kB  2.11%  runtime.ifaceeq
 1072.94kB  1.47% 95.07%  1072.94kB  1.47%  encoding/xml.(*Decoder).Token
  520.04kB  0.71% 95.78%   520.04kB  0.71%  github.com/drone/drone/vendor/google.golang.org/grpc/transport.decodeMetadataHeader

go tool pprof -alloc_space:

(pprof) top
3404.79MB of 6221.94MB total (54.72%)
Dropped 656 nodes (cum <= 31.11MB)
Showing top 10 nodes out of 223 (cum >= 230.76MB)
      flat  flat%   sum%        cum   cum%
  783.32MB 12.59% 12.59%   784.86MB 12.61%  unicode.init
  552.23MB  8.88% 21.47%   552.23MB  8.88%  github.com/drone/drone/vendor/google.golang.org/grpc/transport.(*decodeState).processHeaderField
  449.61MB  7.23% 28.69%   449.61MB  7.23%  runtime.adjustframe
  433.14MB  6.96% 35.65%   433.14MB  6.96%  github.com/drone/drone/vendor/github.com/mattn/go-sqlite3._cgoexpwrap_89c1d62cc849_commitHookTrampoline
  258.01MB  4.15% 39.80%   258.01MB  4.15%  runtime.evacuate
  213.77MB  3.44% 43.24%   213.77MB  3.44%  github.com/drone/drone/vendor/golang.org/x/crypto/acme.(*wireAuthz).error
  208.69MB  3.35% 46.59%   233.69MB  3.76%  github.com/drone/drone/vendor/google.golang.org/grpc.(*Server).processUnaryRPC
  169.64MB  2.73% 49.32%   196.64MB  3.16%  github.com/drone/drone/vendor/github.com/mattn/go-sqlite3.(*SQLiteConn).exec
  168.33MB  2.71% 52.02%   172.83MB  2.78%  net.(*dnsMsg).String
  168.05MB  2.70% 54.72%   230.76MB  3.71%  runtime.(*cpuProfile).getprofile

I’m running version 0.8.4+build.1398 currently in a t2.micro instance managed by Amazon ECS. I’ve given the container a 512/256 hard/soft memory limit.

I used to have this problem before when using a Classic/Elastic Load Balancer between the Agent/Server communication. However this ELB would break the grpc connection every 30 or so seconds and cause the agent to reconnect, which I believed to be causing the memory leak. This can’t be the case now as my cluster of agents (6 instances during the day) are directly connecting to the server through the host machine IP.

There’s about 50 users in our organization and we’re running around 1500 build steps every day, to give some idea about the scale.

Any ideas? I can provide more pprof info if required.

Hi,

Since the problem did not go away by itself I decided to take a closer look. From what I see, I have an ever-increasing number of these go routines:

goroutine profile: total 3726
3645 @ 0x432eba 0x4420f4 0x440d5c 0x7f5aa9 0x92cd25 0x45f881
#	0x7f5aa8	github.com/drone/drone/vendor/github.com/cncd/pubsub.(*publisher).Subscribe+0x208	/go/src/github.com/drone/drone/vendor/github.com/cncd/pubsub/pub.go:58
#	0x92cd24	github.com/drone/drone/server.EventStreamSSE.func2+0xe4					/go/src/github.com/drone/drone/server/stream.go:78

And they keep growing. The goroutine stack dump is full of those:

goroutine 6089779 [select, 40 minutes]:
github.com/drone/drone/vendor/github.com/cncd/pubsub.(*publisher).Subscribe(0xc420434e10, 0x15ffb80, 0xc42229a580, 0xf22db7, 0xc, 0xc4206ce750, 0xc4225e60c0, 0x4388d8)
	/go/src/github.com/drone/drone/vendor/github.com/cncd/pubsub/pub.go:58 +0x209
github.com/drone/drone/server.EventStreamSSE.func2(0xc42229a580, 0xc42058ed80, 0x15ff800, 0xc4225a7000, 0xc42207bda0, 0xc42143b1f0)
	/go/src/github.com/drone/drone/server/stream.go:78 +0xe5
created by github.com/drone/drone/server.EventStreamSSE
	/go/src/github.com/drone/drone/server/stream.go:80 +0x45e

goroutine 6223842 [select, 28 minutes]:
github.com/drone/drone/vendor/github.com/cncd/pubsub.(*publisher).Subscribe(0xc420434e10, 0x15ffb80, 0xc4264013f0, 0xf22db7, 0xc, 0xc422b6c420, 0xa8180e, 0xc422dd2e00)
	/go/src/github.com/drone/drone/vendor/github.com/cncd/pubsub/pub.go:58 +0x209
github.com/drone/drone/server.EventStreamSSE.func2(0xc4264013f0, 0xc4231f6120, 0x15ff800, 0xc420d19780, 0xc420d10780, 0xc422ba6910)
	/go/src/github.com/drone/drone/server/stream.go:78 +0xe5
created by github.com/drone/drone/server.EventStreamSSE
	/go/src/github.com/drone/drone/server/stream.go:80 +0x45e

goroutine 6123856 [select, 36 minutes]:
github.com/drone/drone/vendor/github.com/cncd/pubsub.(*publisher).Subscribe(0xc420434e10, 0x15ffb80, 0xc42229b3f0, 0xf22db7, 0xc, 0xc421c67ad0, 0xc421c93080, 0x4388d8)
	/go/src/github.com/drone/drone/vendor/github.com/cncd/pubsub/pub.go:58 +0x209
github.com/drone/drone/server.EventStreamSSE.func2(0xc42229b3f0, 0xc4223bbb60, 0x15ff800, 0xc422a79d80, 0xc420a93a40, 0xc424bd1ad0)
	/go/src/github.com/drone/drone/server/stream.go:78 +0xe5
created by github.com/drone/drone/server.EventStreamSSE
	/go/src/github.com/drone/drone/server/stream.go:80 +0x45e

No one else has experienced anything similar with Drone v0.8.4? :frowning:

Any ideas @bradrydzewski?

Edit: there’s some interesting routines on the stack as well, that have been open for hours…

goroutine 17 [syscall, 478 minutes, locked to thread]:
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2197 +0x1

goroutine 8 [chan receive, 478 minutes]:
database/sql.(*DB).connectionOpener(0xc4204260a0)
	/usr/local/go/src/database/sql/sql.go:837 +0x4a
created by database/sql.Open
	/usr/local/go/src/database/sql/sql.go:582 +0x212

goroutine 35 [IO wait, 478 minutes]:
net.runtime_pollWait(0x7f7814ad0210, 0x72, 0x15f8d40)
	/usr/local/go/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc420443bf8, 0x72, 0x15f2560, 0xc4204530a0)
	/usr/local/go/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc420443bf8, 0xffffffffffffffff, 0x0)
	/usr/local/go/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).accept(0xc420443b90, 0x0, 0x15f6780, 0xc4204530a0)
	/usr/local/go/src/net/fd_unix.go:430 +0x1e5
net.(*TCPListener).accept(0xc42000eb70, 0xc420039d20, 0x7cd846, 0x45c240)
	/usr/local/go/src/net/tcpsock_posix.go:136 +0x2e
net.(*TCPListener).Accept(0xc42000eb70, 0xf5c250, 0xc420395e00, 0x1602ca0, 0xc42000e0b0)
	/usr/local/go/src/net/tcpsock.go:228 +0x49
github.com/drone/drone/vendor/google.golang.org/grpc.(*Server).Serve(0xc420395e00, 0x15fe500, 0xc42000eb70, 0x0, 0x0)
	/go/src/github.com/drone/drone/vendor/google.golang.org/grpc/server.go:444 +0x196
main.server.func1(0x8, 0xf5d4d0)
	/go/src/github.com/drone/drone/extras/cmd/drone-server/server.go:540 +0x441
github.com/drone/drone/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1(0xc420439640, 0xc4204479e0)
	/go/src/github.com/drone/drone/vendor/golang.org/x/sync/errgroup/errgroup.go:58 +0x57
created by github.com/drone/drone/vendor/golang.org/x/sync/errgroup.(*Group).Go
	/go/src/github.com/drone/drone/vendor/golang.org/x/sync/errgroup/errgroup.go:66 +0x66

Nope, I’m not seeing any memory leaks or receiving any reports of memory leak issues from other users. Sorry, I wish I could offer more help.

Well, that’s a bummer :frowning:

But I think I might be onto something. To keep it short, is it possible that at stream.go#L81 it should pass the local context ctx instead of the Gin context c ? I.e.:

Config.Services.Pubsub.Subscribe(ctx, "topic/events", func(m pubsub.Message)

instead of

Config.Services.Pubsub.Subscribe(c, "topic/events", func(m pubsub.Message) {

Just to test it out I’ve added some logs to the goroutine which seems to be causing the leak:

	go func() {
		// TODO remove this from global config
		logrus.Debugf("user feed: subscribing")
		Config.Services.Pubsub.Subscribe(c, "topic/events", func(m pubsub.Message) {
			name := m.Labels["repo"]
			priv := m.Labels["private"]
			if repo[name] || priv == "false" {
				select {
				case <-ctx.Done():
					return
				default:
					eventc <- m.Data
				}
			}
		})
		logrus.Debugf("user feed: context canceled")
		cancel()
	}()

I noticed that the user feed: context canceled was never logged (although user feed: connection closed was). So I changed it to use the ctx context instead and now pretty much every user feed: connection closed would soon be followed by a user feed: context canceled.

I’ve left it running on production for half a day and the number of EventStreamSSE goroutines stayed at about 20-somethings (instead of the hundreds/thousands I was getting before). Which I guess makes sense? I don’t really see the why instantiate a ctx object if it’s not passed down anywhere.

I’ve also noticed that at the LogStreamSSE function it has a similar goroutine where it uses the ctx object instead of the Gin context.

Great analysis. Send a pull request and lets see if this help.

1 Like

Alright, here it is! https://github.com/drone/drone/pull/2386

Well, I have some unfortunate updates…

image

It seems that the goroutine problem is fixed for good (before the fix I’d have thousands of goroutines by now), however the heap keeps growing. This shows on our memory usage graph as well (16h00 was the time I deployed the “fix” in the PR above):

So there’s something else that keeps allocating all that memory :expressionless:

EDIT: new update, actually things seem fine :thinking:

The memory was deallocated during the night. This would not happen previously, without the fix.

I still find the ever growing heap during the day a bit weird, but as long as it’s able to eventually deallocate the memory and not keep growing wihout limit that should be ok. In this chart 100% is 256 Mb btw so it’s not like it’s consuming a lot of memory either.