Hi all,
So I believe to be suffering from a memory leak in my drone server:
Percentage of memory usage (every “cliff” is the result of a server restart):
Not sure if this helps, but here goes some pprof outputs:
go tool pprof
:
(pprof) top
69814.75kB of 72888.85kB total (95.78%)
Dropped 781 nodes (cum <= 364.44kB)
Showing top 10 nodes out of 98 (cum >= 520.04kB)
flat flat% sum% cum cum%
16974.50kB 23.29% 23.29% 16974.50kB 23.29% runtime.memhash
16379.96kB 22.47% 45.76% 16379.96kB 22.47% unicode.init
11776.38kB 16.16% 61.92% 11776.38kB 16.16% runtime.adjustframe
8195.25kB 11.24% 73.16% 8195.25kB 11.24% runtime.execute
7356.34kB 10.09% 83.25% 7356.34kB 10.09% github.com/golang/protobuf/proto.(*Buffer).unmarshalType
4096.38kB 5.62% 88.87% 4096.38kB 5.62% runtime.efaceeq
1906.81kB 2.62% 91.49% 1906.81kB 2.62% runtime.(*Frames).Next
1536.15kB 2.11% 93.60% 1536.15kB 2.11% runtime.ifaceeq
1072.94kB 1.47% 95.07% 1072.94kB 1.47% encoding/xml.(*Decoder).Token
520.04kB 0.71% 95.78% 520.04kB 0.71% github.com/drone/drone/vendor/google.golang.org/grpc/transport.decodeMetadataHeader
go tool pprof -alloc_space
:
(pprof) top
3404.79MB of 6221.94MB total (54.72%)
Dropped 656 nodes (cum <= 31.11MB)
Showing top 10 nodes out of 223 (cum >= 230.76MB)
flat flat% sum% cum cum%
783.32MB 12.59% 12.59% 784.86MB 12.61% unicode.init
552.23MB 8.88% 21.47% 552.23MB 8.88% github.com/drone/drone/vendor/google.golang.org/grpc/transport.(*decodeState).processHeaderField
449.61MB 7.23% 28.69% 449.61MB 7.23% runtime.adjustframe
433.14MB 6.96% 35.65% 433.14MB 6.96% github.com/drone/drone/vendor/github.com/mattn/go-sqlite3._cgoexpwrap_89c1d62cc849_commitHookTrampoline
258.01MB 4.15% 39.80% 258.01MB 4.15% runtime.evacuate
213.77MB 3.44% 43.24% 213.77MB 3.44% github.com/drone/drone/vendor/golang.org/x/crypto/acme.(*wireAuthz).error
208.69MB 3.35% 46.59% 233.69MB 3.76% github.com/drone/drone/vendor/google.golang.org/grpc.(*Server).processUnaryRPC
169.64MB 2.73% 49.32% 196.64MB 3.16% github.com/drone/drone/vendor/github.com/mattn/go-sqlite3.(*SQLiteConn).exec
168.33MB 2.71% 52.02% 172.83MB 2.78% net.(*dnsMsg).String
168.05MB 2.70% 54.72% 230.76MB 3.71% runtime.(*cpuProfile).getprofile
I’m running version 0.8.4+build.1398 currently in a t2.micro instance managed by Amazon ECS. I’ve given the container a 512/256 hard/soft memory limit.
I used to have this problem before when using a Classic/Elastic Load Balancer between the Agent/Server communication. However this ELB would break the grpc connection every 30 or so seconds and cause the agent to reconnect, which I believed to be causing the memory leak. This can’t be the case now as my cluster of agents (6 instances during the day) are directly connecting to the server through the host machine IP.
There’s about 50 users in our organization and we’re running around 1500 build steps every day, to give some idea about the scale.
Any ideas? I can provide more pprof info if required.