Build step failing with: "step.status":"failure" on log

Hi everyone again.
We are running drone 1.0.1 to run tests in parallel.
They are running in ~20 steps, each step running a rspec command in different folder, running in parallel in the same machine (a r5.xlarge) instance, they are grouped in drone.yml.

The problem is that some steps are failing randomly due to “Failure” Error. The error only says that.

I set the logs to DEBUG (Server and Agent), collected, analyzed, but I did not see any useful information (at least for me) why this is happening.

The following lines, describe the error on log:

Apr 22 19:55:36 074c5f7da9ad drone.server.production 5067 - - {"level":"debug","msg":"manager: updating step status","step.id":4548,"step.name":"test_mailers","step.status":"failure","time":"2019-04-22T19:55:36Z"}
...
Apr 22 19:55:37 074c5f7da9ad drone.server.production 5067 - - {"level":"debug","msg":"manager: updating step status","step.id":4559,"step.name":"test_subscribers","step.status":"failure","time":"2019-04-22T19:55:37Z"}

These 2 steps have failed: test_subscribers and test_mailers in this build.

Entire log is here

Could you help me identify why is this happening?

can you please provide the following information:

  • full copy of the Yaml configuration file. required
  • data from /api/repos/{owner}/{name}/builds/{build} (replace placeholders with actual values) and navigate to the url in the browser. required
  • screenshot of the page showing a little more detailed. optional

Hello @bradrydzewski

Follow the informations:

Screenshot

What do you mean with “navigate to the url in the browser” ? Is it to make a api call in browser? I did it via curl.

Thanks!

Did you check your agent logs for more details? I traced the code and you should see error messages in the agent logs (docker logs <agent>)

	err = runner.Run(timeout)
	if err != nil && err != runtime.ErrInterrupt {
		logger = logger.WithError(err)
		logger.Infoln("runner: execution failed")
		return r.handleError(ctx, m.Stage, err)
	}
	logger = logger.WithError(err)
	logger.Infoln("runner: execution complete")

I also checked the meaning of 255 and it indicates that docker run, docker start or docker logs failed to execute. You might also be able to run docker events on the agent hostmachine to surface any detailed errors being returned by the Docker deamon.

Good!
I am trying to troubleshooting, and soon post here the results.

Hello @bradrydzewski.
I could get the real error using docker inspect

"State": {
            "Status": "created",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 128,
            "Error": "error setting label on mount source '/var/lib/docker/volumes/9bvrs0ahyxj4o7ttigp9tpt7lp6ofpkq/_data': no such file or directory",
            "StartedAt": "0001-01-01T00:00:00Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },

We are trying to understand it, because even with one or two step failing with this error, the others (in paralel) runs normally.
Do you have any idea?

Thanks.

Perhaps it is a problem with Docker? https://github.com/docker/cli/issues/1234
Is it possible you need to upgrade to a newer version of Docker?

Hello @bradrydzewski.
It was a good point!
We tested the build running the agents on another Box. It was running in a CoreOS before. Now it’s running on a Amazon Linux Optimized (We run docker on ECS) and it’s running without errors on build.
So, it seems something related to the Core OS & Docker.

Thanks for your help.