ERROR: [n1] Queued as error 60b0184e

Hi. I run a 3 node CockroachDB cluster since June. Now I found in the logs, that the following happened two days ago:

Oct 20 09:11:55 dc1-sirius cockroach[43586]: *
Oct 20 09:11:55 dc1-sirius cockroach[43586]: * ERROR: [n1] Queued as error 60b0184e7fca48c8b1154f78b616f249
Oct 20 09:11:55 dc1-sirius cockroach[43586]: *
Oct 20 09:11:55 dc1-sirius cockroach[43586]: *
Oct 20 09:11:55 dc1-sirius cockroach[43586]: * ERROR: [n1] Queued as error d22446541bdd4527abf3d9d98408cd08
Oct 20 09:11:55 dc1-sirius cockroach[43586]: *
Oct 20 09:11:55 dc1-sirius cockroach[43586]: * ERROR: [n1] Queued as error 6f17105442c248e4be8231f2a28150b7
Oct 20 09:11:55 dc1-sirius cockroach[43586]: *

Then, about 20 seconds later, my systemd detected the missing executable:

Oct 20 09:11:56 dc1-sirius systemd[1]: cockroach.service: Main process exited, code=exited, status=7/NOTRUNNING
Oct 20 09:11:56 dc1-sirius systemd[1]: cockroach.service: Failed with result 'exit-code'.
Oct 20 09:12:06 dc1-sirius systemd[1]: cockroach.service: Service RestartSec=10s expired, scheduling restart.
Oct 20 09:12:06 dc1-sirius systemd[1]: cockroach.service: Scheduled restart job, restart counter is at 1.
Oct 20 09:12:06 dc1-sirius systemd[1]: Stopped Cockroach Database cluster node.
Oct 20 09:12:06 dc1-sirius systemd[1]: Starting Cockroach Database cluster node...
Oct 20 09:12:16 dc1-sirius systemd[1]: Started Cockroach Database cluster node.

It came up again and a minute later it crashed again:

Oct 20 09:13:01 dc1-sirius cockroach[182193]: * ERROR: [n1] Queued as error f224533d40dd417294588e2f56a9cb9e
Oct 20 09:13:01 dc1-sirius cockroach[182193]: *
Oct 20 09:13:01 dc1-sirius systemd[1]: cockroach.service: Main process exited, code=exited, status=7/NOTRUNNING
Oct 20 09:13:01 dc1-sirius systemd[1]: cockroach.service: Failed with result 'exit-code'.
Oct 20 09:13:12 dc1-sirius systemd[1]: cockroach.service: Service RestartSec=10s expired, scheduling restart.
Oct 20 09:13:12 dc1-sirius systemd[1]: cockroach.service: Scheduled restart job, restart counter is at 2.
Oct 20 09:13:12 dc1-sirius systemd[1]: Stopped Cockroach Database cluster node.
Oct 20 09:13:12 dc1-sirius systemd[1]: Starting Cockroach Database cluster node...
Oct 20 09:13:22 dc1-sirius systemd[1]: Started Cockroach Database cluster node.

Then again:

Oct 20 09:13:22 dc1-sirius cockroach[182207]: * ERROR: [n1] Queued as error 7daa739dd5fe4145a9eff58621a9e17b
Oct 20 09:13:22 dc1-sirius cockroach[182207]: *
Oct 20 09:13:22 dc1-sirius systemd[1]: cockroach.service: Main process exited, code=exited, status=7/NOTRUNNING
Oct 20 09:13:22 dc1-sirius systemd[1]: cockroach.service: Failed with result 'exit-code'.
Oct 20 09:13:33 dc1-sirius systemd[1]: cockroach.service: Service RestartSec=10s expired, scheduling restart.
Oct 20 09:13:33 dc1-sirius systemd[1]: cockroach.service: Scheduled restart job, restart counter is at 3.
Oct 20 09:13:33 dc1-sirius systemd[1]: Stopped Cockroach Database cluster node.
Oct 20 09:13:33 dc1-sirius systemd[1]: Starting Cockroach Database cluster node...
Oct 20 09:13:43 dc1-sirius systemd[1]: Started Cockroach Database cluster node.

This game repeated for another 10 minutes and since then, it seems to run fine again. It was 29 times in sum.

On my second node, I have the same at the same time, but only 7 times.

On the third node, I had this 8 times at the same time!

I run all nodes with cockroachdb version v21.1.6 on CentOS 8.3 (x64). They are connected through a private 10.0.0.0 network for the cluster. There is only one app running on that database. It is written using go and uses the pgx driver. The app detected the outage by its ping (every minute).

I’m now really nervous because the three nodes crashed mostly the same time. What might happen to crash all nodes of a cluster? I mean, I run a cluster to handle failure of one node but that crashed all at the same time? And there is nothing more in the log…

I can imagine that networking had an issue in this time, but the three nodes should definitely not crash all at the same time, right?

I just created an issue in github. Therefore I deleted the additional log entries here.

Hi @Volker.Schmid,

Thanks for the question! I answered on the GitHub issue and will paste my response here as well.

Thanks for filing!

The logs explain why the crashes happened:

clock synchronization error: this node is more than 500ms away from at least half of the known nodes (1 of 2 are within the offset)

The nodes deliberately crash when this happens, to prevent data consistency issues.

Please see our documentation which explains why this happens and how to deal with it: Operational FAQs | CockroachDB Docs