Server is not accepting clients

I started to get this error (“server is not accepting clients”) on clients trying to connect to a live cluster on GCE. The clients could not connect.

What does this error mean? Is it a problem with the database or my client code?

It sounds like an exhaustion of file descriptors. Have you increased the file descriptor limit on your server as recommended here? https://www.cockroachlabs.com/docs/stable/recommended-production-settings.html#file-descriptors-limit

Otherwise, please let us know what you can find in your CockroachDB log files.

The file descriptor limits are increased in the systemd unit file, and the system-wide limit is also set.

I didn’t see anything unusual in the log files. Just lots of runtime stats, no warnings or errors.

Are you using a load balancer? A firewall? Is there any intermediate thing on the network between your clients and your CockroachDB nodes?

Yes, the google cloud load balancer and firewall.

As far as I can tell, they don’t log much at all to google stackdriver. No messages in the relevant timeframe. Or I’m looking in the wrong place.

This message could come when the server is draining. Could the server be getting killed and restarted? This could perhaps happen if systemd is using misconfigured health checks. Could you post your systemd service file?

[Unit]
Description=CockroachDB
After=network.target

[Service]
User=cockroach
Group=cockroach

ExecStart=/usr/local/bin/cockroach start \
        --certs-dir=/etc/cockroach \
        --store=path=/mnt/disks/ssd0 \
        --log-dir=/var/log/cockroach \
        --host=10.156.0.2 \
        --join=10.154.0.2,10.132.0.14 \
        --cache=25%% \
        --max-sql-memory=25%% \
        --logtostderr=ERROR

ExecStop=/usr/local/bin/cockroach quit --certs-dir=/etc/cockroach
Restart=always

LimitNOFILE=64000

[Install]
WantedBy=multi-user.target

I double checked the timestamps on the errors and the logs and you are right, those errors only started happening after I restarted the cockroachdb processes.

The errors that happened before I intervened are ECONNRESET errors, with the error message ‘This socket is closed’. Any information about this one?

Thanks

Could you summarize the complete sequence of events? It’s unclear what’s happening in which order. Once the servers were restarted, did the problems persist?

Clients stopped being able to query the database.
According to the logs, there were ECONNRESET errors at the time while the clients could not query.
I restarted all three CockroachDB processes in the cluster one by one. The ECONNRESET errors stopped, and were replaced by ‘server is not accepting clients’.
Clients were still not able to query, even after the db processes had restarted successfully and rejoined the cluster. ‘Server is not accepting clients’ error is still being logged.
I restarted the clients.
Clients were able to query and there were no more errors.

Oh, I forgot to mention this, the cluster is still on v1.1.5.

OK. The ECONNRESET errors probably mean that the servers crashed. There may be useful information from the logs at that time. They should have been restarted by systemd; it’s not clear why you had to restart them manually.

It sounds like there may be some client-side issues if you needed to restart the clients to get them un-stuck. What language are your clients in? It’s unclear whether the rest of the problem is coming from the client or server side (or the load balancer). There was improvement to the initialization process in 2.0 so I would recommend upgrading.