Strange health endpoint behavior

Hi everyone,

Recently we got some 503 responses from the /health?ready=1 endpoint that seems to be related to some unavailable ranges. The nodes can communicate with each other and no shutdown is in progress. Can someone point us to the right direction to understand this behavior?
We are using version 2.0.6.

Hi @alex9627,

Any chance there were some nodes that were decommissioned? A 503 response is expected at that endpoint when nodes are decommissioning.

No, after a few seconds the response is 200 again without any action.

Hey @alex9627,

Okay, in that case can you provide a little more information on what was going on with the cluster when you received the 503 response?

Were you running any jobs, or was there a particularly heavy workload on the cluster?

Any information on the status of the cluster at that moment would be useful.

Thanks,

Ron

Hi, thanks for the reply. This is our replicas dashboard, we have 18 servers (9 offline/spare) and during this window all the remaining active servers responded 503 a few times (no available servers).
https://snapshot.raintank.io/dashboard/snapshot/jRqZZeaYJ36tM39k7CZTOef6f3ibC2Mh?orgId=2
Please feel free to ask more information.

Hey @alex9627,

Thanks for the snapshot, the number of replicas shouldn’t be fluctuating the way they are in that snapshot, and it looks like entire nodes are going down and then coming back up, so it would be expected to be getting that 503 response. It’s letting you know that the node is unhealthy because the node is unable to mark itself as live in node liveness.

Could you also send over a screenshot of your live node count from
<host>:<port>/#/metrics/runtime/cluster from the time you were seeing the 503 responses.

Thanks,

Ron

Thanks for helping, I guess this snapshot shows the info you asked:

https://snapshot.raintank.io/dashboard/snapshot/EWIwt3Kzsvd224HwxJtrgR6QlGWqvfPx

Hey @alex9627,

I pinged you on slack in our channel.

Thanks,

Ron