How to Debug High Service latency in one of the Node

I have a 3 node Cockroach DB(Version 2.1.5) cluster running in kubernetes. My Golang Microservice does basic CURD Operations . I am doing a simple load testing (200 Queries per second) by calling the GET API of the Service which runs SELET Query in the Table.

Occasionally, I am seeing spike in Service Latency(99th Percentile) in one of the Node through Admin UI.

I want to Understand the spike in the above graph. Hardware and Runtime Dashbaords looks Normal. Below are screenshots of Hardware Dasboard.



Also I checked Statement Details for that query, I couldn’t find any issue.

I need help to debug this issue. Thanks in advance.

Hi @Velu,

A couple questions for you that should help us find the answer faster:

  • How are you balancing load between nodes? What does the output of the sql connections graph on the sql dash look like? (you may have to click between nodes on the dropdown on the top left to see if there’s an imbalance). It seems odd that n2 is not serving any queries.
  • What’s the schema for conversion_currency_rates?
  • How many rows / how much data is contained in the table?
  • Where are the nodes located? What is the latency between them?
  • Did you detect any 180ms response times from the db in your application?
  • Were there any other queries on the statements page?

Thanks!

Hi @tim-o - Please find the answers below.

How are you balancing load between nodes?

I am not doing anything explicitly for load balancing. In my Go Microservice, I am using GORM which opens “postgresql” connection.
The connection string is postgresql://root@cockroach-cockroachdb-public.datastore:26257/mydb?sslmode=disable . Where the url is Kubernetes Service name.

What does the output of the sql connections graph on the sql dash look like?

I believe there is some imbalance in serving the query.

I ran a test, this time n1 shows very latency where as n2 has very high latency. Though there are connections to all 3 nodes.

  • Node -1





  • Node - 2





  • Node - 3





What’s the schema for conversion_currency_rates?

How many rows / how much data is contained in the table?

The table has only one row

Where are the nodes located?

The CRDB Nodes running Kubernetes across 3 availability zones.
AWS Region - ap-southeast-1
n1 AZ - ap-southeast-1c
n2 AZ - ap-southeast-1a
n3 AZ - ap-southeast-1b

What is the latency between them?

admin@ip-10-0-86-66:~$ ping ip-10-0-68-213.ap-southeast-1.compute.internal
64 bytes from ip-10-0-68-213.ap-southeast-1.compute.internal (10.0.68.213): icmp_seq=18 ttl=64 time=0.562 ms
— ip-10-0-68-213.ap-southeast-1.compute.internal ping statistics —
18 packets transmitted, 18 received, 0% packet loss, time 17006ms
rtt min/avg/max/mdev = 0.540/0.593/0.656/0.036 ms

admin@ip-10-0-86-66:~$ ping ip-10-0-56-60.ap-southeast-1.compute.internal
15 packets transmitted, 15 received, 0% packet loss, time 14015ms
rtt min/avg/max/mdev = 1.236/1.335/1.425/0.060 ms

admin@ip-10-0-68-213:~$ ping ip-10-0-56-60.ap-southeast-1.compute.internal
— ip-10-0-56-60.ap-southeast-1.compute.internal ping statistics —
18 packets transmitted, 18 received, 0% packet loss, time 17017ms
rtt min/avg/max/mdev = 1.229/1.392/3.095/0.415 ms

Did you detect any 180ms response times from the db in your application?

I didn’t captured that in log. I will add that and update. But I can correlate the query latency with my API response time. Whenever there is a spike in CRDB Service latency, my average response increases drastically.

Were there any other queries on the statements page?
No, only one query in statements page

Hi @Velu,

With the test you described, you’re not likely to see all of the benefits of CRDB. When your table has a single row, and you have no load balancer, your performance is going to be limited by how quickly the node you’re connecting to can contact the leaseholder for your single row, and how quickly that leaseholder can return the data to the host. If you’re making concurrent requests, sooner or later you’ll run into transaction contention. Also, if the lease moves away from the node you’re connecting to, you’ll see a jump in latency, though that shouldn’t happen if we’re following the workload, and 180ms is more than we’d expect given all three nodes are in the same region.

A couple follow up questions:

  • How many threads are attempting to connect to the database? (I.e.: how much concurrency?) Are you reusing connections or closing them when the query succeeds?
  • Does this test represent what happens in production? Does this table usually have a single row, and are you always looking it up individually? Or are there more rows in the table and is the query pattern distributed over those rows?
  • What are the machine and disk types for the nodes in your cluster?
  • Can you send over the logs and time stamps for a test where you experience high latency? You can easily collect logs from all three nodes using cockroach debug zip. This will contain some sensitive information, so you can submit it privately by opening an issue at https://support.cockroachlabs.com. I’ll see it come through.

Let me know the answer to those questions while I do some more research about what else might contribute to latency. If any of the terminology I used above is unfamiliar, I’d recommend dedicating time to read the architecture docs. Understanding how we’re distributing and accessing data under the hood will help your testing.