Slowly degrading throughput over time

We have an insert and update heavy use case. We need to do 5000 of them each per second. On a 5-node CRDB cluster an experiment has been running for over 2 days.

The data is loaded by 8 Cockroach Loader Program (CLP) written in Go. Each works on batches of 200 records, inserting around 100 rows and updating an equal amount. Both inserts and updates are done via bulk INSERT and UPSERT statements.

10-minute resource utilization numbers from each of the nodes. N1:

----system---- ----total-cpu-usage---- ---load-avg--- ------memory-usage----- -dsk/total- -net/total-
     time     |usr sys idl wai hiq siq| 1m   5m  15m | used  buff  cach  free| read  writ| recv  send
12-07 06:05:00|  6   2  92   0   0   0|30.8 30.6 30.4|51.6G 12.0k 40.3G 2396M| 129k 2554k|   0     0 
12-07 06:06:00| 66   8  18   4   0   4|29.4 30.2 30.3|51.8G 12.0k 40.6G 1750M|4099k   32M|  19M   44M
12-07 06:07:00| 67   8  18   3   0   4|27.7 29.4 30.0|52.6G 12.0k 41.3G  257M|5837k   35M|  21M   48M
12-07 06:08:00| 67   8  18   4   0   4|29.1 29.4 30.0|52.1G 12.0k 41.4G  738M|6270k   33M|  21M   45M
12-07 06:09:00| 67   8  18   3   0   4|33.0 30.4 30.2|52.2G 12.0k 41.3G  737M|2773k   33M|  17M   45M
12-07 06:10:00| 66   8  18   4   0   4|28.8 29.9 30.1|52.5G 12.0k 41.4G  325M|8445k   37M|  23M   49M
12-07 06:11:00| 66   8  18   4   0   4|27.0 29.3 29.9|52.8G 12.0k 41.1G  254M|8561k   33M|  23M   45M
12-07 06:12:00| 66   8  18   4   0   4|28.7 29.4 29.9|52.1G 12.0k 40.3G 1897M|6892k   34M|  21M   45M
12-07 06:13:00| 66   8  18   4   0   4|30.2 29.8 30.0|51.7G 12.0k 40.0G 2488M|3431k   32M|  18M   44M
12-07 06:14:00| 67   8  18   4   0   4|31.0 30.0 30.1|52.4G 12.0k 40.6G 1137M|3985k   32M|  18M   44M
12-07 06:15:00| 66   8  18   3   0   4|32.4 30.5 30.2|52.0G 12.0k 40.8G 1454M|4890k   33M|  20M   45M

N2:

----system---- ----total-cpu-usage---- ---load-avg--- ------memory-usage----- -dsk/total- -net/total-
     time     |usr sys idl wai hiq siq| 1m   5m  15m | used  buff  cach  free| read  writ| recv  send
12-07 06:05:00|  6   1  92   0   0   0|27.3 27.3 27.0|48.1G 3976k 45.8G  349M| 127k 2624k|   0     0 
12-07 06:06:00| 67   8  18   3   0   4|26.3 27.0 26.9|47.2G 3976k 45.1G 1993M|2341k   39M|  17M   50M
12-07 06:07:00| 68   7  18   3   0   4|26.4 26.9 26.8|47.6G 3976k 45.6G  959M|3268k   44M|  18M   56M
12-07 06:08:00| 67   8  18   3   0   4|27.6 26.9 26.8|47.7G 3976k 45.6G  888M|4119k   42M|  19M   54M
12-07 06:09:00| 67   8  19   3   0   4|26.9 26.9 26.8|47.9G 3976k 45.9G  417M|3450k   42M|  18M   53M
12-07 06:10:00| 66   8  18   3   0   4|28.6 27.4 27.0|48.0G 3976k 46.0G  262M|4097k   40M|  18M   51M
12-07 06:11:00| 67   8  18   3   0   4|29.7 28.0 27.2|47.9G 3976k 45.4G  904M|4076k   41M|  18M   52M
12-07 06:12:00| 66   8  20   3   0   4|26.1 27.1 27.0|48.0G 3976k 45.5G  714M|2201k   37M|  16M   48M
12-07 06:13:00| 66   8  19   3   0   4|29.7 27.9 27.3|48.2G 3976k 45.7G  283M|2202k   42M|  17M   53M
12-07 06:14:00| 66   8  19   3   0   4|25.9 27.0 27.0|47.9G 3976k 45.8G  567M|4405k   45M|  19M   57M
12-07 06:15:00| 67   8  18   3   0   4|26.6 26.9 26.9|47.9G 3976k 45.8G  505M|2078k   43M|  17M   55M

N3:

----system---- ----total-cpu-usage---- ---load-avg--- ------memory-usage----- -dsk/total- -net/total-
     time     |usr sys idl wai hiq siq| 1m   5m  15m | used  buff  cach  free| read  writ| recv  send
12-07 06:05:00|  6   1  92   0   0   0|30.1 28.6 28.0|46.2G 4132k 47.7G  281M| 142k 2640k|   0     0 
12-07 06:06:00| 67   7  18   3   0   4|29.9 28.9 28.2|45.6G 4132k 46.9G 1642M|2859k   40M|  19M   53M
12-07 06:07:00| 68   7  18   3   0   4|29.3 28.9 28.2|45.8G 4132k 47.1G 1269M|1411k   43M|  18M   56M
12-07 06:08:00| 68   7  19   3   0   4|28.9 28.8 28.2|46.1G 4132k 47.4G  685M|1506k   42M|  17M   55M
12-07 06:09:00| 69   7  18   3   0   3|26.5 28.4 28.1|44.8G 4132k 47.2G 2131M|1635k   42M|  17M   55M
12-07 06:10:00| 68   7  19   3   0   4|27.4 28.3 28.1|45.3G 4132k 47.4G 1568M|1386k   39M|  17M   52M
12-07 06:11:00| 68   7  19   2   0   4|26.6 27.9 28.0|45.7G 4132k 47.6G  892M|1795k   39M|  17M   52M
12-07 06:12:00| 67   7  20   2   0   4|28.9 28.2 28.1|45.0G 4132k 47.8G 1435M|2961k   42M|  18M   54M
12-07 06:13:00| 67   7  19   3   0   4|27.7 28.0 28.0|45.7G 4132k 47.8G  717M|2231k   42M|  18M   55M
12-07 06:14:00| 69   7  18   3   0   4|29.5 28.4 28.1|46.1G 4132k 47.6G  396M|5861k   43M|  22M   56M
12-07 06:15:00| 68   7  18   3   0   4|28.1 28.1 28.1|45.8G 4132k 47.0G 1394M|3918k   41M|  20M   54M

N4:

----system---- ----total-cpu-usage---- ---load-avg--- ------memory-usage----- -dsk/total- -net/total-
     time     |usr sys idl wai hiq siq| 1m   5m  15m | used  buff  cach  free| read  writ| recv  send
12-07 06:21:01|  3   1  96   0   0   0|10.8 12.0 12.1|44.0G 6276k 49.8G  444M|3959B 1271k|   0     0 
12-07 06:22:01| 37   5  55   2   0   1|10.9 11.9 12.0|43.6G 6276k 49.6G 1054M|1205k   39M|8815k   48M
12-07 06:23:01| 36   5  55   2   0   1|13.0 12.3 12.1|43.8G 6276k 49.9G  527M|2402k   37M|9765k   45M
12-07 06:24:01| 36   5  56   2   0   1|11.6 11.8 12.0|43.7G 6276k 49.7G  746M|  19k   38M|7363k   46M
12-07 06:25:01| 35   5  56   2   0   1|16.0 13.0 12.4|43.8G 6276k 49.8G  569M|1240k   39M|8679k   47M
12-07 06:26:01| 34   5  57   2   0   1|12.8 12.6 12.3|44.1G 6276k 49.9G  280M|2517k   37M|9697k   45M
12-07 06:27:01| 34   5  57   2   0   1|11.6 12.2 12.2|44.0G 6276k 49.7G  501M| 474k   37M|7558k   45M
12-07 06:28:01| 35   5  57   2   0   1|11.3 12.2 12.1|43.9G 6276k 49.6G  680M|1425k   38M|8795k   46M
12-07 06:29:01| 39   5  52   2   0   2|11.6 12.0 12.1|44.2G 6276k 49.8G  259M|1442k   38M|9421k   47M
12-07 06:30:01| 37   5  54   2   0   1|13.7 12.6 12.3|44.2G 6276k 49.4G  619M| 538k   39M|8256k   48M
12-07 06:31:01| 35   5  56   2   0   1|13.0 12.6 12.3|44.2G 6276k 49.6G  467M|1663k   38M|9000k   46M

N5:

----system---- ----total-cpu-usage---- ---load-avg--- ------memory-usage----- -dsk/total- -net/total-
     time     |usr sys idl wai hiq siq| 1m   5m  15m | used  buff  cach  free| read  writ| recv  send
12-07 06:21:00|  1   1  98   0   0   0|9.35 11.2 11.6|41.6G 4244k 51.8G  798M|2754B  932k|   0     0 
12-07 06:22:00| 40   4  52   2   0   2|16.3 12.7 12.0|41.7G 4244k 51.7G  859M| 758k   35M|9032k   44M
12-07 06:23:00| 39   4  53   2   0   1|13.0 12.3 11.9|41.8G 4244k 51.8G  663M|1565k   34M|9636k   42M
12-07 06:24:00| 39   4  53   2   0   1|14.4 12.8 12.1|41.4G 4244k 51.8G 1030M| 191k   35M|8273k   43M
12-07 06:25:00| 40   4  52   2   0   1|12.8 12.6 12.1|41.6G 4244k 51.8G  799M|1205k   35M|9322k   43M
12-07 06:26:00| 38   4  54   2   0   1|14.4 12.9 12.2|41.6G 4244k 51.8G  869M| 212k   36M|8041k   44M
12-07 06:27:00| 39   4  54   2   0   1|15.6 13.7 12.5|41.7G 4244k 51.8G  760M|  74k   35M|8046k   43M
12-07 06:28:00| 40   4  52   2   0   1|13.3 13.4 12.5|41.2G 4244k 51.8G 1213M|2818k   35M|  11M   44M
12-07 06:29:00| 43   5  48   2   0   2|15.7 14.1 12.8|41.4G 4244k 52.1G  705M|3089k   35M|  12M   45M
12-07 06:30:00| 41   5  50   2   0   2|12.6 13.6 12.7|41.5G 4244k 51.9G  812M| 792k   35M|9254k   44M
12-07 06:31:00| 40   4  52   2   0   1|13.1 13.5 12.7|41.3G 4244k 52.0G  951M| 338k   36M|8160k   44M

Each node has the following specs:

  • 24 vCPUs (Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz)
  • 100 GB RAM
  • 1 TB XFS filesystem for Cockroach DB (iSCSI SAN)

The device saturation (reported by sar -d) is higher in N1 than the rest. It is a little over 70% for the rest but over 90% for N1. This was not the case when we started the experiment. The skew has occurred in recent hours.

We have just one table input:

CREATE TABLE IF NOT EXISTS db.schama.input (
    f1 STRING(16),
    f2 STRING(16),
    f3 INET,
    f4 INET,
    f5 INET,
    f6 INT4,
    f7 INT4,
    f8 INT,
    f9 INT DEFAULT 0,
    f10 INT2,
    f11 STRING,
    f12 STRING,
    f13 STRING,
    f14 "primary" PRIMARY KEY (f1, f3, f6, f8),
    INDEX index_1 (f3, f6, f7, f8),
    INDEX index_2 (f8, f9) USING HASH WITH BUCKET_COUNT = 8
);

One crucial fact might be that f8 and f9 are timestamps representing micro-seconds since Epoch.

Any advice will be highly appreciated.

1 Like

Hi @pdp welcome to the forum!

As you’ve already realized, hash-sharded indexes are our recommended way to index time-series.

You mentioned that there are UPSERTs in the workload as well. How frequently are you updating existing rows? If there is one or a few rows that keep getting updated, that will cause a hot range and also affects garbage collection of stale data. You could experiment with a lower gc.ttlseconds for this table. See the docs on it here.

I’d recommend taking a look at the /_status/hotranges endpoint of the DB console and check if there are any hot ranges on node 1.

Hello @rafiss Thank you for the welcome and the response. As per our use case, we should be updating a row exactly once (unless there is a bug in our loader code) because our data-flow has just two parts:

  • Insert row with several columns when a session starts
  • Update end time when the session ends

Both operations are done on a single input table. Temporarily, to check if we were erring during updates, we maintained a second table. The update was replaced by an insert to the new table. This experiment was run for over 6 hours:


See the plot in the red circle. There was still constant degradation.

Thanks for trying those changes out. That’s helpful info.

I’d still recommend looking at the /_status/hotranges endpoint to see if it tells you about any hot ranges.

Also, just to confirm, how are the columns in the other indexes generated? If f1 or f3 are inserted in increasing order, that would also cause contention during INSERTs.

Hello Rafi! Yes, there are many hot ranges but we are not able to correlate the output with that from SHOW RANGES output. For instance, the following image shows that range with ID 510 is a hot range:


But there is no such ID in the SHOW RANGES output. The command output contains range start and end information, which would have given us a hint. May be, we are missing something simple. Please suggest.

Hm, that’s unfortunate about SHOW RANGES not making it easy to find that range. Luckily you can try another query:

SELECT * FROM crdb_internal.ranges WHERE range_id = 510;

This will give you the table name and hopefully more information.

Though I am a bit confused about how startKey and endKey are both null. That means the range is empty, so I don’t understand why it would be a hot range. This isn’t my area of expertise, so I’ll investigate a bit more and let you know what this could mean.

Hello @rafiss One one setup, we have 178 hot ranges but all of them have null as startKey and endKey. We were able to find a few of them related to the input table. The SHOW RANGES output was not null for them. Example:

start_key | end_key |range_id | range_size_mb
/"1202203988"/"\x01\x80\xabͫ\xcd3\xbc\x81\\\x00\x00\x00\x00\x00\x00\x00\x00"/0/1625852827142735 | /"1202252971"/"\x00 \x0f\r\xe7\xc3"/1024/1625851243356855 | 1474 | 373.519671

This data is from our data generator and not actual traffic. Responding to:

Also, just to confirm, how are the columns in the other indexes generated? If f1 or f3 are inserted in increasing order, that would also cause contention during INSERTs.

Columns f1 and f3 are random values from the data generator.

There is some exciting news. We added three new columns of day, hour and minute to the primary key. These three are derived from f8 (micro-seconds since Epoch) and we have not yet seen a throughput degradation:

Ignore the transient drop, which was due to overloading the Kafka cluster by running too many consumers. We’ve processed over a billion records in 27 hours! This does appear counter-intuitive because we thought improving distribution will be better. Perhaps, in the earlier schema the number of affected ranges would increase with time and reduce the throughput. But our data rates are so high (10,000 records/s), that it is better to contain the number of active ranges. Thus, adding a time-based prefix ensures that a similar number of ranges are effected at any point in time. Your insight will be helpful.

I’m glad you’re getting better results! I’m surprised that adding a time-based prefix improved the throughput as well. If you still are planning to investigate more, then maybe you could try with different numbers of workers. I’d expect that as the number of workers increase, the time-based key might start performing worse.

Unfortunately, the slow degrade problem has not gone. We carried on the experiment that I previously mentioned, for over 100 hours and were able to see it clearly. By this time, we also updated the indexes (for improving searches) and started a new run. The following is for over 36 hours wherein the reduction in throughput can still be seen:

Our use case is simple: Every row is updated at most once after insertion. Do you see any scenario where such a degrade should occur?

That’s unfortunate. I don’t know what specifically would be causing this behavior. My main recommendation is to tweak different things about your workload – experiment with changing one variable at a time, and record the results to see if there’s any difference. For example, changing the number of workers could be a good experiment to try. I see that each node has 24 vCPUs and it’s a 5 node cluster. A good ballpark for number of workers this can support is 24 * 4 * 5 = 480 based on these docs.

Some other questions you should try to answer:

  • Do you see any useful information on the Statements page? Statements Page | CockroachDB Docs
  • How does the latency (average, median, p95, p99) of the query change over time?
  • How many active queries do you see in the DB console?