Trying to split same key range


#1

In my 3 node cluster I am seeing lot of logs like below in one of the node.

I180430 12:20:40.574427 62167 storage/replica_command.go:862  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] initiating a split of this range at key /Table/54/4/1921-09-04T06:01:02.831999999Z/1010711/605/"_activity"/114510941/1524994500000 [r33584]

...

I180430 12:20:46.464471 64104 storage/replica_command.go:862  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] initiating a split of this range at key /Table/54/4/1921-09-04T06:01:02.831999999Z/1010711/605/"_activity"/114510941/1524994500000 [r33585]

...

I180430 12:21:09.918581 69380 storage/replica_command.go:862  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] initiating a split of this range at key /Table/54/4/1921-09-04T06:01:02.831999999Z/1010711/605/"_activity"/114510941/1524994500000 [r33589]

...

I180430 12:22:36.947989 116221 storage/replica_command.go:862  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] initiating a split of this range at key /Table/54/4/1921-09-04T06:01:02.831999999Z/1010711/605/"_activity"/114510941/1524994500000 [r33604]

...

I180430 12:24:30.329431 169358 storage/replica_command.go:862  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] initiating a split of this range at key /Table/54/4/1921-09-04T06:01:02.831999999Z/1010711/605/"_activity"/114510941/1524994500000 [r33625]

...

I180430 12:32:00.686331 384299 storage/replica_command.go:862  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] initiating a split of this range at key /Table/54/4/1921-09-04T06:01:02.831999999Z/1010711/605/"_activity"/114510941/1524994500000 [r33706]
I180430 12:47:28.006963 851360 storage/replica_command.go:862  [split,n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] initiating a split of this range at key /Table/54/4/1921-09-04T06:01:02.831999999Z/1010711/605/"_activity"/114510941/1524994500000 [r33864]
W180430 12:47:34.281894 855708 storage/replica.go:3022  [n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] context canceled before proposing: 1 HeartbeatTxn
I180430 12:47:34.883270 113 server/status/runtime.go:219  [n2] runtime stats: 1.2 GiB RSS, 176 goroutines, 522 MiB/344 MiB/926 MiB GO alloc/idle/total, 161 MiB/201 MiB CGO alloc/total, 123688.04cgo/sec, 1.89/0.23 %(u/s)time, 0.01 %gc (5x)
  1. It is trying to split the same key range it seems. Is this expected?
  2. Is there a way I can figure out more details about the split like what index it is trying to split? I tried executing the experimental range command. But it is throwing me the below error
root@:26257/vnera> SHOW EXPERIMENTAL_RANGES FROM TABLE config_str;
invalid syntax: statement ignored: syntax error at or near "from"

(Rebecca Taft) #2

Hi tuk - can you please tell me what server version you’re running? I believe this experimental_ranges feature was only enabled recently.


#3
root@platform1:/home/ubuntu/build-target/infra/cockroach# cockroach version
Build Tag:    v2.0.0
Build Time:   2018/04/03 20:56:09
Distribution: CCL
Platform:     linux amd64 (x86_64-unknown-linux-gnu)
Go Version:   go1.10
C Compiler:   gcc 6.3.0
Build SHA-1:  a6b498b7aff14234bcde23107b9e7fa14e6a34a8
Build Type:   release

(Raphael 'kena' Poss) #4

The name “experimental_ranges” was only introduced in CockroachDB 2.0.1 and 2.1-alpha - it is not available in 2.0.0. Perhaps you can upgrade?


(Rebecca Taft) #5

Thanks! I’m checking with some folks internally - will get back to you shortly.


#6

Thanks @becca and @knz. I will try to upgrade.

Any thoughts on my first query?


(Nikhil Benesch) #7

Alternatively, you can use the old syntax on your current version: SHOW TESTING_RANGES FROM TABLE t

It’s trying to split index ID “4” on that table. (The index ID is the number after the table ID in the key, so /Table/54/4 indicates table ID 54, index ID 4.) You can map index names to index IDs with a query like the following:

SELECT * FROM crdb_internal.table_indexes WHERE descriptor_id = 54;

We’ve seen behavior like this when a range gets to be excessively large. Would you be willing to share a screenshot of the debug page for that range? http://HOSTNAME:8080/#/reports/range/186

And to answer your query: nope, it’s definitely not expected.


(nathan) #8

Another thing to look for is other logs with [split,n2,s2,r186 in them between the initiating a split logs. Something should be telling us why the split is failing.


#9

@benesch - Please find the screenshot attached.


#10

@nathan - The cockroach logs are attached. I am not seeing any error.

cockroach.ttf (751.2 KB)


(Nikhil Benesch) #11

FWIW I’ve been surprised in the past by the lack of error logging in the split code path.


(Nikhil Benesch) #12

Thanks for posting that screenshot, @tuk. It does indeed look like this range is oversize. It’s approximately 215MiB by my count. You haven’t upped the default range size (64MiB), have you?


(Nikhil Benesch) #13

Also, @nathan, the log is full of lines like this:

W180430 14:17:51.288997 3652761 storage/replica.go:3022  [n2,s2,r186/3:/Table/54/{3/10107…-4/1921-…}] context canceled before proposing: 1 HeartbeatTxn

Does that ring a bell? I remember you mentioning a bug in which transactions could take so long to get started that they’d be canceled before they could send out a heartbeat.


#14

@benesch - I did not change the default range size. Just curious how to change the default range size?

initiating a split of this range at key /Table/54/4/1921-09-04T06:01:02.831999999Z/1010711/605/

Also can you let me know what does 1921-09-04T06:01:02.831999999Z denote in the above log?


(nathan) #15

I was hoping that we would hit this log, but I’m guessing we’re re-queuing the split before we hit that line. I need to re-work a bit of this for a different issue, so I’ll make sure to improve logging in this scenario.

Does that ring a bell?

That might indicate some slowdown, but it’s not what we saw with https://github.com/cockroachdb/cockroach/issues/23945.

what does 1921-09-04T06:01:02.831999999Z denote in the above log

That would be a value in the first column of the fourth index of table 54. It looks like a timestamp to me.

@tuk is there any load on this table?


(nathan) #16

@tuk can you take a look at the page http://HOSTNAME:8080/debug/logspy?count=100&duration=30s&grep=\[split,n2,s2,r186 on node 2 and let it sit for 30 seconds before grabbing the output? This page temporarily captures all the verbose logs on that node that match the grep pattern, i.e. all the logs related to splits of r186. Note that this may contain actual data stored in the database, so be careful about sharing it verbatim if you’re storing real data.


#17

@nathan - I ran this in a browser

http://HOSTNAME:8091/debug/logspy?count=100&duration=30s&grep=\[split,n2,s2,r186\]

It is showing me the below output

I180430 16:19:44.876184 6538527 server/debug/logspy.go:183  intercepting logs with options {Count:100 Duration:30s Grep:\[split,n2,s2,r186\]}

#18

@nathan Find the output attached. You can rename the file to .txt output3.ttf (44.2 KB)


(nathan) #19

Thanks @tuk! It looks like \] was accidentally appended to the grep pattern. Could you run it again without that suffix? In fact, could you remove the \[ at the beginning of the pattern as well?


(Nikhil Benesch) #20

@nathan I filed an issue for this so it doesn’t get lost.