Very high CPU usage on one of 4 nodes in cluster

(yeriahn) #1

I did performance test about cockroachdb.
I send the request to cockroach using insert sql like 'insert into tablea values (pk…, 20KBstringvalue).

However I encountered some issue.
I set the cluster of 4 node.
Doing that test, just one node cpu usage to 70%. Other nodes use the only 25%.
So I confirmed the log. I found this log…

there are too many log like this.

W190515 07:16:44.624074 688 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]
W190515 07:16:44.624297 974 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]
W190515 07:16:44.624442 943 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]
W190515 07:16:44.624503 937 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]
W190515 07:16:44.624605 938 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]
W190515 07:16:44.624835 957 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]
W190515 07:16:44.624925 940 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]
W190515 07:16:44.625124 950 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]
W190515 07:16:44.625258 934 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]
W190515 07:16:44.702715 948 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]
W190515 07:16:44.703013 955 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]
W190515 07:16:44.703310 967 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]
W190515 07:16:44.703436 933 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]
W190515 07:16:44.703779 959 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]
W190515 07:16:44.703945 972 storage/store.go:3931 [n2,s2] handle raft ready: 0.8s [processed=0]

sometimes this log appeared.
W190515 07:20:00.406901 938 storage/engine/rocksdb.go:1911 batch [28/242954/3] commit took 594.865795ms (>500ms):
goroutine 938 [running]:
runtime/debug.Stack(0x30648b25, 0xed46db49f, 0x0)
/usr/local/go/src/runtime/debug/stack.go:24 +0xa7
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).commitInternal(0xc434ba06c0, 0xc446712000, 0xecde, 0x10000)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1912 +0x1b3
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).Commit(0xc434ba06c0, 0xed46db400, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1828 +0x7ed
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).applyRaftCommand(0xc42ae60c00, 0x305ad40, 0xc42c79e540, 0xc4261ce520, 0x8, 0x0, 0x0, 0x0, 0x0, 0x0, …)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:5920 +0x417
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).processRaftCommand(0xc42ae60c00, 0x305ad40, 0xc42c79e540, 0xc4261ce520, 0x8, 0x14, 0x1da7a, 0x200000002, 0x2, 0x12, …)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:5498 +0x576
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc42ae60c00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, …)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:4362 +0x1118
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x305ad40, 0xc4317210b0, 0xc42ae60c00, 0x305ad40)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3873 +0x109
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc4202b5b80, 0x305ad40, 0xc4317210b0, 0xc43abf29c0, 0xc440255ed0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3519 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc4202b5b80, 0x305ad40, 0xc42a443980, 0x74a8)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3861 +0x229
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc420cc6200, 0x305ad40, 0xc42a443980)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:225 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x305ad40, 0xc42a443980)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc4258fd5f0, 0xc42002cf30, 0xc4258fd5e0)
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:199 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xad

almost last test, this log appeared.
E190515 07:33:29.471638 274051 storage/queue.go:845 [n2,raftsnapshot,s2,r14876/2:/Table/54/1/"1490{2512…-3071…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.472708 274052 storage/queue.go:845 [n2,raftsnapshot,s2,r13751/2:/Table/54/1/"0443{2465…-3025…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.496444 274053 storage/queue.go:845 [n2,raftsnapshot,s2,r32922/2:/Table/54/1/"1375{0552…-1055…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.505946 274054 storage/queue.go:845 [n2,raftsnapshot,s2,r33412/2:/Table/54/1/"256{7811…-8221…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.526201 274055 storage/queue.go:845 [n2,raftsnapshot,s2,r1736/2:/Table/54/1/"04413{156…-737…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.535250 274056 storage/queue.go:845 [n2,raftsnapshot,s2,r32891/2:/Table/54/1/"13736{293…-834…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.549482 274057 storage/queue.go:845 [n2,raftsnapshot,s2,r29/2:/Table/54/1/"05655{412…-980…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.555808 274058 storage/queue.go:845 [n2,raftsnapshot,s2,r29564/2:/Table/54/1/"255{2371…-3101…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.557196 274059 storage/queue.go:845 [n2,raftsnapshot,s2,r33256/2:/Table/54/1/"25{49951…-50431…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.567968 274060 storage/queue.go:845 [n2,raftsnapshot,s2,r862/2:/Table/54/1/"04422{355…-874…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.573348 274061 storage/queue.go:845 [n2,raftsnapshot,s2,r33296/2:/Table/54/1/"2548{3811…-7711…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.612512 274062 storage/queue.go:845 [n2,raftsnapshot,s2,r14376/2:/Table/54/1/"1489{5709…-6256…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.629354 274063 storage/queue.go:845 [n2,raftsnapshot,s2,r28666/2:/Table/54/1/"0441{2616…-3156…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.645404 274064 storage/queue.go:845 [n2,raftsnapshot,s2,r28420/2:/Table/54/1/"0442{1792…-2355…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.650800 274065 storage/queue.go:845 [n2,raftsnapshot,s2,r4147/2:/Table/54/1/"1439{8498…-9024…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.661331 274066 storage/queue.go:845 [n2,raftsnapshot,s2,r14654/2:/Table/54/1/"0442{7897…-8452…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.670442 274067 storage/queue.go:845 [n2,raftsnapshot,s2,r7115/2:/Table/54/1/"0565{7752…-8337…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.678821 274068 storage/queue.go:845 [n2,raftsnapshot,s2,r8458/2:/Table/54/1/"13669{229…-767…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.690590 274069 storage/queue.go:845 [n2,raftsnapshot,s2,r17387/2:/Table/54/1/"1440{3925…-4482…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.691704 274070 storage/queue.go:845 [n2,raftsnapshot,s2,r12868/2:/Table/54/1/"1456{7522…-8064…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.692506 274071 storage/queue.go:845 [n2,raftsnapshot,s2,r13426/2:/Table/54/1/"0565{6529…-7139…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.719141 274072 storage/queue.go:845 [n2,raftsnapshot,s2,r6414/2:/Table/54/1/"14557{033…-580…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.731474 274073 storage/queue.go:845 [n2,raftsnapshot,s2,r7075/2:/Table/54/1/"05833{378…-956…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.733287 274074 storage/queue.go:845 [n2,raftsnapshot,s2,r33274/2:/Table/54/1/"2549{1011…-5411…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.740651 274075 storage/queue.go:845 [n2,raftsnapshot,s2,r27210/2:/Table/54/1/"04417{270…-839…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.742026 274076 storage/queue.go:845 [n2,raftsnapshot,s2,r14432/2:/Table/54/1/"05668{125…-713…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.765054 274077 storage/queue.go:845 [n2,raftsnapshot,s2,r26996/2:/Table/54/1/"0532{3676…-4223…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.769685 274078 storage/queue.go:845 [n2,raftsnapshot,s2,r14422/2:/Table/54/1/"04421{209…-792…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.779079 274079 storage/queue.go:845 [n2,raftsnapshot,s2,r6694/2:/Table/54/1/"0443{3591…-4171…}] snapshot failed: unable to dial n1: breaker open
E190515 07:33:29.780977 274080 storage/queue.go:845 [n2,raftsnapshot,s2,r32614/2:/Table/54/1/"2569{0411…-5311…}] snapshot failed: unable to dial n1: breaker open

I found the github. (https://github.com/cockroachdb/cockroach/issues/25346)
but that isuue appeared 1 version.
I use the cockroach version 2.1.6.
So, How can I fix the this issue?

(Ron Arévalo) #2

Hey @yeri,

Could you provide us with a CPU profile, you can find it by going to the url <HOST>/debug/pprof/profile?debug=1.

You may be running out of CPU, which seems likely, or Disk I/O. Could you let me know what kind of machine you’re using and how it is provisioned?

Thanks,

Ron

(yeriahn) #3

Almost of cpu usage was iowait about 50%.
The machine is HDD internal disk,
8core 32g mem.
Cluster range is 30,000.

how can I resolve the issue?

(Ron Arévalo) #4

Hey @yeri,

I just want to confirm, when you say:

Are you referring to volumes that have a spinning hard disk? If so, that could be what is causing your issue, for the best performance results, we suggest using SSDs or NVMe devices. The recommended volume size is 300-500 GB.

You may want to read through our production checklist specifically on the section on hardware.

Thanks,

Ron

(yeriahn) #5

I have same node of spec(HDD internal disk 8core 32g mem.) on cluster.
And the same storage size 700GB/5TB.
One node’s numbe of sstable(like 0000.sst) is 20,000.
others number of sstable is 15000.
the range is about 30,000 .
range spread like this.
1 : 20000, 2: 20010, 3: 19000, 4:19000, 5:19000

But just one node’s raft log commit was so late.
Raft command commit latency so late only one node.
And I down the one node, the other one node up to 70% CPU.
So… I don’t know why just one node’s cpu is high?
Why?

and I have profile of cpu.
but because of the company’s policy, i can’t write this forum.
could you let me know email address?

(Ron Arévalo) #6

Hey @yeri,

Sure, you can email me the CPU profile here. Could you also provide some metrics for reads/sec and writes/sec, you can use iostat -dx and send over the results you get in that same email.

Thanks,

Ron

(yeriahn) #7

I sent you an email. Could you check the email?

Thanks,
Yeri

(Ron Arévalo) #8

Hi Yeri,

Thank you, I received the email.
For the iostat portion I was hoping to see something like this:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
loop0             0.00     0.00    0.00    0.00     0.00     0.00     5.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               0.00     1.69    1.90    0.98    43.77    28.68    50.35     0.00    1.49    0.86    2.72   0.12   0.03
sdb               0.00   145.16   27.81  406.88  1928.97 107091.34   501.60     0.54    1.53    0.56    1.60   0.22   9.68

Which I got from running iostat -dx

Thanks,

Ron

(yeriahn) #9

Hi Ron,
OK… I send CockroachDB server iostat in test period.
(iostat -dx 1)
Please tell me why that situation occured?

Thanks,
Yeri

(Ron Arévalo) #10

hey @yeri,

I was looking over the stats you provided. I am still fairly confident that the issue lies in the fact that you’re using HDDs for storage.

From the iostat logs you sent over, all servers show at least a significant amount of CPU utilization, with 206 and 207 being the worst.

Is there any way you can try testing the same workload and use SDD drives instead?

Thanks,

Ron