Understanding traces and tuning performance

Hi there,

I’ve been tinkering with CockroachDB for a while now and I seem to be having some issues with performance when deploying multi-region clusters on AWS (single region multi-zone clusters work fine).

I am unsure what extra changes would be needed when spanning multiple regions.
Currently we are spanning ap-southeast-2 (Australia) and ap-southeast-1 (Singapore).

We seem to have the odd query that performs quite slowly (almost all our queries are read-only, we currently have very few writes [less than 1 per day]), I would say maybe every 1 in 50.
There doesn’t seem to be any correlation between the query and how slow it performs (meaning it happens to any query, no real indicator that it only occurs to specific queries).

I’ve run some basic traces (SELECT * FROM table) and have noticed a few odd entries.
If someone could please take a look at these traces and point me in the right direction on where I should start tinkering that would be most appreciated.

Traces (in CSV format)
2018-03-20 22:54:56.472476,0s,=== SPAN START: sql txn implicit ===,,sql txn implicit,"(0,0)"
2018-03-20 22:54:58.110439,1s637ms962µs893ns,=== SPAN START: starting plan ===,,starting plan,"(0,1)"
2018-03-20 22:54:58.110475,1s637ms998µs662ns,=== SPAN START: consuming rows ===,,consuming rows,"(0,2)"
2018-03-20 22:54:58.110536,1s638ms59µs848ns,querying next range at /Table/69/1,"[client=*REDACTED*:51136,user=techadmin,n6]",sql txn implicit,"(0,0)"
2018-03-20 22:54:58.110863,1s638ms387µs214ns,"r50: sending batch 1 Scan to (n5,s5):8","[client=*REDACTED*:51136,user=techadmin,n6]",sql txn implicit,"(0,0)"
2018-03-20 22:54:58.110890,1s638ms414µs291ns,sending request to db04.db.ap-southeast-2.*REDACTED*:26257,"[client=*REDACTED*:51136,user=techadmin,n6]",sql txn implicit,"(0,0)"
2018-03-20 22:54:58.110912,1s638ms435µs696ns,=== SPAN START: /cockroach.roachpb.Internal/Batch ===,,/cockroach.roachpb.Internal/Batch,"(0,3)"
2018-03-20 22:54:58.111649,1s639ms173µs147ns,=== SPAN START: /cockroach.roachpb.Internal/Batch ===,,/cockroach.roachpb.Internal/Batch,"(0,4)"
2018-03-20 22:54:58.111690,1s639ms213µs807ns,1 Scan,[n5],/cockroach.roachpb.Internal/Batch,"(0,4)"
2018-03-20 22:54:58.111704,1s639ms228µs57ns,executing 1 requests,"[n5,s5]",/cockroach.roachpb.Internal/Batch,"(0,4)"
2018-03-20 22:54:58.111716,1s639ms239µs297ns,read-only path,"[n5,s5,r50/8:/Table/{69-70}]",/cockroach.roachpb.Internal/Batch,"(0,4)"
2018-03-20 22:54:58.111726,1s639ms249µs995ns,command queue,"[n5,s5,r50/8:/Table/{69-70}]",/cockroach.roachpb.Internal/Batch,"(0,4)"
2018-03-20 22:54:58.111736,1s639ms260µs175ns,waiting for read lock,"[n5,s5,r50/8:/Table/{69-70}]",/cockroach.roachpb.Internal/Batch,"(0,4)"
2018-03-20 22:54:58.114563,1s642ms87µs104ns,read completed,"[n5,s5,r50/8:/Table/{69-70}]",/cockroach.roachpb.Internal/Batch,"(0,4)"
2018-03-20 22:54:59.398526,2s926ms49µs546ns,plan completed execution,"[client=*REDACTED*:51136,user=techadmin,n6]",consuming rows,"(0,2)"
2018-03-20 22:54:59.398529,2s926ms53µs228ns,"resources released, stopping trace","[client=*REDACTED*:51136,user=techadmin,n6]",consuming rows,"(0,2)"


2018-03-20 22:29:13.918242,0s,=== SPAN START: sql txn implicit ===,,sql txn implicit,"(0,0)"
2018-03-20 22:29:13.918743,501µs98ns,=== SPAN START: starting plan ===,,starting plan,"(0,1)"
2018-03-20 22:29:13.918753,511µs250ns,=== SPAN START: consuming rows ===,,consuming rows,"(0,2)"
2018-03-20 22:29:13.918809,567µs790ns,querying next range at /Table/57/1,"[client=*REDACTED*:50370,user=techadmin,n5]",sql txn implicit,"(0,0)"
2018-03-20 22:29:13.918854,612µs811ns,"r38: sending batch 1 Scan to (n6,s6):8","[client=*REDACTED*:50370,user=techadmin,n5]",sql txn implicit,"(0,0)"
2018-03-20 22:29:13.918873,631µs112ns,sending request to db02.db.ap-southeast-2.*REDACTED*:26257,"[client=*REDACTED*:50370,user=techadmin,n5]",sql txn implicit,"(0,0)"
2018-03-20 22:29:13.918891,649µs109ns,=== SPAN START: /cockroach.roachpb.Internal/Batch ===,,/cockroach.roachpb.Internal/Batch,"(0,3)"
2018-03-20 22:29:31.667061,17s748ms818µs877ns,"application error: [NotLeaseHolderError] r38: replica (n6,s6):8 not lease holder; current lease is repl=(n6,s6):8 start=1521584793.583634078,0 epo=331 pro=1521584953.919388607,0","[client=*REDACTED*:50370,user=techadmin,n5]",sql txn implicit,"(0,0)"
2018-03-20 22:29:31.667590,17s749ms348µs152ns,"error: {(err: [NotLeaseHolderError] r38: replica (n6,s6):8 not lease holder; current lease is repl=(n6,s6):8 start=1521584793.583634078,0 epo=331 pro=1521584953.919388607,0) <nil>}; trying next peer (n6,s6):8","[client=*REDACTED*:50370,user=techadmin,n5]",sql txn implicit,"(0,0)"
2018-03-20 22:29:31.667607,17s749ms365µs831ns,sending request to db02.db.ap-southeast-2.*REDACTED*:26257,"[client=*REDACTED*:50370,user=techadmin,n5]",sql txn implicit,"(0,0)"
2018-03-20 22:29:31.667625,17s749ms382µs883ns,=== SPAN START: /cockroach.roachpb.Internal/Batch ===,,/cockroach.roachpb.Internal/Batch,"(0,4)"
2018-03-20 22:29:31.754763,17s836ms521µs517ns,=== SPAN START: /cockroach.roachpb.Internal/Batch ===,,/cockroach.roachpb.Internal/Batch,"(0,5)"
2018-03-20 22:29:31.754801,17s836ms559µs67ns,1 Scan,[n6],/cockroach.roachpb.Internal/Batch,"(0,5)"
2018-03-20 22:29:31.754815,17s836ms573µs342ns,executing 1 requests,"[n6,s6]",/cockroach.roachpb.Internal/Batch,"(0,5)"
2018-03-20 22:29:31.754844,17s836ms602µs710ns,read-only path,"[n6,s6,r38/8:/Table/5{7-8}]",/cockroach.roachpb.Internal/Batch,"(0,5)"
2018-03-20 22:29:31.754851,17s836ms609µs392ns,request range lease (attempt #1),"[n6,s6,r38/8:/Table/5{7-8}]",/cockroach.roachpb.Internal/Batch,"(0,5)"
2018-03-20 22:29:31.754860,17s836ms618µs17ns,=== SPAN START: [async] storage.pendingLeaseRequest: requesting lease ===,,[async] storage.pendingLeaseRequest: requesting lease,"(0,6)"
2018-03-20 22:29:31.755020,17s836ms778µs94ns,querying next range at /System/NodeLiveness/6,"[s6,r38/8:/Table/5{7-8},n6]",[async] storage.pendingLeaseRequest: requesting lease,"(0,6)"
2018-03-20 22:29:31.755062,17s836ms820µs413ns,"r3: sending batch 1 CPut, 1 BeginTxn, 1 EndTxn to (n5,s5):13","[s6,r38/8:/Table/5{7-8},n6]",[async] storage.pendingLeaseRequest: requesting lease,"(0,6)"
2018-03-20 22:29:31.755088,17s836ms846µs342ns,sending request to db04.db.ap-southeast-2.*REDACTED*:26257,"[s6,r38/8:/Table/5{7-8},n6]",[async] storage.pendingLeaseRequest: requesting lease,"(0,6)"
2018-03-20 22:29:31.755102,17s836ms860µs203ns,=== SPAN START: /cockroach.roachpb.Internal/Batch ===,,/cockroach.roachpb.Internal/Batch,"(0,7)"
2018-03-20 22:29:31.878508,17s960ms266µs386ns,=== SPAN START: /cockroach.roachpb.Internal/Batch ===,,/cockroach.roachpb.Internal/Batch,"(0,8)"
2018-03-20 22:29:31.878612,17s960ms370µs548ns,"1 CPut, 1 BeginTxn, 1 EndTxn",[n5],/cockroach.roachpb.Internal/Batch,"(0,8)"
2018-03-20 22:29:31.878634,17s960ms392µs807ns,executing 3 requests,"[n5,s5]",/cockroach.roachpb.Internal/Batch,"(0,8)"
2018-03-20 22:29:31.878650,17s960ms407µs946ns,read-write path,"[n5,s5,r3/13:/System/NodeLiveness{-Max}]",/cockroach.roachpb.Internal/Batch,"(0,8)"
2018-03-20 22:29:31.878672,17s960ms430µs701ns,command queue,"[n5,s5,r3/13:/System/NodeLiveness{-Max}]",/cockroach.roachpb.Internal/Batch,"(0,8)"
2018-03-20 22:29:31.878818,17s960ms576µs521ns,applied timestamp cache,"[n5,s5,r3/13:/System/NodeLiveness{-Max}]",/cockroach.roachpb.Internal/Batch,"(0,8)"
2018-03-20 22:29:31.878970,17s960ms727µs973ns,evaluated request,"[n5,s5,r3/13:/System/NodeLiveness{-Max}]",/cockroach.roachpb.Internal/Batch,"(0,8)"
2018-03-20 22:29:31.878998,17s960ms756µs503ns,"acquired {raft,replica}mu","[n5,s5,r3/13:/System/NodeLiveness{-Max}]",/cockroach.roachpb.Internal/Batch,"(0,8)"
2018-03-20 22:29:32.462073,18s543ms831µs420ns,applying command,"[n5,s5,r3/13:/System/NodeLiveness{-Max}]",/cockroach.roachpb.Internal/Batch,"(0,8)"
2018-03-20 22:29:32.462186,18s543ms944µs544ns,gossiping 1 node liveness record(s) from span /System/NodeLiveness/6{-/NULL},"[n5,s5,r3/13:/System/NodeLiveness{-Max}]",/cockroach.roachpb.Internal/Batch,"(0,8)"
2018-03-20 22:29:32.621433,18s703ms191µs23ns,coordinator stops,"[s6,r38/8:/Table/5{7-8},n6]",[async] storage.pendingLeaseRequest: requesting lease,"(0,6)"
2018-03-20 22:29:32.650276,18s732ms34µs455ns,"heartbeat 1521584981.254882203,0","[n6,s6,r38/8:/Table/5{7-8}]",[async] storage.pendingLeaseRequest: requesting lease,"(0,6)"
2018-03-20 22:29:32.650290,18s732ms48µs830ns,read-write path,"[n6,s6,r38/8:/Table/5{7-8}]",[async] storage.pendingLeaseRequest: requesting lease,"(0,6)"
2018-03-20 22:29:32.650329,18s732ms86µs911ns,applied timestamp cache,"[n6,s6,r38/8:/Table/5{7-8}]",[async] storage.pendingLeaseRequest: requesting lease,"(0,6)"
2018-03-20 22:29:32.650465,18s732ms223µs218ns,evaluated request,"[n6,s6,r38/8:/Table/5{7-8}]",[async] storage.pendingLeaseRequest: requesting lease,"(0,6)"
2018-03-20 22:29:32.650478,18s732ms236µs65ns,"acquired {raft,replica}mu","[n6,s6,r38/8:/Table/5{7-8}]",[async] storage.pendingLeaseRequest: requesting lease,"(0,6)"
2018-03-20 22:29:33.363785,19s445ms543µs93ns,applying command,"[n6,s6,r38/8:/Table/5{7-8}]",[async] storage.pendingLeaseRequest: requesting lease,"(0,6)"
2018-03-20 22:29:33.363849,19s445ms606µs886ns,"storage/replica_proposal.go:453 [n6,s6,r38/8:/Table/5{7-8}] new range lease repl=(n6,s6):8 start=1521584793.583634078,0 epo=331 pro=1521584971.754854298,0 following repl=(n6,s6):8 start=1521584793.583634078,0 epo=331 pro=1521584953.919388607,0",,[async] storage.pendingLeaseRequest: requesting lease,"(0,6)"
2018-03-20 22:29:33.363859,19s445ms617µs58ns,not gossiping system config because the replica doesn't contain the system config's start key,"[n6,s6,r38/8:/Table/5{7-8}]",[async] storage.pendingLeaseRequest: requesting lease,"(0,6)"
2018-03-20 22:29:33.364415,19s446ms173µs856ns,"lease acquisition succeeded: repl=(n6,s6):8 start=1521584793.583634078,0 epo=331 pro=1521584953.919388607,0","[n6,s6,r38/8:/Table/5{7-8}]",/cockroach.roachpb.Internal/Batch,"(0,5)"
2018-03-20 22:29:33.364428,19s446ms186µs471ns,command queue,"[n6,s6,r38/8:/Table/5{7-8}]",/cockroach.roachpb.Internal/Batch,"(0,5)"
2018-03-20 22:29:33.364438,19s446ms195µs970ns,waiting for read lock,"[n6,s6,r38/8:/Table/5{7-8}]",/cockroach.roachpb.Internal/Batch,"(0,5)"
2018-03-20 22:29:33.385684,19s467ms442µs376ns,read completed,"[n6,s6,r38/8:/Table/5{7-8}]",/cockroach.roachpb.Internal/Batch,"(0,5)"
2018-03-20 22:29:33.386635,19s468ms393µs755ns,plan completed execution,"[client=*REDACTED*:50370,user=techadmin,n5]",consuming rows,"(0,2)"
2018-03-20 22:29:33.386640,19s468ms398µs863ns,"resources released, stopping trace","[client=*REDACTED*:50370,user=techadmin,n5]",consuming rows,"(0,2)"

Cheers,
Stefan

Hi Stefan,

I’m looking at the second trace, which has an 18 seconds gap in it. There’s multiple things around that gap that make no sense to me, so you’re not alone in not understanding what’s going on.
Am I right to assume that you’re running some CRDB version before our 2.0 betas? I think we’re missing important parts of that trace that might shed some light on what’s going on, and I think it’s because of this old issue: https://github.com/cockroachdb/cockroach/issues/20032 which we’ve since fixed, but only for our upcoming 2.0. Is it possible for you to upgrade your cluster to the latest 2.0 beta?
https://www.cockroachlabs.com/docs/releases/#testing-releases

Hi Andrei,

Thanks for getting back so quick.

The environment is technically the production environment though we are at the pre-live stage so it’s not being used by any actual users.

Is it enough to just upgrade the binaries to v2.0-beta (and not perform the finalisation step which disables us from rolling back to v1.1.6)?

If so I can upgrade to the beta.

Cheers,
Stefan

yes, it’d be sufficient to upgrade the binaries without setting the cluster version

Hi Andrei,

Here’s the traces run again using the latest beta.

Traces (in CSV format)
2018-03-21 21:35:48.639188,0s,=== SPAN START: sql txn ===,,,sql txn,0
2018-03-21 21:35:48.639569,381µs680ns,=== SPAN START: session recording ===,,,session recording,11
2018-03-21 21:35:48.639573,385µs17ns,=== SPAN START: starting plan ===,,,starting plan,1
2018-03-21 21:35:48.639589,401µs125ns,=== SPAN START: consuming rows ===,,,consuming rows,2
2018-03-21 21:35:48.639611,422µs991ns,Scan /Table/57/{1-2},"[n8,client=*REDACTED*:61016,user=techadmin]",,sql txn,0
2018-03-21 21:35:48.639627,439µs444ns,=== SPAN START: dist sender ===,,,dist sender,3
2018-03-21 21:35:48.639644,456µs563ns,querying next range at /Table/57/1,"[client=*REDACTED*:61016,user=techadmin,txn=c0736aa2,n8]",,dist sender,3
2018-03-21 21:35:48.639705,517µs667ns,"r38: sending batch 1 Scan to (n8,s8):4","[client=*REDACTED*:61016,user=techadmin,txn=c0736aa2,n8]",,dist sender,3
2018-03-21 21:35:48.639725,537µs469ns,sending request to local server,"[client=*REDACTED*:61016,user=techadmin,txn=c0736aa2,n8]",,dist sender,3
2018-03-21 21:35:48.639741,553µs194ns,=== SPAN START: /cockroach.roachpb.Internal/Batch ===,,,/cockroach.roachpb.Internal/Batch,4
2018-03-21 21:35:48.639746,558µs149ns,1 Scan,[n8],,/cockroach.roachpb.Internal/Batch,4
2018-03-21 21:35:48.639750,562µs713ns,read has no clock uncertainty,[n8],,/cockroach.roachpb.Internal/Batch,4
2018-03-21 21:35:48.639757,568µs738ns,executing 1 requests,"[n8,s8]",,/cockroach.roachpb.Internal/Batch,4
2018-03-21 21:35:48.639765,577µs536ns,read-only path,"[n8,s8,r38/4:/Table/5{7-8}]",,/cockroach.roachpb.Internal/Batch,4
2018-03-21 21:35:48.639773,585µs623ns,request range lease (attempt #1),"[n8,s8,r38/4:/Table/5{7-8}]",,/cockroach.roachpb.Internal/Batch,4
2018-03-21 21:35:48.639781,593µs653ns,=== SPAN START: request range lease ===,,,request range lease,5
2018-03-21 21:35:48.639789,601µs257ns,=== SPAN START: [async] storage.pendingLeaseRequest: requesting lease ===,,,[async] storage.pendingLeaseRequest: requesting lease,6
2018-03-21 21:35:48.639806,617µs981ns,=== SPAN START: liveness heartbeat ===,,,liveness heartbeat,7
2018-03-21 21:35:52.239368,3s600ms179µs736ns,slow heartbeat took 3.6s,"[n8,s8,r38/4:/Table/5{7-8}]",storage/node_liveness.go:501,liveness heartbeat,7
2018-03-21 21:35:52.239380,3s600ms191µs793ns,read-write path,"[n8,s8,r38/4:/Table/5{7-8}]",,[async] storage.pendingLeaseRequest: requesting lease,6
2018-03-21 21:35:52.239388,3s600ms199µs917ns,applied timestamp cache,"[n8,s8,r38/4:/Table/5{7-8}]",,[async] storage.pendingLeaseRequest: requesting lease,6
2018-03-21 21:35:52.239515,3s600ms327µs100ns,evaluated request,"[n8,s8,r38/4:/Table/5{7-8}]",,[async] storage.pendingLeaseRequest: requesting lease,6
2018-03-21 21:35:52.239525,3s600ms337µs610ns,"acquired {raft,replica}mu","[n8,s8,r38/4:/Table/5{7-8}]",,[async] storage.pendingLeaseRequest: requesting lease,6
2018-03-21 21:35:53.566827,4s927ms639µs210ns,applying command,"[n8,s8,r38/4:/Table/5{7-8}]",,[async] storage.pendingLeaseRequest: requesting lease,6
2018-03-21 21:35:53.566889,4s927ms700µs795ns,"new range lease repl=(n8,s8):4 seq=0 start=1521645122.521862991,0 epo=284 pro=1521668148.639777248,0 following repl=(n8,s8):4 seq=0 start=1521645122.521862991,0 epo=284 pro=1521668140.028950576,0","[n8,s8,r38/4:/Table/5{7-8}]",storage/replica_proposal.go:201,[async] storage.pendingLeaseRequest: requesting lease,6
2018-03-21 21:35:53.566895,4s927ms707µs555ns,not gossiping system config because the replica doesn't contain the system config's start key,"[n8,s8,r38/4:/Table/5{7-8}]",,[async] storage.pendingLeaseRequest: requesting lease,6
2018-03-21 21:35:53.942564,5s303ms376µs59ns,"lease acquisition succeeded: repl=(n8,s8):4 seq=0 start=1521645122.521862991,0 epo=284 pro=1521668140.028950576,0","[n8,s8,r38/4:/Table/5{7-8}]",,/cockroach.roachpb.Internal/Batch,4
2018-03-21 21:35:53.942569,5s303ms381µs304ns,request range lease (attempt #2),"[n8,s8,r38/4:/Table/5{7-8}]",,/cockroach.roachpb.Internal/Batch,4
2018-03-21 21:35:53.942577,5s303ms388µs907ns,=== SPAN START: request range lease ===,,,request range lease,8
2018-03-21 21:35:53.942585,5s303ms396µs952ns,=== SPAN START: [async] storage.pendingLeaseRequest: requesting lease ===,,,[async] storage.pendingLeaseRequest: requesting lease,9
2018-03-21 21:35:53.942595,5s303ms406µs775ns,=== SPAN START: liveness heartbeat ===,,,liveness heartbeat,10
2018-03-21 21:35:55.149995,6s510ms806µs769ns,slow heartbeat took 1.2s,"[n8,s8,r38/4:/Table/5{7-8}]",storage/node_liveness.go:501,liveness heartbeat,10
2018-03-21 21:35:55.150024,6s510ms835µs990ns,read-write path,"[n8,s8,r38/4:/Table/5{7-8}]",,[async] storage.pendingLeaseRequest: requesting lease,9
2018-03-21 21:35:55.150032,6s510ms844µs233ns,applied timestamp cache,"[n8,s8,r38/4:/Table/5{7-8}]",,[async] storage.pendingLeaseRequest: requesting lease,9
2018-03-21 21:35:55.150129,6s510ms941µs620ns,evaluated request,"[n8,s8,r38/4:/Table/5{7-8}]",,[async] storage.pendingLeaseRequest: requesting lease,9
2018-03-21 21:35:55.150141,6s510ms952µs730ns,"acquired {raft,replica}mu","[n8,s8,r38/4:/Table/5{7-8}]",,[async] storage.pendingLeaseRequest: requesting lease,9
2018-03-21 21:35:56.683034,8s43ms846µs228ns,applying command,"[n8,s8,r38/4:/Table/5{7-8}]",,[async] storage.pendingLeaseRequest: requesting lease,9
2018-03-21 21:35:56.683129,8s43ms941µs107ns,"new range lease repl=(n8,s8):4 seq=0 start=1521645122.521862991,0 epo=284 pro=1521668153.942571319,0 following repl=(n8,s8):4 seq=0 start=1521645122.521862991,0 epo=284 pro=1521668148.639777248,0","[n8,s8,r38/4:/Table/5{7-8}]",storage/replica_proposal.go:201,[async] storage.pendingLeaseRequest: requesting lease,9
2018-03-21 21:35:56.683136,8s43ms948µs375ns,not gossiping system config because the replica doesn't contain the system config's start key,"[n8,s8,r38/4:/Table/5{7-8}]",,[async] storage.pendingLeaseRequest: requesting lease,9
2018-03-21 21:35:56.683366,8s44ms178µs66ns,"lease acquisition succeeded: repl=(n8,s8):4 seq=0 start=1521645122.521862991,0 epo=284 pro=1521668148.639777248,0","[n8,s8,r38/4:/Table/5{7-8}]",,/cockroach.roachpb.Internal/Batch,4
2018-03-21 21:35:56.683379,8s44ms191µs418ns,command queue,"[n8,s8,r38/4:/Table/5{7-8}]",,/cockroach.roachpb.Internal/Batch,4
2018-03-21 21:35:56.683389,8s44ms201µs718ns,waiting for read lock,"[n8,s8,r38/4:/Table/5{7-8}]",,/cockroach.roachpb.Internal/Batch,4
2018-03-21 21:35:56.683469,8s44ms280µs862ns,read completed,"[n8,s8,r38/4:/Table/5{7-8}]",,/cockroach.roachpb.Internal/Batch,4
2018-03-21 21:35:56.683655,8s44ms467µs462ns,plan completed execution,"[n8,client=*REDACTED*:61016,user=techadmin]",,consuming rows,2
2018-03-21 21:35:56.683659,8s44ms471µs642ns,"resources released, stopping trace","[n8,client=*REDACTED*:61016,user=techadmin]",,consuming rows,2

2018-03-21 21:36:33.045079,0s,=== SPAN START: sql txn ===,,,sql txn,0
2018-03-21 21:36:33.045451,371µs313ns,=== SPAN START: session recording ===,,,session recording,14
2018-03-21 21:36:33.045454,374µs523ns,=== SPAN START: starting plan ===,,,starting plan,1
2018-03-21 21:36:33.045468,388µs635ns,=== SPAN START: consuming rows ===,,,consuming rows,2
2018-03-21 21:36:33.045486,406µs463ns,Scan /Table/69/{1-2},"[n8,client=*REDACTED*:61016,user=techadmin]",,sql txn,0
2018-03-21 21:36:33.045501,421µs940ns,=== SPAN START: dist sender ===,,,dist sender,3
2018-03-21 21:36:33.045513,433µs900ns,querying next range at /Table/69/1,"[client=*REDACTED*:61016,user=techadmin,txn=16a7b27e,n8]",,dist sender,3
2018-03-21 21:36:33.045562,483µs28ns,"r50: sending batch 1 Scan to (n5,s5):8","[client=*REDACTED*:61016,user=techadmin,txn=16a7b27e,n8]",,dist sender,3
2018-03-21 21:36:33.045581,501µs858ns,sending request to db04.db.ap-southeast-2.*REDACTED*:26257,"[client=*REDACTED*:61016,user=techadmin,txn=16a7b27e,n8]",,dist sender,3
2018-03-21 21:36:33.045600,520µs475ns,=== SPAN START: /cockroach.roachpb.Internal/Batch ===,,,/cockroach.roachpb.Internal/Batch,4
2018-03-21 21:36:33.046469,1ms389µs303ns,=== SPAN START: /cockroach.roachpb.Internal/Batch ===,,,/cockroach.roachpb.Internal/Batch,6
2018-03-21 21:36:33.046534,1ms454µs932ns,1 Scan,[n5],,/cockroach.roachpb.Internal/Batch,6
2018-03-21 21:36:33.046551,1ms471µs771ns,executing 1 requests,"[n5,s5]",,/cockroach.roachpb.Internal/Batch,6
2018-03-21 21:36:33.046562,1ms483µs71ns,read-only path,"[n5,s5,r50/8:/Table/{69-70}]",,/cockroach.roachpb.Internal/Batch,6
2018-03-21 21:36:33.047422,2ms343µs221ns,request range lease (attempt #1),"[n5,s5,r50/8:/Table/{69-70}]",,/cockroach.roachpb.Internal/Batch,6
2018-03-21 21:36:33.047432,2ms352µs586ns,=== SPAN START: request range lease ===,,,request range lease,7
2018-03-21 21:36:33.047441,2ms362µs226ns,=== SPAN START: [async] storage.pendingLeaseRequest: requesting lease ===,,,[async] storage.pendingLeaseRequest: requesting lease,8
2018-03-21 21:36:33.047466,2ms386µs693ns,=== SPAN START: liveness heartbeat ===,,,liveness heartbeat,9
2018-03-21 21:36:41.333944,8s288ms865µs152ns,slow heartbeat took 8.3s,"[n5,s5,r50/8:/Table/{69-70}]",storage/node_liveness.go:501,liveness heartbeat,9
2018-03-21 21:36:41.337934,8s292ms854µs949ns,heartbeat failed on epoch increment,"[n5,s5,r50/8:/Table/{69-70}]",storage/replica_range_lease.go:282,[async] storage.pendingLeaseRequest: requesting lease,8
2018-03-21 21:36:41.351282,8s306ms202µs840ns,"replica.Send got error: [NotLeaseHolderError] r50: replica (n5,s5):8 not lease holder; current lease is repl=(n5,s5):8 seq=0 start=1521667895.156177244,0 epo=112 pro=1521667895.156184927,0","[n5,s5,r50/8:/Table/{69-70}]",,/cockroach.roachpb.Internal/Batch,6
2018-03-21 21:36:41.352500,8s307ms420µs957ns,*roachpb.NotLeaseHolderError,[n5],,/cockroach.roachpb.Internal/Batch,6
2018-03-21 21:36:41.370503,8s325ms423µs338ns,"application error: [NotLeaseHolderError] r50: replica (n5,s5):8 not lease holder; current lease is repl=(n5,s5):8 seq=0 start=1521667895.156177244,0 epo=112 pro=1521667895.156184927,0","[client=*REDACTED*:61016,user=techadmin,txn=16a7b27e,n8]",,dist sender,3
2018-03-21 21:36:41.371306,8s326ms227µs155ns,"error: {(err: [NotLeaseHolderError] r50: replica (n5,s5):8 not lease holder; current lease is repl=(n5,s5):8 seq=0 start=1521667895.156177244,0 epo=112 pro=1521667895.156184927,0) <nil>}; trying next peer (n5,s5):8","[client=*REDACTED*:61016,user=techadmin,txn=16a7b27e,n8]",,dist sender,3
2018-03-21 21:36:41.371329,8s326ms249µs594ns,sending request to db04.db.ap-southeast-2.*REDACTED*:26257,"[client=*REDACTED*:61016,user=techadmin,txn=16a7b27e,n8]",,dist sender,3
2018-03-21 21:36:41.371352,8s326ms272µs702ns,=== SPAN START: /cockroach.roachpb.Internal/Batch ===,,,/cockroach.roachpb.Internal/Batch,5
2018-03-21 21:36:41.374171,8s329ms91µs603ns,=== SPAN START: /cockroach.roachpb.Internal/Batch ===,,,/cockroach.roachpb.Internal/Batch,10
2018-03-21 21:36:41.374199,8s329ms119µs406ns,1 Scan,[n5],,/cockroach.roachpb.Internal/Batch,10
2018-03-21 21:36:41.374209,8s329ms129µs320ns,executing 1 requests,"[n5,s5]",,/cockroach.roachpb.Internal/Batch,10
2018-03-21 21:36:41.374217,8s329ms137µs328ns,read-only path,"[n5,s5,r50/8:/Table/{69-70}]",,/cockroach.roachpb.Internal/Batch,10
2018-03-21 21:36:41.374224,8s329ms144µs955ns,request range lease (attempt #1),"[n5,s5,r50/8:/Table/{69-70}]",,/cockroach.roachpb.Internal/Batch,10
2018-03-21 21:36:41.374231,8s329ms151µs582ns,=== SPAN START: request range lease ===,,,request range lease,11
2018-03-21 21:36:41.374237,8s329ms157µs665ns,=== SPAN START: [async] storage.pendingLeaseRequest: requesting lease ===,,,[async] storage.pendingLeaseRequest: requesting lease,12
2018-03-21 21:36:41.374250,8s329ms170µs426ns,=== SPAN START: liveness heartbeat ===,,,liveness heartbeat,13
2018-03-21 21:36:41.636799,8s591ms719µs916ns,read-write path,"[n5,s5,r50/8:/Table/{69-70}]",,[async] storage.pendingLeaseRequest: requesting lease,12
2018-03-21 21:36:41.636814,8s591ms734µs920ns,applied timestamp cache,"[n5,s5,r50/8:/Table/{69-70}]",,[async] storage.pendingLeaseRequest: requesting lease,12
2018-03-21 21:36:41.636921,8s591ms842µs270ns,evaluated request,"[n5,s5,r50/8:/Table/{69-70}]",,[async] storage.pendingLeaseRequest: requesting lease,12
2018-03-21 21:36:41.636933,8s591ms853µs362ns,"acquired {raft,replica}mu","[n5,s5,r50/8:/Table/{69-70}]",,[async] storage.pendingLeaseRequest: requesting lease,12
2018-03-21 21:36:41.763410,8s718ms330µs605ns,retry proposal 46116d4b2a7c5897: reasonTicks,"[n5,s5,r50/8:/Table/{69-70}]",,[async] storage.pendingLeaseRequest: requesting lease,12
2018-03-21 21:36:41.765345,8s720ms265µs281ns,applied timestamp cache,"[n5,s5,r50/8:/Table/{69-70}]",,[async] storage.pendingLeaseRequest: requesting lease,12
2018-03-21 21:36:41.765431,8s720ms351µs609ns,evaluated request,"[n5,s5,r50/8:/Table/{69-70}]",,[async] storage.pendingLeaseRequest: requesting lease,12
2018-03-21 21:36:41.765441,8s720ms361µs452ns,"acquired {raft,replica}mu","[n5,s5,r50/8:/Table/{69-70}]",,[async] storage.pendingLeaseRequest: requesting lease,12
2018-03-21 21:36:41.909924,8s864ms844µs631ns,applying command,"[n5,s5,r50/8:/Table/{69-70}]",,[async] storage.pendingLeaseRequest: requesting lease,12
2018-03-21 21:36:41.909986,8s864ms906µs283ns,"new range lease repl=(n5,s5):8 seq=0 start=1521667895.156177244,0 epo=112 pro=1521668201.374227141,0 following repl=(n5,s5):8 seq=0 start=1521667895.156177244,0 epo=112 pro=1521668201.374227141,0","[n5,s5,r50/8:/Table/{69-70}]",storage/replica_proposal.go:201,[async] storage.pendingLeaseRequest: requesting lease,12
2018-03-21 21:36:41.909996,8s864ms916µs448ns,not gossiping system config because the replica doesn't contain the system config's start key,"[n5,s5,r50/8:/Table/{69-70}]",,[async] storage.pendingLeaseRequest: requesting lease,12
2018-03-21 21:36:41.910830,8s865ms750µs423ns,"lease acquisition succeeded: repl=(n5,s5):8 seq=0 start=1521667895.156177244,0 epo=112 pro=1521667895.156184927,0","[n5,s5,r50/8:/Table/{69-70}]",,/cockroach.roachpb.Internal/Batch,10
2018-03-21 21:36:41.910844,8s865ms764µs922ns,command queue,"[n5,s5,r50/8:/Table/{69-70}]",,/cockroach.roachpb.Internal/Batch,10
2018-03-21 21:36:41.910870,8s865ms790µs743ns,waiting for read lock,"[n5,s5,r50/8:/Table/{69-70}]",,/cockroach.roachpb.Internal/Batch,10
2018-03-21 21:36:41.911886,8s866ms806µs596ns,read completed,"[n5,s5,r50/8:/Table/{69-70}]",,/cockroach.roachpb.Internal/Batch,10
2018-03-21 21:36:42.150355,9s105ms276µs10ns,plan completed execution,"[n8,client=*REDACTED*:61016,user=techadmin]",,consuming rows,2
2018-03-21 21:36:42.150360,9s105ms281µs121ns,"resources released, stopping trace","[n8,client=*REDACTED*:61016,user=techadmin]",,consuming rows,2

Cheers,
Stefan

Thanks, Stefan.
The good news is that we got some more information, like this line:

2018-03-21 21:36:41.333944,8s288ms865µs152ns,slow heartbeat took 8.3s,"[n5,s5,r50/8:/Table/{69-70}]",storage/node_liveness.go:501,liveness heartbeat,9

It looks like at least a node is having trouble pinging its “liveness record”, possibly because the range that holds the liveness information might be having trouble. There’s a bizarre line in there that I don’t know how to make sense of:

2018-03-21 21:36:41.351282,8s306ms202µs840ns,"replica.Send got error: [NotLeaseHolderError] r50: replica (n5,s5):8 not lease holder; current lease is repl=(n5,s5):8 seq=0 start=1521667895.156177244,0 epo=112 pro=1521667895.156184927,0","[n5,s5,r50/8:/Table/{69-70}]",,/cockroach.roachpb.Internal/Batch,6

@tschottdorf, what does this mean to you?

I think the cluster overall is in some kind of unhappy state. Stefan, would you mind giving us a “debug zip” (https://www.cockroachlabs.com/docs/stable/debug-zip.html) - that contains all sorts of logs and debug dumps that we can comb through. Unless you have particularly sensitive data in there - some of the logs might leak various small amount of data.

And just to check - you’re not putting any extreme load on the cluster or its machines while collecting these traces, are you?

Hi Andrei,

I have sent you a direct message with the debug zip.

The cluster isn’t under any extreme load, pretty light load actually, roughly 1 simple query per second.

It may be due to underpowered servers though the staging environment with the same configuration (except limited to 1 region) works just fine.

Cheers,
Stefan

Thanks Stefan. I didn’t get a chance to look yet, but I haven’t forgotten about this.

Hi Stefan,

I took a look at the debug zip you sent. This cluster is barely limping along! It’s hard to determine the root cause from looking at the logs.

One common root cause is overloaded disks. Especially if you’re running on a public cloud or using a SAN, you can get throttled or bottlenecked on disk write throughput. If that happens, things get pretty bad.

What kind of storage are you using here? In the UI, what are you seeing as the “log commit latency” and “command commit latency” (/#/metrics/storage/cluster)?

Hi Tobias,

Thanks for taking a look.

Unfortunately I had to make the call the scale down to one region (and upgrade to v2-stable) for now as we were having too many performance issues that I couldn’t solve.

Doing so has pretty much fixed all performance issues (although we’re having some odd logs and info about a replica being left in purgatory and 2 replicas being under-replicated).

As for some insight as to what we are running for our servers:
Cloud: AWS
Instance Type: t2.micro
Storage per Instance: 1 x 100GB General Purpose SSD
Instances per Availability Zones: 2
Availability Zones per Region: 2
Regions: 1 (previously 2)

I am aware the T2 instances are not recommended but from my testing they seem to handle our loads just fine in a single region deployment. As demand on the servers is quite low at the moment the T2 instances are enough and we have yet to have their CPU credits depleted. We plan to upgrade them to more powerful instances once their utilisation goes up.

With the current single region deployment of the DB infrastructure (the app code is still deployed in multiple regions) the “log commit latency” hovers around 4ms - 15ms for all nodes and “command commit latency” hovers around 0.2ms - 5ms for all nodes.

Just to add some more detail, when we had the multi-region setup, at no point did our disks become throttled (or overloaded) according to the AWS monitoring. The latency was low (<1ms/op), time spent idle was high (>80%), and IOPS never reached high enough to require bursting.

Cheers,
Stefan