SELECT query sometimes taking a lot of time


#1

I am having a 3 node cockroach cluster in a dev set-up. I am having a table like below

root@:26257/vnera> show create table config_str;
+------------+--------------------------------------------------------------------------+
|   Table    |                               CreateTable                                |
+------------+--------------------------------------------------------------------------+
| config_str | CREATE TABLE config_str (                                                |
|            |                                                                          |
|            |     customerid INTEGER NOT NULL,                                         |
|            |                                                                          |
|            |     objecttype INTEGER NOT NULL,                                         |
|            |                                                                          |
|            |     objectid BIGINT NOT NULL,                                            |
|            |                                                                          |
|            |     propertyname STRING NOT NULL,                                        |
|            |                                                                          |
|            |     timeinstant BIGINT NOT NULL,                                         |
|            |                                                                          |
|            |     updatetimestamp TIMESTAMP WITH TIME ZONE NULL DEFAULT now(),         |
|            |                                                                          |
|            |     value STRING NULL,                                                   |
|            |                                                                          |
|            |     CONSTRAINT "primary" PRIMARY KEY (customerid ASC, objecttype ASC,    |
|            | propertyname ASC, objectid ASC, timeinstant DESC),                       |
|            |                                                                          |
|            |     INDEX cid_ot_prop_oid_time_uts (customerid ASC, objecttype ASC,      |
|            | propertyname ASC, objectid ASC, timeinstant ASC, updatetimestamp ASC),   |
|            |                                                                          |
|            |     INDEX cid_uts (customerid ASC, updatetimestamp ASC),                 |
|            |                                                                          |
|            |     INDEX uts_desc_cid_ot (updatetimestamp DESC, customerid ASC,         |
|            | objecttype ASC),                                                         |
|            |                                                                          |
|            |     INDEX uts_cid_ot (updatetimestamp ASC, customerid ASC, objecttype    |
|            | ASC),                                                                    |
|            |                                                                          |
|            |     INDEX ot_uts_cid (objecttype ASC, updatetimestamp ASC, customerid    |
|            | ASC),                                                                    |
|            |                                                                          |
|            |     FAMILY "primary" (customerid, objecttype, objectid, propertyname,    |
|            | timeinstant, updatetimestamp, value)                                     |
|            |                                                                          |
|            | )                                                                        |
+------------+--------------------------------------------------------------------------+
(1 row)

Time: 97.108731ms

Sometimes the query like below is taking more than 20 seconds.

SELECT customerid, objecttype, objectid, propertyname, timeinstant, updateTimestamp FROM config_str@uts_desc_cid_ot where  updateTimestamp >= 1524899796::TIMESTAMPTZ; 
|    1010735 |        515 |          1989444141 | _name        | 1524881400000 | 2018-04-28 07:16:40.312+00:00 |
|      10686 |        546 |          1520650600 | _name        | 1524880111147 | 2018-04-28 07:16:40.284+00:00 |
|    1010714 |        515 |          1128009733 | _name        | 1524876900000 | 2018-04-28 07:16:39.297+00:00 |
|    1010714 |        515 |          1128009733 | _denorm      | 1524876900000 | 2018-04-28 07:16:38.876+00:00 |
|      10686 |        546 |          1520650600 | _canonical   | 1524880111147 | 2018-04-28 07:16:36.336+00:00 |
+------------+------------+---------------------+--------------+---------------+-------------------------------+
(58877 rows)

Time: 20.65102417s

The output pprof & cockroach logs from each node is attached.

curl -X GET "http://localhost:8091/debug/pprof/profile?seconds=30" > pprof.ttf

cockroach_logs.ttf (2.8 MB)

Rename the file to .zip.

Load on cluster is about 600 reads per second and about 150 inserts per sec.

Number of rows in table.

root@:26257/vnera> select count(*) from config_str;
+---------+
|  count  |
+---------+
| 2028627 |
+---------+
(1 row)

Time: 51.387208194s

In my log I am seeing the below trace of warning sometimes

W180427 22:23:11.568057 137 storage/store.go:3857  [n2,s2] handle raft ready: 0.6s [processed=0]
W180427 22:23:11.569376 144 storage/store.go:3857  [n2,s2] handle raft ready: 0.6s [processed=0]
W180427 22:23:11.583835 131 storage/engine/rocksdb.go:1755  batch [5/412/0] commit took 549.832341ms (>500ms):
goroutine 131 [running]:
runtime/debug.Stack(0xf2a33f, 0xed27599cf, 0x0)
        /usr/local/go/src/runtime/debug/stack.go:24 +0xa7
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).commitInternal(0xc42a344000, 0x0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1756 +0x128
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).Commit(0xc42a344000, 0xed2759900, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1679 +0x6fe
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).applyRaftCommand(0xc428e5fc00, 0x2612180, 0xc42c299380, 0xc425a00e38, 0x8, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:5037 +0x4d5
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).processRaftCommand(0xc428e5fc00, 0x2612180, 0xc42c299380, 0xc425a00e38, 0x8, 0x7, 0x9558d, 0x100000001, 0x1, 0xd, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:4742 +0x55a
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc428e5fc00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3658 +0x12c8
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x2612180, 0xc434acd320, 0xc428e5fc00, 0x2612180)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3799 +0x109
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc4204b2c00, 0x2612180, 0xc434acd320, 0xc427cab040, 0xc430cc9ed0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3121 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc4204b2c00, 0x2612180, 0xc42090a030, 0x94)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3787 +0x229
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc4201c0e10, 0x2612180, 0xc42090a030)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:226 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x2612180, 0xc42090a030)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:166 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc42031a2e0, 0xc420920000, 0xc42031a2d0)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:185 +0xad

Can someone let me know what is going wrong?

Environment

  • Cockroach Version 2.0.0
  • OS - Ubuntu 14
  • nproc - 4

Cockroach Nodes were started like below

cockroach start --insecure --advertise-host=platform2 --port=26257 --http-port=8091 --http-host 0.0.0.0 --store=path=/var/lib/cockroach/data --log-dir=/var/log/cockroach --join=platform1:26257,platform2:26257,platform3:26257 --max-offset=5000ms --attrs=ram:2gb,cpu:1

Some sample data

root@:26257/vnera> select * from config_str limit 2;
+------------+------------+-----------+--------------+---------------+-------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| customerid | objecttype | objectid  | propertyname |  timeinstant  |        updatetimestamp        |                                                                                                                                                                                                              value                                                                                                                                                                                                               |
+------------+------------+-----------+--------------+---------------+-------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|          0 |        832 |  86272338 | _denorm      | 1524770211742 | 2018-04-26 19:16:51.812+00:00 | CmZbRDdDWFNEWiwgVlJOSS1ORVctREVWLCB2cm5pLXBsYXRmb3JtLjEwLjE1My4xOTAuMjA1LCBJVkI0SFZQLCB2cm5pLCBQTEFURk9STSwgZGVicmFqY29ja3JvYWNoY2x1c3Rlcl0SDjA6ODMyOjg2MjcyMzM4IgtkaWQ6RDdDWFNEWiIQZW52OlZSTkktTkVXLURFViIhaG9zdDp2cm5pLXBsYXRmb3JtLjEwLjE1My4xOTAuMjA1IgtpaWQ6SVZCNEhWUCIIaW5mOnZybmkiDXJvbGU6UExBVEZPUk0iHHNldHVwOmRlYnJhamNvY2tyb2FjaGNsdXN0ZXIiDHNrdTpwbGF0Zm9ybSIPc291cmNlOkxhdW5jaGVyIgt0eXBlOm9ucHJlbQ==                 |
|          0 |        832 | 128703658 | _denorm      | 1524805856498 | 2018-04-27 05:12:21.75+00:00  | CmxbRDdDWFNEWiwgVlJOSS1ORVctREVWLCBwbGF0Zm9ybTIuMTAuMTUzLjE4OS4xNDYsIElQN1RTR1osIHZybmksIFBMQVRGT1JNLCBkZWJyYWpfY29ja3JvYWNoX2NsdXN0ZXItY2x1c3Rlcl0SDzA6ODMyOjEyODcwMzY1OCILZGlkOkQ3Q1hTRFoiEGVudjpWUk5JLU5FVy1ERVYiHWhvc3Q6cGxhdGZvcm0yLjEwLjE1My4xODkuMTQ2IgtpaWQ6SVA3VFNHWiIIaW5mOnZybmkiDXJvbGU6UExBVEZPUk0iJnNldHVwOmRlYnJhal9jb2Nrcm9hY2hfY2x1c3Rlci1jbHVzdGVyIgxza3U6cGxhdGZvcm0iD3NvdXJjZTpMYXVuY2hlciILdHlwZTpvbnByZW0= |
+------------+------------+-----------+--------------+---------------+-------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
(2 rows)

Time: 350.907612ms

(Rebecca Taft) #2

Hi @tuk, for the query that is slow, have you tried running SHOW TRACE FOR <query>? That may help us understand where the bottleneck is.

If the query is not consistently slow, you can set cluster setting sql.trace.txn.enable_threshold = '10s' to trace all queries taking longer than 10 seconds (or you can adjust that timeframe if 10 seconds is too long/short). The output of that tracing will show up in the logs.


#3

@becca - I have filed an issue for this in https://github.com/cockroachdb/cockroach/issues/25151 . I have posted lot more details there. If you need anything more from my side. Let me know.


(Rebecca Taft) #4

Thank you! If we need anything else we’ll post on the issue.