Select count from a table have no response ( even though it may relate to stress test that cause high cpu usage, but why? )

Hi, I’m not sure if this issue is same as https://github.com/cockroachdb/cockroach/issues/9540
stability: run insert&delete-heavy workload with intent/GC queue pressure #9540

If so, it may help to provide some information, and myself may want to know some of of why ( the cause ) ( if there’s any )

And why that table is special

I do the following
To have that table ( bag_index ) have many rows ( about 123612 rows )

Then doing delete will cause no response ( after Control + C, the select doesn’t work too ) that table can’t do any operation ( long time no response, say one hour, I don’t wait that long ) ( the problem exists that long, say one day )

For now it only happen for that table ( bag_index )

beta-20161027 ( done restart that node once, I tried delete again, restart won’t fix)
beta-20161103 ( doing restart that node, make it can select again )

Select have no response
root@192.168.100.241:8000> SELECT count(*) FROM bag_index;

Other tables are fine

root@192.168.100.241:8000> SELECT count(*) FROM bag_index_compute;
+----------+
| count(*) |
+----------+
|    71314 |
+----------+
(1 row)

After restart the table is okay now

root@192.168.100.241:8000> select count(*) from bag_index;
+----------+
| count(*) |
+----------+
|   123612 |
+----------+
(1 row)

Here are some logs

[wen@hadoop241 logs]$ less cockroach.hadoop241.wen.log.INFO.2016-11-15T09_45_50+08_00.3327
I161115 09:45:50.648868 1 util/log/clog.go:988  [config] file created at: 2016/11/15 09:45:50
I161115 09:45:50.648868 1 util/log/clog.go:988  [config] running on machine: hadoop241
I161115 09:45:50.648868 1 util/log/clog.go:988  [config] binary: CockroachDB beta-20161103 (linux amd64, built 2016/11/03 15:03:54, go1.7.3)
I161115 09:45:50.648868 1 util/log/clog.go:988  [config] arguments: [/home/wen/clusterthree/cockroach start --insecure --host= --store=node1 --port=8000 --http-port=7000]
I161115 09:45:50.648868 1 util/log/clog.go:988  line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid file:line msg
I161115 09:45:50.648867 1 cli/start.go:299  CockroachDB beta-20161103 (linux amd64, built 2016/11/03 15:03:54, go1.7.3)
I161115 09:45:50.756303 1 cli/backtrace.go:40  backtrace disabled: stat /opt/backtrace/bin/ptrace: no such file or directory
I161115 09:45:50.756366 1 cli/start.go:315  starting cockroach node
W161115 09:45:50.756421 1 server/server.go:154  [n?] running in insecure mode, this is strongly discouraged. See --insecure.
I161115 09:45:50.756657 1 gossip/gossip.go:237  [n?] initial resolvers: []
W161115 09:45:50.756678 1 gossip/gossip.go:1055  [n?] no resolvers found; use --join to specify a connected node
I161115 09:45:50.766629 1 storage/engine/rocksdb.go:326  opening rocksdb instance at "node1"
I161115 09:45:51.360771 1 server/config.go:443  1 storage engine initialized
I161115 09:45:51.364954 1 base/node_id.go:62  [n1] NodeID set to 1
W161115 09:45:51.630882 1 storage/store.go:766  [n1] storeMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Store).Start for 265.797643ms (>100ms):
goroutine 1 [running]:
runtime/debug.Stack(0x1d25710, 0x1d25750, 0x3b)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/cockroachdb/cockroach/pkg/util/syncutil.ThresholdLogger.func1(0xfd7c00b)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:65 +0xe1
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*TimedMutex).Unlock(0xc4203977b8)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:92 +0x80
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Start(0xc420397500, 0x7f6a4b6b9dc0, 0xc420213c80, 0xc420465cb0, 0xc4205ecc70, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:1052 +0x520
github.com/cockroachdb/cockroach/pkg/server.(*Node).initStores(0xc4201a0780, 0x7f6a4b6b9dc0, 0xc420213c80, 0xc4201461f0, 0x1, 0x1, 0xc420465cb0, 0x0, 0x0, 0x7f6a4b6b9dc0)
        /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:418 +0x213
github.com/cockroachdb/cockroach/pkg/server.(*Node).start(0xc4201a0780, 0x7f6a4b6b9dc0, 0xc420213c80, 0x232c420, 0xc4203c5f60, 0xc4201461f0, 0x1, 0x1, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:340 +0x11d
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start(0xc420202000, 0x7f6a4b6b9dc0, 0xc420213c80, 0xc42046b7c0, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:575 +0xccc
github.com/cockroachdb/cockroach/pkg/cli.runStart(0x2316b40, 0xc42044f680, 0x0, 0x5, 0x7fffb067c83f, 0x4)
        /go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:324 +0x920
github.com/cockroachdb/cockroach/pkg/cli.maybeDecorateGRPCError.func1(0x2316b40, 0xc42044f680, 0x0, 0x5, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/cli/error.go:30 +0x5a
github.com/spf13/cobra.(*Command).execute(0x2316b40, 0xc42044f630, 0x5, 0x5, 0x2316b40, 0xc42044f630)
        /go/src/github.com/spf13/cobra/command.go:632 +0x23e
github.com/spf13/cobra.(*Command).ExecuteC(0x2312740, 0xc42000e220, 0xebcdac937f56b809, 0xc42000e258)
        /go/src/github.com/spf13/cobra/command.go:722 +0x367
github.com/spf13/cobra.(*Command).Execute(0x2312740, 0xc42000e240, 0xebcdac937f56b809)
        /go/src/github.com/spf13/cobra/command.go:681 +0x2b
github.com/cockroachdb/cockroach/pkg/cli.Run(0xc42000c0f0, 0x6, 0x6, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:95 +0x6d
main.main()
        /go/src/github.com/cockroachdb/cockroach/main.go:37 +0xe1
I161115 09:45:51.631284 1 storage/store.go:1173  [n1] [n1,s1]: failed initial metrics computation: [n1,s1]: system config not yet available
I161115 09:45:51.631321 1 server/node.go:433  [n1] initialized store [n1,s1]: {Capacity:105689415680 Available:57766645760 RangeCount:0}
I161115 09:45:51.631455 1 server/node.go:318  [n1] node ID 1 initialized
I161115 09:45:51.631566 1 gossip/gossip.go:280  [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"hadoop241:8000" > attrs:<> locality:<> 
I161115 09:45:51.631713 1 storage/stores.go:296  [n1] read 2 node addresses from persistent storage
I161115 09:45:51.632068 1 server/node.go:563  [n1] connecting to gossip network to verify cluster ID...
I161115 09:45:51.632098 1 server/node.go:584  [n1] node connected via gossip and verified as part of cluster {"104a7b1e-070a-4472-94fd-6c8b26850864"}
I161115 09:45:51.632131 1 server/node.go:368  [n1] node=1: started with [[]=node1] engine(s) and attributes []
I161115 09:45:51.632172 1 server/server.go:605  [n1] starting http server at hadoop241:7000
I161115 09:45:51.632188 1 server/server.go:606  [n1] starting grpc/postgres server at hadoop241:8000
I161115 09:45:51.632382 1 server/server.go:607  [n1] advertising CockroachDB node at hadoop241:8000
I161115 09:45:51.774141 168 gossip/client.go:126  [n1] node 1: started gossip client to hadoop243:8001
I161115 09:45:54.579276 201 storage/raft_transport.go:423  [n1] raft transport stream to node 2 established
I161115 09:45:54.789133 161 storage/raft_transport.go:423  [n1] raft transport stream to node 3 established
I161115 09:45:54.802167 145 sql/event_log.go:95  [n1] Event: "node_restart", target: 1, info: {Descriptor:{NodeID:1 Address:{NetworkField:tcp AddressField:hadoop241:8000} Attrs: Locality:} ClusterID:{UUID:104a7b1e-070a-4472-94fd-6c8b26850864} StartedAt:1479174351632104074}
I161115 09:45:56.082623 231 storage/replica_raftstorage.go:574  [n1,s1,r116/1:/System/tsd/cr.store.{mu…-ra…}] [n1,s1,r116/1:/System/tsd/cr.store.{mu…-ra…}]: with replicaID 1, applying Raft snapshot at index 24327 (id=989003f9, encoded size=16, 25 rocksdb batches, 4048 log entries)
I161115 09:45:57.265915 326 storage/replica_raftstorage.go:574  [n1,s1,r1/1:/{Min-System/tsd/c…}] [n1,s1,r1/1:/{Min-System/tsd/c…}]: with replicaID 1, applying Raft snapshot at index 159280 (id=9d6b22b8, encoded size=16, 41 rocksdb batches, 3124 log entries)
I161115 09:45:57.550411 231 storage/replica_raftstorage.go:577  [n1,s1,r116/1:/System/tsd/cr.store.{mu…-ra…}] [n1,s1,r116/1:/System/tsd/cr.store.{mu…-ra…}]: with replicaID 1, applied Raft snapshot in 1.468s
W161115 09:45:57.551112 231 storage/replica.go:528  [n1,s1,r116/1:/System/tsd/cr.store.{mu…-ra…}] raftMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest for 1.468760541s (>500ms):
goroutine 231 [running]:
runtime/debug.Stack(0x1d28bf8, 0x1d28c38, 0x48)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/cockroachdb/cockroach/pkg/util/syncutil.ThresholdLogger.func1(0x578b81dd)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:65 +0xe1
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*TimedMutex).Unlock(0xc42026d1f8)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:92 +0x80
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest(0xc420397500, 0x7f6a4b6b9dc0, 0xc420212db0, 0xc4203b4d78, 0x3f459234f9039098, 0x3e23945eaf8aba90, 0x74, 0xc420212a20, 0x2e, 0x30, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2923 +0xc88
github.com/cockroachdb/cockroach/pkg/storage.(*Store).HandleSnapshot(0xc420397500, 0xc4203b4d20, 0x233e340, 0xc4207621f0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2567 +0x6ac
github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftSnapshot.func1.1(0x233e340, 0xc4207621f0, 0xc420079500, 0xc4203caf38, 0xc4203caf40)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:352 +0x14d
github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftSnapshot.func1(0x7f6a4b6b9dc0, 0xc42026ff50)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:353 +0x45
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc420465cb0, 0x1ffde26, 0x19, 0x162, 0x0, 0x0, 0xc420212930, 0x7f6a4b6b9dc0, 0xc42026ff50)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:264 +0xdf
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:265 +0x234
I161115 09:45:58.131184 326 storage/replica_raftstorage.go:577  [n1,s1,r1/1:/{Min-System/tsd/c…}] [n1,s1,r1/1:/{Min-System/tsd/c…}]: with replicaID 1, applied Raft snapshot in 0.865s
W161115 09:45:58.131592 326 storage/replica.go:528  [n1,s1,r1/1:/{Min-System/tsd/c…}] raftMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest for 865.770425ms (>500ms):
goroutine 326 [running]:
runtime/debug.Stack(0x1d28bf8, 0x1d28c38, 0x48)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/cockroachdb/cockroach/pkg/util/syncutil.ThresholdLogger.func1(0x339a9bb9)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:65 +0xe1
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*TimedMutex).Unlock(0xc42026c778)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:92 +0x80
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest(0xc420397500, 0x7f6a4b6b9dc0, 0xc4207343c0, 0xc420e5c058, 0xb0416caeb8226b9d, 0xccb6d6981cf7f5bd, 0x1, 0x26d30f8, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2923 +0xc88
github.com/cockroachdb/cockroach/pkg/storage.(*Store).HandleSnapshot(0xc420397500, 0xc420e5c000, 0x233e340, 0xc420e961d0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2567 +0x6ac
github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftSnapshot.func1.1(0x233e340, 0xc420e961d0, 0xc420079500, 0xc4207d7f38, 0xc4207d7f40)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:352 +0x14d
github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftSnapshot.func1(0x7f6a4b6b9dc0, 0xc420785c50)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:353 +0x45
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc420465cb0, 0x1ffde26, 0x19, 0x162, 0x0, 0x0, 0xc420785e90, 0x7f6a4b6b9dc0, 0xc420785c50)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:264 +0xdf
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:265 +0x234
I161115 09:45:58.660700 346 storage/replica_raftstorage.go:574  [n1,s1,r132/1:{-}] [n1,s1,r132/1:{-}]: with replicaID 1, applying Raft snapshot at index 7329 (id=d2ed84e5, encoded size=16, 13 rocksdb batches, 2494 log entries)
I161115 09:45:58.848294 346 storage/replica_raftstorage.go:577  [n1,s1,r132/1:/System/tsd/cr.node.{sq…-tx…}] [n1,s1,r132/1:/System/tsd/cr.node.{sq…-tx…}]: with replicaID 1, applied Raft snapshot in 0.188s
I161115 09:45:59.559487 297 storage/replica_raftstorage.go:574  [n1,s1,r131/1:{-}] [n1,s1,r131/1:{-}]: with replicaID 1, applying Raft snapshot at index 16085 (id=5d7ff75c, encoded size=16, 18 rocksdb batches, 4282 log entries)
I161115 09:45:59.823922 297 storage/replica_raftstorage.go:577  [n1,s1,r131/1:/System/tsd/cr.{node.…-store…}] [n1,s1,r131/1:/System/tsd/cr.{node.…-store…}]: with replicaID 1, applied Raft snapshot in 0.264s
I161115 09:46:00.526882 333 http2_server.go:236  transport: http2Server.HandleStreams received bogus greeting from client: "GET / HTTP/1.1\r\nHost: 19"
I161115 09:46:01.633268 163 server/status/runtime.go:228  runtime stats: 611 MiB RSS, 130 goroutines, 103 MiB/237 MiB/362 MiB GO alloc/idle/total, 77 MiB/111 MiB CGO alloc/total, 0.00cgo/sec, 0.00/0.00 %(u/s)time, 0.00 %gc (25x)
I161115 09:46:01.635334 143 storage/store.go:3732  [n1,s1] sstables (read amplification = 4):

bag_index

CREATE TABLE "crebusiness.bag_index" (
	sharding_code STRING(4) NULL,
	ticket_all_no STRING(32) NOT NULL,
	ticket_sign STRING(1) NULL,
	ticket_no STRING(12) NULL,
	form_code STRING(3) NULL,
	ticket_type STRING(1) NULL,
	trans_type STRING(2) NULL,
	print_date STRING(8) NULL,
	print_time STRING(6) NULL,
	send_man_name STRING(64) NULL,
	send_province_code STRING(2) NULL,
	send_city_code STRING(4) NULL,
	send_area_code STRING(6) NULL,
	send_man_address STRING(128) NULL,
	send_man_phone STRING(20) NULL,
	send_man_fax STRING(20) NULL,
	send_man_mobile STRING(20) NULL,
	send_man_id_card STRING(18) NULL,
	send_man_email STRING(64) NULL,
	receive_man_name STRING(64) NULL,
	receive_province_code STRING(2) NULL,
	receive_city_code STRING(4) NULL,
	receive_area_code STRING(6) NULL,
	receive_man_address STRING(128) NULL,
	receive_man_phone STRING(20) NULL,
	receive_man_fax STRING(20) NULL,
	receive_man_mobile STRING(20) NULL,
	receive_man_id_card STRING(18) NULL,
	receive_man_email STRING(64) NULL,
	goods_name STRING(20) NULL,
	num STRING(11) NULL,
	weight STRING(11) NULL,
	declare_price STRING(11) NULL,
	passenger_num STRING(4) NULL,
	train_code STRING(8) NULL,
	train_date STRING(8) NULL,
	train_ticket_no STRING(20) NULL,
	ticket_station_code STRING(5) NULL,
	create_datetime STRING(14) NULL,
	update_datetime STRING(14) NULL,
	cre_sheet_no STRING(32) NULL,
	row_id INT NULL,
	CONSTRAINT "primary" PRIMARY KEY (ticket_all_no),
	UNIQUE INDEX idx_row_id (row_id),
	INDEX idx_update_datetime (update_datetime),
	INDEX idx_create_datetime (create_datetime),
	INDEX idx_print_date (print_date, print_time),
	FAMILY "primary" (sharding_code, ticket_all_no, ticket_sign, ticket_no, form_code, ticket_type, trans_type, print_date, print_time, send_man_name, send_province_code, send_city_code, send_area_code, receive_province_code),
	FAMILY fam_1_send_man_address_send_man_phone_send_man_fax_send_man_mobile_receive_city_code (send_man_address, send_man_phone, send_man_fax, send_man_mobile, receive_city_code),
	FAMILY fam_2_send_man_id_card_send_man_email_receive_man_name_receive_area_code_receive_man_phone_receive_man_id_card (send_man_id_card, send_man_email, receive_man_name, receive_area_code, receive_man_phone, receive_man_id_card),
	FAMILY fam_3_receive_man_address_receive_man_fax_receive_man_mobile_goods_name_num (receive_man_address, receive_man_fax, receive_man_mobile, goods_name, num),
	FAMILY fam_4_receive_man_email_weight_declare_price_passenger_num_train_code_train_date_train_ticket_no_ticket_station_code_create_datetime_row_id (receive_man_email, weight, declare_price, passenger_num, train_code, train_date, train_ticket_no, ticket_station_code, create_datetime, row_id),
	FAMILY fam_5_update_datetime_cre_sheet_no (update_datetime, cre_sheet_no)
)

Yesterday it has this issue ( the select count for bag_index isn’t response at this time 17:34 )

E161114 17:34:05.899731 214 storage/queue.go:568  [n1,raftlog] on [n1,s1,r115/1:/Table/59/1/"102120{1K0…-9K3…}]: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded
E161114 17:37:40.921581 214 storage/queue.go:568  [n1,raftlog] on [n1,s1,r115/1:/Table/59/1/"102120{1K0…-9K3…}]: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded
E161114 17:43:53.169373 629354 server/admin.go:278  namespace information_schema with ParentID 0 not found
E161114 17:43:53.170946 629669 server/admin.go:278  namespace pg_catalog with ParentID 0 not found
E161114 17:43:54.184960 621246 server/admin.go:278  namespace information_schema with ParentID 0 not found
E161114 17:43:54.192395 621247 server/admin.go:278  namespace pg_catalog with ParentID 0 not found
E161114 17:50:20.995330 214 storage/queue.go:568  [n1,raftlog] on [n1,s1,r115/1:/Table/59/1/"102120{1K0…-9K3…}]: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded
E161114 17:55:11.022894 214 storage/queue.go:568  [n1,raftlog] on [n1,s1,r115/1:/Table/59/1/"102120{1K0…-9K3…}]: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded

Yesterday’s graph (for version beta-20161103, till this morning select count for bag_index still no response, restart that node makes it okay now )

Provided sql dump may help to re-produce the problem

Table dump and all logs files
http://www.files.com/shared/582a7fe44d75b/2016-11-15_dump_and_logs.zip

In addition to the above ( to re-produce the issue may not by doing delete from bag_index, but by stress testing that cause high cpu usage, which is my guess )

Of course I think delete is a issue, But this topic may more about stress test that cause high cpu usage, that cause this bag_index no response of select

This happen again, that ( no delete operation involved )

Restart the node, makes it okay again ( can do select count, it has 880861 rows now )

I think you are hitting this bug: https://github.com/cockroachdb/cockroach/issues/9493

Thanks for your reply.

Is it you mean it cause by drop ( or delete) , not by high cpu usage?

And why only for that table, Others are fine

As above noted after restart, it works ( so at this point everythings okay) , Except doing a stress test again causes it can’t do select operation on table bag_index again
( During restart to next stress testing, I don’t think it involved drop or delete operation )

I once thought that there’s a issue handling high load ( that high cpu, cause it send rpc failed ) ( I think the stress test shouldn’t make table bag_index no response ) , What do you think about this?

I think there’s no drop or delete or restart before this issue happen
And we don’t have this error

(error: pq: missing fk back reference to quotes.quotes_episode_idx from episodes.primary)

It just have no response for select on table bag_index

Thanks for reporting this problem. Can you file an issue for this? Thanks

Got it, see https://github.com/cockroachdb/cockroach/issues/10752