Starting, stopping, and restarting node

hello (another day another newbie issue…)

After following these excellent instructions on starting and stopping a node, i would think i should be able to immediately run through the procedure again, rather than rebooting, deleting files, etc.

i did issue: ps -ef | grep cockroach; to make sure there was nothing else running.

there is probably something exceedingly obvious i am missing here.

cockroach start --host=localhost --background --logtostderr;

I170412 16:24:30.818749 1 cli/start.go:330 CockroachDB CCL beta-20170330 (linux amd64, built 2017/04/07 14:45:36, go1.8)
I170412 16:24:30.920037 14 cli/start.go:367 starting cockroach node
W170412 16:24:30.920405 14 server/server.go:158 [n?] running in insecure mode, this is strongly discouraged. See --insecure.
W170412 16:24:30.924412 14 server/config.go:352 soft open file descriptor limit 4096 is under the recommended limit 15000; this may decrease performance
please see https://www.cockroachlabs.com/docs/recommended-production-settings.html for more details
I170412 16:24:30.924607 14 storage/engine/rocksdb.go:374 opening rocksdb instance at "cockroach-data"
I170412 16:24:30.967364 14 server/config.go:486 1 storage engine initialized
I170412 16:24:30.967653 14 server/server.go:641 [n?] sleeping for 456.105194ms to guarantee HLC monotonicity
I170412 16:24:31.427757 14 storage/store.go:1318 [n1] [n1,s1]: failed initial metrics computation: [n1,s1]: system config not yet available
I170412 16:24:31.427827 14 server/node.go:450 [n1] initialized store [n1,s1]: {Capacity:10725883904 Available:8568995840 RangeCount:9 LeaseCount:0}
I170412 16:24:31.427869 14 server/node.go:341 [n1] node ID 1 initialized
I170412 16:24:31.428007 14 gossip/gossip.go:293 [n1] NodeDescriptor set to node_id:1 address:<network_field:“tcp” address_field:“localhost:26257” > attrs:<> locality:<>
I170412 16:24:31.428171 14 storage/stores.go:296 [n1] read 1 node addresses from persistent storage
I170412 16:24:31.428362 14 server/node.go:599 [n1] connecting to gossip network to verify cluster ID…
W170412 16:24:31.429181 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
I170412 16:24:31.429353 14 server/node.go:623 [n1] node connected via gossip and verified as part of cluster "0ae37bf7-3476-4707-8cd2-0914f5743c63"
I170412 16:24:31.429395 14 server/node.go:388 [n1] node=1: started with [[]=cockroach-data] engine(s) and attributes []
I170412 16:24:31.429443 14 sql/executor.go:342 [n1] creating distSQLPlanner with address {tcp localhost:26257}
I170412 16:24:31.436106 14 server/server.go:695 [n1] starting http server at localhost:8080
I170412 16:24:31.436141 14 server/server.go:696 [n1] starting grpc/postgres server at localhost:26257
I170412 16:24:31.436161 14 server/server.go:697 [n1] advertising CockroachDB node at localhost:26257
W170412 16:24:31.482240 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
W170412 16:24:31.586839 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
W170412 16:24:31.764790 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
I170412 16:24:31.938822 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
W170412 16:24:31.938899 115 gossip/client.go:125 [n1] failed to start gossip client to cockroachdb3:26258: rpc error: code = Unavailable desc = grpc: the connection is unavailable
W170412 16:24:32.173466 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
I170412 16:24:32.939103 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
W170412 16:24:33.019991 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
I170412 16:24:33.753150 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
W170412 16:24:34.465763 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
I170412 16:24:34.860763 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
W170412 16:24:35.428941 162 storage/raft_transport.go:442 [n1] raft transport stream to node 2 failed: unable to look up descriptor for node 2
I170412 16:24:36.018077 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
I170412 16:24:37.055935 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
I170412 16:24:37.865674 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
W170412 16:24:37.985173 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
^C
[root@cockroachdb3 ~]# I170412 16:24:38.421131 1 cli/start.go:457 received signal ‘interrupt’

@edwardsmarkf, when you first ran through the instructions, did you start 3 nodes? If so, I suspect you’re seeing those messages after restarting just 1 node. If that’s the case, you need to restart at least one other node in order for a majority of replicas to be available, which is required for the cluster to be operational.

hi jesse and thank you very for answering me. i only opened two, not the three.

here are the steps i am doing thus far to ensure i can run through the start/stop procedure repeatedly (below).

[root@cockroachdb3 ~]# cockroach start --host=localhost --background ; ## primary initial node
CockroachDB node starting at 2017-04-12 17:31:55.851198524 +0000 UTC
build: CCL beta-20170330 @ 2017/04/07 14:45:36 (go1.8)
admin: http://localhost:8080
sql: postgresql://root@localhost:26257?sslmode=disable
logs: cockroach-data/logs
store[0]: path=cockroach-data
status: initialized new cluster
clusterID: c318627d-e456-46f6-8d6e-58751d5d8c05
nodeID: 1
[root@cockroachdb3 ~]# cockroach start --background --insecure --port=26258 --http-port=8081 --store=node2 --join=localhost:26257 ; ## secondary node
CockroachDB node starting at 2017-04-12 17:31:59.529771114 +0000 UTC
build: CCL beta-20170330 @ 2017/04/07 14:45:36 (go1.8)
admin: http://cockroachdb3:8081
sql: postgresql://root@cockroachdb3:26258?sslmode=disable
logs: node2/logs
store[0]: path=node2
status: initialized new node, joined pre-existing cluster
clusterID: c318627d-e456-46f6-8d6e-58751d5d8c05
nodeID: 2
[root@cockroachdb3 ~]# cockroach quit;
initiating graceful shutdown of server
ok
server drained and shutdown completed

[root@cockroachdb3 ~]# ps -ef | grep cock;
avahi 305 1 0 16:44 ? 00:00:00 avahi-daemon: running [cockroachdb3.local]
root 1747 1 2 17:31 pts/0 00:00:00 cockroach start --insecure --port=26258 --http-port=8081 --store=node2 --join=localhost:26257
root 1764 905 0 17:32 pts/0 00:00:00 grep --color=auto cock
[root@cockroachdb3 ~]# kill -9 1747; ## obviously a different number every time
[root@cockroachdb3 ~]# rm -Rf ./node2/ ./cockroach-data/CURRENT ./cockroach-data/OPTIONS- ./cockroach-data/MANIFEST- ./cockroach-data/000*;**

i can repeat this procedure without any issues, but of course i would like to be able to skip the “delete” step!

Hmm, I just ran through your steps on my mac and didn’t have any trouble restarting the cluster, and I didn’t have to delete the store data. Can you share your steps for restarting the nodes?

Also, just fyi, when running locally, you don’t need to specify --insecure or --host. The cluster’ll be insecure by default, and without the --insecure flag, it’ll listen only on localhost.

here are my steps without doing the delete: (and thank you again Jesse)

[root@cockroachdb3 ~]# cockroach start --host=localhost --background ; ## primary initial node
CockroachDB node starting at 2017-04-12 19:22:35.292345577 +0000 UTC
build: CCL beta-20170330 @ 2017/04/07 14:45:36 (go1.8)
admin: http://localhost:8080
sql: postgresql://root@localhost:26257?sslmode=disable
logs: cockroach-data/logs
store[0]: path=cockroach-data
status: initialized new cluster
clusterID: 841ed988-6ead-49de-940d-396d5bcf84c4
nodeID: 1
[root@cockroachdb3 ~]# cockroach start --background --insecure --port=26258 --http-port=8081 --store=node2 --join=localhost:26257 ; ## secondary node
CockroachDB node starting at 2017-04-12 19:22:39.567997245 +0000 UTC
build: CCL beta-20170330 @ 2017/04/07 14:45:36 (go1.8)
admin: http://cockroachdb3:8081
sql: postgresql://root@cockroachdb3:26258?sslmode=disable
logs: node2/logs
store[0]: path=node2
status: initialized new node, joined pre-existing cluster
clusterID: 841ed988-6ead-49de-940d-396d5bcf84c4
nodeID: 2
[root@cockroachdb3 ~]#
[root@cockroachdb3 ~]# cockroach quit;
initiating graceful shutdown of server
ok
server drained and shutdown completed

[root@cockroachdb3 ~]#
[root@cockroachdb3 ~]# ps -ef | grep cock ; ## delete the last cockroach process_
avahi 305 1 0 16:44 ? 00:00:00 avahi-daemon: running [cockroachdb3.local]
root 2289 1 2 19:22 pts/0 00:00:00 cockroach start --insecure --port=26258 --http-port=8081 --store=node2 --join=localhost:26257
root 2304 905 0 19:22 pts/0 00:00:00 grep --color=auto cock
[root@cockroachdb3 ~]# kill -9 2289 ;
[root@cockroachdb3 ~]# ps -ef | grep cock; ## make sure no cockroach process is running
avahi 305 1 0 16:44 ? 00:00:00 avahi-daemon: running [cockroachdb3.local]avahi 305 1 0 16:44 ? 00:00:00 avahi-daemon: running [cockroachdb3.local]
root 2306 905 0 19:23 pts/0 00:00:00 grep --color=auto cock
[root@cockroachdb3 ~]# cockroach start --host=localhost --background --logtostderr ; ## primary initial node
I170412 19:23:21.828390 1 cli/start.go:330 CockroachDB CCL beta-20170330 (linux amd64, built 2017/04/07 14:45:36, go1.8)
I170412 19:23:21.930405 14 cli/start.go:367 starting cockroach node
W170412 19:23:21.931034 14 server/server.go:158 [n?] running in insecure mode, this is strongly discouraged. See --insecure.
W170412 19:23:21.935674 14 server/config.go:352 soft open file descriptor limit 4096 is under the recommended limit 15000; this may decrease performance
please see https://www.cockroachlabs.com/docs/recommended-production-settings.html for more details
I170412 19:23:21.935911 14 storage/engine/rocksdb.go:374 opening rocksdb instance at "cockroach-data"
I170412 19:23:22.031241 14 server/config.go:486 1 storage engine initialized
I170412 19:23:22.031555 14 server/server.go:641 [n?] sleeping for 403.568768ms to guarantee HLC monotonicity
I170412 19:23:22.440105 14 storage/store.go:1318 [n1] [n1,s1]: failed initial metrics computation: [n1,s1]: system config not yet available
I170412 19:23:22.440189 14 server/node.go:450 [n1] initialized store [n1,s1]: {Capacity:10725883904 Available:8571211776 RangeCount:7 LeaseCount:0}
I170412 19:23:22.440230 14 server/node.go:341 [n1] node ID 1 initialized
I170412 19:23:22.440364 14 gossip/gossip.go:293 [n1] NodeDescriptor set to node_id:1 address:<network_field:“tcp” address_field:“localhost:26257” > attrs:<> locality:<>
I170412 19:23:22.440578 14 storage/stores.go:296 [n1] read 1 node addresses from persistent storage
I170412 19:23:22.440822 14 server/node.go:599 [n1] connecting to gossip network to verify cluster ID…
W170412 19:23:22.441551 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
I170412 19:23:22.441680 14 server/node.go:623 [n1] node connected via gossip and verified as part of cluster "841ed988-6ead-49de-940d-396d5bcf84c4"
I170412 19:23:22.441728 14 server/node.go:388 [n1] node=1: started with [[]=cockroach-data] engine(s) and attributes []
I170412 19:23:22.441791 14 sql/executor.go:342 [n1] creating distSQLPlanner with address {tcp localhost:26257}
I170412 19:23:22.447341 14 server/server.go:695 [n1] starting http server at localhost:8080
I170412 19:23:22.447396 14 server/server.go:696 [n1] starting grpc/postgres server at localhost:26257
I170412 19:23:22.447436 14 server/server.go:697 [n1] advertising CockroachDB node at localhost:26257
W170412 19:23:22.492249 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
W170412 19:23:22.596954 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
W170412 19:23:22.815026 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
I170412 19:23:22.949738 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
W170412 19:23:22.949859 115 gossip/client.go:125 [n1] failed to start gossip client to cockroachdb3:26258: rpc error: code = Unavailable desc = grpc: the connection is unavailable
W170412 19:23:23.157908 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
I170412 19:23:23.950404 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
W170412 19:23:23.990684 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
I170412 19:23:24.906673 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
W170412 19:23:25.596892 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
W170412 19:23:25.841626 157 storage/raft_transport.go:442 [n1] raft transport stream to node 2 failed: unable to look up descriptor for node 2
I170412 19:23:25.931528 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
I170412 19:23:26.801527 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
I170412 19:23:27.844657 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
W170412 19:23:28.326105 42 storage/store.go:1405 [n1,s1] could not gossip system config: [NotLeaseHolderError] range 2: replica {1 1 1} not lease holder; lease holder unknown
I170412 19:23:28.646040 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
I170412 19:23:29.775502 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
I170412 19:23:30.594308 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }
I170412 19:23:31.656599 117 vendor/google.golang.org/grpc/clientconn.go:806 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = “transport: dial tcp 10.128.0.3:26258: getsockopt: connection refused”; Reconnecting to {cockroachdb3:26258 }

@edwardsmarkf, if your steps are complete, you’re only restarting 1 of your 2 nodes. As explained in my first response (probably not clearly), once you have a cluster of multiple nodes, the cluster remains operational only as long a majority of replicas are available. In a cluster of 3 nodes, where each piece of data is replicated 3 times (the default), the cluster remains operational with 2 nodes but not with just 1.

In your case, you’re running just 2 nodes. When you restart 1 node, a majority of replicas are NOT available. That’s why you’re not seeing the cluster come back online. Try this:

  1. Restart node 1. Disregard the log messages momentarily.
  2. Restart node 2.

Hope that works for you. Also, clearly you’re doing local testing, but it’s important to note that a cluster of 2 nodes is not resilient. As we’ve seen here, with a cluster of 2 nodes, if any 1 node fails, the entire cluster is unavailable. To survive the failure of any single node, you need at least 3 nodes.

hi jesse - once again, thank you for your patience.

yes i am just testing. i am trying to figure out how to start/stop cockroach, and restart cockroach after a reboot.

could i possibly trouble you one more time to cut/paste what i gave you into a little bash-shell script to demonstrate all this? and maybe using this as your starting point:

cockroach start --host=localhost --background ; ## primary initial node
cockroach start --background --insecure --port=26258 --http-port=8081 --store=node2 --join=localhost:26257 ; ## secondary node
cockroach quit;
ps -ef | grep cock ; ## delete the last cockroach process
cockroach start --host=localhost --background --logtostderr ; ## primary initial node 

or even possibly, include the restarted node instructions at the bottom of this page.

@edwardsmarkf, I ran through those commands earlier and can confirm what you’re seeing. But what you’re seeing is expected. Again, if you have a 2 node cluster and restart only 1 node, the cluster will not be available. You need to restart both nodes.

I’ve updated your snippet so you can copy and paste directly into your terminal to see what I mean:

cockroach start --background

cockroach start --background --port=26258 --http-port=8081 --store=node2 --join=localhost:26257

cockroach quit

ps -ef | grep cock ; ## delete the last cockroach process

cockroach start --background

cockroach start --background --port=26258 --http-port=8081 --store=node2 --join=localhost:26257

hello jesse - step five, cockroach start --background; just seems to hang indefinitely.

first i rebooted and deleted all the files. so the first three steps were fine.

NOTE: you have the patience of a saint, and certainly more than i have.

Yes, it will hang indefinitely until you restart the second node. Try restarting node 1 in one terminal window and node 2 in separate terminal window. Soon after restarting node 2, node 1 won’t hang anymore.

hmmm ok that works fine, thank you so much.

some day (not today anyways) you may have to explain this one to me. probably best explained with a couple of alcoholic beverages between us.

in the meantime, once again i thank you for your extraordinary patience with me.

1 Like