Upgrade question


(Jason shi) #1

I have a cluster of 9 nodes.Version is 1.1.4
I planing upgrade it to 2.0.3.
When I upgraded first node, everything was normal.then upgrade second node, it can’t join cluster,below are error log:

[root@host123 logs]# tail -f cockroach.log
I180817 06:59:30.005815 1 cli/start.go:796 process identity: uid 0 euid 0 gid 0 egid 0
I180817 06:59:30.005824 1 cli/start.go:461 starting cockroach node
I180817 06:59:30.006514 27 storage/engine/rocksdb.go:552 opening rocksdb instance at “/export/cockroachdb/data/cockroach-temp234291194”
I180817 06:59:30.052271 27 server/server.go:734 [n?] monitoring forward clock jumps based on server.clock.forward_jump_check_enabled
I180817 06:59:30.052735 27 storage/engine/rocksdb.go:552 opening rocksdb instance at “/export/cockroachdb/data”
I180817 06:59:30.540784 27 server/config.go:538 [n?] 1 storage engine initialized
I180817 06:59:30.540800 27 server/config.go:541 [n?] RocksDB cache size: 31 GiB
I180817 06:59:30.540812 27 server/config.go:541 [n?] store 0: RocksDB, max size 0 B, max open file limit 97400
W180817 06:59:30.553727 27 gossip/gossip.go:1293 [n?] no incoming or outgoing connections
I180817 06:59:30.555855 83 gossip/client.go:129 [n?] started gossip client to 172.18.151.38:3359
W180817 06:59:37.406171 302 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 06:59:43.401988 647 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 06:59:44.740855 741 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 06:59:48.990065 884 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 06:59:51.129178 877 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 06:59:58.169765 1171 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
I180817 07:00:00.338898 140 gossip/gossip.go:1307 [n3] node has connected to cluster via gossip
W180817 07:00:00.553986 1150 server/server.go:1267 The server appears to be unable to contact the other nodes in the cluster. Please try

  • starting the other nodes, if you haven’t already
  • double-checking that the ‘–join’ and ‘–host’ flags are set up correctly
  • running the ‘cockroach init’ command if you are trying to initialize a new cluster

If problems persist, please see https://www.cockroachlabs.com/docs/v2.0/cluster-setup-troubleshooting.html.
W180817 07:00:02.922922 535 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:00:04.170776 456 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:00:07.409378 1355 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:00:15.258407 1586 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:00:22.173672 1414 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:00:23.739068 1801 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:00:24.503301 1826 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:00:25.001562 1828 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:00:37.424708 2143 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:00:41.835039 849 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:00:44.930754 2234 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:00:45.146773 2339 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:00:46.686379 2316 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:00:50.408449 2445 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:00:53.227656 2595 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
W180817 07:01:01.067939 2730 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”

please help me,thanks!


(Tim O'Brien) #2

Hey @Jason, can you give us a bit more detail about the cluster setup? Are you using SSL? If so, have certificates been issued for the new node? Were there any changes to the hostname or address after restart that would require a change in configuration elsewhere in your environment?


(Jason shi) #3

Thanks for your reply!
I’ve done the upgrade from 1.1.4 to 2.0.3, but the performance of the cluster has dropped a lot, and there are a lot of warning in the log.

I180820 08:35:24.124613 135418633 storage/replica_raftstorage.go:520 [replicate,n3,s3,r113665/4:/Table/3623/1/26{4965…-5506…}] generated preemptive snapshot 36207626 at index 55180
I180820 08:35:24.125170 135418633 storage/replicate_queue.go:226 [replicate,n3,s3,r113665/4:/Table/3623/1/26{4965…-5506…}] snapshot failed: (n1,s1):?: remote declined snapshot: store busy applying snapshots and/or removing replicas
I180820 08:35:24.260017 135419106 storage/replica_raftstorage.go:520 [replicate,n3,s3,r52798/3:/Table/6506/1/532{0972…-1004…}] generated preemptive snapshot 0e8bc7b0 at index 84096
I180820 08:35:24.260491 135419106 storage/replicate_queue.go:226 [replicate,n3,s3,r52798/3:/Table/6506/1/532{0972…-1004…}] snapshot failed: (n7,s6):?: remote declined snapshot: store busy applying snapshots and/or removing replicas
W180820 08:35:24.618494 135418916 vendor/google.golang.org/grpc/server.go:625 grpc: Server.Serve failed to create ServerTransport: connection error: desc = “transport: http2Server.HandleStreams failed to receive the preface from client: EOF”
I180820 08:35:25.154949 135419084 storage/replica_raftstorage.go:520 [replicate,n3,s3,r140204/3:/Table/6423/1/5450{157…-209…}] generated preemptive snapshot d12595b7 at index 346001
I180820 08:35:25.155389 135419084 storage/replicate_queue.go:226 [replicate,n3,s3,r140204/3:/Table/6423/1/5450{157…-209…}] snapshot failed: (n1,s1):?: remote declined snapshot: store busy applying snapshots and/or removing replicas

cluster info below:
±—±--------------------±-------±--------------------±--------------------±--------±-----------------±----------------------±-------±-------------------±-----------------------±--------------±-------------±--------------±-------------±-------------±------------------±-------------------±------------+
| id | address | build | updated_at | started_at | is_live | replicas_leaders | replicas_leaseholders | ranges | ranges_unavailable | ranges_underreplicated | live_bytes | key_bytes | value_bytes | intent_bytes | system_bytes | gossiped_replicas | is_decommissioning | is_draining |
±—±--------------------±-------±--------------------±--------------------±--------±-----------------±----------------------±-------±-------------------±-----------------------±--------------±-------------±--------------±-------------±-------------±------------------±-------------------±------------+
| 1 | 10.18.151.38:3359 | v2.0.3 | 2018-08-20 09:24:36 | 2018-08-17 10:20:46 | true | 18729 | 18728 | 24524 | 0 | 0 | 737754815948 | 90968762210 | 646806563447 | 100149 | 169102949 | 24524 | false | false |
| 2 | 10.18.150.200:3359 | v2.0.3 | 2018-08-20 09:24:43 | 2018-08-17 05:39:02 | true | 19611 | 19611 | 68640 | 0 | 0 | 2468052935634 | 308298008477 | 2159814538983 | 2899076 | 375336791 | 68640 | false | false |
| 3 | 10.18.150.238:3359 | v2.0.3 | 2018-08-20 09:24:39 | 2018-08-17 10:43:18 | true | 20591 | 20591 | 68599 | 0 | 0 | 2761109243963 | 335108553492 | 2426044128749 | 11896722 | 700620856 | 68599 | false | false |
| 4 | 10.18.150.201:3359 | v2.0.3 | 2018-08-20 09:24:40 | 2018-08-17 10:10:29 | true | 19626 | 19626 | 68640 | 0 | 0 | 2543388580280 | 309286329794 | 2234155642437 | 5993132 | 703693490 | 68640 | false | false |
| 5 | 10.18.150.207:3359 | v2.0.3 | 2018-08-20 09:24:44 | 2018-08-17 10:02:43 | true | 19665 | 19665 | 68638 | 0 | 0 | 2820316127275 | 338004004553 | 2482350233669 | 11738202 | 940116087 | 68638 | false | false |
| 6 | 10.18.151.13:3359 | v2.0.3 | 2018-08-20 09:24:40 | 2018-08-17 09:58:02 | true | 19617 | 19617 | 68640 | 0 | 0 | 2451373090665 | 309501830445 | 2141914437787 | 3364251 | 341572174 | 68640 | false | false |
| 7 | 10.18.150.173:3359 | v2.0.3 | 2018-08-20 09:24:43 | 2018-08-17 09:52:03 | true | 19224 | 19224 | 24764 | 0 | 0 | 743485896773 | 92308723952 | 651192497501 | 1598616 | 161799682 | 24764 | false | false |
| 8 | 10.18.150.197:3359 | v2.0.3 | 2018-08-20 09:24:34 | 2018-08-17 09:47:54 | true | 19629 | 19629 | 68638 | 0 | 0 | 2469301402801 | 303718144843 | 2165655795314 | 3078033 | 346599087 | 68638 | false | false |
| 9 | 10.18.150.167:3359 | v2.0.3 | 2018-08-20 09:24:39 | 2018-08-17 09:43:16 | true | 19861 | 19861 | 68638 | 0 | 0 | 2573276217995 | 328117438295 | 2245210994159 | 11060369 | 304826577 | 68638 | false | false |
±—±--------------------±-------±--------------------±--------------------±--------±-----------------±----------------------±-------±-------------------±-----------------------±--------------±-------------±--------------±-------------±-------------±------------------±-------------------±------------+


(Tim O'Brien) #4

Hey @Jason,

Most of those log lines are normal chatter - if it’s prefaced by an “I”, it’s for information, and doesn’t require any action. Most of the chatter there seems to be about nodes trying to up-replicate to nodes 7 and 2, while those nodes are applying snapshots and therefore rejecting new ones. This makes sense, since they’re currently holding ~24k ranges, compared to ~68k for the rest of the nodes.

What do you see on the “Queue Processing Failures”, “Queue Processing Times”, and “Replication Queue” graphs in the Queues dashboard (http://<adminurl>/#/metrics/queues/cluster)?

Regarding performance, what specifically changed performance-wise? Were there specific queries that slowed down? What do you see in the Service Latency graphs before and after the upgrade?


(Tim O'Brien) #5

Hey @Jason,

Just following up here to see how things are going - did the reply above help?


(Jason shi) #6

Thanks for your help.

The cluster normal running for two days, but yesterday was not normal. Below is screenshots.

There are a lot of errors in log:

E180823 17:06:40.749284 446819381 server/server.go:1537 [n5,client=10.175.46.83:38016] write tcp 10.18.150.207:3359->10.175.46.83:38016: write: broken pipe
E180823 17:06:43.009302 446845735 server/server.go:1537 [n5,client=10.175.46.83:47326] write tcp 10.18.150.207:3359->10.175.46.83:47326: write: broken pipe
E180823 17:06:46.166461 446841944 server/server.go:1537 [n5,client=10.175.46.82:45694] write tcp 10.18.150.207:3359->10.175.46.82:45694: write: broken pipe
E180823 17:06:46.490916 446835803 server/server.go:1537 [n5,client=10.175.46.82:36450] write tcp 10.18.150.207:3359->10.175.46.82:36450: write: broken pipe
E180823 17:06:48.491327 446852938 storage/replica_range_lease.go:282 [n5,s9,r16401/5:/Table/3257/1/1{69587…-70351…}] heartbeat failed on epoch increment
E180823 17:06:48.494067 446852827 storage/replica_range_lease.go:282 [n5,s9,r82557/1:/Table/6506/1/56360{504…-676…}] heartbeat failed on epoch increment
E180823 17:06:48.496596 446852576 storage/replica_range_lease.go:282 [n5,s9,r1264/2:/System/tsd/cr.node.sql.mem.c…] heartbeat failed on epoch increment
E180823 17:07:16.175730 446858668 storage/queue.go:778 [replicate,n5,s9,r150551/1:/Table/6506/1/56109{552…-716…}] range requires a replication change, but lacks a quorum of live replicas (1/2)
E180823 17:07:16.183385 446858867 storage/queue.go:778 [replicate,n5,s9,r162943/5:/Table/436/1/{84025/…-166943…}] range requires a replication change, but lacks a quorum of live replicas (1/2)
E180823 17:07:16.186694 446858986 storage/queue.go:778 [replicate,n5,s9,r155064/3:/Table/5137/1/3{00627…-12575…}] range requires a replication change, but lacks a quorum of live replicas (1/2)
E180823 17:07:44.355476 446859756 storage/replica_range_lease.go:282 [n5,s9,r162134/4:/System/tsd/cr.node.sql.mem.a…] heartbeat failed on epoch increment
E180823 17:07:44.355626 446859809 storage/replica_range_lease.go:282 [n5,s9,r161161/3:/System/tsd/cr.node.sql.mem.a…] heartbeat failed on epoch increment
E180823 17:07:44.355750 446859923 storage/replica_range_lease.go:282 [n5,s9,r154036/2:/System/tsd/cr.node.sql.mem.a…] heartbeat failed on epoch increment
E180823 17:07:44.355832 446859971 storage/replica_range_lease.go:282 [n5,s9,r158040/2:/System/tsd/cr.node.sql.mem.c…] heartbeat failed on epoch increment

node status is ok,10.18.150.207, this server also has no downtime and broken network. What is the reason?


(Tim O'Brien) #7

Hey @Jason, it seems that nodes were not able to communicate to upreplicate a range. What do the logs on 10.175.46.83 and 10.175.46.82 say at the same time? That’s where the connection seems to have broken.


(Tim O'Brien) #8

@Jason - any further help needed here?


(Jason shi) #9

The problem has been solved.
Thanks!