Behavior during Asymmetric network partition between 2 nodes

Hi,

I have a 3 node cluster (A, B, C). I used comcast tool to create a one way 100% packet loss between two cockroachdb nodes, say from A -> B.
I dont see any nodes being marked as suspect nodes in the admin UI.
I am also not able to connect to node A from my java app using postgresql driver (the connection attempt just seems to hang after this log statement

2019-10-24 22:36:33 FINEST org.postgresql.core.v3.ConnectionFactoryImpl sendStartupPacket  FE=> StartupPacket(user=berserker, database=accounts, client_encoding=UTF8, DateStyle=ISO, TimeZone=America/Los_Angeles, extra_float_digits=2)

Connections and writes to nodes B and C seem to work just fine. Why does this happen? Is this expected behavior?

Thanks!

Hi @iamhari,

To summarize: only Node A Is unable to communicate with Node B?

I believe this is expected behavior, the cluster is able to recognize that there is a connectivity problem from A->B. However, Node C can communicate with A and B bi-directionally and therefore node A is not considered a suspect.

Thanks for your reply @mattvardi
I think I get the part about A not being considered a suspect node but can you share some details on why the java app is unable to connect to Node A in the above scenario?

So after some poking around, we are aware of this limitation and I think in the future we may raise a flag that there is a one-direction network partition. Earlier, I mentioned the cluster is smart enough and I would like to reword my response. The cluster is still able to achieve raft consensus and therefore nothing is suspicious on the front-end side.

With that said, you can always view the network page on the admin UI to view the partition.

In order to troubleshoot why your java app isn’t able to connect to Node A, i would need more details on how you are creating this packet loss scenario and what is being blocked.

Thanks,
Matt

Technically the cluster is still able to perceive connectivity using its gossip sub-system, which is what shows up in the node status page in the UI.

However certain forms of Raft consensus will be impossible (in particular any range forming a raft group over nodes A and B will be impacted and may become unable to process KV requests). So it’s definitely possible for this situation to cause partial range unavailability.

The main limitation is that the UI is not showing partial range/network unavailability (available in the range and network debug pages) in the main screen under node availability. This is a UI limitation. In case of doubt, use the range and network/latency debug pages which will be more precise to assess cluster health.

I am unable to replicate the same issue now but when I created a similar asymmetric partition between nodes A and B, this is what i observed.

Preparation: Single thread/connection from java app connecting to a node in the cluster and trying to insert 1000 rows with random generated uuid as the primary key

Scenario 1: Connecting to node A
Connection successful but unable to insert records - no errors observed

Scenario 2: Connection to node C
Connection successful and all records inserted successfully

Scenario 3: Connection to node B
Connection unsuccessful with below error:

DriverManager.getConnection("jdbc:postgresql://ec2-3-86-42-222.compute-1.amazonaws.com:26257/accounts?loggerLevel=TRACE")
    trying org.postgresql.Driver
2019-11-04 10:53:02 INFO main waiting for workers to spool up
2019-11-04 10:53:02 FINE org.postgresql.Driver connect Connecting with URL: jdbc:postgresql://ec2-3-86-42-222.compute-1.amazonaws.com:26257/accounts?loggerLevel=TRACE
2019-11-04 10:53:02 FINE org.postgresql.jdbc.PgConnection <init> PostgreSQL JDBC Driver 42.2.5
2019-11-04 10:53:02 FINE org.postgresql.jdbc.PgConnection setDefaultFetchSize   setDefaultFetchSize = 0
2019-11-04 10:53:02 FINE org.postgresql.jdbc.PgConnection setPrepareThreshold   setPrepareThreshold = 5
2019-11-04 10:53:02 FINE org.postgresql.core.v3.ConnectionFactoryImpl openConnectionImpl Trying to establish a protocol version 3 connection to ec2-3-86-42-222.compute-1.amazonaws.com:26257
2019-11-04 10:53:02 FINEST org.postgresql.core.Encoding <init> Creating new Encoding UTF-8 with fastASCIINumbers true
2019-11-04 10:53:02 FINEST org.postgresql.core.Encoding <init> Creating new Encoding UTF-8 with fastASCIINumbers true
2019-11-04 10:53:02 FINEST org.postgresql.core.Encoding <init> Creating new Encoding UTF-8 with fastASCIINumbers true
2019-11-04 10:53:02 FINEST org.postgresql.core.v3.ConnectionFactoryImpl enableSSL  FE=> SSLRequest
2019-11-04 10:53:02 FINEST org.postgresql.core.v3.ConnectionFactoryImpl enableSSL  <=BE SSLOk
2019-11-04 10:53:02 FINE org.postgresql.ssl.MakeSSL convert converting regular socket connection to ssl
2019-11-04 10:53:02 FINE org.postgresql.core.v3.ConnectionFactoryImpl tryConnect Receive Buffer Size is 178,560
2019-11-04 10:53:02 FINE org.postgresql.core.v3.ConnectionFactoryImpl tryConnect Send Buffer Size is 23,040
2019-11-04 10:53:02 FINEST org.postgresql.core.v3.ConnectionFactoryImpl sendStartupPacket  FE=> StartupPacket(user=berserker, database=accounts, client_encoding=UTF8, DateStyle=ISO, TimeZone=America/Los_Angeles, extra_float_digits=2)
2019-11-04 10:53:12 FINEST org.postgresql.core.v3.ConnectionFactoryImpl doAuthentication  <=BE ErrorMessage(ERROR: error looking up user berserker: get-hashed-pwd: no inbound stream connection
  Server SQLState: XX000)
org.postgresql.util.PSQLException: ERROR: error looking up user berserker: get-hashed-pwd: no inbound stream connection
  Server SQLState: XX000
	at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:514)
	at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:141)
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:192)
	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
	at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:195)
	at org.postgresql.Driver.makeConnection(Driver.java:454)
	at org.postgresql.Driver.connect(Driver.java:256)
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:678)
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
	at org.apache.commons.dbcp2.DriverManagerConnectionFactory.createConnection(DriverManagerConnectionFactory.java:92)
	at org.apache.commons.dbcp2.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:291)
	at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:883)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:436)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:365)
	at org.apache.commons.dbcp2.PoolingDriver.connect(PoolingDriver.java:152)
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:678)
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:252)
	at com.twilio.blaberus.Worker.run(Worker.java:74)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
	at java.base/java.lang.Thread.run(Thread.java:844)
SQLException: SQLState(XX000)
2019-11-04 10:53:12 FINE org.postgresql.Driver connect Connection error:
org.postgresql.util.PSQLException: ERROR: error looking up user berserker: get-hashed-pwd: no inbound stream connection
  Server SQLState: XX000
	at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:514)
	at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:141)
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:192)
	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
	at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:195)
	at org.postgresql.Driver.makeConnection(Driver.java:454)
	at org.postgresql.Driver.connect(Driver.java:256)
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:678)
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
	at org.apache.commons.dbcp2.DriverManagerConnectionFactory.createConnection(DriverManagerConnectionFactory.java:92)
	at org.apache.commons.dbcp2.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:291)
	at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:883)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:436)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:365)
	at org.apache.commons.dbcp2.PoolingDriver.connect(PoolingDriver.java:152)
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:678)
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:252)
	at com.test.Worker.run(Worker.java:74)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
	at java.base/java.lang.Thread.run(Thread.java:844)

    trying org.apache.commons.dbcp2.PoolingDriver
getConnection failed: org.postgresql.util.PSQLException: ERROR: error looking up user berserker: get-hashed-pwd: no inbound stream connection
  Server SQLState: XX000
getConnection failed: org.postgresql.util.PSQLException: ERROR: error looking up user berserker: get-hashed-pwd: no inbound stream connection
  Server SQLState: XX000
2019-11-04 10:53:12 SEVERE Worker run pool-1-thread-1 ex: org.postgresql.util.PSQLException: ERROR: error looking up user berserker: get-hashed-pwd: no inbound stream connection
  Server SQLState: XX000
org.postgresql.util.PSQLException: ERROR: error looking up user berserker: get-hashed-pwd: no inbound stream connection
  Server SQLState: XX000
	at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:514)
	at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:141)
	at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:192)
	at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
	at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:195)
	at org.postgresql.Driver.makeConnection(Driver.java:454)
	at org.postgresql.Driver.connect(Driver.java:256)
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:678)
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:190)
	at org.apache.commons.dbcp2.DriverManagerConnectionFactory.createConnection(DriverManagerConnectionFactory.java:92)
	at org.apache.commons.dbcp2.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:291)
	at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:883)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:436)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:365)
	at org.apache.commons.dbcp2.PoolingDriver.connect(PoolingDriver.java:152)
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:678)
	at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:252)
	at com.test.Worker.run(Worker.java:74)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:514)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1167)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:641)
	at java.base/java.lang.Thread.run(Thread.java:844)

@knz can you elaborate why the raft consensus will be impossible, as 2 nodes are always available from any nodes view of the cluster?

Any insights on above @knz/@mattvardi ?

CockroachDB does not support functioning with asymmetric network topologies. If node A can connect to node B but not the other way, this will cause multiple kinds of problems including a possible lost of quorum.