If you’re using enterprise CRDB, we have the ability to do geo-partitioning. There is a good doc that outlines a migration using those features: https://www.cockroachlabs.com/docs/stable/demo-automatic-cloud-migration.html
If you’re not using enterprise edition, then I would suggest the following:
- Avoid changing the num_replicas (i.e., RF) to an even number. this puts you into a bad situation during your migration. In fact, I think we can avoid changing the RF altogether.
- Monitor where your ranges are located by running in sql shell:
SELECT replicas, replica_localities, COUNT(*) FROM [SHOW RANGES FROM DATABASE your_db_name] GROUP BY replicas, replica_localities;
- Start your new nodes with the --locality flag set (https://www.cockroachlabs.com/docs/v20.2/cockroach-start#locality). Verify that they’re connected to the cluster and your VPN peering is all working correctly (firewall, DNS, working both directions, etc.)
cockroach node status. Make sure you see the old nodes and the new. Make a note of which node ids are the old and the new.
- Check your ranges query again - you should see some of the ranges start to move around to give better diversity across the old and new nodes. Wait for it to “settle.” Your app latencies (read and write) may increase during this time since network hops are in the mix now.
- Stop one of the old nodes by doing:
cockroach node drain; monitor the draining status (
cockroach node status) and when it is done, kill the node with a SIGKILL
- All ranges should still be available at this point because you will have 2/3 of the replicas in every range. An easy way to check is to run:
SELECT COUNT(*) from your_db.your_table. If it’s not responding, bring the node back up.
- You should see under-replicated ranges show up in the DB Console and then start to move down in number; and if you re-run your replicas query, you should see the ranges move around some and start to move away from the old nodes.
- Wait for the under-replicated ranges metric to go to 0.
- If the node is still showing up in the cluster, you can get rid of it by doing
cockroach node decommission <node id>.
- Continue to take the old nodes out of the cluster, one-at-a-time, using the process outlined here.
- When you’re done, you should be back to a state where you have 3 nodes in the cluster and all the ranges are located on these three nodes. And, you should have been able to do this whole process without losing DB availability.
One thing to note is that while this process is going on and you’re in a state where you have 2/3 replicas up and available, you don’t want to lose or take down a second node because you’ll probably create a situation where some of the ranges lose quorum (i.e., have 1/3 of the replicas available). You can mitigate against this possibility by increasing the RF to 5 (in the step after you add the three new nodes). But, you will have to drop it back down to RF=3 at the point where you go from 5 nodes to 4 nodes. You’ll also have to make sure you have enough disk capacity to handle the extra copies of the replicas. And, there will be some additional overhead (creating additional replicas, etc.). I’m not sure this extra hassle is worth the risk mitigation, but it is a possibility.
One thing I’m not addressing here is the question of how your apps are connecting to Cockroach. You’ll notice in the article I referenced at the top of my note that they adjust load balancer settings. Assuming you’re running through a LB, you’ll need to do the same. You may also need to take down your apps and re-point them in a rolling fashion. As long as the apps can talk to at least one node in the cluster, they can read/write. But, you don’t want to overload any given node with having it taking all the read/write requests, so it’s a good idea to monitor the connections in the DB Console using the SQL Dashboard and the Connections graph. Try to keep it balanced as you go.
Hopefully that gets you going in the right direction…