Locality of data within ranges

We have 4 datacenters that we wish to run CockroachDB in, us-east, us-west, us-central, europe-west. I’m curious if there are any tips on how to avoid the europe-west latency in writes. We are running all of the nodes with “–locality continent=${CONTINENT},region=${REGION}”. We are testing out CockroachDB for our sessions storage. We have around 1-2k sessions updates/inserts per second right now with around 40% coming from Europe and 60% from US. We see around 30-40 sessions migrate between US<->Europe every second but for the most part sessions on one continent stay on that continent.

Specifically, a problem I envision is that since ranges are made up randomly with data a range can’t effectively be assigned a leaseholder in the most likely area of writes/reads. I haven’t found a way for us to tell CockroachDB about the relative locality of rows so it can try to keep alike rows (based on region of the session) in the same range. I read the “Follow-the-workload” but I’m concerned that since roughly half of requests would come from different regions, it won’t be able to effectively maintain a leaseholder in the best region.

If there isn’t a way to specify anything that affects the hash for ranges, would a table per region/dc make more sense then? Could we store the table name in the sessionID and then use that to do lookups in the correct table, which would then mostly contain rows from users in that region, and allow leaseholders to be more likely in the right region?

I also might be misunderstanding something about ranges or CockroachDB in general, so any clarity is greatly appreciated.

Hey @fastest963,

This really depends on your table schema. If you set your primary key to include the locality (ideally as the first part of a composite key), cockroach will group similar localities together, as we don’t use a hash. Note that this isn’t a guarantee, but it’s a good start.

We also have a new feature coming out in our 2.0 release which I think is available in our current alpha that is called row level partitioning, which will do exactly what you’re thinking of. Take a look at the RFC.

@Bram thanks for the reply! So if I do PRIMARY KEY (region, id) then it will group by “region” in v1.1? Or do I need to use the ALTER TABLE ... SPLIT AT VALUES ("region")?

Will the above achieve what I want or do I need to use the alpha and partitioning? I read the RFC and it seems to allow for much more finely grained partitioning but I might not need that currently and it seems like as long as I set my primary key to be region, id then I could switch to it in the future?

Yes. Exactly that. And you will require the alter table split at to ensure that the tables’ ranges are split in the right places.

However, what you won’t be able to do without the new feature is be able to set the zone configs on a per region basis.

@Bram thanks for all your help so far! I realized the SPLIT AT won’t work for us unless I can specify multiple values, which I couldn’t get to work. Ideally, I’d like to split on our 4 DC names. But even without doing any manual split, things seem to have split by themselves pretty nicely so I might not need that.

I set up a table like:

  region STRING,
  uuid   STRING,
  lp     STRING,
  lfv    INT,
  PRIMARY KEY (region, uuid)

Looking at the SHOW TESTING_RANGES for the table seems to show that it split across region and most of the europe-based ranges are owned by a node in Europe. The US regions (us-east, us-west, us-central) are less correct, however, with multiple west ranges owned by east/central and vice-versa. Is the relatively low latency between US regions causing this?

I’ve updated the locality of every node to now be: --locality continent=${CONTINENT},region=${REGION},zone=${ZONE} and within every region we have a box per zone (minimum 3 per region). I thought this might encourage CockroachDB to put multiple replicas for a range within the same region, but it doesn’t seem like that’s the case. Reading through various docs, it seems like this is a feature to spread out the failure domain from a single zone/region? I assume I should have min_replicas: 5 so that I could get a quorum with only 3 nodes potentially all in the US to minimize the write latency?

So what you’re trying to do, force some ranges based on their specified region to be in specific continents or even regions, will require the new partition feature. To lock a specific table / database to locality, you need to use zone configs. With the new partitioning feature, you can do that with ranges within a table.

By default, cockroach will try to spread out replicas for a range as much as possible for fault tolerance. And having a second level of locality will push cockroach to spread out extra replicas within the first level to be spread in the 2nd. Unless there is a zone config to specify locality constraints. And even then, it will try to spread out amongst the localities that it is constrained to.

As far as ownership goes, what I think you mean by lease, this is going to depend on the load. So if there is no load, then the lease may move around. When there is load, the lease should move towards where the most load is coming from.

By setting the replica factor to 5 nodes instead of 3, it does not guarantee that those will be in the locality you want.

Thanks again for the insight. This helps a lot. We tested the new partitioning feature today, but after talking it over with the team, we decided the availability tradeoff isn’t worth the speed benefit. If all of us-east1 goes down and users get sent to another DC, their session cannot be totally unavailable.

One last idea we had would be to run each node with just --locality zone=${ZONE} (absent of any region/continent info) and then run with num_replicas: 5 and since we have 3 zones per region, we could potentially have 3 of the 5 replicas for a range within the same region, then writes would have a quorum within the same DC (which would make them fast). The issue I think this poses is that if a whole region goes down and those 3 zones contained the range, the last 2 replicas won’t have a quorum to repair/rebuild the replicas, correct?

If that doesn’t work then the next best option for us would be to get 3 other Europe regions so we have 3 US regions and 3 Europe regions with 3 zones each and then use partitioning at the continent level so all of the replicas would be low latency but we could tolerate a whole region going down.

Once again, we appreciate all your help so far!

@fastest963, sorry for the delay on this.

How many regions do you have? And you’re going to split those into sub regions?

If you have 2 regions: 1, 2
And you split those into a,b,c each: 1a, 1b, 1c, 2a, 2b, 2c

And you set the replication to 5. Then indeed, all but one of those regions will have a replica. But I’m not sure how you would be able to specify which region (1 or 2) would get the 3 replicas and and which would get 2 replicas.

However, we have a new zone config format coming that might address that a bit more, see https://github.com/cockroachdb/cockroach/issues/19985 for more on the ongoing discussion.

And yes, if lose quorum, then that data is indeed unavailable.

I think your option of going with 3 regions per continent sounds like a significantly better way to deal with the failure of a single DC.

Let’s continue under the assumption that we’ll have 3 regions in each of 2 continents. Then with the new config (which I saw just got merged into 2.0 branch) ideally we’d have the replicas set up like:
1a, 1b, 2a, 2b, 3a

So that if 1 went down there’d still be a quorum. Rather than:
1a, 1b, 1c, 2a, 2b

Which means that if 1 goes down, there wouldn’t be a quorum and the data would be unavailable.

Is there a way to get that without the new config format, or is that what CockroachDB does by default? We currently have locality set like: --locality continent=$CONTINENT,region=$REGION,zone=$ZONE where there are 2 continents, 3 regions in each continent and 3 zones in each region.

With 2 regions instead of 3, if I’m understanding correctly, we wouldn’t be able to support a region going down since the split would be:
1a, 1b, 1c, 2a, 2b

and so when 1 goes down, then there’s only 2 left. But if 2 went down, then we’d still be okay.

Thanks again!

Your assessment is absolutely correct. The new format will be available in our first beta release for 2.0, which is happening, hopefully, next week.

And no, there is no way to simulate this with the current zone configs, which is exactly why we needed the improved format.

@fastest963 You were not alone in being confused by this, we’ve seen other customers go through the same questions. In response, Andy and I wrote up a high-level description of the tradeoffs you can make (and how to make them) in a recent blog post. Check it out: https://www.cockroachlabs.com/blog/geo-partitioning-one/

Thanks for the blog post! The new constraint format should be particularly useful in our use case.