I would like to create a table in which a stream of timestamped data is inserted which is queried by clients for the records in the previous hour constrained by the “cluster” and “type” columns. I know CRDB uses the Primary Key for partitioning the data, so I was wondering what the most effective way to partition the writes and reads across the whole cluster would be for this data patter?
You mean you want to target your application to specific cluster nodes? I suggest you to put between your application and the cluster an instance of HA proxy (http://www.haproxy.org/) you can ask cockroach db to create a configuration for you, see documentation at:
No sorry, I was not very clear. I am trying to make sure that I am not creating any hot spots in the cluster as data is inserted. Coming from a Cassandra background I am aware that the underlying key value store will partition the data by the primary key, and I wanted to make sure that I was evenly distributing primary keys across the cluster so not to overwhelm any particular nodes. After some research it appears that using a UUID column for the primary key should distribute the keys across the cluster of nodes in a random fashion, which I was hoping someone could confirm.
As for distributing the client connections we are already using HA Proxy which has worked very well.
Hi @somecallmemike! You’re correct that data is chunked into ranges and distributed by primary key. If you’d like to avoid any hotspots on data being inserted, a UUID primary key would do that. Here’s some more docs on how to set up schemas if you’re interested: https://www.cockroachlabs.com/docs/stable/performance-best-practices-overview.html#use-uuid-to-generate-unique-ids