How do we create partitioned tables in CockroachDB? We are planning to migrate Oracle to cockroachDB. We have partitioned tables in Oracle. Thank you
Cockroachdb does partitioning automatically on it’s own over nodes. There is some support to tune it by locality settings, and I think plans to manually configure it in the future: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/sql_partitioning.md
@jlauro is absolutely correct here.
I spent some time reading over oracle’s partitioned table page, https://docs.oracle.com/cd/B28359_01/server.111/b32024/partition.htm and from what I can tell, CockroachDB provides most of this functionality automatically. All tables are split (partitioned) into 64MiB chunks and are spread around the full cluster. So when SQL statements are being run against the table, the whole cluster instead of a single node is used to speed up the operation.
And as was pointed out, being able to partition your data on a per row basis is a feature that will be coming soon, but this feature is designed so that specific rows of a table can be kept in specific locations, for faster access or comply with data sovereignty laws.
We do already provide functionality right now to ensure that an entire database or table is housed on specific servers via zone configs. https://www.cockroachlabs.com/docs/configure-replication-zones.html
Please let me know if there was a specific aspect of partitioned tables that you are looking for that you think we might need to address.
Does CRDB offer subpartitioning ?
We’re working on it at this very moment! Specifically, an RFC about partitioning is out for review, if you’d like to provide feedback: https://github.com/cockroachdb/cockroach/pull/18683
- Cockroach partitioning is about spiting table-spaces into ranges, which are replicated across servers.
- Cockroach ‘Geo Partitioning’ will allow us to control where ranges are stored.
- Oracle partitioning is an optimization technique that effectively creates a separate table-space for each partition. Say you partitioned account information by date is was received, and had seven years data. If a query was looking for data from last year, It would only look at the table-space with last years data. Making the query use 1/7 the resources.
I’d hate to go back to the early days when we manually created a bunch of tables-spaces and then put a view over it:
- create table x_2001 …
- create table x_2002 …
- create table x_2018 …
- create view x as Select * from x_2001 where cdate between ‘2001/01/01’ and ‘2002/01/01’ … Select * from x_2018 where cdate between ‘2018/01/01’ and 2019/01/01’
with the new partitioning feature you would make each year a separate partition.
Then any query that selects a particular value for that year with WHERE will only use the nodes / ranges that contain data for that year. That will indeed use the fraction of the resources corresponding to how much data is available for that year.
I do not see why you would need to create separate tables. What makes you think it is necessary?
I had to re-read the RFC twice.
Partitioning in SQL Server, Oracle, UDB and DB2 is mostly about sub-dividing larger tables into smaller chunks to limit data processed in a select, statistics gathering and backups. While it is possible to designate partitions to faster or slower drives, this is NOT what most legacy DBA’s think about.
I suspect ‘partition elemination’ is what the original poster was looking for.
Range splitting across servers( as mentioned by John A Lauro) does not address this concern.
Bram, you are correct. The RFC does provide what they are looking for.
However is focuses on the benifits of partition localization (Geo and storage speed)
You have to VERY carefully read the examples to see that select queries do the sort of partition elimination other DBMS do.
Might want to add a bullet for that.
It looks like it’s been discussed above, but to reiterate the importance of the it - one feature I would be looking for is the ability query a single partition. For example if I have data partitioned by a date column into daily partitions, and would like to query only the data in a given partition.
Another feature would be the ability to drop a partition. For example, with data in daily partitions I would like to drop one older partition per day, after all of the data in in it has been processed.
We’re in the middle of working on our official documentation for partition (to coincide with our upcoming 2.0 release) , and I’ll pass on this feedback directly to our documents team.
As you can tell, for us, this partitioning feature is more about data placement and sovereignty over querying speed. Such as ensuring that Australian data is located in Australia for faster lookups. In the classic case (non-geo-distributed), It can help with querying speed, but so will a basic index. All of our tables are automatically partitioned and distributed already.
All our tables are already automatically partitioned and as long as you have an index or use the date as part of the primary key, we would already only send the query to the specific partition. The use of a classic WHERE clause would cover this already.
Now, if there was a need to specifically move the data to a slower node (or datacenter) or something like that, then our new partition system would help. And partition by day would work for that case.
As far as dropping a partition, using a DELETE with a WHERE clause should be sufficient.
I started some work on a row based TTL feature a while ago, but it hasn’t been a priority so it’s still in the backlog. But I do think it would be very helpful. If there is enough interest, I’d love to build it.
But perhaps I’m misunderstanding your use case. Would a basic index based on date not be sufficient?
Thanks for the reply.
I believe, it is confusing when you use the same word, ‘Partitioning’
- For range ‘splitting’ (as controlled by the DBMS)
- For data ‘partitioning’ (as controlled by a DBA)
‘This partitioning feature is more about data placement and sovereignty’
- I love this feature!!! You are correct to highlight it!
However, ‘Partitioning’ tables by field also allows query partition eliminate.
- DBA’s who work with legacy DBMS, will find this is more important
- Not talking about it, will leave many thinking it is not provided.
Hopefully the documentation will clear up the confusion.
Yep, an index would be sufficient. When I was writing about drop partition
I was assuming that this operation is a lot faster than a delete (I know it
is in Oracle).