Does read/write perf scale linearly with node count like Cassandra? Are there declared perf goals like this?
Read/Write performance should scale linearly with the node count assuming that the data is laid out properly. Cassandra essentially hashes the partition key to spread data evenly through the cluster, so for Cockroach you would want to do something like this:
CREATE TABLE data (id BYTES PRIMARY KEY, email STRING); INSERT INTO data (sha_256('firstname.lastname@example.org'), 'email@example.com'); SELECT email FROM data WHERE id = sha_256('firstname.lastname@example.org');
Of course, if your primary key is already fairly randomly distributed, you don’t need to hash it. In the future we may have built-in support for this sort of thing (which may also enable the same semantics in the presence of secondary indexes, for which the above currently doesn’t work well).
See https://github.com/cockroachdb/cockroach/issues/7186#issuecomment-225646736 for the original thread.
@tobias Given you guys recently wrapped up the code yellow stability work (congrats) and are presumably moving on to more feature/tuning related work - are there any performance numbers against recent betas you can share, if they exist (no matter how naive)? Even something as simple as the insert/select statement above across different cluster sizes.
Hi @n00b, thanks for your interest in CockroachDB. We currently do not have any performance numbers to share, although we are actively looking into this. As a lot of inputs go into measuring performance (some of which you mentioned), we want to find the most effective way to share metrics. Was there a specific use case / latency requirement you had in mind?
Hi @dianasaur323, I don’t have a specific answer to the use case / latency requirement question. Mainly curious about the performance when simply adding more nodes. Given a baseline of a 3 node cluster, what are the performance characteristics with the same workload against 5, 10, or 16 nodes? Close(ish-ly-kind of) to linear? Not looking for anything in-depth, repeatable, or independently verifiable - just any kind of idea what you have seen even with the simplest of workloads.
Disclaimer: I understand if you release some numbers, no matter how synthetic, CockroachLabs will get called out somewhere if its not positive, or questioned if they’re really positive. So if the answer to the question is “we ain’t sharing yet”, thats a totally acceptable response in my book.
Ha! I’m glad we are on the same page re metrics. It’s really easy to poke holes in any sort of benchmarking exercise. Regarding performance of adding new nodes, we are aiming for “near” linear scalability. If you take a look at our product roadmap for Q4 https://github.com/cockroachdb/cockroach/issues/10528, we outlined some metrics we hope to hit in regards to scalability. Full disclosure is that this will likely leak into Q1 2017, when we will have some more to share with you on this topic.
Hi @dianasaur323, any updates on the metrics, benchmarks and per node count performance?
I’m personally specifically curious about performance on a globally sorted indexed field. See this problem with Cassandra: https://stackoverflow.com/a/44743936/177498
Hi @ubershmekel - thanks for providing that specific example. We are actually currently knee deep in setting up the infrastructure we need to release performance benchmarks. We are currently testing with specific transactional workloads including YCSB, although as you know, it’s hard to compare different databases on an apples-to-apples basis.
The one thing I can share with you is that we’ve seen a linear increase in throughput as you increase nodes, but in terms of query performance, it’s still an ongoing area of research for us. Please stay tuned, and apologies for the delay!
Based on the SO link you give, your comments at the bottom regarding CokroachDB and the linked article at http://www.datastax.com/dev/blog/we-shall-have-order I sense there is a difference between the question you wrote on our forum above and the question you intend to ask:
your question here seems to be about “globally sorted indexed field”. In CockroachDB all field indices are sorted. Moreover, all indices are also distributed and sharded automatically. As a result all queries that use the index to access the table are automatically distributed too. For this specific use case we know already performance is reliably scalable with the number of nodes.
however, the link you provide is about time series (TS) data and how to store and query this data efficiently. The problem there has nothing to do with indexes and sorting, and instead about locality of data the primary key. As explained in the linked article http://www.datastax.com/dev/blog/we-shall-have-order “One of the more typical early mistakes made in time series data modeling is designing a table that is dependent on a time as its primary key”. This will typically incur performance penalties in CockroachDB for the reasons you explain in your SO comment. (More specifically, max inserts/sec will not increase with cluster size.) However note:
- Although we do not currently have any concrete plans to accelerate insert performance and data balancing on TS tables organized with a timestamp as PK, there is a lot of demand for this and we will likely have a more serious look at this in the future.
- Meanwhile, for this type of use case, you can tune write performance effectively with CockroachDB by using some random value as PK and use a secondary index on the time column. In this way, inserts are properly scattered through the cluster, resulting in better throughput scalability of writes (more inserts/sec when you increase the number of nodes), at the cost of slightly higher latency for individual read queries.
Does this clarify?
A timestamp as a Primary Key … is that a sensible at all? Hardware gets faster and faster … once, a timestamp with a resolution of seconds was acceptable since hardware was slow. After hardware upgrades problems came up. Now we have timestamps with millisecond or even microsecond resolution … hardware will still get faster.
Just my 2c.
Metrics would be cool. And that does clarify.
@knz why does making the time series column a secondary index make a difference? According to https://www.cockroachlabs.com/blog/sql-in-cockroachdb-mapping-table-data-to-key-value-storage/
The secondary index is effectively equivalent to the primary key: /tableID/indexID/indexColumns[/columnID]
Is the secondary index key-value-pair creation deferred or cheaper than the PK key-value-pair?
@ik_zelf - Cassandra has a problem with a timestamp as a secondary index because an index is the same thing as a table there. I think that might be the case with Cockroach too. Essentially you can’t sort the entire table - you can only sort each partition.
You are right, value locality in inserts to a secondary indexed column will cause contention of writes on the secondary index. I hadn’t considered that.