What is recommended max size of one node

What is recommended max size of one node?

What is recommended max size of one table at one node?

If I have a table of 100T data, how much modes are best practice?

It mostly depends on the kind of workloads.

If, say, hypothetically you only had 1 client app that only runs 1 SQL query every week and uses only 1 row out of your 100T of data, then it’s likely you can run your entire cluster on a single node (with a large hard disk and a couple GB of RAM).

The interesting question is to determine what throughput and latency you are expecting, across how many client connections, and how much data locality there is between the SQL queries.

The evaluated case is as follows:

  • 1 big table which store data of one year, about 100T, several tens of billions rows.
  • data is synced in real time from other databases, about 10000 TPS
  • data is queried with some conditions not complicated, by primary key or indexes, about 100QPS

For the above case, with 3 replicas, how much nodes should I build the cluster?

How is the data imported from other databases?

What is the locality of the imported data? Is this a timeseries or are these random primary keys?

The import method is reading mysql binlog and translate them to sql statements, apply them
to crdb. So TPS for crdb is same as upstream db(10000TPS).

The data has random primary key.

First, a disclaimer: the information presented here does not constitute a guarantee that the parameters will be suitable for any particular purpose. This information is given as-is without any warranty. For contractual engagements, contact our sales department.

So your workload here is constrained by the write activity (the import).

Back of the envelope:

  • “several tens of billions of row for one year” is several hundreds new rows per second.
  • “100TB data over several 10xGrows” means about 10KB of data per row.
  • however you also mention about 10000 TPS of importable activity, that means that a large majority of these import TPS modify existing rows (i.e. not just add new rows).
  • 100TB of data means 1.5M ranges (4.6M replicas). So with a truly random access, each import TPC is likely to touch a completely different range and not be able to reuse any recently accessed range.

Pessimistically, this random access workload with extremely poor locality. This will be expensive. It would be important to determine the locality of the imported data - is the same data updated frequently? Are related data (in index order) also modified closed in time? Is the updated data evenly spread or are there hot spots?

Assuming:

  • the import TPS are indeed randomly spread,
  • there are no foreign key relations

each write statement will need to touch 1 range for the PK, plus 1 range per secondary index. Because the transactions are truly random all the internal grouping of writes usually performed by CockroachDB cannot occur.

In that case, for a 10000 TPS write workload I would say at least 10 nodes (16 cores/node, 8+GB RAM per node), increase that with the number of secondary indexes. You will need at least 1-10GBit LAN bandwidth to sustain the inter-node synchronization.

Your import task will need to connect to multiple nodes simultaneously, and create multiple connections per node (100+ conn / node) and divide the SQL statements over all the available connections.

Again, if you have additional information about data locality, it may be possible to optimize this (i.e. reduce the number of required nodes).

Finally, a special word about storage and networking.

With just 10 nodes and 100TB of data we’re talking about 10TB of data per node or 450000+ replicas per node. This is an incredibly large number. If your imported TPS is really random and can touch any range randomly at any time, you will need arrays of SSD drives over fast parallel controllers to sustain the disk latency needed for the writes. This will be expensive hardware.

You can lower the price of the hardware by instead increasing the number of nodes: as the number of nodes increases, the required amount of storage per node and there is a point where you don’t need expensive fast storage controllers any more. You’ll need to consult with your hardware provider to determine the right price point.

Again, if you have more information about data locality, it may be possible to split the data across “slower” (cheaper) drives that are accessed rarely, and “faster” (more expensive) drives that are accessed often.

If you have more secondary indexes, the required capacity will increase too.

If your cluster is sufficiently provisioned to support a 10KTPS write workload, it will be able to support 100 read TPS easily, so I am not even considering this here.

Does this help?

To give you an idea of the benefits of locality: if you know that on average each second only 100 rows are modified by the 10000 TPS imports, and then the same rows are modified over and over multiple times so that e.g. over a period of 1 minute only 3000 rows are touched in total, then probably 3-5 nodes will be sufficient.

The hard drives will be still expensive though. You’ll need 300TB of storage with a replication factor 3.

Thanks for so detailed reply.

The data locality of my application is:

  • inserts and updates are nearly equal
  • most updates happen in one minute after the insert
  • after one minute, there are little updates for the same key

The queries have following requirements:

  • most queries select rows of recent 10 days
  • result set is small
  • columns in where often changes, so we will have quite some indexes(>6)

10000 TPS is the max value, 2000TPS is average.

According to your explanation, the number of nodes is mainly determined by
data locality and TPS, is it right?

According to our practice of using mysql, there are some other things to consider
about node capacity:

  • the time of adding columns, we need this time < 5 hours (crdb is very quick)
  • the time of adding index, < 5 hours
  • the time of defragmentation, < 5 hours
  • the time of backup, < 5 hours

Does crdb meet the above requirements when the capacity of one node is too large?

This is very good news. This means the locality is very good after all. In that case my estimates above may have been pessimistic (but see below)

Ok so the number of secondary indexes is directly related to the cost of write transactions (more indexes = more writes). This negatively impacts overall TPS. You can often compensate this with more nodes as long as there is very little contention.

That but also the desired maximum latency, and desired maximum number of clients/connections.

So there are a few things you need to know:

  • in CockroachDB, schema changes and backups are asynchronous: you can continue to use the database / table while the schema change is ongoing. Maybe this changes the “5 hours” requirement because you do not need to stop your client applications to perform these operations.
  • in CockroachDB adding a column is a relatively cheap operation, and a backup can occur fully independently from query traffic, but in contrast adding an index can cause some amount of performance degradation.
  • we do not optimize the schema change operations for maximum speed, but instead so that they have minimum impact on the performance of clients (because we assume they will be run in production environments concurrently with client traffic). So it can be that these operations in CockroachDB will be slower overall.

Also do these “5 hours” requirements apply to all the 100TB of data? (or 300TB with replication?) If so, you’re asking about a database that can perform column adds, index adds, backups at 5-15GB/second! This is a very ambitious requirement. This requirement alone may require 100 nodes or more and then even I’m not sure this is do-able.

Finally, there is no “defragmentation” in CockroachDB that you need to plan. RocksDB compactions happen continuously in the background.