Time Series usage

replication
sql

(Q Pike) #1

I’m curious how cockroachdb’s performance would be for time series data? We are currently looking at Elastic and InfluxDB but rather use cockroachdb if possible.

We need to be able to store billions of datapoints down to the millisecond accuracy ( Finance/Stock Data ). How would it be able to handle this amount of data and running queries on it, and what is the throughput of writes. We would be writing about 10-30k points/second.

Thanks!


(Diana Hsieh) #2

Thanks for your question, @qrpike. As you might already suspect, we aren’t built to optimize for time series use cases. We are still working on testing our performance, although you should be able to achieve 10-30k points / second depending on how many nodes you have and what deployment you decide on. The issue with time series right now is that you will need to use something other than just the timestamp as the primary key so that you can shard your inserts across multiple nodes and thereby increase performance. For queries, that depends on what you’re thinking of doing - simple selects / single row writes / transactions / joins?

Re ms accuracy - could you provide more detail on what you mean by that?


(Drasko DRASKOVIC) #3

@dianasaur323 and @qrpike - do you have any updates on this subject? We from Mainflux project (https://github.com/Mainflux/mainflux) are strongly considering switching completely to CockroachDB.

While we have no doubts for resource storage, we are wondering if the messaging DB (currently InfluxDB) can be replaced and with which consequences. Actually, in Mainflux architecture messages are sent to NATS broker, and a worker behind the broker is getting them as the events and writing them in the DB. Messages come from sensors, and in IoT deployments of the future we are talking about millions and millions of sensors, so messaging might be intensive.

So, do you have the latest performance benchmarks to share, especially ones that can be interesting from this time-series IoT aspect?


(Diana Hsieh) #4

@drasko We don’t have any performance benchmarks yet - we are currently building out the infrastructure internally, and this is a core focus for us before our next release. We aren’t optimized for time-series / heavy write workloads, but I would suggest testing things out to see if they meet your needs? You can check out our loadgen repo for the load generators we have written to test CockroachDB performance / stability.


(Drasko DRASKOVIC) #5

@dianasaur323 thanks. We will be switching Mainflux to CockroachDB in the next few days and then give it a test.


#6

As a first experiment in using cockroach in a production but not critical role I am also interested in how well it might suite time series data. So I found this thread and wondered what results people may have come up with since January and what techniques they used.

Firstly I must say I am a database/SQL novice so even the most basic advice would be welcome.

My idea of time-series is simply series of records containing a series name/ID, a timestamp and a value. There may be a lot of inserts going on but never any updates. Reads would query a range of records for one or more series between given times.

Naively one might simply create a table of such rows and start inserting records. Before I even do that I’m sure it would be terribly slow. What with all those round trips to the three nodes in my cluster.

Looking around the net I find quite a lot of discussion of using SQL as a time-series database. Ways to optimize it. Most of them seem to use specific features of particular databases, MySQL, postgresql, etc. More importantly they seem to be tuned to the ways those db’s work under the hood. I imagine such tuning would be somewhat different for Cockroach.

My simple mind has understood a few simple optimizations:

  1. Batch up writes and do bulk inserts of many rows at a time. One row per value.

  2. Batch up writes and put many values into arrays. One row can then hold many values.

  3. Batch up writes and put them into vectors. As above. I don’t think Cockroach has vectors.

  4. People do odd things like use a different table for each series. Or generally partition things in other ways.

I did not understand dianasaur323 statement above: “The issue with time series right now is that you will need to use something other than just the timestamp as the primary key so that you can shard your inserts across multiple nodes and thereby increase performance.”

Surely cockroach is doing it’s own sharding and replication under the hood? What would be a good primary key? I was just going to use the series ID and timestamp.

Any advice welcome. Something simple to get me started for now.


(dan) #7

@heater As mentioned above, CockroachDB is not designed for timeseries
data, which generally requires a very specific set of tradeoffs that aren’t
relevant to other workloads. This is why timeseries dbs tend to be very
specifically just for that one thing. That said you can certainly try to
use CockroachDB for this and report back, it’s obviously a use many people
consider us for. There’s a benefit to using one DB for everything.

I’d expect that a primary key of (series_id, timestamp) would be best.
CockroachDB splits the data it contains into ranges (which are the smallest
thing we use to redistribute load) sorted by primary key. Which means that
if you have series_id before timestamp (and once the dataset gets big
enough), each series will be writing to its own range.

Additionally, each row in a table corresponds to one key-value pair in our
underlying kv layer (which our SQL impl is built on top of). There’s some
overhead to each batch of inserts and to each kv, so your points (1) and
(2) are likely to help in CockroachDB as well. Our support for arrays is
new, so I’d try it love to hear how it compares with and without arrays.
(4) isn’t needed as it’s handled automatically for you if the primary key
is as I described above.

Good luck and please do follow up with any more questions/your results.


#8

Thanks dan,

I quite appreciate that cockroach is not designed for time-series. I would never have suspected that any SQL database was. But I was inspired by examples of people having good luck with SQL and time-series. Even if it meant bending things and losing all the generality of SQL along the way. I think storing in arrays/vectors is such a “bending”.

Anyway. I have decided to start with the most naive approach to this. One time series sample, one insert.

I’d like to see how it goes. As an SQL novice doing the most simple thing I managed to find a rather serious bug in cockroach already :slight_smile:

And it’s a baseline from which to measure any optimizations like using arrays or whatever.

Thanks.


(Charl) #9

@heater would you mind sharing your findings with using CockroachDB as a time series store?


#10

@heater Did you find success with your time series implementation? We are also considering using CRDB and implementing a time series data set.


#11

I did not get as far along the road of time series experiments as I would have liked.

I do have a 3 roach cluster logging about 4 records per minute, only an ID, a time stamp and a value which has been running since November and is now up to 1.2 million rows. Not much of an exercise but it works well.

Soon I might have much bigger demands, we will see.