Database snapshot

My use case is probably not that unusual in that an ability to take snapshots at the database level would be useful (especially with some degree of management tools to view, drop, and perhaps quota those snapshots). Our production database (presently not CockroachDB) is large enough that dumping and restoring it is a time consuming effort, yet small enough that having an entire copy of it is very useful in troubleshooting customer problems and performing pre-production QA as new code is evaluated. Presently I snapshot the relational database at a file system level and clone it (ZFS), mounting said clone with a containerized instance of the database application. Since we are becoming more engaged with a container approach in general, and using ZFS in the container world comes with its own challenges of being outside normal workflow, then being able to snapshot the entire database and present that specific view to applications that need that data for relatively shorter time spans (a week for testing, etc), then dropping that snapshot along with container tear down, etc., would be a real value added.

Hi @thrill, that’s interesting. How would you expect a snapshot to be taken and deployed? And how would the same thing work in a containerized deployment?

Hi Tobias, while it’s been awhile since I’ve used it, I’d look for something similar to Microsoft’s capability in SQL server https://docs.microsoft.com/en-us/sql/relational-databases/databases/database-snapshots-sql-server

Deployment would simply be on the same servers hosting the present CockroachDB instances. The snapshot would be a copy (on write?) set of the same records. Some “SNAPSHOT” command that required a specific user permission. The snapshot would then be accessed with the same (but changeable) connection string as the copied database, but under some name extension that made it unique (a different approach would work too - perhaps force a new database name and a change of password when demanding the snapshot in order to reduce the chance of later connecting to the wrong database). In a container world the requesting container would just use the new connection string, which could be built from docker secrets or environment variables at container deployment.

CockroachDB provides highly available and scalable database by running on at least 3 servers. As the database size increases, dumping and restoring time is controlled by parallelizing. Parallel backup and restore is a part of the Enterprise CockroachDB feature set. For example, consistent full weekly and incremental daily backups across all of the servers is a typical workflow. More details are at https://www.cockroachlabs.com/docs/stable/backup.html

Hi Robert, the BACKUP command sounds like under the hood that it’s almost what I’m looking for except for the requirement that the destination of the backup be an external location and that the backup is not interacted with unless it’s restored somewhere. What I’m looking for is an ability to create a point-in-time copy of the database (the AS OF SYSTEM TIME parameter implies to me that the modification time of each record is available), but I want to then be able to read and-or write to that backup as if it’s a new database without affecting the “original”.

Hi Bill, since you want to be able to write into the snapshot, as your correctly noted, you won’t be able to use the AS OF SYSTEM TIME parameter. Are you against taking a backup (which gets you the snapshot) and then restoring it into the existing cluster under a different DB name? I know this introduces an extra step, but it would get you to where you want to be?

The thing is that our backup / restore should be pretty fast based on how you’ve described your dataset. Are you asking for this improvement as a usability vs functionality win?

I’ve not tested the backup/restore process you describe, but I think it might take significant time, especially as the database grows, but I do think it’d be a usable workaround. I’m asking for this functionality as I think it’d be a good win for the testing and troubleshooting process because of the rapidity of the process (being able to do a similar thing via ZFS snapshots has turned out to be very useful to us). In our company we like to stand up an isolated copy of the production database for any branch that we are testing (and at multiple points during that branch’s life as code is pushed) before we merge it into our pre-production branch. We also do the same for our customer ticketing system by providing our customer support team a daily snapshot of the database to evaluate issues against (our customers sometimes delete stuff they didn’t mean to - part of the troubleshooting process is to see if our logic [mis]leads them into this action).

I’m trying to automate this as much as possible via webhook calls for automated container builds, etc. The backup/restore process you describe will probably work but the snapshot that simply stays in place, and is addressable under a defined name, until dropped, would be most convenient.

Yes, I can see how having a more automated solution here would make things better for you. Unfortunately, we’ve got a couple other big features in the pipeline, so we won’t be able to get to this in the next release. That being said, we have a record of it now, and can consider how to prioritize it in the next next release.

The backup / restore functionality is an enterprise feature, but it comes with a free 30-day trial. Let us know if you would like us to set you up with a license?

@thrill Engineer here. Just to add a little more color to Diana’s answer…

Because of the fundamentally distributed nature of CockroachDB, it’s
unfortunately not possible to use something like zfs snapshots to replicate
a cluster if it is simultaneously serving queries. BACKUP and RESTORE were
very carefully designed to be the fastest possible way to get data out of
and into a cluster without compromising correctness. It’s possible that we
could fast-path a few things in the specific case of initializing a new
cluster from a backup, but it would be a HUGE project and my back of the
envelop math says it would only be ~2x as fast (RESTORE is already quite
fast). I talked to Diana a bit about how this would work and we’ve made an
entry to be prioritized in our roadmap. In the meantime using BACKUP and
RESTORE is definitely the fastest way to do what you’re describing.

Thanks so much to the team for the effort on this.

Here’s a github issue I opened for you https://github.com/cockroachdb/cockroach/issues/19881

1 Like