So how do people generally manage/configure cockroachdb on production servers? Are there some documented best practices?
Does everyone write their own case-specific shell scripts to invoke/monitor/restart cockroachdb?
I’d was looking at https://github.com/tomogoma/cockroach-installer/ which has a systemd unit file that will watch/restart the cockroach process.
What I’m puzzled by is that there seems to be no easy way to both start and monitory cockroach with systemd. I can make a unit file with fixed command line options which will get me certs, a data directory, etc, but starting up with or without a --join= option will vary from invocation to invocation.
We’re looking at a 3 or 4 server deployment, all at one site (on centos 7). I can write a simple script that will manually start a node or start and join a node… but I can’t see an easy way to automate monitoring/restarting because I wont know which of the modes I should start in (e.g. initial node, or join an existing cluster). That pretty much seems to leave systemd monitoring out. I guess I could write a complicated script that will try to detect the situation and then invoke/exec cockroach with different arguments, and then have systemd monitor that…
I’d drop a list of potential peers into a variable, use nc to try to connect to each of them. If none allow connection, then start up as the first node in a cluster, otherwise I’d --join any of the ones I found alive. Even still there are race conditions, which I’m not sure how I could easily handle.
Also I dont really find any “recover from a totally crashed cluster” documentation. Is it really as simple as ‘start any node, join a second node, join the rest’? All the docs I see assume that you still have a live node somewhere else. What if I loose power, UPS, generator at my one site with all 4 nodes? There doesn’t seem to be a good way of bootstrapping a cluster.
MariaDB/Galera-cluster has a similar challenge - they use a config file to store potential peers, a ‘standard join-a-cluster’ service file (or systemd unit file) and a special case manual script to start the first node on the cluster.
Is there a better, standard way?