Panic from too many open files, but /proc/sys/fs/file-max is adequate

On a Debian machine, I am running a single node which was running fine for some time, but which now fails to start.

In cockroach-pebble.log I am finding errors that appear to be happening during a compaction process that starts automatically as part of the server starting. The lines look like this (I’ve elided the start of the file paths here):

I211110 10:53:24.904256 198 3@vendor/github.com/cockroachdb/pebble/table_stats.go:235 ⋮ [n?,pebble,s?] 9 background error: open ‹[...]store/105064.sst›: too many open files

Prior to those lines, the pebble log contains:

`I211110 10:53:24.014043 50 3@vendor/github.com/cockroachdb/pebble/compaction.go:1791 ⋮ [n?,pebble,s?] 8 [JOB 3] compacting(default) L0 [075934 075940 075945 075952 …] (16 M) + L5 [075929 …] (14 M)

The log entry itself seems to be split into multiple very long lines, but the above considers them as a whole. The first (L0) list contains 14978 six-digit numbers and the second (L5) contains only 5. I’m guessing that these numbers correspond to .sst files in the data store and the numbers correspond well enough–there are 15008 .sst files.

At the end, it concludes with the following two lines:

I211110 10:53:24.925827 50 3@vendor/github.com/cockroachdb/pebble/compaction.go:1832 ⋮ [n?,pebble,s?] 17 [JOB 3] compaction(default) to L5 error: pebble: could not open table 105064: open ‹[...]store/105064.sst›: too many open files

I211110 10:53:24.926035 50 3@vendor/github.com/cockroachdb/pebble/compaction.go:1767 ⋮ [n?,pebble,s?] 18 background error: pebble: could not open table 105064: open ‹[...]store/105064.sst›: too many open files

After seeing an error mentioning too many open files, I ensured that /proc/sys/fs/file-max was adequate, and it appears to be: Its value is currently 9220000000000000000. (I tried with much smaller but still likely adequate numbers, just in case that itself was a cause of the issue.)

Any help or pointers would be appreciated!

Try to increase the open file limit as described here Production Checklist | CockroachDB Docs
Let me know if that fixes your problem.

I’d already checked the kernel itself, but I’ve now done all the other steps – thanks for the reference, I’d missed that!

Unfortunately I get the same result and the server doesn’t start.

I should also mention that the server itself is not overloaded in any way: it has plenty of free memory, free disk, and unused CPU. (And, less relevantly, network capacity.)

How are you running the server? Through systemd as a service? It would help if you can grep the cockroach.log file for “max open file limit”. It should be close to the beginning of the file.

BTW, I did see that when cockroach tries to start, it (re)creates file named OPTIONS-nnnnn (where nnnnn seems to be a random number) in the store directory, that that file seems to be a configuration file, and that it contains a setting called “max_open_files” which has the value 10000.

However, editing this file and (trying to) start cockroach does not result in any different behavior. A new OPTIONS-nnnnn file is created and it contains 10000 as the value for that setting, even if it has been edited in an OPTIONS-nnnnn file produced by a previous invocation.

Here are few things you can try. You can get the current file limit with ulimit -n in a shell. You can also set it to a higher value like this ulimit -n 70000 for example. If that works and you can see the setting with ulimit -n then try to run cockroach start-single-node --insecure in an empty folder. Assuming no other instance of cockroach is running, it will run a new server and create cockroach-data subdirectory with the store data. Including the log file. If you can see in the log that the file limit is increased, then try to do the same for your regular instance. The link that I sent has some info about increasing the limit if using systemd or user’s limit. Here is one more page that could be helpful How to Increase Open Files Limit in Ubuntu & Debian – TecAdmin

I had been blithely assuming that because I changed the limit in all the places recommended in the production checklist, that the shell itself wasn’t blocking me but… I was wrong. ulimit -n was still showing a very low number, and changing it manually before starting the server did work. Thank you!

Unfortunately I haven’t been able to figure out how to get the shell to not have that low default, but (a) I have a workaround, and (b) it’s clearly not a cockroachdb problem.

Thanks again for pointing me in the right direction!