Strange failures on CIFS mounts - Solution

Sonarr version (exact version):
Mono Version
Mono version (if Sonarr is not running on Windows):
OS: Ubuntu 16.04
((Debug logs)):

Description of issue:

I have this problem and I have the solution. If it’s a known thing then I apologize in advance. I really did search and didn’t find any solutions, but i did find a lot of people whose issues will be fixed by this (corrupted files on CIFS, Web interface not loading, etc)

In the process of setting up Sonarr for the first time I mapped several different CIFS drives to my server from my Synology Diskstation. Plex has been serving files from it for ages. Other similar applications have used them as well. However, Sonarr would run sometimes for a day, sometimes for less than a minute, and it would stop responding. The debug logs had nothing of interest in them at all. No errors, no nothing. Just a dead stop.

When sonarr became unresponsive, there was absolutely no way to recover. Killing the processes always left behind a consumed port which would stop it from coming up again. Only fix other than changing ports was a reboot.

Here’s what I found and did. Sorry for the length. First there were messages in the syslog like this:

 [ 1799.788259] INFO: task kworker/3:0:8256 blocked for more than 120 seconds.
 [ 1799.788262]       Tainted: P           OE   4.4.0-98-generic #121-Ubuntu
 [ 1799.788262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 [ 1799.788263] kworker/3:0     D ffff880652babbe8     0  8256      2 0x00000000
 [ 1799.788279] Workqueue: cifsiod cifs_writev_complete [cifs]

This is a somewhat well known problem with CIFS and its ability to handle high IO and the OS’s means of handling caching (by letting dirty pages build up in the cache). This gave me my first real clue as to what the problem was.

I did these things to eliminate the problem:

I modified the cifs mounts to turn off caching. On modern flavors of Linux this is done by simply adding the “cache=none,” and in addition, I set the the rsize and wsize both to 32768 (generally recommended for linux). Older flavors of linux will want to use “forcedirectio” instead of “cache=none”… They’re the same thing, but forcedirectio is deprecated. DBA’s will likely recognize the setting as it’s typically only used with databases. This application was close enough. A typical /etc/fstab entry looks like this now:

/host/home/Home_Movies /movies/Movies cifs credentials=/home/blah/.yeahsure,rsize=32768,
wsize=32768,cache=none,iocharset=utf8,file_mode=0744,dir_mode=0744 0

Secondly…Please be careful with this last bit of advice. I changed the vm.dirty_ratio and the vm.dirty_background_ratio in order to reduce the amount of data that the system could actually fall behind on. If you frequently get delays when accessing a network share…this may help. This is the issue that directly caused the kernel to panic. These are both limits on how much system memory can be used to hold dirty pages (pages waiting to be written to disk). When those limits are reached, all following IO becomes asynchronous. Those pauses in IO are because the caching system has decided that there’s too much to write to disk. In this case, my assumption was that there was too much caching (not too little!) and that a small cache size would force more writes and less pauses and it wouldn’t hit the 120second “breaking point” that causes the kernel to panic.

The Ubuntu defaults are vm.dirty_ratio=20 and vm.dirty_background_ratio=10, I cut each of those in half. Again, do this at your own risk, and I would suggest reading more about it in case I don’t know what the hell I’m talking about. Changing these SHOULDN’T be needed as we’re disabling the cache altogether on the CIFS mounts. If for some reason you can’t set cache=none, then this is something to consider but it will have system-wide implications.

Anyway… hope this all helps!

I’ll just leave this here (I have a habit of digging this up when someone says CIFS):

Those file truncation problems are absolutely a result of this caching issue.

I was actually wondering yesterday why this problem would appear on CIFS and not on NFS as much. I did a little poking around and what would appear to make the difference in THIS case is that NFS writes cached data a LOT more aggressively, which surely helps to keep pileups like this one from happening.

In defense of CIFS (four words I never thought I’d say), it has always been supplied with awful defaults; it defaults to vers-1.0 instead of vers=3.0, even if you change the vers= to 3.0, you still get 1.0 read and write sizes.

However, I will say “don’t use CIFS” is not a reasonable response to a problem that can absolutely be fixed with proper configuration.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.