Stale sshfs mounts lead to failures

glalejos · April 15, 2022, 12:44pm

Problem Description
After about 10 days of Freedombox uptime, subproceses related to backup application (at least df and mountpoint) block indefinitely, leading to malfunctioning of other applications (such as tt-rss and samba), as well as the backup application itself.

Steps to Reproduce
I’m not sure this can be easily reproduced.

Set-up remote SSH server.
Login to FreedomBox.
Go to backup application.
Configure the remote scheduled backup.
After about 10 days of uptime the symptoms start manifesting

Expected Results
I expected the backups to work indefinitely, without affecting other applications.

Actual results
I see a lot of df and mountpoint suspended processes. If I invoke those programs from the command line, the command line blocks as well.

Another side effect that seems caused by this situation is (but is unconfirmed): tt-rss application stops working because it is unable to synchronize feeds, apparently because of a timeout when accessing resources shared with backup application.

Another side effect that seems caused by this situation is (but is unconfirmed): Several days later, other applications, such as cockpit and samba stop working.

Rebooting the FreedomBox solves the problem for ~10 days.

Information

FreedomBox version: Debian GNU/Linux 11 (bullseye) ; FreedomBox versión 22.8
Hardware: Virtual machine.
How did you install FreedomBox?: Fresh Debian install.

** Proposed patch **

I have experienced similar issues with custom scripts, and solved them by tuning the configuration of sshfs.

This problem might be related to the SSH configuration, but I think that FreedomBox should be robust enough to deal with it.

I’ll report back if this patch solves the issue.

diff --git a/actions/sshfs b/actions/sshfs
index b2fb6b5c7..b47a2439d 100755
--- a/actions/sshfs
+++ b/actions/sshfs
@@ -55,10 +55,24 @@ def subcommand_mount(arguments):
     kwargs = {}
     # the shell would expand ~/ to the local home directory
     remote_path = remote_path.replace('~/', '').replace('~', '')
+    # 20220415glalejos: added reconnect, ServerAliveInterval and
+    # ServerAliveCountMax. After ~11 uptime days, backups scheduled via
+    # the backup application stop working. I can see a lot of processes
+    # '/usr/share/plinth/actions/storage usage-info' suspended in the
+    # invocation of 'df' program (this blocks too if I manually invoke it from
+    # the command line). Also there are a lot of
+    # '/usr/share/plinth/actions/sshfs is-mounted' suspended in the invocation
+    # of 'mountpoint' program (this, too, blocks if manually invoked from the
+    # command line).
+    # Apparently, this situation has some lateral effects, such as tt-rss
+    # failing to update feeds after several uptime days.
+    # Other custom scripts that I used in the past had simmilar issues, and
+    # the options included in the following 'cmd' helped solving them.
     cmd = [
         'sshfs', remote_path, arguments.mountpoint, '-o',
         f'UserKnownHostsFile={arguments.user_known_hosts_file}', '-o',
-        'StrictHostKeyChecking=yes'
+        'StrictHostKeyChecking=yes', '-o', 'reconnect', '-o',
+        'ServerAliveInterval=15', '-o', 'ServerAliveCountMax=3'
     ]
     if arguments.ssh_keyfile:
         cmd += ['-o', 'IdentityFile=' + arguments.ssh_keyfile]

glalejos · May 2, 2022, 7:40pm

Hello,
Well, my patch was partial, since the mount point doesn’t seem to be properly umounted (I couldn’t find where it is actually umounted…). So I have added some lines to backups/schedule.py. Following is the full patch:

diff --git a/actions/sshfs b/actions/sshfs
index b2fb6b5c7..b47a2439d 100755
--- a/actions/sshfs
+++ b/actions/sshfs
@@ -55,10 +55,24 @@ def subcommand_mount(arguments):
     kwargs = {}
     # the shell would expand ~/ to the local home directory
     remote_path = remote_path.replace('~/', '').replace('~', '')
+    # 20220415glalejos: added reconnect, ServerAliveInterval and
+    # ServerAliveCountMax. After ~11 uptime days, backups scheduled via
+    # the backup application stop working. I can see a lot of processes
+    # '/usr/share/plinth/actions/storage usage-info' suspended in the
+    # invocation of 'df' program (this blocks too if I manually invoke it from
+    # the command line). Also there are a lot of
+    # '/usr/share/plinth/actions/sshfs is-mounted' suspended in the invocation
+    # of 'mountpoint' program (this, too, blocks if manually invoked from the
+    # command line).
+    # Apparently, this situation has some lateral effects, such as tt-rss
+    # failing to update feeds after several uptime days.
+    # Other custom scripts that I used in the past had simmilar issues, and
+    # the options included in the following 'cmd' helped solving them.
     cmd = [
         'sshfs', remote_path, arguments.mountpoint, '-o',
         f'UserKnownHostsFile={arguments.user_known_hosts_file}', '-o',
-        'StrictHostKeyChecking=yes'
+        'StrictHostKeyChecking=yes', '-o', 'reconnect', '-o',
+        'ServerAliveInterval=15', '-o', 'ServerAliveCountMax=3'
     ]
     if arguments.ssh_keyfile:
         cmd += ['-o', 'IdentityFile=' + arguments.ssh_keyfile]
diff --git a/plinth/modules/backups/schedule.py b/plinth/modules/backups/schedule.py
index ee3729719..462a20972 100644
--- a/plinth/modules/backups/schedule.py
+++ b/plinth/modules/backups/schedule.py
@@ -187,6 +187,13 @@ class Schedule:
             return False
 
         repository = self._get_repository()
+
+        try:
+            # Make sure that the repository is not mounted.
+            repository.umount()
+        except Exception as e:
+            logger.warning("Unable to umount repository before running backup.", exc_info=e)
+        
         repository.prepare()
 
         recent_backup_times = self._get_recent_backup_times(repository)
@@ -321,3 +328,8 @@ class Schedule:
                 logger.info('Cleaning up in repository %s backup archive %s',
                             self.repository_uuid, archive['name'])
                 repository.delete_archive(archive['name'])
+
+        try:
+            repository.umount()
+        except Exception as e:
+            logger.warning("Unable to umount repository when cleaning up after backup.", exc_info=e)

Is there a more appropriate place to umount the repository?

Again, I’ll report back in a few days if this solves the issue.

Kind regards,

Guillermo

glalejos · June 26, 2022, 12:42pm

Hello,

After a couple of months of monitoring my system, it seems that the problem is fixed with my proposed patch. I’ve created merge request [1] to add the patch to upstream.

This is my very first code contribution to an open source project, I hope everything is in place (I’ve checked [2] too).

Thank you and kind regards,

Guillermo

[1] https://salsa.debian.org/freedombox-team/freedombox/-/merge_requests/2249
[2] https://salsa.debian.org/freedombox-team/freedombox/-/blob/master/CONTRIBUTING.md