Matrix-Synapse service stopped for no apparent reason

jonny · October 29, 2023, 7:14am

Last night my matrix-synapse service stopped with status inactive (dead). At the same time the coturn service stopped working.

Since I setup a monit rule to check the coturn status a while ago the coturn service restarted automatically. Matrix Synapse didn’t so I created a new Monit checking rule for it. If somebody looks for something similar, this is the rule:

# MATRIX-SYNAPSE
check program matrix-synapse with path "/usr/bin/systemctl --quiet is-active matrix-synapse.service"
start program "/usr/bin/systemctl start matrix-synapse.service"
stop program "/usr/bin/systemctl stop matrix-synapse.service"
if status != 0 for 2 cycles then restart

Again I thought maybe adding Monit as a system app with some basic presets would be beneficial to increase the reliability of the Freedombox services?

@sunil: I would be willing to work on this but wouldn’t know where to start at the moment.

Can somebody point me into the right direction?

Avron · October 29, 2023, 10:05am

If you still have journal available (write to disk or no reboot), search for “Error” in the journal (sudo journalctl --since=today and press /Error and n to go to next Error) and see if you don’t have any failed backup.

All services are stopped before backup and, in my personal experience, when a backup fails, many (if not all) services won’t be restarted.

jonny · October 29, 2023, 5:42pm

@Avron: Thanks, you were right. There was a backup failure “returned non-zero exit status 2”. This was the first time I had any trouble with a backup.

Still… I think it would be a useful second security layer to implement a simple service monitoring.

Avron · October 29, 2023, 6:02pm

I also had a backup failure last night. I do a remote backup to a machine that is running fine, hasn’t had any change and has plenty of free disk space (last time I had a problem was disk full on that machine). Since you also had a failed backup last night, I wonder whether there isn’t a general problem affecting more people.

I posted the error that I have in the journal, I don’t know whether you have the same or not.

About the monitoring: I agree. Also, in the notifications, I don’t have any notification of failed backup, this would need some improvement.

sunil · October 30, 2023, 3:14pm

@jonny glad to know you wish to work on the issue. I think we can improve the automatic restart mechanism for many of the daemons we have in FreedomBox. Write to me if you have any queries with development setup or issue you are working on.

Let us discuss Monit. From my experience with it long ago, it can be used for reporting and correcting. Correcting is mainly in the form on restarting the service that failed. This functionality can be performed by systemd very effectively as it knows the return code of the process, it does not need to poll, it keeps an accurate counter on how many times the process has restarted etc. For this, we only have to ensure that .service files for a daemon has the right Restart= and related directives. In order to add this directive, we only have to drop files into /usr/lib/systemd/system/<daemon>.service.d directory making the whole process quite simple and easy to manage. We have to be careful not to restart services that have been intentionally stopped, not make too many restart attempts too quickly, don’t start services killed because of out-of-memory conditions, etc.

Monit also allows reporting problems when daemons stop. We sort want to go a step ahead and report other failures (such as daemon not providing an expected service). For this, we have newly added a mechanism to run diagnostics tests regularly and report problems. We can extend this to send email notifications.

Finally, we also want to work on a feature to backup app. We can take a snapshot of the btrfs filesystem and perform the backup from the snapshot so that stopping the service is no longer required (and avoid the above problem altogether).