BTRFS errors: transid, tree level mismatch,

I’m experiencing a number of btrfs problems. Feel free to chime in if you have solved any of these problems.

Problem Description

  • btrfs errors in logs
  • errors concentrated on /dev/sdc (disk may be failing)
  • can’t remove /dev/sdc cleanly using “btrfs device remove”

Solutions
Background is below, and I include anything I’ve tried which seems to be working well here in Solutions.

Clean Removal of Disk
I was unable to remove /dev/sdc from a btrfs raid1. Sadly I didn’t copy the error message, but the complaint was about filesystem, “structure.” After using btrfs-scrub about three times and taking a break I was able to remove the problem device using ‘btrfs device remove /dev/sdc /’.

Filesystem Maintenance Automation
Btrfs needs to be scrubbed from time to time according to the btrfs wiki. This and other filesystem maintenance tasks can be automated using the btrfsmaintenance package from apt. They claim that the defaults are sensible and include scheduled scrub and rebalance.

Hardware Monitoring
I have installed smartmontools from apt which includes command line tools and a data collection daemon. I’ve run the offline and short tests on the problem disk and it appeared to me to pass these checks. The long test is in progress and is taking something like 13 hours to complete on a 4TB drive.

Steps to Reproduce

  1. Install freedombox on btrfs root
  2. Enable storage snapshots using Freedombox app
  3. Hard disk is not well
  4. observe btrfs errors logged: transid, tree level mismatch, checksum error, rd (read), wr(write), flush, corrupt, gen(don’t know what this is…)

Information

  • FreedomBox version: You are running Debian GNU/Linux 11 (bullseye) and FreedomBox version 22.15. FreedomBox is up to date.
  • Hardware: intel atom. all chips are intel except ASPEED PCI Bridge and VGA
  • How did you install FreedomBox?: (‘bought pre-installed hardware’ or ‘apt install freedombox’ or ‘downloading testing images from https://freedombox.org’ or ‘using a cloud image’)

Storage Summary
Storage Devices

  • sda1: btrfs
  • sdb1: btrfs
  • sdc: btrfs #appears to be sick
  • sdd: btrfs
  • sde: boot, swap, btrfs

Filesystems
/ btrfs raid1 including all btrfs slices

Landscape
ATA Controller 0: sde
ATA Controller 1: sda, sdb, sdc, sdd
No hardware RAID
No LVM

Goals

  • Migrate data off of /dev/sdc ( raid1 should have a copy of anything on another device)
  • Remove /dev/sdc in orderly fashion with ‘btrfs device remove /dev/sdc /’
  • Be a better steward of btrfs filesystem in the future

What I’ve Learned
I’ll share here what I’ve learned. I’m not sourcing everything, sorry, but I am trying to be picky about what I do here - I won’t blindly type any command I read on the internet.

Accumulated Snapshots
The quantity of btrfs snapshots may catch up with you eventually. I have cleaned out old snapshots, and reduced the number of timeline copies created. Reboot snapshots add up fast on account of automatic updating. I believe that some of my problems were related to files and folders inside snapshots, and reducing the quantity of these seems to have slowed down some of my problems.

Scrub is Required
btrfs scrub should be used from time to time. If you read up on this there are some discussions of scrub intervals. Debian has a btrfsmaintenance package which will automate this. It is not currently included in Freedombox, and I’ll make a request for this.

… more to come as I learn

Well, it turns out that /dev/sdc was broken. :rofl:
It failed the long test in SMART tools.

1 Like