Kernel fragile? [Archive] - Screaming Electron Forums

bumbler

May 1st, 2005, 09:50

This concerns major crashes due to file system errors.

In running FreeBSD over the past couple of years, I've had some serious crashes that really puzzled me. It's the sort of thing that makes me wonder what I've done wrong. The following events look all related to me, and have happened each on different machines:

Running 4.8, I hooked an old Fat32 drive up and booted. I mounted it manually, and during the process of trying to read files, the system crashed and I was unable to reboot. Running fsck did no good at all, and I lost several month's work. The old harddrive was marginal.
Running 4.9, I had a CD burner going out on me. Instead of complaining, or simply refusing to do it, the system decided it couldn't run and crashed out to reboot.
Running 4.11, I was mounting a marginal floppy. During dismount, the system wiped my entire home directory, then crashed. Fsck brought it all back, but I had open files that had to be fixed.
Running 5.4, another floppy was getting marginal, and it crashed the machine during unmount.

In each case, I was not aware the drive/disk was going bad until it was too late. Is there something about FreeBSD that it can't handle file system errors? I don't even know how to Google for this.

bmw

May 1st, 2005, 12:26

Bumbler, the filesystem interface in UNIX has always made assumptions about the integrity of the data on the media. You could call that a weakness, especially since I think that Windows tries harder to verify that what it's about to mount is valid. That said, I have had some monumental hangs and crashes mounting stuff (floppies, CDs) under Windows!

If you check the mount manpage, near the bottom you'll see this warning:BUGS
It is possible for a corrupted file system to cause a crash.
I much prefer the older, unsantized version of that text. Eg: BSD 2.10 says ...BUGS
Mounting file systems full of garbage will crash the system.

frisco

May 1st, 2005, 19:22

Some filesystems on Solaris provide different functionality on error - see onerror= in mount_ufs(1M) or ioerror= in mount_vxfs(1M).
Linux provides other recourses on errors for some filesystems as well - see errors= for ext2fs and jfs, and the default behaviour is different for msdos (all according to mount(8)).
I've yet to thoroughly test either OS in this regard, or see a difference in practice (i've always left it to the default).

On *BSD there is only onerror=pray that i am aware of.

I'm surprised by the first situation you describe - to have a system go that bad is usually reserved for "act of god" days - but the other scenarios sound like things i've experienced.

In order to check yourself before you wreck yourself, for floppies make sure to use fdformat to initialize them, and discard if any errors are reported. Unfortunately this won't help in the case of a floppy that goes bad in the following years.
If you're ever working with potentially bad media, either use a separate, throwaway machine, or at least mount ro your important filesystems.
For modern harddrives, try using SMART features - i don't know how to enable this in FreeBSD but in OpenBSD it's via atactl(8). In theory SMART functionality should warn you of disk hardware problems in advance, but in practice i've never had it work for me (but i'm willing to say YMMV).

But above all else, keep good backups of important work.

bumbler

May 2nd, 2005, 12:22

Well, this is the sort of stuff you run into when working in charitable organizations. You use the hardware someone donates and pray it stays up. Same with removable file systems. I don't have to run Windows, I just have to be compatible. Backups of my work are abundant.

Right now I'm down to my ancient laptop, as the desktop system died (CPU flaking out). Because of a recent floppy-related crash on the laptop, I've gone back to a Linux release I found tolerable on this old laptop (RH 7.3). I won't be running any BSDs until I get another desktop machine. I'm constrained by the necessities of my work.

frisco

May 5th, 2005, 14:37

Coincidentally enough, the day after i posted to this thread, a linux web server of mine had its sole disk (a hardware raid5) go dead (oops, should have been paying closer attention to those syslog messages). As the partitions were set "Errors behavior: Continue", the machine stayed up even though no partitions were readable or writeable. Most of the web content was being served up correctly anyway since it was cached in RAM (and thus nagios showed the site as being ok too), but no php session data could be written and no less-frequently-accessed pages could be viewed. So i got woken up earlier than i wanted to be. If i'd set the machine to panic on error instead, it wouldhave gone down and the secondary would have taken over and i would have continued sleeping. Instead i logged in but couldn't run ifconfig/route/shutdown since these commands weren't in FS cache. I had someone onsite unplug the ethernet cables from the server so the secondary would take over, then tried to go back to sleep.

Even more coincidentally, later that day when i queried my workplace via that what is website (http://www.screamingelectron.org/forum/showthread.php?t=2316), i got:
Wcc is in hot standby.

bmw

May 5th, 2005, 14:43

Good cautionary tale, Frisco!

You know, the stock UNIX "panic" response is still an extremely useful one--perhaps the best. Nothing like a nice quick reboot to clear up errors. :-)

bumbler

May 5th, 2005, 16:24

My own post-mortem: There is no version of Linux with which I'm familiar that will work well enough on this old laptop without undue risk. Turns out I was reminded why I quit RedHat a couple of years ago: the stupid font server is hard-coded and X won't run without it -- X 4.2.1, even. I took some thought and reinstalled FreeBSD 4.11, and it's plenty quick, connects better via dialup, and I'll just have to take my chances on file system crashes.