A little over a week ago, avestriel, my gateway box, suddenly began losing its mind.
The first symptom was the apparently-spontaneous corruption of a Cyrus database, resulting in an inability to recieve new mail and the spewing of approximately 1 GB of data into /var/log/messages in the space of a few hours. (Oops. Suddenly, putting /var/log on a separate partition, or quota-limiting it, doesn’t seem as ragingly paranoid as it once did.)
This was concomitant with corruption of the filesystem itself, requiring me to boot from an installation CD and give myself a crash course in the use of reiserfstools. Two hours of nailbiting later, the filesystem was apparently recovered fully. Score a few more points for Hans Reiser. I rebooted back into the OS, and all was seemingly well.
I built smartmontools, which reported a single SMART error, but one which could credibly have resulted from host-side problems, and nothing that indicated imminent mechanical drive death. The system’s overall problems proved persistent, however: random and irreproducible segmentation faults while building packages, further corruption of Cyrus data, and the occasional outright lockup.
This was distressing, to say the least, on two fronts: first, avestriel had been performing flawlessly since I built it, a year and a half ago; second, partly because of that, I’d come to rely on it for a great deal of my communication with the outside world: e-mail and telephony, in addition to mundane web browsing and SSH’ing to work.
On Sunday, having grown tired of crossing my fingers and hoping for the best every time I rebooted, and thinking that the random segfaults hinted at memory problems, I pulled all three 256-MB DIMMs, replacing them with a single 256-MB unit.
Today’s Thursday, and I’ve observed no problems with avestriel since. (This could simply mean that they’ve become crafty and are hiding, but I suspect not.) The conclusion, then, would appear obvious: bad memory, no?
Well, not quite. Because I promptly shoved the memory I’d just yanked from avestriel into another box, behemoth, and booted Memtest86+. That RAM has spent the last four days enduring a mind-numbingly repetetive succession of all the pattern-writes and -reads that could be thrown at it, so far without a single bit error.
Tentative conclusion: whack-ass electrical-contact problems. I’ll see what happens when I clean the pads and re-seat the original memory, although I may not do that until avestriel is relieved from its current, crucial post as intermediary between me and the outside world.
Things I learned from this experience:
- GNU ls, when passed the --si flag, knows how to display a file size in petabytes. I really didn’t expect to learn that for a few more years. (Filesystem corruption is an amazing thing, I tell you.)
- A servicable way of recovering a corrupted Cyrus <user>.seen file is to use cvt_cyrusdb to convert it from skiplist to flat format and back again.
- It’s very important to make sure that said files are owned by user cyrus and group mail when you’re done with them.
- I really need to back up /home, /etc, and parts of /var on avestriel, like, yesterday.
A few people have wondered about avestriel’s name. All I can say is that the case I built it into is a bright, fire-engine red. (And its successor, ambriel, will be bright red, too, even though bright-yellow-and-powder-blue might be a more appropriate combination, given its namesake.)