Downtime Followup - eviltoast

As a continuation from the FIRST POST

As you have likely noticed, there are still issues.

To summarize the first post… catastrophic software/hardware failure, which meant needing to restore from backups.

I decided to take the opportunity to rebuild newer, and better. As such, I decided to give proxmox a try, with a ceph storage backend.

After, getting a simple k8s environment back up and running on the cluster, and restoring the backups- lemmy online, was mostly back in business using the existing manifests.

Well, the problem is… when heavy backend IO occurs (during backups, big operations, installing large software…), the longhorn.io storage used in the k8s environment, kind of… “dies”.

And- as I have seen today, this is not an infrequent issue. I have had to bounce the VM multiple times today to restore operations.

I am currently working on building out a new VM specifically for LemmyOnline, to seperate it from the temporary k8s environment. Once, this is up and running, things should return to stable, and normal.