Downtime Explanation - Updated 7/24 9pm CST - eviltoast

My apologies for the past day or so of downtime.

I had a work conference all of last week. On the last morning around 4am, before I headed back to my timezone, “something” inside of my kubernetes cluster took a dump.

While- I can remotely reboot nodes, and even access them… the scope of what went wrong was far above what I can accomplish remotely via my phone.

After returning home yesterday evening, I started plugging away a bit, and quickly realized… something was seriously wrong with the cluster. As such, from previous experience, I found it was quicker to just tear it down, rebuild it, and restore from backups. So- I started that process.

However, since, I had not seen my wife in a week, I felt spending some time with her was slightly more important at the time. But- I was able to finish getting everything restored today.

Due, to the issues before, I will be rebuilding some areas of my infrastructure to be slightly more redundant.

Whereas before- I had bare-metal machines running ubuntu, going forward, I will be leveraging proxmox for compute clustering and HA, along with ceph for storage HA.

That being said, sometime soon, I will have ansible playbooks setup to get everything pushed out and running.

Again- My apologies for the downtime. It was completely unexpected, and came out of the blue. I honestly still have no idea what happened.

The best suspicion I have, is disk failure… and after rebooting the machine, it came back to life?

Regardless, Will work to improve this moving forward. Also- I don’t plan on being out of town soon… so, that will help too.

There may be some slight downtime later on as I am working on and moving things around. If- that is the case, it will be short. But- for now- the goal is just restoring my other services and getting back up and running.

Update 2023-07-23 CST

There are still a few kinks being worked out. I have noticed occasionally things are disconnecting still.

Working on ironing out the issues still. Please bear with me.

(This issue appears to be due to a single realtek nic in the cluster… realtek = bad)

Update 9:30pm CST

Well, it has been a “fun” evening. I have been finding issues left and right.

  1. A piece of bad fiber cable.
  2. The aforementioned server with a realtek NIC which was bringing down the entire cluster.
  3. STP/RSTP issues, likely caused by the above two issues.

Still, working and improving…

Update 2023-07-24

Update 9am CST

Working out a few minor kinks still. Finish line is in sight.

Update 5pm CST

Happened to find a SFP+ module which was in the process of dying. Swapped it out with a new one, and… magically, many of the spotty network issues went away.

Have new fiber ordered, will install later this week.

Update 9pm CST

  1. Broken/Intermittent SFP+ Module replaced.
  2. Server with crappy realtek nic removed. Re-added server with 10G SFP+ connectivity.
  3. Clustered servers moved to dedicated switch.
  4. New fiber stuff ordered to replace longer-distance (50ft) 10G copper runs.

I am aware of current performance issues. These will start going away as I expand out the cluster. Still focusing on rebuilding everything to a working state.