Service availability monitoring/flapping services - eviltoast

So I’ve got a home server that’s having issues with services flapping and I’m trying to figure out what toolchain would be actually useful for telling me why it’s happening, and not just when it happened.

Using UptimeKuma, and it’s happy enough to tell me that it couldn’t connect or a 503 happened or whatever, but that’s kinda useless because the service is essentially immediately working by the time I get the notice.

What tooling would be a little more detailed in to the why, so I can determine the fault and fix it?

I’m not sure if it’s the ISP, something in my networking configuration, something on the home server, a bad cable, or whatever because I see nothing in logs related to the application or the underlying host that would indicate anything even happened.

It’s also not EVERY service on the server at once, but rather just one or two while the other pile doesn’t alert.

In sort: it’s annoying and I’m not really making headway for something that can do a better job at root-cause-ing what’s going on.

  • schizo@forum.uncomfortable.businessOP
    link
    fedilink
    English
    arrow-up
    3
    ·
    4 months ago

    It’s hilariously annoying, but to address your points:

    1. There’s nothing in any of the service logs
    2. It’s notifications from services that have external monitoring, but is not always the same service
    3. The local monitoring (which uses the same DNS records for resolution, and uses the same reverse proxy to connect) doesn’t flap at all.
    4. It’s sites behind cloudflare, ones not behind cloudflare, and one via one of their argo tunnels, so it doesn’t seem specific to CF.

    The 503s are coming from cloudflare indicating it can’t connect to the back end, which makes me think network issue again. Non-CF sites just show timeout errors.

    I don’t think it’s resource related; it’s a 10850k with 64gb of ram, and it’s currently using uh, 3% cpu and about 15gb of ram so there’s more than sufficient idle resources to handle even a substantial spike in traffic (which I don’t see any indications of in the logs, but).

    It’s gotta be some incredibly transient network issue but it’s so transient I’m not sure how to actually make a determination as to what happens when it breaks, since it’s “fixed itself” by the time I can get near enough to something to take a look.

    • myliltoehurts@lemm.ee
      link
      fedilink
      arrow-up
      1
      ·
      4 months ago

      Maybe set up a script that runs locally and pings an external service like 1.1.1.1 or 8.8.8.8 every second to see if it survives in a window when your services alert? Perhaps it’s your modem refreshing some config which causes a blip for a few seconds or something similar. If this doesn’t alert at least you can rule out that your internet fully goes out.

      The other side of this would also be useful, if you could run a similar check towards different levels of your home network to see how far down it gets (e.g. ping your router, expose some simple TCP echo service on the server running all this and nc it, curl the status page of the reverse proxy (or set up a static page in it), curl the app behind the reverse proxy - just make sure to use firewall rules for this and not just put everything on the internet). Depending on where it fails should hopefully give you some idea to go on.

      Maybe set up https://www.thinkbroadband.com/broadband/monitoring/quality/ to see if it registers any packet loss in those times or increased latency (although I’d still do the above as well)