Company brought to its knees by a cable - eviltoast

Yesterday around noon, the internet at my company started acting up. No matter, slowdowns happen and there’s roadwork going on outside: maybe they hit the fiber or something. So we waited.

Then our Samba servers started getting flaky. And the database too. Uh oh… That’s different.

We started investigating. Some machines were dropping ICMP packets like crazy, then recovered, then other machines started to become unpingable too. I fired up Wireshark and discovered an absolute flood of IGMP packets on all the trunks, mostly broadcast from Windows machine. It was so bad two Linux machines on the same switch couldn’t ping each other reliably if the switch was connected to the intranet.

So we suspected a DDOS attack initiated from within the intranet by an outside attacker. We cut off the internet, but the storm of packets kept on coming. Physically disconnecting machines from the intranet one by one didn’t do a thing either.

Eventually, we started disconnecting each trunk one by one from the main router until we disconnected one and all the activity lights immediately stopped on all the ports. We reconnected it and the crazy traffic resumed.

So we went to that trunk’s subrouter and did the same thing. When we found the cable that stopped all the traffic, we followed it and finally found one lonely $10 ethernet switch with… a cable with both ends plugged into the switch. We disconnected the cable and everything instantly returned to normal.

One measly cable brought the entire company to a standstill for hours! Because half of the software we have to use are cloud crap or need to call their particular motherships to activate their licenses, many people couldn’t work anymore for no good technical reason at all while we investigated the networking issue.

Anyway, I thought switches had protections against that sort of loopback connection, and routers prevented circular routes. But there’s theory and there’s reality. Crazy!

  • Gobo@lemmy.world
    link
    fedilink
    English
    arrow-up
    53
    ·
    4 months ago

    Yea. This is what spanning tree and bpduguard is for. Don’t disable them on your edge.

  • mlg@lemmy.world
    link
    fedilink
    English
    arrow-up
    28
    ·
    4 months ago

    Lol imagine the poor dude in his office who was just bored and thought “what if I plug this cable back into the hub, probably won’t do anything”

    • ExtremeDullard@lemmy.sdf.orgOP
      link
      fedilink
      English
      arrow-up
      37
      ·
      edit-2
      4 months ago

      Actually this happened in the lab. I know exactly who did this because he told me: we were discussing what had happened and he said “Oh yeah, Daniel and I needed to connect this Windows machine to the intranet quick because we had something urgent to do, and we connected all the ends of the nest of ethernet cables at random until the machine connected. And then we left everything as it was.” But bad luck for us, their machine was connected, but so was that fatal cable on both ends. It just happened that their machine kept working well enough for them to finish what they were doing without noticing the problems rightaway.

      And in case you wonder, there’s no penalty in our company for owning up to honest mistakes, so that’s why he readily admitted to it. Only people who never do anything never do anything wrong.

      • GreyEyedGhost@lemmy.ca
        link
        fedilink
        English
        arrow-up
        4
        ·
        4 months ago

        I do hope you taught him the many better ways of doing this. I absolutely agree with making an environment where mistakes are easily owned up to (I made a mistake that ended up costing my employer over $10k in the last year), but if it isn’t coupled with turning those into learning experiences (here’s why you don’t do that, here’s why this is a better solution) then you just have a lot of mistakes happening over and over again.

    • Socsa@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      6
      ·
      4 months ago

      In my experience it’s either someone doing it on purpose, or someone accidentally pulling the wrong cable out of a rats nest.

  • oleorun@real.lemmy.fan
    link
    fedilink
    English
    arrow-up
    25
    ·
    4 months ago

    This got me too once. I was in the server room replacing old 110 punch panels/blocks with 8P8C connections. I lost track of cable connections, a mistake I have learned from, and I looped a patch cable into the same switch. Within moments the entire network went down.

    Forty-five minutes later and we figured out the loop.

    Another lesson learned: HP Procurve switches did not have Spanning Tree enabled by default.

    Anyway, mistakes happen, especially in IT. It’s all part of the learning experience. My boss was the coolest, chillest guy in the world so I learned and moved on.

  • ramble81@lemm.ee
    link
    fedilink
    English
    arrow-up
    16
    ·
    4 months ago

    I really hope you meant “switch” when saying “hub”. I haven’t seen a hub used in decades. Also your switch should have some level of STP protection enabled to prevent that. Even if someone had a hub with a routing loop, STP would have disabled the ports.

    • dan@upvote.au
      link
      fedilink
      English
      arrow-up
      13
      ·
      4 months ago

      Basic unmanaged switches often don’t have any sort of protection, and on some fancier managed switches it’s disabled by default (no idea why)

      • Jajcus@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        12
        ·
        edit-2
        4 months ago

        no idea why

        Because it makes initial connection much slower. Dumb switch - you insert a cable and it works. STP-enabled switch: you insert a cable and it takes a while until the port is enabled (unless you do extra configuration, appropriate for your network topology). This is annoying and for inexperienced users it could seem like the switch ‘does not work’. It is easier to sell a switch without such a feature enabled by default.

  • AstridWipenaugh@lemmy.world
    link
    fedilink
    English
    arrow-up
    15
    ·
    4 months ago

    I was diagnosing a network bottleneck at a customer site that didn’t make any sense. Literally everything had gigabit connections except one block of cubicles, but all the devices were connected to the same subnet router for that part of the building. Started tracing wires like you did and found that someone didn’t have a long enough cable when building the office and installed a 10 megabit linksys switch in the drop ceiling to connect two short cables. Rather than fix the cable, the customer just went to Best buy and bought a gigabit Linksys switch to replace it… A multi-million dollar operation is being held together by a $10 switch…

  • °˖✧ ipha ✧˖°@lemm.ee
    link
    fedilink
    English
    arrow-up
    14
    ·
    4 months ago

    But there’s theory and there’s reality.

    Mood. I can’t count the times I’ve found issues that shouldn’t be possible, but are clearly happening.

    • oleorun@real.lemmy.fan
      link
      fedilink
      English
      arrow-up
      13
      ·
      4 months ago

      We used to use Malwarebytes Corporate Edition at work.

      One afternoon all of our web servers stopped responding to traffic on port 443. I could RDC into the servers, and I could ping them, but most traffic wasn’t being passed properly.

      Despite not having made any changes, I did everything I could think of to get them to work. I tried moving them to different switches, different static IPs, Wireshark showed packets flowing, but no web traffic.

      I left the office. It was around 8 PM and I had been banging my head on my desk trying to figure out what the hell was going on.

      I came back around 10 PM, mind clear and stomach topped off. I worked a few more minutes, then heard the Outlook ding.

      Mass email from Malwarebytes CEO. Bad update. Blocked all class B IP addresses by mistake (guess which class we used). Mea culpa. So sorry. New update fixes things.

      I immediately uninstalled MWB CE and boom. Services restored.

      The next week we got our licenses refunded by our VAR and we never used that product again.

  • Orbituary@lemmy.world
    link
    fedilink
    English
    arrow-up
    10
    ·
    4 months ago

    Just reading the title of the post I knew what happened. I read through the whole thing because your story was good and I was in suspense to figure out if it was a router or voip phone that was the culprit.

    Had this happen at work about a decade ago.

  • dan@upvote.au
    link
    fedilink
    English
    arrow-up
    9
    ·
    4 months ago

    By “hub”, do you mean switch? I haven’t seen a hub in a very long time. I don’t think I’ve ever seen a 1Gbps one.

    • Possibly linux@lemmy.zip
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      4 months ago

      There is such a thing as a small 1Gbps hub that are designed to just handle a small network. They scare me as they are cheap on Amazon and could theoretically bring a network to its knees if a random user finds a port that isn’t authenticated.

      • Nougat@fedia.io
        link
        fedilink
        arrow-up
        4
        ·
        4 months ago

        For the passers-by, in very simple terms:

        A switch maintains a list of the IPs and MAC addresses of devices attached to it (ARP [Address Resolution Protocol] table). When a packet comes into the switch for a specific destination IP, the switch looks up on the ARP table where that destination IP can be found, and only sends the packet out on the port the destination device (or next hop towards that device) is connected to.

        A hub doesn’t do any of that. Every packet that comes into the hub gets sent out of every port on the hub, to every device connected to the hub. It’s on the connected devices’ to discard packets that aren’t addressed to them. On anything but a very small and relatively slow network, this would create an unnecessarily large amount of traffic, not to mention the security issue around sending packets to devices they’re not addressed to.

  • Socsa@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    7
    ·
    4 months ago

    Yup, the good old “loopback FU.”

    Routers do have some protections which can mitigate this, but the entire problem is broadcast flooding which can’t really be dealt with at later 2, or even at layer 3 within the same segment. Most places will have no broadcast forwarding between segments, but even if you detect unusual broadcast activity and ban that class of traffic, you break other things. A lot of times it is ARP floods, so it doesn’t happen when the network is static and converged until someone plugs a new laptop in, and then everyone assumes it’s that laptop.

  • deadbeef@lemmy.nz
    link
    fedilink
    English
    arrow-up
    6
    ·
    4 months ago

    Most hubs didn’t protect you from anything in particular.

    Most of them would forward everything to every port, some really insane ones would strip out the spanning tree that could have prevented a loop.

    It’s been a long time since I did anything that goes as far into a network as the desktop, but 15+ years ago we had a customer ring up with the same sort of complaint. After we followed the breadcrumbs on site we found a little 8 port hub ( that we hadn’t supplied ) plugged into two wall ports that went to two different Cisco edge switches in the server room, two cisco phones also with their passthrough ports both patched into same switch and then two desktop PC’s.

    Amazing.

  • pastermil@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    6
    ·
    4 months ago

    Does that kind of loop really mess with things? ELI5 please!

    Also, what do you mean a lonely switch? Does it have that loop and a port connected to another switch in the network?

    • stoy@lemmy.zip
      link
      fedilink
      English
      arrow-up
      13
      ·
      4 months ago

      IT tech here, yes, yes it can.

      Network infrastructure is both increadibly smart while also being dumb in other ways.

      To do an ELI5 answer:

      Imagine you have a container of pearls that you need to sort, red, green and blue pearls all need to be dropped into a red, green or blue hole.

      The container is being refilled, but slow enough that it only gets a new pearl once you have sorted the previous.

      The holes are connected to pipes going to separate buckets.

      Everything is fine, but then some adds a new hole that is muticolored and tells you that all pearls should go there.

      You tell your friends that you have a faster way to deal with the perls and to send you their pearls.

      The new hole also has a pipe, but that is connected to the container that recieves pearls, so every time you drop a pearl into the new hole, it appears in the container again.

      So now you have a situation where you not only get your normal ammount of pearls, but everyone else’s pearls and you also get every pearl you send back again.

      You are smart and quickly realize that something is wrong and call for your teacher for help, networking gear don’t have that capabillity to understand that it is wrong, it just looks at each pearl and not the big picture.

      If we go back to the real world, we have developed tools to deal with this situation, we have protocols line spanning tree which can have switches speak with eachother and figure out if there is a physical loop before sending traffic through it.

      There are other tools as well, but they all need to be configured and to be honest, it is easily forgotten or made a low priority since it happens rarely.

      It is something that is often implemented after a big outage.

    • Socsa@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      ·
      4 months ago

      Certain types of broadcast traffic always get re-broadcast from of every port on a switch. So if you directly connect two ports, and you get some broadcast coming into the switch, that broadcast will loop forever across that loopback, and then get propagated repeatedly until it hits a broadcast boundary. It’s surprisingly difficult to prevent even with managed switches unless you are willing to hand manage every port and significantly restrict the kind of network services which can flow through it.

      Some devices can detect these loops and break them, but that can have other unintended impacts if your network is designed (some would argue poorly) around using dumb switches to multiply limited Ethernet drops at the edge.

  • dukatos@lemm.ee
    link
    fedilink
    English
    arrow-up
    6
    ·
    4 months ago

    Managed switches are not expensive and have death loop protection.

  • Possibly linux@lemmy.zip
    link
    fedilink
    English
    arrow-up
    5
    ·
    4 months ago

    If you are using a hub then that’s expected as they tend to be one of the main sources of floods on a network.

    If you have managed switches make sure you turn on loop protection and alerting. Ideally you should immediately know when something like that happens.

    Also bonus if you setup vlans with different subnets. From there practice least privilege and block all forward traffic by default.

  • Randelung@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    ·
    4 months ago

    Our Unifi network collapsed and I have no clue why. One theory was the automatic WiFi bridges that might have acted as loops.

    • Brickhead92@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      4 months ago

      Yeah I’ve had a wireless uplink between two Unifif AP’s on the same switch, the only non Unifi switch, come up by itself and caused a loop. Unlikely that switch only had a 1G uplink to the next switch, all the rest were 10G links, so it mostly only affected devices on that switch.

      Edit: thought I’d just say that since then, I always disable wireless uplink on all AP’s, and the global system setting, unless it’s actually used, and only on the APs that need it.