Network Stability Issues with tg3 Driver - Hardware Replacement or Further Testing? - eviltoast

Hi Lemmy!

Encountering network stability issues on my Mac mini (“Core i7” 2.3 (Late 2012)) with Proxmox VE (Linux 6.5.13-1-pve). The enp1s0f0 interface (tg3 driver) frequently drops, displaying “Link is down” messages before recovering. This affects my Plex media server and has persisted across different OS installations.

(related posts here and here)

Here are some key excerpts from the logs showing the problem:

  • NETDEV WATCHDOG: enp1s0f0 (tg3): transmit queue 0 timed out
  • tg3 0000:01:00.0 enp1s0f0: transmit timed out, resetting
  • tg3 0000:01:00.0 enp1s0f0: Link is down
  • kernel: vmbr0: port 1(enp1s0f0) entered disabled state
  • kernel: tg3 0000:01:00.0 enp1s0f0: Link is up at 1000 Mbps, full duplex

Here is when it’s happened

  • journalctl | grep -E "Link is down|Link is up" | grep 'enp1s0f0'
  • Mar 03 00:03:05 macminiserver kernel: tg3 0000:01:00.0 enp1s0f0: Link is down
  • Mar 03 00:03:09 macminiserver kernel: tg3 0000:01:00.0 enp1s0f0: Link is up at 1000 Mbps, full duplex
  • Mar 03 15:35:30 macminiserver kernel: tg3 0000:01:00.0 enp1s0f0: Link is down
  • Mar 03 15:35:34 macminiserver kernel: tg3 0000:01:00.0 enp1s0f0: Link is up at 1000 Mbps, full duplex
  • Mar 04 12:43:45 macminiserver kernel: tg3 0000:01:00.0 enp1s0f0: Link is down
  • Mar 04 12:43:49 macminiserver kernel: tg3 0000:01:00.0 enp1s0f0: Link is up at 1000 Mbps, full duplex
  • Mar 07 09:14:48 macminiserver kernel: tg3 0000:01:00.0 enp1s0f0: Link is down
  • Mar 07 09:14:52 macminiserver kernel: tg3 0000:01:00.0 enp1s0f0: Link is up at 1000 Mbps, full duplex

The issue is super intermittent, but it appears to happen more when I am trying to watch something on Plex (Direct Play to Apple TV), but it looks like it’s also happened over night with limited activity. I’ve also successfully streamed all day on multiple devices, run ping to and from multiple devices, mtr between it and my other Mac and run 16 hours of iperf3 with zero issues!

  • Does anyone have any guidance on how I can determine if this is a hardware issue or could it be driver/kernel related?
  • I’ve ordered a Plugable USB to Ethernet Adapter to see if I can bypass the NIC and test if something else is the cause, I also needed a good USB to Ethernet adapter so it was time
  • Would an external solution suffice, or is it time for a new system?
  • Should I focus on further diagnostics in a different environment, or is it time to retire this box and get something new?

Happy to share more of my syslog and also my network setup, I’m in a NYC apartment, so my options for changing the environment is limited. I’ve also not encountered (or noticed) the issue with any of my other devices on the same switch hooked up to the router in the same manner. I have tried a different port and cable so far, but not physically moved it to another switch yet.

I’m getting closer and closer to just buying a Dell Optiplex (probably 11th Gen i7), cannibalize the SSD and trying to play with more services, but my original goal was to learn and host Plex cheaply and easily using this older hardware, but my sanity is running out!

Thanks

  • Sorcaeden@lemmy.world
    link
    fedilink
    arrow-up
    2
    ·
    8 months ago

    I seem to recall a VMware complaint similar to this too, and there was a ring buffer tuning to do to fix it… But that error message doesn’t seem quite right to match it.

    TX queue timeouts can be caused by several things, but I wonder if you’re not seeing an end result of a spammy Ethernet flow control implementation where the device can’t transmit because the link is continuously paused.

    If so, there may be rx_xoff counters viewable from within proxmox, or “ethtool -s enp1s0f0” would tell you where the device is seeing pause frames from the switch on a regular Linux host.

    The link down tends to be a reaction by the driver to recover from a hung queue, so if it’s not flow control, there could be a driver/firmware upgrade possible, or a series of tunables if there’s a bug somewhere in packet handling land resulting in the NIC itself hanging.

  • dkc@lemmy.world
    link
    fedilink
    arrow-up
    2
    ·
    8 months ago

    Anything is possible, but I’d be surprised if it’s a driver issue. Many server vendors use tg3 based NICs as the onboard NIC integrated into the motherboard. The install base for that driver is huge.

  • robalees@lemmy.worldOP
    link
    fedilink
    arrow-up
    1
    ·
    8 months ago

    Memmy doesn’t like Markup it seems 🤷🏻‍♂️ happy to post plain text logs if that helps