Federation troubleshooting - eviltoast

So I’ve been troubleshooting the federation issues with some other admins:

(Thanks for the help)

So what we see is that when there are many federation workers running at the same time, they get too slow, causing them to timeout and fail.

I had federation workers set to 200000. I’ve now lowered that to 8192, and set the activitypub logging to debugging to get queue stats. RUST_LOG="warn,lemmy_server=warn,lemmy_api=warn,lemmy_api_common=warn,lemmy_api_crud=warn,lemmy_apub=warn,lemmy_db_schema=warn,lemmy_db_views=warn,lemmy_db_views_actor=warn,lemmy_db_views_moderator=warn,lemmy_routes=warn,lemmy_utils=warn,lemmy_websocket=warn,activitypub_federation=debug"

Also, I saw that there were many workers retrying to servers that are unreachable. So, I’ve blocked some of these servers:

commallama.social,mayheminc.win,lemmy.name,lm.runnerd.net,frostbyrne.io,be-lemmy.org,lemmonade.marbledfennec.net,lemmy.sarcasticdeveloper.com,lemmy.kosapps.com,pawb.social,kbin.wageoffsite.com,lemmy.iswhereits.at,lemmy.easfrq.live,lemmy.friheter.com,lmy.rndmm.us,kbin.korgen.xyz

This gave good results, way less active workers, so less timeouts. (I see that above 3000 active workers, timeouts start).

(If you own one of these servers, let me know once it’s back up, so I can un-block it)

Now it’s after midnight so I’m going to bed. Surely more troubleshooting will follow tomorrow and in the weekend.

Please let me know if you see improvements, or have many issues still.

  • hawkwind@lemmy.management
    link
    fedilink
    arrow-up
    7
    ·
    1 year ago

    Admins: FYI on lemmy logs:

    Example of a federation message success (HTTP Response 200):

    INFO actix_web::middleware::logger: 12.34.56.78 'POST /inbox HTTP/1.1' 200 0 '-' 'Lemmy/0.17.4; +https://remote.lemmy' 0.145673

    and failure (HTTP Response 400):

    INFO actix_web::middleware::logger: 12.34.56.78 'POST /inbox HTTP/1.1' 400 65 '-' 'Lemmy/0.17.4; +https://remote.lemmy' 0.145673

    These are usually accompanied soon with a more verbose reason like Http Signature is expired.

    Lemmy is far better at logging INCOMING stuff than it is OUTGOING. You can grep for activity_queue to get a sense if there are issues. This is not good:

    Target server https://lemmy.lemmy/inbox rejected https://lemmy.local/activities/announce/9852ff01-c768-484b-a38da-da021cd1333, aborting

    Also there are stats indicating “pile-ups”:

    Activity queue stats: pending: 1, running: 1, retries: 706, dead: 0, complete: 12