How to store a rolling archive of an RSS feed? - eviltoast

Is anyone aware of an existing project that can do something like this:

  • Access an RSS feed.
  • Parse the contents of the items in the feed, and fetch linked images.
  • Take the new feed elements and add them to previously fetched elements.
  • Store all of the content in a merged RSS/XML file, or something like a SQLite DB.

Context: I’d like to archive Mastodon posts of an account automatically. I’d prefer it to be a script/binary I could run on Linux as I’d likely throw it in a GitHub action and save the resulting output in the git repo.

I could probably whip something together but I’m lazy and I’d prefer to use something that already exists.

    • bogo@sh.itjust.worksOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 year ago

      Thanks. This has potential and would force me to finally learn Ruby if I want to tweak it.

      • Paradox@lemdro.id
        link
        fedilink
        English
        arrow-up
        2
        ·
        1 year ago

        Best way to learn is to dive in and try to accomplish something you want to do

  • Lodra@programming.dev
    link
    fedilink
    arrow-up
    4
    ·
    1 year ago

    No but I have an indirect answer (a method?) for you. There are many open source projects that do this type of work. For example, newsblur. Maybe you can find a few of these projects in the language you want to use and see how they’re handling it. I would expect the to be done common libraries used between them.

  • abhibeckert@lemmy.world
    link
    fedilink
    arrow-up
    4
    arrow-down
    1
    ·
    edit-2
    1 year ago

    I don’t know of a project that does this, but if I was to tackle it I would convert the RSS to the Activity Streams standard - https://www.w3.org/TR/activitystreams-core/.

    Activity Streams are basically the new RSS and it’s a lot better than RSS.

    Mastodon is built on Activity Pub, which is built on Activity Streams - so you shouldn’t even need to touch RSS at all. The AS already exists. You can access it via the API.

    Under European laws all services are required to give you a copy of all data associated with your account if you ask for it. And Mastodon being a European product is of course fully compliant. Just go to your profile and hit the “Request your Archive” button. You could do that once a month or something.

    • bogo@sh.itjust.worksOP
      link
      fedilink
      arrow-up
      1
      ·
      1 year ago

      Yes, the “Request Archive” method may be the “don’t over engineer this stupid” option I go with.

  • atheken@programming.dev
    link
    fedilink
    arrow-up
    2
    ·
    1 year ago

    I use miniflux, and you can configure it to modify feed items. As far as I know it does not purge anything by default.

    Really, pulling an RSS feed and parsing it, storing stuff is probably 50 lines of bash, and less in a general purpose scripting language.