Has anyone here built a Beowulf Cluster? - eviltoast

A university near me must be going through a hardware refresh, because they’ve recently been auctioning off a bunch of ~5 year old desktops at extremely low prices. The only problem is that you can’t buy just one or two. All the auction lots are batches of 10-30 units.

It got me wondering if I could buy a bunch of machines and set them up as a distributed computing cluster, sort of a poor man’s version of the way modern supercomputers are built. A little research revealed that this is far from a new idea. The first ever really successful distributed computing cluster (called Beowulf) was built by a team at NASA in 1994 using off the shelf PCs instead of the expensive custom hardware being used by other super computing projects at the time. It was also a watershed moment for Linux, then only a few yeas old, which was used to run Beowulf.

Unfortunately, a cluster like this seems less practical for a homelab than I had hoped. I initially imagined that there would be some kind of abstraction layer allowing any application to run across all computers on the cluster in the same way that it might scale to consume as many threads and cores as are available on a CPU. After some more research I’ve concluded that this is not the case. The only programs that can really take advantage of distributed computing seem to be ones specifically designed for it. Most of these fall broadly into two categories: expensive enterprise software licensed to large companies, and bespoke programs written by academics for their own research.

So I’m curious what everyone else thinks about this. Have any of you built or admind a Beowulf cluster? Are there any useful applications that would make it worth building for the average user?

  • Kangie@lemmy.srcfiles.zip
    link
    fedilink
    arrow-up
    4
    ·
    edit-2
    9 months ago

    Overall you’re not too far off, but what you’ll tend to find is that it’s a lot of doing similar calculations over and over.

    For example, climate scientists may, for certain experiments, read a ton of data from storage for say different locations and date/times across a bunch of jobs, but each job is doing basically the same thing - you might submit 100000 permutations, or have an updated model that you want to crunch the existing dataset out with.

    The data from each job is then output, and analysed (often with followup batch jobs).

    Edit: here’s an example of a model that I have some real-world experience building to run on one of my clusters: https://www.nrel.colostate.edu/projects/century/

    Swin have some decent, public docs. I think mine are pretty good, but they’re not public so…

    https://supercomputing.swin.edu.au/docs/2-ozstar/oz-partition.html

    There will typically be some interactive nodes in a cluster as well that enable users to log in and perform interactive tasks, like validating that the software will run or, more commonly, to submit jobs to the queue manager.