Multiple process or threads? And, why? - eviltoast

(For context, I’m basically referring to Python 3.12 “multiprocessing.Pool Vs. concurrent.futures.ThreadPoolExecutor”…)

Today I read that multiple cores (parallelism) help in CPU bound operations. Meanwhile, multiple threads (concurrency) is due when the tasks are I/O bound.

Is this correct? Anyone cares to elaborate for me?

At least from a theorethical standpoint. Of course, many real work has a mix of both, and I’d better start with profiling where the bottlenecks really are.

If serves of anything having a concrete “algorithm”. Let’s say, I have a function that applies a map-reduce strategy reading data chunks from a file on disk, and I’m computing some averages from these data, and saving to a new file.

  • Fred@programming.dev
    link
    fedilink
    arrow-up
    2
    ·
    1 month ago

    I can’t remember if threads are core bound or not.

    On Linux, by default they’re not. getcpu(2) says:

       The getcpu() system call identifies the processor and node on which the
       calling thread or process is currently running and writes them into the
       integers pointed to by the cpu and node arguments.  ...
    
       The  information  placed in cpu is guaranteed to be current only at the
       time of the  call:  unless  the  CPU  affinity  has  been  fixed  using
       sched_setaffinity(2),  the  kernel  might  change  the CPU at any time.
       (Normally this does not happen because the scheduler tries to  minimize
       movements  between  CPUs  to keep caches hot, but it is possible.)  The
       caller must allow for the possibility that the information returned  in
       cpu and node is no longer current by the time the call returns.