Thanks for the additional tests - but I’m still a bit confused by the results on your system… Although I think things are starting to make sense now.
For your smaller test with a 250MB file, I’m surprised to see such small improvements in performance. I was expecting 5× to 10× speedups… Is the file stored on a local (not networked) filesystem? Ideally it would be on a RAM-backed tmpfs
filesystem to rule out any issues with the filesystem itself…
For your larger tests, I hadn’t realised you were trying to copy 128GB of data in parallel… That will indeed stretch your system’s scheduling and buffering mechanisms. But it all depends on exactly what it was you did. If you did copy these test data over to /tmp
first, then your /tmp
folder can’t possibly be a RAM-backed tmpfs
filesystem, that would already have filled your RAM (and as I understand it, these filesystems are typically constrained to take up less than half your available RAM, so it would have maxed out at 64GB). So I assume you placed these files on a standard filesystem, but at least hopefully a local one (?). If these files were left on the GPFS filesystem, it’s going to be very difficult to tease apart issues specific to that setup from issues related to your system’s ability to handle all the concurrency and task-switching that running 32 instances of mrconvert
will generate.
Assuming these files were on a local filesystem, the fact that the RAM requirements of these jobs exceed the RAM on your system will create issues all of their own. To run at full speed, you’d ideally need 2×4GB of available RAM per instance of mrconvert
(i.e. space for both the input and output files), and it would run fastest when the input file has been used recently so that it remains buffered in RAM (e.g. if you run the same command twice in a row, the second invocation is often orders of magnitude faster). So for your test to run quick, you’d need more than 2×4×32=256GB of RAM - anything lower will imply some disk IO to swap data pages in & out of RAM. So what you’re really testing here is the ability of your system to schedule data fetches and writes to/from disk (or wherever the files reside) to system RAM and vice-versa - a hard problem to solve when all these threads want access to different parts of the input files and want to write to different parts of the output files, all at the same time, and it’s not possible to fit it all in RAM concurrently. This is actually a great example of why I’d recommend to users to not run multiple MRtrix3 jobs concurrently: in this case, the combined RAM requirements exceed the system RAM, and all the swapping needed to satisfy all these concurrent requests ends up slowing everyone down…
Note that this RAM usage is not a hard requirement as such: since we’re using memory-mapping, the system is free to manage all of this in whichever way it sees fit, without this formally counting against the application’s RAM usage. This is because technically the data for that memory-mapped region resides on disk, and what’s on disk is the proper version of the data. The system merely provides a convenient interface to it by mapping that region into your process’s address space. The RAM is required to buffer the data while it’s being used, but the system can decide to allocate that RAM to some other process if it needs to. So that RAM will show up in monitoring utilities as buffered rather than actually used. But it is still the case that RAM will be needed during operation, and that in your case ideal performance would require more than you have.
Another note: the situation would be worse again when operating on a network filesystem (or more precisely any filesystem MRtrix3 determines to be inappropriate for random-access writes to a memory-mapped region). In this case, MRtrix3 will allocate a RAM buffer for the output file, to be written back as one block when the image is closed. So in your case, your 32 mrconvert
jobs would immediately request 32×4=128GB of RAM, taking up all of your RAM. This immediately implies the system will have to swap to disk to allow processing, which will slow things down. How much it slows things down will depend, again, on how well your system can handle all this pressure. This would be compounded by the need to also buffer up the incoming data (which is accessed via memory-mapping), which will take up some additional RAM - although hopefully not the full 128GB. But you can see that if the system is trying to free some RAM so one instance of mrconvert
can write some output, it might decide to drop some of the RAM used to hold the incoming data (it’s read-only memory-mapped, so it can always be retrieved from storage), but if it does that just as another instance of mrconvert
was about to read from that region, it’ll need to be retrieved again, etc. So there’s plenty of scope for all of these threads to get in each others’ way when you have limited resources like this. I think this alone would explain the issues you’ve reported here.
Note also that if MRtrix3 was designed to hold both input and output data in RAM during processing, this would definitely exceed your system’s RAM, resulting in either an immediate abort (if your swap space isn’t sufficient to hold the additional RAM), or the same (or worse) problem with slow performance as the system would need to constantly swap data pages in & out of RAM from a slow spinning disk. You can test this if you’re interested by using compressed input & output images: in this case, the input file can’t be memory-mapped, so has to be uncompressed into RAM (or compressed from RAM for the output). So in this case your RAM usage would also include the full allocation for the uncompressed images, i.e. the full 256GB.
OK, so I think this just about covers all I had to say…One last thing though: you mention one of the mrconvert
jobs terminating by signal 7. This is a bus error, which we find typically happens when there is not enough space on your storage device (particularly the one holding temporary files). I have to admit that I’ve never really understood what a bus error actually is and under what circumstances we can expect them to occur… But I wouldn’t be surprised to find that the storage device you’re using to hold these files while you’re running these jobs is just too small to fit all of the input and output files, so that one of them fails when trying to allocate space for its output when trying to write to it. Given that all of these instances are launched in parallel, which one ends up failing will be more or less random. Hopefully that’ll explain that one…(?)