Mrtrix mmap() / GPFS issue

Mike · February 9, 2017, 3:21pm

Dear MRtrix developers,

we recently encountered an issue with several mrtrix3 programs (mrconvert, dwiextract, …) running on Centos7 using a GPFS filesystem on a supercomputer. When multiple instances of the program are executed on one computing node or when the program is executed with multiple threads (i.e. in any case except for when the ‘-nthreads 0’ flag is set) it often comes to a situation where the execution is incredibly slow or comes to a halt altogether. Furthermore, when the job hits the walltime of the job scheduler and gets terminated, zombie processes remain on the used computing nodes and cause them to crash.

We could verify that the problem does not occur when the MAP_PRIVATE flag is used for mmap(). This appears to be a bug in GPFS, we contacted IBM regarding this. As it could take several month until there is a fix we considered two other options:

One could change the mmap() flag from MAP_SHARED to MAP_PRIVATE. Based on
our (very limited) understanding of the code this should not be a problem since e.g. mrconvert uses threads and no inter-process communication is required.
mrconvert has a mechanism to detect network filesystems and falls back to using standard I/O rather than file-mapping on such systems. GPFS is not detect as such and thus no special path is triggered. We could try to change the code such that it uses this “delayed writeback” mode. However, we do not know if that has an impact on performance.

Can you give us some feedback on whether either of the options is viable
or some alternatives on how to proceed? At the moment we are running mrconvert only with individual instances per computing node, which is not really efficient usage of the supercomputing resources. In the case of dwipreproc multiple instances on a single node can be executed normally as long as multithreading is turned off.

Thanks!

Michael

jdtournier · February 9, 2017, 4:01pm

Yes, writing to memory-mapped files over network filesystems is a Bad Idea. I think it would be best all round to detect GPFS as a network filesystem - we just need to add its type ID to this line of the code. According to this (line 40), you’d just need to add:

|| fsbuf.f_type == 0x47504653 /* GPFS */

at the end of that particular line in file/mmap.cpp mentioned above.

Maybe you can try this modification, ./build again, and see if that sorts out the problem…?

Mike · February 10, 2017, 8:51am

It works, thanks a lot!
Byte-by-byte comparison shows that the results of the two mrconvert versions (patched and unpatched) were identical.

One strange observation, though: when multihreading is allowed, it runs much slower compared to when “-nthreads 0” is set (tested with number of instances = 2/3 of number of processors). That is not directly a problem if one knows about it, but might point to some underlying issue.

jdtournier · February 10, 2017, 11:56am

Good to hear. In which case, I’ll commit that change to the main repo.

So that is very unexpected. We try quite hard to avoid anything that would cause this kind of problem, and I really struggle to see how that could happen. For mrconvert in particular, the copy is performed according to the strides of the input image, so the read should be more or less linear through the input data file (assuming it’s memory-mapped, it’ll definitely be linear and single-threaded if it needs to be loaded to RAM first, e.g. for compressed NIfTI, or certain types of DICOM images). So the filesystem backend handling the memory mapping should have no trouble optimising for this (e.g. with read-ahead). In my experience, mrconvert always runs faster with multi-threading - with benefits even when the number of threads exceeds the core count.

The only reason I could see for multi-threading causing issues is if the filesystem-handling code needs to somehow serialise data requests in a blocking, synchronous fashion, so that the multiple threads somehow get in each others’ way. But that would be a very odd implementation if that were the case, the kernel should manage that pretty well, allowing all threads to access memory pages that have already been loaded, and only invoking the GPFS filesystem backend for memory pages not already loaded/cached in RAM - unless you’re mounting this filesystem in synchronous mode…?

On that last point, if you were to use the filesystem in synchronous mode, that means every write would need to be committed before control is handed back to the application, and every read needs a round-trip to the fileserver(s) to ensure the contents haven’t been modified in the meantime. That would indeed be horrendously inefficient, and involve a lot of network round trips with plenty of opportunity for bottlenecks (a general issue with synchronous filesystems). If that is the issue, I would recommend you reassess the need for such synchronous access - you would only need this for absolutely mission-critical applications, where you expect concurrent read/writes to the same file from multiple locations in normal operation. This is rarely an issue with the vast majority of applications - the concern is typically to ensure exclusive file creation (so two applications don’t end up creating temporary files with the same name, for example). Using asynchronous access (which is the default in most cases) allows the kernel to use aggressive local caching to minimise latency on read, and also to minimise the number of writes (by only flushing to storage when necessary). I’m not sure whether that might be issue here, but I think it’s worth looking into, if only to get to the bottom of the problem…

jdtournier · February 10, 2017, 3:38pm

Looking through your post again, I was trying to figure out how your suggestion of using MAP_PRIVATE might work. The problem is that using a private mapping will not alter the file contents, so can’t be used for read-write access. But we could use it for read-only access, which by the sounds of things is the most likely thing hindering multi-threaded performance in your case. I’ve just pushed a commit on this branch to use MAP_PRIVATE for the read-only case, can you give it a try and see if that works for you? The read-write case is no longer a problem anyway since GPFS is now correctly identified as a network filesystem, so doesn’t even use memory-mapping. Hopefully that’ll resolve your remaining performance issues.

To try it out:

git fetch 
git checkout detect_gpfs_as_network_fs
./build

Mike · February 11, 2017, 1:59pm

Thanks a lot for the update and the thought you are putting into this. We tried the patched version with the same result.

This seems to be the problem: when the number of threads is not explicitly specified then each mrconvert process spawns 48 threads. For example, if we run 16 instances of mrconvert on a node with 24 cpus this will create 768 threads on a node with 48 hardware threads.
This is because mrtrix uses “std::thread::hardware_concurrency()” to get the default number of threads, which will always return 48 on this supercomputer since each mrconvert instance is started on its own so each process thinks it has exclusive usage of that particular node. The gurus at the supercomputing facility assume that the contention and overhead of task switching will lead to the decreased performance. We didn’t see a way on the user side to prevent this except for explicitly specifying the number of threads. On the other hand, running only a single instance of mrconvert per node wouldn’t optimally make use of the available resources as execution time for a single instance is only about double as fast as when 16 or more instances with disabled multihreading are run.

Some more remarks from the supercomputing gurus:

You are of course right regarding the usage of MAP_SHARED for the read/write access. This false suggestion was based on a misunderstanding of that flag on our side.
The file systems are not mounted with sync flags and they are unsure if that would even apply to GPFS since GPFS largely bypasses the Linux page caching mechanisms.

bjeurissen · February 14, 2017, 9:06am

Not sure if it is related to this, but I still get frequent hangs with the tckgen command (with large number of fibers) writing to glusterfs. The tckgen process becomes an unkillable zombie process. As far as I can tell, the glusterfs should be detected as a network filesystem in mmap.cpp as it uses fuse, right?

jdtournier · February 14, 2017, 11:27am

No, that one seems unrelated. tckgen writes from a single thread using standard C++ std::fstream - I can’t see that there would be anything weird there…?

jdtournier · February 15, 2017, 8:52pm

@bjeurissen: just thinking your issue with tckgen hanging on glusterfs, there’s a chance it might be due to tckgen's current implementation of buffered write-back. If you look at this point in the code, you’ll note that it opens the file in append mode at every write-back (every 16M by default) and closes it again after the write. The reason for that was to keep the number of concurrently open files low for certain applications like tcknodeextract that might hold a lot of output files open (see message commit for details). I’m not sure it necessarily has anything to do with it, but it might be worth investigating. If it turns out to be the problem, that would almost definitely be a bug in glusterfs - I can’t see that there’s anything odd about what’s going on in our code…

jdtournier · February 15, 2017, 10:47pm

@Mike: sorry about the delay in responding (I’m away on leave currently). There’s a lot to discuss here, but maybe this isn’t the best forum (might be an idea to file an issue on GitHub about it instead). Briefly though:

MRtrix3 is indeed designed to use all available CPU resources - this is a design choice that I made many years ago, for reasons I still think are the right ones in most cases. I think it’s generally a better use of resources to run one application on all cores, rather than many instances of a single-threaded application concurrently, since a single application will have a smaller overall RAM footprint and will also typically have lower CPU cache requirements than all the other processes combined. So for example, on a 16-core system, I’d expect 16 sequential, fully multi-threaded jobs to complete faster than 16 concurrent single-threaded jobs. I would always recommend to users to avoid running concurrent MRtrix3 jobs, they’ll just end up slowing themselves down (and often everyone else too).

This I think applies for most CPU-intensive workloads. However, there are commands that are mostly limited by IO bandwidth, and mrconvert is definitely one of them. That said, on a standard desktop (mine for example), I definitely observe large benefits using multithreading. For example, on my 6-core i7, I get a 7× speedup on a straight mrconvert using multi-threading (once the file has been buffered in RAM by the system) - I’m very surprised by your report of only a 2× speed-up (?). So even though it ought to be IO-bound, there are clear benefits to multi-threading, presumably due to the overhead of the datatype conversions and other operations that happen under the hood in mrconvert. But even if it was IO bound, I’d still recommend against launching concurrent mrconvert jobs, since it really wouldn’t make sense to try to parallelise a set of jobs that was inherently limited by the IO bandwidth of the system - something that inherently can’t be parallelised (although I note GPFS does try to do something like that…).

I’m also dubious about the suggestion that the slowdown is due to thread contention or task switching. To check this, I tried this little experiment:

make 32 copies of a large-ish (190M) input file, named image-N.mif, into a new folder (/tmp/testing/) located in /tmp, which on my system is mounted on a tmpfs filesystem (entirely RAM-backed, so no filesystem overhead whatsoever):
```
  for n in {1..32}; do cp dwi.mif /tmp/testing/image-$n.mif; done
```
try running 32 parallel instances of mrconvert image-N.mif image-N-out.mif, and time the result. The command is a little obscure, but this is one of those things bash & xargs can do for you when you ask them nicely…
```
  time ( echo image-{1..32}{,-out}.mif | xargs -P 32 -n 2 mrconvert )
```

I tried the above as-is, then adding -nthread 0 after mrconvert, then adding -nthread 32 after mrconvert. This is running on my 6-core i7-4930K CPU (with hyper-threading), which by default will run 12 threads. There was no detectable difference in runtime between all of those, despite them running with 32×12=384, 32×1=32, and 32×32=1024 threads respectively, on a node with 12 hardware threads. I appreciate my CPU doesn’t have the same level of concurrency that yours do (especially if your nodes are dual-socket?), but I’d nonetheless find it surprising that a job that should be strongly IO bound in your case (parallel mrconvert from/to network storage) would run slow because of this issue.

For the record, the results are similar-ish on a live RAID1 ext4 filesystem with dm-crypt (my home folder), but runtimes are generally longer (as expected) and much more variable (seems to be due to temporary locks while the kernel flushes pending writes - presumably the amount of data being written exceeds the kernel’s limit on dirty pages).

So all this to say, I really do think there is something specific about GPFS that is causing this problem, particularly their handling of memory-mapped files. If what you say about GPFS bypassing the Linux page cache mechanisms is true, then I wouldn’t be surprised to learn that the deep surgery required to graft this filesystem so deep into the guts of the Linux kernel might have introduced a few unexpected side effects. I’ve generally been pretty impressed with how well Linux manages some of this stuff, I’m not sure why IBM would feel the need to bypass it altogether - although I suspect it’s way above my pay grade…

Otherwise, one last thing: you’re obviously aware that you can use the -nthread option to control the number of threads on any given job, but you can also set the NumberOfThreads entry in the config file, or use the MRTRIX_NTHREADS environment variable if needed. Just in case that helps…

Mike · February 17, 2017, 3:19pm

Hi @jdtournier,

thanks a lot for your informative text!

Yes, its a dual-socket Intel Xeon E5-2680 v3 Haswell nodes 24 cores @ 2.5 GHz.

Following is the results from the little experiment (using a ~250 MB DWI).

Variant 1 (w/o nthreads flag):

real 0m2.713s
user 1m43.363s
sys 0m9.085s

Variant 2 (-nthreads 0):

real 0m4.103s
user 1m24.375s
sys 0m6.817s

Variant 3 (-nthreads 32):

real 0m3.144s
user 1m42.608s
sys 0m9.811s

This perfectly confirms your thoughts! However, when I used 32 full DWI input files (each 4 GB, from HCP data set) things looked differently. Here I had to stop the times manually, because, strangely, every single time when I executed your command with the large DWIs there was exactly one mrconvert instance that crashed.
Error message: mrconvert: [ 0%] copying from “image-13.mif” to “image-13-out.mif”… xargs: mrconvert: terminated by signal 7

(It was never the same image-*.mif so that wasn’t the reason)

Anyway, these were the runtimes (round about):

variant 1: 17 minutes,
variant 2: 5 minutes,
variant 3: 10 minutes

Hmm, maybe main memory is the problem here? Main memory is 128 GB, so it would be quite full for 32 instances (32*4 = 128) … But the problem also existed when I used 16 or less instances. Anyway, I found a way to work with it so from my side I’m pretty happy with how it works right now

Thanks again for your help and sorry if the problem was actually much more trivial than previously thought (if it’s really the main memory)

Best,
Mike

Edit: Ignore my speculations about main memory, that makes no sense, main memory shouldn’t be the issue, additional threads shouldn’t consume that much more memory… Let me know if you need me to do more tests on this machine.

jdtournier · February 20, 2017, 10:27am

Thanks for the additional tests - but I’m still a bit confused by the results on your system… Although I think things are starting to make sense now.

For your smaller test with a 250MB file, I’m surprised to see such small improvements in performance. I was expecting 5× to 10× speedups… Is the file stored on a local (not networked) filesystem? Ideally it would be on a RAM-backed tmpfs filesystem to rule out any issues with the filesystem itself…

For your larger tests, I hadn’t realised you were trying to copy 128GB of data in parallel… That will indeed stretch your system’s scheduling and buffering mechanisms. But it all depends on exactly what it was you did. If you did copy these test data over to /tmp first, then your /tmp folder can’t possibly be a RAM-backed tmpfs filesystem, that would already have filled your RAM (and as I understand it, these filesystems are typically constrained to take up less than half your available RAM, so it would have maxed out at 64GB). So I assume you placed these files on a standard filesystem, but at least hopefully a local one (?). If these files were left on the GPFS filesystem, it’s going to be very difficult to tease apart issues specific to that setup from issues related to your system’s ability to handle all the concurrency and task-switching that running 32 instances of mrconvert will generate.

Assuming these files were on a local filesystem, the fact that the RAM requirements of these jobs exceed the RAM on your system will create issues all of their own. To run at full speed, you’d ideally need 2×4GB of available RAM per instance of mrconvert (i.e. space for both the input and output files), and it would run fastest when the input file has been used recently so that it remains buffered in RAM (e.g. if you run the same command twice in a row, the second invocation is often orders of magnitude faster). So for your test to run quick, you’d need more than 2×4×32=256GB of RAM - anything lower will imply some disk IO to swap data pages in & out of RAM. So what you’re really testing here is the ability of your system to schedule data fetches and writes to/from disk (or wherever the files reside) to system RAM and vice-versa - a hard problem to solve when all these threads want access to different parts of the input files and want to write to different parts of the output files, all at the same time, and it’s not possible to fit it all in RAM concurrently. This is actually a great example of why I’d recommend to users to not run multiple MRtrix3 jobs concurrently: in this case, the combined RAM requirements exceed the system RAM, and all the swapping needed to satisfy all these concurrent requests ends up slowing everyone down…

Note that this RAM usage is not a hard requirement as such: since we’re using memory-mapping, the system is free to manage all of this in whichever way it sees fit, without this formally counting against the application’s RAM usage. This is because technically the data for that memory-mapped region resides on disk, and what’s on disk is the proper version of the data. The system merely provides a convenient interface to it by mapping that region into your process’s address space. The RAM is required to buffer the data while it’s being used, but the system can decide to allocate that RAM to some other process if it needs to. So that RAM will show up in monitoring utilities as buffered rather than actually used. But it is still the case that RAM will be needed during operation, and that in your case ideal performance would require more than you have.

Another note: the situation would be worse again when operating on a network filesystem (or more precisely any filesystem MRtrix3 determines to be inappropriate for random-access writes to a memory-mapped region). In this case, MRtrix3 will allocate a RAM buffer for the output file, to be written back as one block when the image is closed. So in your case, your 32 mrconvert jobs would immediately request 32×4=128GB of RAM, taking up all of your RAM. This immediately implies the system will have to swap to disk to allow processing, which will slow things down. How much it slows things down will depend, again, on how well your system can handle all this pressure. This would be compounded by the need to also buffer up the incoming data (which is accessed via memory-mapping), which will take up some additional RAM - although hopefully not the full 128GB. But you can see that if the system is trying to free some RAM so one instance of mrconvert can write some output, it might decide to drop some of the RAM used to hold the incoming data (it’s read-only memory-mapped, so it can always be retrieved from storage), but if it does that just as another instance of mrconvert was about to read from that region, it’ll need to be retrieved again, etc. So there’s plenty of scope for all of these threads to get in each others’ way when you have limited resources like this. I think this alone would explain the issues you’ve reported here.

Note also that if MRtrix3 was designed to hold both input and output data in RAM during processing, this would definitely exceed your system’s RAM, resulting in either an immediate abort (if your swap space isn’t sufficient to hold the additional RAM), or the same (or worse) problem with slow performance as the system would need to constantly swap data pages in & out of RAM from a slow spinning disk. You can test this if you’re interested by using compressed input & output images: in this case, the input file can’t be memory-mapped, so has to be uncompressed into RAM (or compressed from RAM for the output). So in this case your RAM usage would also include the full allocation for the uncompressed images, i.e. the full 256GB.

OK, so I think this just about covers all I had to say…One last thing though: you mention one of the mrconvert jobs terminating by signal 7. This is a bus error, which we find typically happens when there is not enough space on your storage device (particularly the one holding temporary files). I have to admit that I’ve never really understood what a bus error actually is and under what circumstances we can expect them to occur… But I wouldn’t be surprised to find that the storage device you’re using to hold these files while you’re running these jobs is just too small to fit all of the input and output files, so that one of them fails when trying to allocate space for its output when trying to write to it. Given that all of these instances are launched in parallel, which one ends up failing will be more or less random. Hopefully that’ll explain that one…(?)