CUDA processing in MRtrix?


#1

Hello developers,

I have heard from some of my colleagues that CUDA processing can speed up a lot DWI processing, and tractography in particular. This experience derived from some custom code someone wrote for FSL tools a couple of years ago. Word goes that full brain tractography went done from hours to 15 minutes.

I thought CUDA processing would have taken over by now, but surprisingly there is almost no information around. Looks like only eddy uses CUDA within MRtrix.

Do you have thoughts on this? Do you think that running tckkgen with 2000+ CUDA cores would bring a performance boost as compared to using modern CPUs (i.e., Xeon Gold 6142 with 16 cores)? Do you plan to add CUDA tools in the MRtrix suite?

I am asking because I have a couple of good Tesla cards that I can integrate in the system but I am not sure if it’s worth it.

Thank you.
Dorian


#2

Yes, there are applications that can really benefit from the massively parallel processing capabilities of modern GPUs. In the FSL case, I think the primary speedup was in bedpostx rather than the tracking itself. I can see that an application like bedpostx would benefit from this, given the nature of the processing (massive MCMC sampling over the data for each voxel). In MRtrix3, the FOD estimation is typically not the bottleneck though.

The tracking part might benefit from CUDA, but it’s not a clear-cut winner – not clear enough for me to consider investing what would be an enormous amount of effort. There are so many options in tckgen now that would be very complex to convert into a GPU implementation, I can’t see this being a trivial job… But more to the point, the nature of data processing during tractography is not a great match to the way GPU like to operate. Yes, it’s massively parallelisable (and already is), but not at the fine granularity that GPUs expect: there’s a lot going one for each step during tracking. Also, the stochastic nature of the algorithms makes it almost impossible to optimise data access patterns, which become really critical on these systems: it’s difficult to predict which data the algorithm will need next (i.e. which voxels the streamline will venture into at the next step). This means most threads will sit idle waiting for data to be fetched from RAM. And the typically workaround for this (increasing the number of concurrent threads to increase the chances of at least some being able to run) also won’t work well here, given the complexity of each thread and the amount of registers required (means swapping threads also means swapping out registers, which causes more fetch/store operations to RAM).

If you’re interested, there was an ISMRM 2012 abstract that did implement the tractography from the old MRtrix 0.2 version (streamtrack) on the GPU, but if you look at the details, the speedup factor was not as impressive as I would have liked. The headline figure was a 25× speedup, but it’s less impressive when you consider the comparison was between a single thread on a 8-core CPU vs. a full-blown GPU implementation – so more like a factor of 3…

Based on the above, no – at least not at this time. If we do come across situations where a GPU implementation really might offer significant benefits, then we’ll obviously revisit. But my primary concern with this technology is that it’s still a moving target. It also would complicate installation, and restrict the number of systems where this was available. Also, it increases the maintenance burden significantly, if we now how to maintain two versions of the code (automated testing would also be interesting…). Finally, if I was to go down that route, I would also seriously consider using OpenCL for portability reasons over CUDA – at least it’s an industry standard, and doesn’t tie users down to NVIDIA systems.

So in short, yes, it’s something we’re thinking about, but I’d need to see a serious case before going for it. At present, I would expect tckgen to be the bottleneck in the overwhelming majority of use cases, but it’s also the one application least likely to perform well on the GPU. If other applications (current or future) become serious bottlenecks in users’ workflows, then if we can convince ourselves that they would perform well on the GPU, I’m sure we’d give it serious consideration…


#3

Thanks for the quick and thorough response.

Good luck with it all, MRtrix keeps being one of the most documented and best developed imaging software.