Yes, there are applications that can really benefit from the massively parallel processing capabilities of modern GPUs. In the FSL case, I think the primary speedup was in
bedpostx rather than the tracking itself. I can see that an application like
bedpostx would benefit from this, given the nature of the processing (massive MCMC sampling over the data for each voxel). In MRtrix3, the FOD estimation is typically not the bottleneck though.
The tracking part might benefit from CUDA, but it’s not a clear-cut winner – not clear enough for me to consider investing what would be an enormous amount of effort. There are so many options in
tckgen now that would be very complex to convert into a GPU implementation, I can’t see this being a trivial job… But more to the point, the nature of data processing during tractography is not a great match to the way GPU like to operate. Yes, it’s massively parallelisable (and already is), but not at the fine granularity that GPUs expect: there’s a lot going one for each step during tracking. Also, the stochastic nature of the algorithms makes it almost impossible to optimise data access patterns, which become really critical on these systems: it’s difficult to predict which data the algorithm will need next (i.e. which voxels the streamline will venture into at the next step). This means most threads will sit idle waiting for data to be fetched from RAM. And the typically workaround for this (increasing the number of concurrent threads to increase the chances of at least some being able to run) also won’t work well here, given the complexity of each thread and the amount of registers required (means swapping threads also means swapping out registers, which causes more fetch/store operations to RAM).
If you’re interested, there was an ISMRM 2012 abstract that did implement the tractography from the old MRtrix 0.2 version (
streamtrack) on the GPU, but if you look at the details, the speedup factor was not as impressive as I would have liked. The headline figure was a 25× speedup, but it’s less impressive when you consider the comparison was between a single thread on a 8-core CPU vs. a full-blown GPU implementation – so more like a factor of 3…
Based on the above, no – at least not at this time. If we do come across situations where a GPU implementation really might offer significant benefits, then we’ll obviously revisit. But my primary concern with this technology is that it’s still a moving target. It also would complicate installation, and restrict the number of systems where this was available. Also, it increases the maintenance burden significantly, if we now how to maintain two versions of the code (automated testing would also be interesting…). Finally, if I was to go down that route, I would also seriously consider using OpenCL for portability reasons over CUDA – at least it’s an industry standard, and doesn’t tie users down to NVIDIA systems.
So in short, yes, it’s something we’re thinking about, but I’d need to see a serious case before going for it. At present, I would expect
tckgen to be the bottleneck in the overwhelming majority of use cases, but it’s also the one application least likely to perform well on the GPU. If other applications (current or future) become serious bottlenecks in users’ workflows, then if we can convince ourselves that they would perform well on the GPU, I’m sure we’d give it serious consideration…