Tckgen on HCP Dataset takes excessively long

I’m trying to build an MRTrix-based diffusion processing pipeline for the HCP dataset. However, I’m running into the problem that Tckgen is taking prohibitively long.

In the line of code below, when I set $nTracts to “1M” it took 9.5 hours to run for a single subject on 46 threads. Assuming that processing time increases linearly with requested number of tracts, that’d mean 95 hours for a single subject using the recommended 10M tracts. Scaling this up to a large number of subjects would become computationally unfeasible.

tckgen -nthreads 46 “$tmp”/DWI_FOD_WM.mif “$tmp”/DWI_hollander_tractogram100.tck -act “$tmp”/5TT.mif -backtrack -crop_at_gmwmi -seed_dynamic “$tmp”/DWI_FOD_WM.mif -maxlength 250 -number “$nTracts” -step .8 -cutoff 0.06

Am I doing something wrong, or is this the expected processing time for a single subject? If yes, how could I change this code to become feasible for a dataset as large as the HCP.

Reinder - it probably shouldnt take that long, but it depends on your computer hardware and the image data you are using.

You could try removing the optional arguments and running tckgen in its simplest form. Then add back the options one-by-one until you find what’s causing the problem (i suspect the nthreads or backtrack options?).

Since you are using an old version of mrtrix (-number option in tckgen is from an old version), try upgrading to the latest version.

Indeed, it shouldn’t take that long… Although HCP data is more demanding than most. Can you provide more details as what hardware you’re running this on? In particular:

  • what OS?
  • is this a desktop computer or on a cluster?
  • if a cluster, what is the job allocation policy? Are there restrictions on how many threads your job can actually run?
  • how much RAM is available on the system / node?
  • how many cores on the CPU / node?
  • how many other jobs were running concurrently on that system / node at the time?
  • where is the data being stored? Locally or over a network filesystem?
  • can you verify that the CPU is running at 100%?

This should help us narrow down the issue…

Thanks for all the comments. I have updated MRtrix to the newest version - this did not make a substantial difference in processing speed. The system specs of the computer that I was testing it on are below.

  • OS: Ubuntu 14.04 (Mate)
  • Desktop Computer
  • RAM: 256GB
  • 24 Cores (48 threads through multithreading)
  • Nothing else was running.
  • The data is stored on a network filesystem. Note that the desktop has a 1GBit network connection. I’m not certain what the network filesystem’s network speed is but I know it exceeds 1GBit by a large margin.
  • I’ve observed individual tckgen jobs using between 60-97% so there may be an issue here.

I feel like when I ran this on the institute’s SGE cluster with 36 threads, it ran a lot faster. I’m currently working on confirming this, but I’m having issues with the cluster which have to be resolved first.

EDIT: I got it running on the institute’s SGE cluster (30 threads). After 70 minutes it generated >6million tracks, so its definitely a lot faster. I would prefer to have it running on the lab’s workstations as well though.

OK, that is weird… Can I check that when you say it’s using 60-97% of the CPU, that’s as a percentage of the full CPU capacity? Utilities like top often report the usage as a proportion of a single core, which in your case would mean that tckgen is not even using up a whole core…

Also, can you try running the command with the -debug option, and posting the full output, including the command itself? That might give us more information.

The other thing to look into is whether you might inadvertently be using a debug or assert build. What does your config file say on the matter? You might want to run ./configure && ./build again just to make sure you’re using a full release build - the assert version is definitely slower, and the debug version is much much slower again…

A lot of MRtrix3 commands we’ve seen effectively 100% CPU utilisation on 16-core 32-thread, but I’ve never had access to higher than that. It’s entirely feasible that a particular section of code could bottleneck at around that point. For tckgen, I can see a few points that could conceivably lead to less than 100% usage:

  • Although tracks are generated in parallel, they still need to be written to file sequentially. This is done by writing the track data from the tractography threads into a queue structure, and a single thread is responsible for writing that data to file. This requires mutual exclusion locking, which may prevent tracking threads from running at full speed once a certain number of threads is reached. You may have simply found this number.

  • If you are writing to a network-based file system, it’s maybe conceivable that the file system I/O is at its limit, but I find this fairly unlikely. You could try setting the config file entry TrackWriterBufferSize to a larger number and see what happens.

  • With dynamic seeding specifically, determining streamline seed points is not entirely independent between threads: All threads are both reading from fixel data to determine seed probabilities, and writing fixel data to dynamically update those probabilities, and these data are common across all threads. This is done using the C++11 atomics library rather than explicit mutual exclusions, and I’ve deliberately used the most relaxed memory synchronisation rules I could, but this could conceivably hit a multi-threading limit. Running tckgen with some other seeding mechanism should tell you whether or not it’s the dynamic seeding that’s preventing 100% utilisation.

When I said 60-97% I meant for each process, not the total number of processes.

It took a while for me to get back to working on this issue - and all of a sudden it seems like it resolved itself somehow… I’m not quite sure what, if anything, changed, but if it works it works I guess. Thanks for your advise though, if this issue reoccurs I’ll see if this information can help me fix it.