Limited RAM for 'tckmap'

Hi ^^

I want to generate 0.18~0.2mm resolution TDI or TODI using 30M~40M track data (step:0.01mm, iFOD1), which takes so long time.

My workstation has 128GB (16GB * 8) memory, 88 threads (2 x intel E5 2696 v4 ).

The problem with this is:
When I perform ‘tckmap’, it only uses 9.9% of ram operating rate.
This is not determined by CPU, since it just takes 180~ 200% CPU availability of total 8800% (88 threads).
I confirmed my parallel system is perfectly performed when using ‘tckgen’ (reach around 8800 %)

Hence, it’s reasonable that memory availability is limited to a single task.

For example,
if I generate two TDI tasks simultaneously, each task also uses 9.9% of ram individually (thus, total amount = 19.8%)
Accordingly, I can infer that RAM is limited to certain operating rate, 9.9%, for one TDI task.

If I’m not wrong, can I visit configuration or source code file which is including the related teme? or is it my problem with the system of my own device?

Thank you very much.

Wow, that is a serious rig! Very impressed with your system… :heart_eyes:

The issue is that tckmap is difficult to parellelise efficiently. In short, the way it works is using MRtrix3’s ThreadedQueue classes to run a pipeline consisting of:

  • single thread to load streamlines
  • multiple threads to map streamlines to a set of voxels
  • single thread to write these voxels back to the image

The problem is that the final write-back stage is single-threaded (to avoid the potential for concurrent updates to the same location), and that’s the bottleneck for the whole pipeline. In practice, we found we got close to full CPU utilisation on a regular 8-core system - but this will be dependent on the specifics of the hardware and the data being processed. In your case, you’re getting very poor CPU utilisation, but unfortunately I don’t think it’s abnormal.

We’ve had extensive discussions in the past about ways of improving this (see this old discussion if you’re really interested), but there’s no easy solution for this. Maybe we’ll need to revisit this at some point…


Also: in case you’re wondering, the RAM consumption is primarily determined by the size of the output TDI/TWI image - not the input tractogram. The streamlines are loaded in order, and the image is updated as processing proceeds, so only the relatively small set of streamlines currently being processed needs to be stored in RAM at any point in time.

From that analysis, it looks like your output TDI takes up 10GB? This is huge… And likely to contribute to the poor performance: with such a large output image, the CPU won’t be able to efficiently cache the data in L2/L3 cache, meaning it’ll be constantly fetching data from the main RAM to update the values. Data fetches from main system RAM involve much higher latency than on-board L2/L3 cache (several hundred clock cycles), so the writer thread is likely stalling on IO a very large proportion of the time.

Thank you for your pretty answer.

If I comprehend your excellent responses,

  1. 1 thread uploads a small set of streamlines in RAM,
  2. multi-threads calculate track density at each voxel using uploaded streamlines (whether these uploaded streamlines are in RAM or L2/L3 cache),
  3. 1 thread update TDI continuously. In this case, since TDI is already assigned fully in RAM, this thread only update the value of each voxel which should be changed while calculating 2)

So, you mean that ‘1.’ in the above (the limitation of using the single thread) is the main factor that makes ‘tckmap’ become difficult to parallelize efficiently?

Also, does huge RAM assignment for making a template of output TDI induce the CPU to fetch the data (streamlines) in RAM rather than L2/L3 cache?

I hope this question I’m asking is not outrageous.

########

Thank you!
-Gangwon Jeong

It’s point 3 that is preventing your system from fully exploiting all available threads.

You can’t have multiple threads trying to update the same TDI image, as two threads may attempt to modify the same location at the same time. Hence the work of determining the set of voxels intersected by each streamline (step 2) can be done in a multi-threaded fashion, but the resulting voxel visitation data from each streamline must be serialised for a single thread to then contribute to the TDI.

The issue for you is that because your target image is so large, cache performance is exceptionally poor, and hence the thread responsible for step 3 is very slow. So while you have a huge number of threads available to work on step 2, they are capable of producing voxel visitation data much faster than the thread responsible for step 3 can keep up with, and therefore those threads are spending most of their time idling, waiting for the thread responsible for step 3 to announce that it is ready to accept more voxel visitation data.

I think your answer is very clear.
My curiosity has been resolved.

Thank you for your kindness!