Parallel tracking

maedoc · October 29, 2016, 4:00pm

I’ve been trying to bring down the run time on the streamline generation. Based on benchmarking with 4, 8 and 16 core Xeons, other things being equal, tckgen and tcksift appear to be main-memory bandwidth limited: the number of tracks selected per second is roughly maxed out after a few threads, irrespective of core count (though much slower clock speeds need another thread or so to saturate memory bandwidth).

I’ve had some success in further reducing runtime by partitioning the tckgen workload across multiple machines and concatenating the resulting files with tckedit before running tcksift, and apart from seeding the RNGs correctly, I can’t think of why this would be incorrect.

However, tcksift is also a big hog in terms of memory requirements, and I was wondering if it’s reasonable to tcksift the partitioned tractograms before concatenating them (or do 10x10M, SIFT each to 2M, concat to single 20M then SIFT to 10M?), or if the promises of SIFT are only realized when running on the full tractogram?

jdtournier · October 31, 2016, 2:29pm

Thanks for the detailed benchmarks. However, I’m not sure this has anything to do with main memory bandwidth. On any modern system, memory throughput is at least several GB/s - tckgen will only produce at most MB/s, I can’t see that being a problem. What might be a problem is cache misses and latency of data fetches from main memory, if the process is constantly requesting data from main memory that isn’t already in the on-board L2 or L3 cache (which I’m guessing will be at least 12MB on a Xeon), and which the CPU’s instruction optimisation engine hasn’t pre-fetched through its prediction algorithms. In my experience, it’s pretty hard to come up with situations where a modern CPU really stalls on that front (and I’ve tried…), and typically in this situation running more threads improves the situation somewhat (due to having more threads available to run when one of them stalls waiting for data). It’s much more common to find performance hits due to L1 cache misses (which is much closer to the CPU cores themselves), but I’d expect that would impact on the per-thread performance, not overall performance across threads.

What I think is more likely is that whatever commands you’re running entail very high thread contention: too many threads locking on shared resources at a very fast rate, and they all end up waiting for each other. This was definitely a problem for tcksift IIRC - not sure whether this is still the case. I’m not sure how that would be a problem for tckgen though - as far as I can reason, there’s nothing there that I wouldn’t expect to scale more or less linearly. The only thing I was thinking was dynamic seeding, but that now uses atomic operations, so shouldn’t have any issues dealing with high thread counts.

Maybe if you can share the exact commands that you were using, we might be able to look into it. Also, I’m assuming you’ve already checked that storage isn’t the bottleneck here either: if this is running on an already saturated networked filesystem for example, it wouldn’t be surprising to observe the kind of problem you’ve reported. Although the read/write rate for these applications really shouldn’t ever be that high that even a slow storage device couldn’t support it (unless it was accessed in synchronous mode, which would entail quite a bit more latency, I’m guessing…).

The RNG in MRtrix3 is always set from /dev/random, so there’s no need to worry about it, it’ll always be different across runs.

I’ll leave that one for @rsmith…

maedoc · November 1, 2016, 2:27pm

For reference, I used the following sequence to benchmark

cmd="tckgen wm-csd.mif 100K.tck -force -act 5tt.mif -backtrack -crop_at_gmwmi -seed_dynamic wm-csd.mif -maxlength 250 -cutoff 0.06"
time $cmd -number 1
time $cmd -number 100K

I was quite surprised that you write that tckgen would generate only only MB/s of throughput: the above datasets (wm-csd.mif and 5tt.mif) are about 250 MB (so not fitting in any L3 cache I’m aware of) and at hundreds or thousands of streamlines generated per second, probabilistically traversing image space, I would’ve assumed that random access to the FOD and 5TT images (and near-constant L1 cache misses) are the limiting factor, but maybe my understanding of the in-memory data structures is way off?

Edit just ran that for 1, 2, 4, 6 8 threads on an 8 core Xeon and saw linear scaling, so that confirms what you said. I guess I don’t understand what’s happening in tckgen… thanks for all the details.

jdtournier · November 1, 2016, 5:56pm

OK, not too sure what you meant with that last edit: I thought you said tckgen throughput was maxing out after a few threads. Did that last statement of yours refer to a different command line…?

Otherwise, your thinking is spot on, but in practice the effect of L3 cache misses is nothing like the effect of L1 cache misses. What I’ve found is that for a standard run of tckgen, performance is not L2 or L3 cache limited, and performance increases linearly with the number of threads, and even keeps improving slightly when the number of threads exceeds the core count - the latter being due to allowing the CPU to run threads that might have data available when the currently running ones stall on IO.

For a standard run of tckgen, the amount of work performed for each step is relatively large, so latency of data fetches doesn’t have an enormous impact. We optimise the data layout to have contiguous per-voxel SH coefficients in RAM prior to starting processing, which really helps with cache utilisation, and improved performance by 2-3× when I was testing it (quite a few years ago now). Tracks get written out at a relatively leisurely pace compared to the bandwidth that the system will be capable of, and besides this doesn’t have any latency implications - the data are being written out and are no longer needed by the CPU.

So in my experience, what really makes a difference is thread collisions. And I think this is what might be happening here, since I note you run with the -seed_dynamic option. That definitely requires a lot of synchronisation between threads, as the programme will update a TDI every time a streamline is accepted, and that TDI is used for seeding. I had a quick look at the code yesterday, and noted that it seemed to use atomic operations to avoid locking issues, so that might not be the issue here. But it might very well be the case that the part of the code responsible for seeding from that TDI is itself the bottleneck. Looking at src/dwi/tractography/tracking/exec.h, it looks like the seeder runs as the final stage in a 4-stage threaded queue, and that a single thread is responsible for managing that part, which could cause bottlenecks in the pipeline as a whole if it’s not processing items fast enough. I’m guessing @rsmith will be better placed to comment on this, but it might be that we need to run more threads for this bit…? In fact, I’m not sure what the seeder is doing exactly, but do we even need to run it as a separate thread? Can we not use lock-free atomic operations to seed within the tracker threads…?

rsmith · November 2, 2016, 1:06am

Based on benchmarking with 4, 8 and 16 core Xeons, other things being equal, tckgen and tcksift appear to be main-memory bandwidth limited

I’d definitely keep tckgen and tcksift apart in discussions relating to performance, rather than either grouping them or considering the overall performance of the two combined. The fundamental operations used to gain performance are very much different; for instance, tcksift has to spawn and then delete N number of threads twice per iteration, so would very much suffer if a system has a high thread spawn overhead.

I’ve had some success in further reducing runtime by partitioning the tckgen workload across multiple machines and concatenating the resulting files with tckedit before running tcksift, and apart from seeding the RNGs correctly, I can’t think of why this would be incorrect.

If the seeding mechanism is entirely random and every seed / track is independent of every other seed / track, then yes, this is a perfectly valid thing to do. However the entire design rationale of dynamic seeding is that these things are not independent. So whether or not this is a valid thing to do depends on whether or not the independent executions of tckgen give a sufficient sampling of the connectivity field in order to learn the seeding weights; the initial ‘burn-in’ period where the reconstruction biases are greater because the appropriate seeding weights have not yet been learned will also be additive across executions.

However, tcksift is also a big hog in terms of memory requirements …

I got it as small as possible, I promise… Yes the requirements can be high, but given you’re playing around with 16-core Xeons, I wouldn’t expect this to be too much of an issue? Alternatively, if you’re dealing with very high-resolution data e.g. HCP, you might want to consider using a down-sampled FOD field for SIFT: This should have minimal influence on results, but reduce RAM usage by a factor of ~3.

and I was wondering if it’s reasonable to tcksift the partitioned tractograms before concatenating them (or do 10x10M, SIFT each to 2M, concat to single 20M then SIFT to 10M?), or if the promises of SIFT are only realized when running on the full tractogram?

You can, it’s just difficult to know how far you can push the limits of such partitioning, or indeed precisely what type of effect it’s going to have. It’s not something that I’ve looked into in great detail; I just made a different method instead…

We optimise the data layout to have contiguous per-voxel SH coefficients in RAM prior to starting processing, which really helps with cache utilisation, and improved performance by 2-3× when I was testing it (quite a few years ago now).

Lately I’ve been thinking about arranging voxel data on a Hilbert or Z-order curve…

But it might very well be the case that the part of the code responsible for seeding from that TDI is itself the bottleneck. Looking at src/dwi/tractography/tracking/exec.h, it looks like the seeder runs as the final stage in a 4-stage threaded queue, and that a single thread is responsible for managing that part, which could cause bottlenecks in the pipeline as a whole if it’s not processing items fast enough.
…
It might be that we need to run more threads for this bit…? In fact, I’m not sure what the seeder is doing exactly, but do we even need to run it as a separate thread? Can we not use lock-free atomic operations to seed within the tracker threads…?

There’s a thread calling a functor within the seeder, which is responsible for updating the streamlines density in each fixel after generated tracks are written to file and then mapped to voxels; but this isn’t responsible for actually drawing seeds. That is managed by the same threads as used for tracking, and is atomic-flag spin-locked per fixel within this range.

This old comment has me slightly worried though.

@maedoc You could try un-commenting this line and see what it gives you. It would also be useful to know whether or not the CPU usage is changing during the course of the tckgen run (since the seeding probabilities evolve in different ways during this time).

maedoc · November 2, 2016, 9:14am

OK, so I’m starting to be afraid that my previous observation

number of tracks selected per second is roughly maxed out after a few threads

was both with a slightly older version of MRtrix and different seeding options (seed gmwmi iirc), so not valid for the commands I wrote previously. OTOH, I did run on a system with nearly double the main-memory bandwidth and saw more than double increase in performance (related to GitHub issue on ppc64 arch). From what you wrote, this is likely due to better support in hardware for atomic ops, not just memory bandwidth.

We optimise the data layout to have contiguous per-voxel SH coefficients in RAM prior to starting processing

This is something I don’t quite understand, since it’s a multidimensional image… or you mean that physical space has stride greater than one, and SH coefs are stride one?

I got it as small as possible, I promise… Yes the requirements can be high, but given you’re playing around with 16-core Xeons, I wouldn’t expect this to be too much of an issue?

No, I haven’t run into memory issues, but if I did split up these steps across our cluster, then tractograms of 100M or 200M are easily generated, but not subsequently SIFTed on a single machine.

I just made a different method instead…

I saw this mentioned in the docs, but the link is broken, and I just found the tcksift2 command.

Thanks to you both for the copious details, I’ll take some more time to ingest some of the finer points…

jdtournier · November 2, 2016, 3:27pm

Just checking: the system with double the memory bandwidth was a different CPU architecture altogether…? If so, it’s going to be very difficult to tie this up with memory bandwidth per se - there will be loads of other differences, including how atomic operations and thread locking is implemented, along with differences in the sizes of various caches, caching management algorithms, the memory access prediction engine, etc. Not sure how much we can read into that…

Yes, that’s exactly what I mean.

Not sure I get the finer details of what’s going on here, but I have to admit I can’t see how the TDI update can be the bottleneck, it should be much faster than the tracking itself.

@maedoc, any chance you could run the tracking with a standard -seed_image mask.mif using the whole-brain mask, just to check whether the process scales as expected in this case? If it doesn’t, there’s definitely something to look into… Otherwise, some of the other seeding mechanisms might introduce issues of their own, it would be good to identify which part is slowing things down.

We’ve discussed this a few times over the years, and I did give something like this a go a while back - found it made no difference whatsoever. Having thought about this, I think the reason is that the way the CPU accesses RAM is via cache lines, which are 64 bytes on Core i7 processors, apparently. That’s enough to store 16 floats, or about a third of the 45 SH parameters we’d need to load per voxel during tracking (actually, we need 8 voxel’s worth of data given the use of trilinear interpolation). There is no latency associated with where in RAM other required cache lines might be located, and no further cache utilisation limitations given that the cache line is the fundamental unit used in the CPU’s cache management system. So apart from a few potentially unused bytes padding out the remainder of the cache lines either side of the data needed, there is no benefit in further optimising for data locality. And the CPU is really good at predicting which cache lines you might need next, so may very well pre-fetch a lot of the data you need even though the tracking itself is stochastic.

All this to say, I don’t think you’ll get any benefits whatsoever… Where data locality really matters is when reading from spinning disks (since the seek latency is so massive), or when iterating over data that might not fit in L1 cache in a very tight loop (e.g. cache blocking techniques for e.g. matrix operations). It’s also problematic when iterating over data in such a way that only a single data point is used per cache line (i.e. iterating over a data set with badly matched strides), since that is making very inefficient use of the cache. But in the case of tckgen, I don’t think there’s much more that can be done…

maedoc · November 2, 2016, 5:35pm

Yes, a POWER8 system. I agree, it’s another zoo altogether. Among other things, it has a 128 byte cache line.

run the tracking with a standard -seed_image mask.mif using the whole-brain mask, just to check whether the process scales as expected in this case?

Yep, I see linear scaling there as well.

jdtournier · November 2, 2016, 10:14pm

Good to hear, thanks for reporting back. So that would indicate that the dynamic seeding might have something to do with it. Although did you also mention that using GM/WM interface seeding also suffers from the same problem?