Parallel tracking

rsmith · November 2, 2016, 1:06am

Based on benchmarking with 4, 8 and 16 core Xeons, other things being equal, tckgen and tcksift appear to be main-memory bandwidth limited

I’d definitely keep tckgen and tcksift apart in discussions relating to performance, rather than either grouping them or considering the overall performance of the two combined. The fundamental operations used to gain performance are very much different; for instance, tcksift has to spawn and then delete N number of threads twice per iteration, so would very much suffer if a system has a high thread spawn overhead.

I’ve had some success in further reducing runtime by partitioning the tckgen workload across multiple machines and concatenating the resulting files with tckedit before running tcksift, and apart from seeding the RNGs correctly, I can’t think of why this would be incorrect.

If the seeding mechanism is entirely random and every seed / track is independent of every other seed / track, then yes, this is a perfectly valid thing to do. However the entire design rationale of dynamic seeding is that these things are not independent. So whether or not this is a valid thing to do depends on whether or not the independent executions of tckgen give a sufficient sampling of the connectivity field in order to learn the seeding weights; the initial ‘burn-in’ period where the reconstruction biases are greater because the appropriate seeding weights have not yet been learned will also be additive across executions.

However, tcksift is also a big hog in terms of memory requirements …

I got it as small as possible, I promise… Yes the requirements can be high, but given you’re playing around with 16-core Xeons, I wouldn’t expect this to be too much of an issue? Alternatively, if you’re dealing with very high-resolution data e.g. HCP, you might want to consider using a down-sampled FOD field for SIFT: This should have minimal influence on results, but reduce RAM usage by a factor of ~3.

and I was wondering if it’s reasonable to tcksift the partitioned tractograms before concatenating them (or do 10x10M, SIFT each to 2M, concat to single 20M then SIFT to 10M?), or if the promises of SIFT are only realized when running on the full tractogram?

You can, it’s just difficult to know how far you can push the limits of such partitioning, or indeed precisely what type of effect it’s going to have. It’s not something that I’ve looked into in great detail; I just made a different method instead…

We optimise the data layout to have contiguous per-voxel SH coefficients in RAM prior to starting processing, which really helps with cache utilisation, and improved performance by 2-3× when I was testing it (quite a few years ago now).

Lately I’ve been thinking about arranging voxel data on a Hilbert or Z-order curve…

But it might very well be the case that the part of the code responsible for seeding from that TDI is itself the bottleneck. Looking at src/dwi/tractography/tracking/exec.h, it looks like the seeder runs as the final stage in a 4-stage threaded queue, and that a single thread is responsible for managing that part, which could cause bottlenecks in the pipeline as a whole if it’s not processing items fast enough.
…
It might be that we need to run more threads for this bit…? In fact, I’m not sure what the seeder is doing exactly, but do we even need to run it as a separate thread? Can we not use lock-free atomic operations to seed within the tracker threads…?

There’s a thread calling a functor within the seeder, which is responsible for updating the streamlines density in each fixel after generated tracks are written to file and then mapped to voxels; but this isn’t responsible for actually drawing seeds. That is managed by the same threads as used for tracking, and is atomic-flag spin-locked per fixel within this range.

This old comment has me slightly worried though.

@maedoc You could try un-commenting this line and see what it gives you. It would also be useful to know whether or not the CPU usage is changing during the course of the tckgen run (since the seeding probabilities evolve in different ways during this time).