OK, not too sure what you meant with that last edit: I thought you said tckgen
throughput was maxing out after a few threads. Did that last statement of yours refer to a different command line…?
Otherwise, your thinking is spot on, but in practice the effect of L3 cache misses is nothing like the effect of L1 cache misses. What I’ve found is that for a standard run of tckgen
, performance is not L2 or L3 cache limited, and performance increases linearly with the number of threads, and even keeps improving slightly when the number of threads exceeds the core count - the latter being due to allowing the CPU to run threads that might have data available when the currently running ones stall on IO.
For a standard run of tckgen
, the amount of work performed for each step is relatively large, so latency of data fetches doesn’t have an enormous impact. We optimise the data layout to have contiguous per-voxel SH coefficients in RAM prior to starting processing, which really helps with cache utilisation, and improved performance by 2-3× when I was testing it (quite a few years ago now). Tracks get written out at a relatively leisurely pace compared to the bandwidth that the system will be capable of, and besides this doesn’t have any latency implications - the data are being written out and are no longer needed by the CPU.
So in my experience, what really makes a difference is thread collisions. And I think this is what might be happening here, since I note you run with the -seed_dynamic
option. That definitely requires a lot of synchronisation between threads, as the programme will update a TDI every time a streamline is accepted, and that TDI is used for seeding. I had a quick look at the code yesterday, and noted that it seemed to use atomic operations to avoid locking issues, so that might not be the issue here. But it might very well be the case that the part of the code responsible for seeding from that TDI is itself the bottleneck. Looking at src/dwi/tractography/tracking/exec.h
, it looks like the seeder runs as the final stage in a 4-stage threaded queue, and that a single thread is responsible for managing that part, which could cause bottlenecks in the pipeline as a whole if it’s not processing items fast enough. I’m guessing @rsmith will be better placed to comment on this, but it might be that we need to run more threads for this bit…? In fact, I’m not sure what the seeder is doing exactly, but do we even need to run it as a separate thread? Can we not use lock-free atomic operations to seed within the tracker threads…?