Computing SIFT2 weights with multiple .tck files

jhadida · January 31, 2019, 2:30pm

Is it currently possible to run tcksift2 on a collection of .tck files (as opposed to a single one)? If not, would that be something worth considering for future versions of MRtrix?

The multiple .tck files are obtained by running tckgen with the same parameters and on the same dataset; it is just faster and less risky to run tractography in smaller parallel batches, rather than a single run with 10M streamlines. I could of course (as I currently do) merge all these .tck files before computing SIFT2 weights, but that requires twice as much disk space. As far as I can see though (admittedly, I wear glasses…), there is no reason why this should be required, is there?

Please correct me if I am wrong, but it would also be incorrect to compute SIFT2 weights on each individual .tck files independently, because the weight obtained for any particular tract depends on all the other tracts considered (i.e. those weights depend on the global tract density), is that right? That is what basic tests on my side seem to indicate anyway.

jdtournier · January 31, 2019, 5:27pm

The current recommended approach is what you’re currently using: merge them before feeding that to tcksift2. That does indeed entail duplicating the data, so I can certainly appreciate the motivation. It wouldn’t be difficult to support such an interface in tcksift2, but it without complicate its interface (and we’d likely rapidly be expected to replicate this for other commands too). My preference here would be to (finally) implement piping for streamlines data (previously discussed here), which would immediately allow this kind of operation, and many others besides…

jhadida · January 31, 2019, 9:21pm

Thanks a lot for the quick reply; I am always impressed with how fast one gets an answer from you or Rob on this forum, kudos really!

After reading through the GitHub issue, I understand that piping streamlines would solve several problems in one go – it would certainly solve the one I raised. And I appreciate that providing this functionality for one tool would soon lead to extend others too.

I wish I had a better mind-map of the underlying code in order to make a concrete suggestion / contribution, but my posting of the memory-map code last week was not completely innocent. If the need to handle very large numbers of streamlines (possibly across multiple files) without blowing up the host memory, is common to several desirable enhancements; then perhaps memory mapping combined with a dynamic cache could result in a data-structure that would completely abstract this problem within the existing code. Does that sound like a viable option?

I would probably need a few pointers to know what sort of interface this data-structure should implement, but I would be happy to build a prototype, if that could help determine whether or not this could work.

jdtournier · January 31, 2019, 10:52pm

Thanks! It is a labour of love…

I knew there must have been an ulterior motive…

It most certainly does – as @rsmith mentioned, it’s certainly an idea that’s come up before. I’d personally implement this slightly differently, using actual memory-mapping (which we use to access images). It means the OS is responsible to managing the caching, and makes the implementation relatively trivial. But that doesn’t mean it’s necessarily something we should do:

Memory-mapping (whether using our own caching or preferably relying on the OS’s virtual memory management system – i.e. using mmap()) allows you to access a much larger file than can fit in physical memory. That’s fine, but on a constrained memory system, it does means that relatively random access to streamlines will require a lot of seeks and reads from drive or over the network (depending on where the files are stored). This can drastically slow down processing. In contrast, on a system with enough RAM, even if you explicitly manage your cache and drop streamlines constantly, the file data will nonetheless remain in the OS’s page cache, making subsequent access to that streamline’s data near-instantaneous. The difference in performance between these two situations is enormous: the memory-constrained system will be completely IO bound, making it orders of magnitude slower than a system with enough RAM, and a process that might take hours on the larger system would then be completely unfeasible on a memory-constrained system – it’ll run, but it would take days/weeks to complete.

These are the kinds of issues we grapple with in applications like fixelcfestats, which require lots of RAM to run: we’ve thought about somehow breaking up the processing to reduce memory requirements, but it’s almost impossible to do so without a massive impact on performance (and it already takes long enough…). Same with SIFT / SIFT2.
The second reason is more mundane: what applications do we have that might benefit from the ability to read streamlines in a random access fashion? All our applications are designed to read a stream of data from the file, and process streamlines in order as they appear on file. This allows the OS to prefetch, minimises disk seeks, and generally maximise throughput, and keeps memory consumption to a minimum. Before we start looking into an approach like this, we’d need to come across an application where this way of doing things was a problem, and where random access would be better. The closest to this is probably SIFT/SIFT2, but the streamlines get internally processed as they’re being read to a different representation more suitable for processing – so there’s no need to access the raw streamlines data after that anyway (although it does two passes over the data if I recall correctly). If you have specific applications in mind where this is important, by all means let us know, we can discuss it.

Just my 2 cents on the matter…

By the way, if you’re interested in contributing to the development, I encourage to head over to our GitHub site, take a look through the list of issues, etc. – it’s probably a better place to discuss these ideas than this forum. Also, feel free to rummage through our API docs if you’re thinking of diving into the code…

rsmith · February 10, 2019, 2:07pm

I could of course (as I currently do) merge all these .tck files before computing SIFT2 weights, but that requires twice as much disk space.

It’s only disk space Plus if you don’t need to use the original track files individually, you can just delete them upon performing the concatenation.

Please correct me if I am wrong, but it would also be incorrect to compute SIFT2 weights on each individual .tck files independently,

It’s not strictly “incorrect”, just not ideal. By doing so you are assuming that the density of streamlines in each subset is adequately dense to both reasonably sample all possible reconstruction trajectories, and to provide enough granularity with which SIFT2 can correct the gross reconstruction density biases individually in each; you’re then simply concatenating multiple results together in order to increase total streamline count. The fit is just about guaranteed to be inferior to using all track data at once though. I also can’t rule out the possibility of some systematic bias (or lack of correction thereof) that could become additive across subsets, and hence the quantity of the correction could become inferior to if the entire tractogram were processed as a whole in unusual ways, but I don’t really have a sense of the likely nature or possible magnitude of such effects.

After reading through the GitHub issue, I understand that piping streamlines would solve several problems in one go – it would certainly solve the one I raised. And I appreciate that providing this functionality for one tool would soon lead to extend others too.

It’s worth clarifying that my intention for streamlines data piping was not to use memory-mapping of any sort, but to literally pass the vertex data in binary from from stdout of one process to stdin of another, hence obviating the need to store the intermediate data on disk. An intrinsic limitation of such then is that random access is not slow but impossible. I’m certainly not against alternatives, that’s just what I have in mind right now.

I am always impressed with how fast one gets an answer from you or Rob on this forum, kudos really!

Thanks! It is a labour of love…

Speak for yourself; @alan-connelly chained me to my desk years ago and nobody is responding to my cries for help…

The closest to this is probably SIFT/SIFT2, but the streamlines get internally processed as they’re being read to a different representation more suitable for processing – so there’s no need to access the raw streamlines data after that anyway (although it does two passes over the data if I recall correctly)

No, should only be the one; except for tcksift producing the output tractogram, but it could theoretically as an alternative write a vector of ones and zeroes that tckedit could then use to retain / remove streamlines.

jhadida · February 13, 2019, 6:00pm

Thanks a lot for these amazing answers

It’s only disk space

Sure – and it is fine as long as we are talking about a single subject. But if we are talking about the HCP, or similar large studies, I don’t think the additional space required deserves a shrug; we are talking about dozens of extra TB, which should logically be entirely unnecessary.

When considering large dataset:

Merging all .tck files into a larger one can only be done serially (parallel execution would blow up both disk-space and i/o bandwidth).
Additional tractographies, e.g. to increase the “density”, require the duplication of very large files (typically 10GB+) for each subject, again serially-only.
And with such large files, the risk associated with bit-corruption becomes a real problem (i.e. recomputing all streamlines vs only a small subset). So it is not without risk that we can delete the smaller .tck files.

There are of course many potential issues when dealing with large datasets, but I think that being able to handle lists of .tck files for tcksift2 and tck2connectome would be particularly useful in that case, because “It’s only disk space” does not apply then IMO.

jhadida · February 14, 2019, 3:37pm

After looking into the source-code, I would like to write an extension of MR::DWI::Tractography::Reader, similar to the current implementation of MR:DWI::Tractography::Editing::Loader.

The difficult parts that I can foresee are:

Consensus of properties across .tck files. This is pretty much already implemented in the command tckedit.
Handling of associated weights file. This could in theory be handled either with one weight file associated with each .tck file, or with a single weight file for all .tck file. I think the first option is the easiest to implement, but the second option might be more practical because commands called with a list of .tck files (e.g. tcksift2) would still produce a single output file.

With regards to adapting the existing commands, I think this could be done with minimal disruption by accepting files with a “new” extension .lst, which would be parsed as a vector of filenames.

Would you have any concerns about this approach? I just forked the repository and am planning the implementation; should I create a separate branch for this, or can this be done at the time of the PR?

jdtournier · February 15, 2019, 11:30am

Hi @jhadida,

Great to hear that you want to get stuck in. However, in this instance, my preferred way to handle this issue remains what I’d stated earlier:

While contributions are very welcome, it’s best to ensure the solution proposed is in line with expectations. I recommend you start an issue on GitHub (this isn’t the best forum for issues related to development) to discuss the best approach with the other developers, and make sure everyone is aware of and supportive of whatever solution is proposed.

In terms of the particular problem we’re discussing, my preferred option would be to support streamline piping and use tckedit (or a dedicated tckcat if that’s felt desirable) to feed the data into whatever application is desired.

In terms of the specifics of handling the PR: forking is the right approach, no need for a separate branch, we can merge your master as-is in the PR. However, I’d strongly recommend using dev as your base branch – this is where all non-bug-fix changes go, and it’ll hopefully become master soon enough anyway…

jhadida · February 16, 2019, 1:56pm

As suggested, I opened a new issue on GitHub about this.

I have also implemented the solution that I suggested above in this fork. This is a fork from the master branch, not the dev branch; sorry about this, but I had started implementing before I saw your message.

As summarised in the issue on GitHub, there are 3 main commits (and one very minor) which implement the behaviour that I mentioned above. I would really appreciate your feedback on this, and perhaps we can go ahead with a PR in some separate branch if you think this is ready.