Suggestion for tckedit

wamigy · February 25, 2017, 2:20pm

Hello Mrtrix devs’,

First of all thanks again for your wonderful work and your efforts for an efficient implementation of your tools. Everything works like a charm.

However, if I may, I would have a suggestion to reduce processing time especially when dealing with very large sets of fibers. I generate big .tck files (>100M tracs) and, of course, I have to split the file for further processing for RAM constraints.

I’m using tckedit do to so but it takes a while to generate smaller packets. Let’s say I have 100M fibers and would like to generate 100 packets of 1M fibers. I have to call 100 times tckedit playing with the option -skip and therefore, it requires scanning repeatedly the big .tck file (in total 99M+98M+97M+96M+… = 4950M fibers scanned without writing).

Is there a way to split the 100M tck file into 100 smaller 1M tck files without scanning numerous times the original file ? If not, do you think an option like “-divide N” could be integrated to tckedit to split a tck file into N smaller files ?

Thank you again!

Best regards,
W.

rsmith · February 27, 2017, 6:15am

Hi @wamigy,

Yes, that’s an entirely reasonable suggestion. If you’d like, you can post a new issue on GitHub (or I can add it myself if you don’t have an account). Indeed if anybody has ever wanted to try their hand at contributing to MRtrix3, this would be a nice easy entry point!

I’m not sure whether such a functionality would go into tckedit, since it would introduce further ambiguity in processing order: e.g. If you want to split into chunks of 1M, but you also specify a -include ROI, should the files be split for each 1M input tracks, or for each 1M output tracks? It might be more at home in tckconvert, or maybe even its own command tcksplit.

Though I will question this bit first:

However, if I may, I would have a suggestion to reduce processing time especially when dealing with very large sets of fibers. I generate big .tck files (>100M tracs) and, of course, I have to split the file for further processing for RAM constraints.

Apart from certain commands (fixelcfestats, tcksift, tcksift2, any others?), the size of the input track file should not actually significantly influence the RAM usage of the command: Tracks are processed as a stream of data, and don’t need to be stored in memory in their entirety. Moreover, many commands will use multi-threading, so for a single system splitting the tracks across multiple files and running the same command multiple times won’t actually be any faster than running one instance with multi-threading. So splitting / processing / merging track files may even be slower… Is there a specific use case that you’re having performance issues with?

Cheers
Rob

jdtournier · February 27, 2017, 9:05am

I fully agree with @rsmith here, and I’d add that the commands referred to as needing to load the whole tracks file (e.g. fixelcfestats, tcksift, tcksift2) do so out of necessity - they need access to all the streamlines (or data derived from them) during processing for correct execution. If the work could be parallelised or split in some way to reduce RAM usage, I think we would already be doing it (unless it really massively deteriorated performance). So if you’re thinking of splitting the tracks files for use with any of these commands, you will get suboptimal results from merging the results, compared to processing the full tracks file as one (with all the RAM requirements that entails).

Cheers,
Donald.

wamigy · February 27, 2017, 9:54am

Thanks for your answers.

To be more specific, small packets are easier to handle with matlab for further processing. In my pipeline, this step would be the very last of all mrtrix commands (and yes I totally admit I might be the only one interested in such a feature ). I also understand introducing a function called tcksplit may be misleading for users.

jdtournier · February 27, 2017, 10:38am

Maybe it would be better to modify or implement a different Matlab function to allow loading streamlines one at a time…? That should be relatively simple, we’d just need to split the current read_mrtrix_tracks.m function into 3 functions: open() / read_next() / close() - maybe a candidate for Matlab’s OOP functionality?

wamigy · February 27, 2017, 1:24pm

Ok yes you’re right, actually I’ve already written a matlab function for this task, this suggestion was “in case” you could see an interest or someone else could need it.

Thank you again!

jdtournier · February 28, 2017, 8:19am

Understood. Great to have suggestions like this, please keep them coming. In this case however, I think your own needs would be better addressed by modifying the matlab functions to allow the data to be streamed (much like most MRtrix3 commands already operate). Otherwise, there may be other legitimate use cases where splitting a large tractogram as you suggest might come in handy - if anyone has such a need, please make yourselves heard. It’s not difficult to implement, we’d be happy to if we felt there was sufficient demand.