Hi, I wanted to test out the newest version of fixelcfestats, which seemed like it would be sleeker given that fixel-fixel connectivity doesn’t need to be computed each time the command is executed. When I run my model in the older version, it takes about 1hr/contrast (to calculate fixel-fixel connectivity and run the permutations for that contrast). I reran the same analysis in the new version of MRtrix3, allocating the same 40G of memory that I always do for fixelcfestats (with all the contrasts represented in a single matrix), and the process has been crawling through the first set of permutation calculations for over four days. The process isn’t “hung”, because there has been incremental creeping forward of the progress bar. Any ideas for why this might be? Full info below…
fixelcfestats -force -nshuffles 1000 $DIR/fd_smooth/ $DIR/scripts_fixel/fixel_1208_fd.txt $DIR/scripts_fixel/design_1208_ones_sex_age_raceAA_raceOT_qc_wrat.txt $DIR/scripts_fixel/contrast_matrix_0000000b.txt $DIR/population_template/matrix $DIR/fixelcfe_output_smooth/fixelcfestats_1208_ones_sex_age_raceAA_raceOT_qc_wrat_fdB_2 -nthreads 2
0 0 0 0 0 0 1
0 0 0 0 0 0 -1
0 0 0 0 0 1 0
0 0 0 0 0 -1 0
0 0 0 0 1 0 0
0 0 0 0 -1 0 0
0 0 0 1 0 0 0
0 0 0 -1 0 0 0
0 0 1 0 0 0 0
0 0 -1 0 0 0 0
0 1 0 0 0 0 0
0 -1 0 0 0 0 0
##output after running for over four days:
fixelcfestats: [WARNING] existing output files will be overwritten
fixelcfestats: Number of fixels in template: 173239
fixelcfestats: Importing data from files listed in “fixel_1208_fd.txt” as found relative to directory “/gpfs/ysm/scratch60/pittenger/rgg27/PNC/fixel//fd_smooth/”… …done
fixelcfestats: Number of inputs: 1208
fixelcfestats: Number of factors: 7
fixelcfestats: Design matrix condition number: 13.7053
fixelcfestats: Number of hypotheses: 12
fixelcfestats: [WARNING] A total of 5180 fixels do not possess any streamlines-based connectivity; these will not be enhanced by CFE, and hence cannot be tested for statistical significance
fixelcfestats: Loading fixel data (no smoothing)… [==================================================]
fixelcfestats: Calculating basic properties of default permutation… [========================================]
fixelcfestats: Outputting beta coefficients, effect size and standard deviation… [=================================================]
fixelcfestats: Running GLM and enhancement algorithm for default permutation… [======================================]
fixelcfestats: Running permutations… [========================================
I’ve not observed this kind of behaviour, and there’s no particular reason why the command should be running that much slower than the previous code. I’d have expected the execution time to take less than 12 hours given the total amount of processing should be less than your prior 12 separate executions, each of which individually required building the fixel-fixel connectivity matrix. The internals of the GLM have changed quite a lot, but I didn’t observe any major slowdowns in my own testing.
I can only generate a couple of hypotheses that are consistent with the information provided (there are new capabilities that will slow down execution, but they’re not relevant given their absence from your terminal output):
Your number of inputs is much larger than anything I’ve tested on. The change in empirical null distribution generation from Manly to Freedman-Lane will involve an additional 1208x1208 matrix multiplication compared to the old code, which won’t be super cheap.
The old code handled generation of t-values in batches, whereas the new code does the whole matrix multiplication for all fixels in one go. It’s possible that with the data I did testing on that was fine, but with your very large number of inputs the matrix data are becoming too large, leading to cache miss slowing down execution. I can probably re-introduce some manual buffering to the execution which might help (and if I do I now know who to ask to test it ). But I’d have hoped that Eigen would have made the appropriate decisions here…
The fixel-fixel connectivity matrix should either be memory-mapped, or explicitly loaded into RAM if this is not possible. There’s some chance that in your situation these data are not explicitly loaded into RAM at the commencement of execution, but access to the data are also slow: this would lead to the CFE portion taking a long time to execute due to delays in acquiring the fixel connectivity information for each fixel to be enhanced.
Given the length of the permutation progress bar I’m hoping that execution has completed in the time since you made the post. But thanks for flagging, and I’ll have to think for a little while about where effort needs to be invested here.
Hi Rob, thank you for the thoughtful answer. I do wonder if maybe #3 is occurring on my HPC system. There are ~180k fixels in my input data, so nothing too crazy, right? But yes–happy to stress test things if you make any changes
I had to cancel the original processes because they were going to time out on the cluster, but I was able to speed things up considerably by reconfiguring memory/CPU allocation (I bumped CPUs up to “-nthreads 10” and allocated less memory per CPU). Now it’s taking a couple of days instead of longer than 1 week. Probably there is still something weird going on; I will continue to tweak, and I should probably also chat with our IT folks to see if they have ideas.
In my own FBA experimentation recently I’ve myself been observing that some specific invocations seem to take longer than I otherwise expected, and there’s some resemblance to your report here. Specifically, a run that performs N t-tests in a single command invocation seems to take more than N times the duration of a run that performs one t-test. But I’m going to have to do a lot of testing to try to quantify it and figure out what if anything might be going wrong.
I’m hoping you might be able to shed more insight based on your own experience so that I can minimise wastage of both my own time and CPU cycles; but it’s really important to try to isolate individual variables:
When you increased the number of threads, did the speed of execution basically increase in direct proportion to the number of threads?
When you allocated “less memory per CPU”, was the total amount of memory allocated across all CPUs definitely equivalent to that of your earlier attempts?
Rather than comparing pre-
3.0.0 code with one or two t-tests to
3.0.1 with 12 tests, did you ever attempt using
3.0.x but only performing one or two t-tests per invocation?
When you performed your tests pre-
3.0.0, did you nevertheless include the same number of factors in the design matrix? I.e. Rather than a design matrix with 7 columns being used for all hypothesis tests, with each invocation just extracting the beta coefficient for the column of interest, were you perhaps instead using design matrices that had just two columns, being that first column (presumably a global intercept) and the column of interest for just that hypothesis?
In a test I just did, tripling the number of design matrix columns increased the execution time for a single hypothesis by a factor of 12. So it might not be the number of hypothesis tests that inflated the execution time at a rate greater than expected, but the increased complexity of the model.
(Note that if this happens to be the case, the actual hypothesis being tested is different between the two usages).
On a more technical note, this long execution time for permutation inference using more complex models—and the solution to such—is discussed in this manuscript; I’d like to adopt that approach at some point, but it’s unfortunately not very high on the priority list…
Just replying to let you know this is on my radar. I am just returning to this project after some time away, so I hope to be able to help you answer these questions by the end of the month!