Response function for group analysis

Dear Mrtrix experts,

I would like to create an atlas from a group of N controls. All the acquisitions have the same parameters (number of directions, b-0 value, etc…).
Do I need to estimate the response function for each subject with the dwi2response command then estimate the CSD with the N different response functions? Or the best way is to have the same average response function for everyone (and does it exist a command to average the response function)?

Thanks in advance!
Best regards

taken from the docs:

…using the same response function when estimating FOD images for all subjects enables differences in the intra-axonal volume (and therefore DW signal) across subjects to be detected as differences in the FOD amplitude (the AFD). To ensure the response function is representative of your study population, a group average response function can be computed by first estimating a response function per subject, then averaging with the script:

foreach * : dwi2response tournier IN/dwi_denoised_preproc_bias_norm.mif IN/response.txt
average_response */response.txt ../group_average_response.txt


I second that. The manual isn’t very elaborate on explaining why, and “to ensure the response function is representative of your study population” is slightly vague though. I reckon you’d be also good for instance with the average response of only your controls. Of even only your patients. Or even just one subject. But the more important point to emphasise it that you want to use just 1 response (or set of tissue responses if doing multi-tissue CSD) for all subjects when doing CSD. In a way, the response function is the unit of your FOD that results from CSD: amplitudes of the FOD are expressed in a unit called something like “times your responses function”. When doing any subsequent quantitative analysis across your subjects, it’s important that their FODs are expressed using the same units. You can’t compare apples and oranges!

Dear Thijs and Max,

Thanks for this perfect explanation on why averaging the response function. My turn to second Felix (friend and French colleague) on problems that we encountered by using the same response in a group of subjects.
In our experience (mainly focused on Diffusion gradient scheme and TWI adaptation for computing cranial and peripheral nerves atlas), using an average response text file instead of those yielded from an individual dataset led to a “denoising” aspect of the tractogramm (ie. on the visual analysis of the TWI map, whatever the contrast type -FOD amplitude, length…- or use of super resolution properties).

In other words, small distal tracts or nerves could “disappear” by averaging the response function, which could be problematic at the group level, depending of the disease model.
I assume this could be less important for brain white matter fascicles…but any help would be appreciated :slight_smile:

Best, Arnaud

A bit late to chip in with my 2 cents, but better than never, I guess…

OK, what this sounds like to me, is that the FODs might have ended up scaled differently in different subjects, so that the effective threshold on tracking varies - this would indeed ‘hide’ smaller amplitude, more minor tracts in those subjects where the FODs are smaller than they would be when using the subject-specific response. Conversely, I’d expect messier tracking in those subjects where the FODs ended up larger than expected.

In my experience, the response is remarkably stable across subjects (assuming the same acquisition protocol, particularly the b-value, is used). What does change is the data scaling: that is determined by the coil loading, the scanner’s calibration, internal FFT scaling, etc. When using the subject-specific response, these global differences in scaling are inherently accounted for, since the response is derived from the same data, and ends up scaled to the same extent. If however this response is then averaged and used to process the same subjects without any attempt at adjusting the scaling in each subject’s raw data, then this will introduce differences in the scaling of the output FODs. This is the same issue that needs to be accounted for in fixel-based analyses, and is a sufficiently important topic that it has its own page in the documentation.

If you were already performing some kind of subject-wise global intensity normalisation, then your experience is unexpected, and I’d like to figure out what the problem might be. But first we’d need to rule out the much simpler explanation above…

That sounds very interesting! I look forward to the results - will this be made available at some point…?

Hi everyone,

Following from this conversation, I am interested to model subject-specific intrapair differences in monozygotic twins. Would the group average response function be recommended for such an approach as well?


Hi @emmanuelpua,

Yep, it certainly would be equally recommended. As I mentioned somewhere above in a reply, the important thing is mostly that you use just a single response function (or a single one per tissue, if performing multi tissue CSD) for all subjects. Since all your twins are humans, you’re essentially just after a “human single-fibre white matter response for your particular scanner and acquisition protocol”. So as I mentioned above, that could in principle even just come from 1 (random) subject. But because it’d be weird to pick one at random, or for any reasons, the easy thing is to just take the average one. If you’re talking about a severely diverse group, or e.g. a comparison between populations where one population is for instance severely affected by neurodegeneration, then it may be more clever to not use an average response of all subjects per se, but in certain scenarios it would be wiser to use the average response of the healthy population only. But even then, it’s probably not going to differ a whole lot from the overall average one… so no worries in practice either way.

So well, in summary: as long as you use a single response (or single set of tissue responses) for all subjects, you’re fine. That’s what allows you to compare those subjects CSD outcomes. Using different responses for different subjects renders any comparison problematic.


Thanks Thijs!

1 Like

Hi everyone,

regarding this convo, would you suggest using one single RF even if only tractography and no quantitative analysis is to be done? Do you have experience on how much the RF changes across subjects and if this variation differs across b-val? Also, does FOD scaling affect tractography reconstruction?



I have some doubts regarding this topic as well. I always asumed that you only need an average response function or somekind of normalisation if you are interested in some measure derived from SIFT or SIFT2, am I right? For example if I would like to use FA-weighted matrices (or weighted by any metric), then I belive that should be fine to calculate each matrix independently and compare it, or if I’m interested in some graph metrics, they should be quite robust, regardless the response function used or the absence of normalisation, am I right? Thanks in advance!

Best regards,


Hi @Chiara_Maffei1,

It matters indeed less if you’re not after quantitative analysis. However, if you’re still working with “a group of subjects”, in the sense that the goal is to compare or even just “do” something across them, and as long as they’ve of course been acquired using the exact same protocol, I’d still recommend just using one single unique (set of, in case multi-tissue) response function. In my experience, the response functions vary very little across subjects in shape/contrast (not size, see below!); and if they do, it’s also due to data quality, amount of certain tissues present, and in the end, performance of the response function selection algorithm, which is not per se something that is uniquely valuable to a single subject. Also, the kinds of variations across subjects I’ve seen (which are very little indeed) seems to barely affect the CSD outcome in a substantial manner at all. So there’s no real worries, I’d say.

It sure does, since all our FOD-based tractography algorithms have an amplitude threshold to cut off streamlines ("-cutoff"). However, using a single (set of, in case of multi-tissue) response function is only half of the requirements to make sure that this doesn’t affect any consistency across subjects. The other half is mtnormalise, which accounts for the intensity differences that directly affect the size (amplitude) of the FODs.

That said, there is in a strange way something to be said for indeed using the response function(s) of the subject itself in a scenario where you’re really only after single-subject tractography; since the size of the response functions will also scale with the data; so performing a CSD technique will actually normalise the data to it’s own scale up to a certain extent. So let’s say, e.g., in a clinical scenario where you perform tractography on individual subjects for their own sake (e.g. delineating a bundle for surgical workup), you can probably stick to just dwi2response on the individual subject and CSD using it’s own response function(s), and then directly tractography with “known” good values for -cutoff that suit your scenario (e.g. a specific bundle of interest). But, really strictly speaking, if you want to establish a well controlled processing protocol for a given fixed acquisition protocol, I’d argue to compute (“average”) response function(s) once, based on (a group of) healthy subjects, and always use these for similar subjects (e.g. responses of healthy adult humans, to be used for adult humans in general, both healthy and with a condition / pathology), but also always follow up with mtnormalise, so your -cutoff thresholds can generalise in a consistent manner.

So well, in conclusion, if you’re going to do stuff “across subjects”, or even wanting to set up a well-controlled “standardised” processing pipeline to be used consistently, I’d recommend a fixed set of responses + mtnormalise. However, in practice, in some tractography scenarios, you could be fine with dwi2response per subject and CSD using their own response(s); and then the need for mtnormalise is in practice much less. The latter strategy is in another way also potentially a bit “safer”, since it’s “robust against unexpected things changing to the acquisition protocol”. But of course, if you’re doing anything across a group of subjects, that should never, ever (ever, ever) happen. In a clinical scenario though, I’ve seen cases of that happening, due to diverse reasons…

See my answer above; it depends on where you get your tractograms from, and how. Due to -cutoff, amplitude of FODs (and hence, normalisation issues of all kinds) do affect the outcome of your tractogram itself.

The FA metric itself is of course not affected; but that’s due to it being derived of the tensor model, which happens to model the ADC values, which happen to be derived from a “normalised” (by the b=0 image(s)) value.

Well yes, but then again no if that metric is apparent fibre density (from the FODs) itself.

So in conclusion, always check your entire pipeline for any uses of the FOD amplitude (or even shape, to be honest); if you rely on them being consistent (i.e. “normalised”), then it’d be recommended to use a fixed (set of) response function across your subjects, combined with mtnormalise.

Also, to emphasise this once more (can’t be emphasised enough :sweat_smile:), this doesn’t mean it has to be an “average” (set of) response function(s); just a single unique (set of) response function(s). In more and more scenarios I’m witnessing myself, I’m seeing value to derive the response functions e.g. of healthy subjects only. But in practice, the difference is often very (very) little compared to using an average across “all” subjects (in a study that contains non-healthy subjects).

On a completely separate note, in studies related to development, I’d also get the response function(s) only from the most developed subject(s) in the spectrum/range that’s being studied; for other reasons. And also within those subjects, you’re after the response functions representing the most developed tissue, I’d argue. There’s different ways of looking at this latter scenario though; several of which are “ok”, but they all mean something different…

But well, in conclusion here: a single (set of) response function(s) doesn’t per se strictly mean an “average” set of response functions. Just a single, unique one; so there’s a fixed point of reference to express the results of CSD techniques relative to.

Thanks so much everyone for this great thread. @ThijsDhollander you mentioned in your most recent reply here that

Just wondering if it would be possible to elaborate on this point?

More specifically, in our study we have a longitudinal sample of 8-14 year-olds with and without ADHD. Some researchers in our team are using the entire dataset, while others a subset (e.g., controls only, or one timepoint only). So this really raises two main questions:

  1. Based on this thread and our goals, would it be valid to create a single unique response function (or set of response functions for multi-tissue), and have everyone use this same response function regardless of the subset of data being studied?

  2. If so, would you suggest using controls at the most developed/final timepoint to create this response function (even if some studies may only investigate timepoint 1 or ADHD individuals, for example)? Or do you think averaging over all subjects and timepoints would be most appropriate here?

Really appreciate your assistance with this.


1 Like

Hey Phoebe,

Hope you’re doing well. :relaxed:

Sure, up to some extent! So this is actually what I did in practice for example in this work. Some of the relevant insight into why you might want to calibrate the response functions, in particular the single-fibre white matter one, from the older / more developed subjects is also hidden in plain sight in this talk on the most recent version of the response function estimation algorithm, in particular the bit starting from 1:26. The diagram / sketch that I often used to talk about 3-tissue CSD (the one with the triangle) is useful to explain this. When you analyse your whole cohort and want to make the 3-tissue output (or derived metrics, e.g. FD) comparable, you want 2 things wrt response functions:

  1. Use the same unique triplet of WM-GM-CSF response functions for all subjects, as the response functions essentially represent the “units” of your resulting metrics later on. So to compare apples with apples, it’s all got to be expressed in the same units, aka the same triplet of WM-GM-CSF response functions.

  2. To make sure those metrics aren’t biased, you want the model to fit the data. Even more so, fit it equally well across your whole population, young and old. Here’s where the diagram in the talk is useful: 3-tissue CSD with non-negative compartments will be able to fit what i inside the triangle. The response function calibration on the other hand, determines the corners of the triangle. So essentially, the triangle needs to be large enough to capture all signals later on.

Now, for very young subjects, i.e. when things still change a lot due to development, the most developed bits of WM will be to the far top left corner of that triangle. If you were to e.g. calibrate the response functions (for example sake) based of the youngest subjects specifically, the top left corner of the triangle might not extend far enough to capture the most developed bits of white matter in the older subjects. That would introduce a bias in the fit for the most developed bits of WM in the older subjects. In specific cases, this can lead to an inversion in the kind of pattern you’d expect in those bits. Older subjects might even appear “less” developed than younger ones due to these particular kinds of biases. That’s obviously not desired.

So calibrating the response functions for all subjects and using the average over all ages, lands you somewhere halfway. The problem is less severe then compared to calibrating based of the youngest subjects only, but it’s still present for the same reasons. The solution then is to use the older / more developed end of the spectrum. In this way, most WM of most subjects will fit well, without substantial biases in the 3-tissue signal representation.

So how much does this matter in practice? Well, I know very well from experience it does matter for neonatal age ranges. Be in touch privately if you need more details on that. However, for your specific scenario:

So that’s already quite a bit older than neonates or even babies for that matter. I also know from experience that at that point, it’s far far (far far) less of a worry, in terms of the points I described above. So that should hopefully already put you a bit at ease there. :slightly_smiling_face:

Yes, so that’s point 1 above. I can reassure you it’s perfectly possible and should work very well. As mentioned in point 1 above, it’s also a requirement if you want to be able to compare the FODs or compartments in a, well, comparable way. Not using the same unique (set of) response functions will make this very… different. I would strongly advise against it. So the gist: no worries, it will work well.

So likely the choice here makes little difference, because of the “older” age range. However, exactly because of that, it should be entirely safe to use a subset of subjects. You only need 1 technically, so a small batch of them is far more than robust enough. So if you want to play it on the safest side possible, simply pick the controls at the most developed/final time point. So the gist: I anticipate it won’t make much of a difference, but if you want to be safe “regardless”, pick the most developed controls.

I hope that helps, but feel free to be in touch if you need more insights or proper wording to explain stuff, etc… :slightly_smiling_face:

Cheers & take care,

Hi Phoebe,

8-14 year-olds have signal characteristics that resemble adult brain much more than they do neonatal brain tissue where response functions are a function of age and location. In 2018, we published a paper describing this but as Thijs said, feel free to ignore this.

You don’t need to accept an epistemic justification: if you want to convince yourself, load all subjects’ response functions into matlab or numpy or similar, normalise them by their first entry (b=0, l=0) and plot them in a similar way as shown in fig. 5. If you see no age-trend or clear difference between groups, it won’t matter what subset of the cohort you use. If there is a discernable age-trend or outliers in the data, I’d check the response function voxel selection masks (dwi2response -voxels) and the dMRI data for these subjects. If the response function voxel selection masks and data are sensible and you genuinely observe an age trend or group difference, it is generally best to use the oldest and or control subjects for the simple reason that your ODFs and derived measures relate to the response functions used which facilitates their interpretability and comparability. Again, for your cohort, the differences between WM, GM, and CSF should by far outweigh inter-subject differences within any component you might or might not observe. Hence intra-component variations will make very little difference to the overall decomposition but should affect the component ODF in these areas which your analysis pipeline is likely supposed to tease out.


Thank you both @ThijsDhollander and @maxpietsch for your clear and thorough responses! That’s incredibly helpful and really appreciate you taking the time.