Dwi2response issue with gsub on SGE

Ali-Mahzarnia · February 19, 2023, 5:20pm

Hi All,

I am parallelizing Mrtrix on a SGE server. I read the potential SGE issues but they’re not similar to my current issue.
When I try this command on the server terminal:

dwi2response dhollander dwi.mif wm.txt gm.txt csf.txt -voxels voxels.mif -mask mask.mif -scratch temp/ -force

it works, but when I place it in a bash file:

#!/bin/bash
#$ -l h_vmem=50000M,vf=50000M
#$ -M am983@duke.edu
#$ -m ea
#$ -o slurm-$JOB_ID.out
#$ -e slurm-$JOB_ID.out
#$ -N mrtrix
dwi2response dhollander dwi.mif wm.txt gm.txt csf.txt -voxels voxels.mif -mask mask.mif -scratch temp/ -force

and qsub the bash file on the server terminal:

qsub  mrtrix.bash

I get the following errors:

dwi2response: 
dwi2response: Note that this script makes use of commands / algorithms that have relevant articles for citation. Please consult the help page (-help option) for more information.
dwi2response: 
dwi2response: [WARNING] Output file voxels.mif' already exists; will be overwritten at script completion
dwi2response: [WARNING] Output file 'wm.txt' already exists; will be overwritten at script completion
dwi2response: [WARNING] Output file 'gm.txt' already exists; will be overwritten at script completion
dwi2response: [WARNING] Output file 'csf.txt' already exists; will be overwritten at script completion

dwi2response: [ERROR] Unhandled Python exception:
dwi2response: [ERROR]   ValueError: No JSON object could be decoded
dwi2response: [ERROR] Traceback:
dwi2response: [ERROR]   /usr/local/packages/mrtrix3/3.0.3/bin/dwi2response:83 (in execute())
dwi2response: [ERROR]     if not grad_import_option and 'dw_scheme' not in image.Header(path.from_user(app.ARGS.input, False)).keyval():
dwi2response: [ERROR]   /usr/local/packages/mrtrix3/3.0.3/lib/mrtrix3/image.py:46 (in __init__())
dwi2response: [ERROR]     data = json.load(json_file)
dwi2response: [ERROR]   /usr/lib64/python2.7/json/__init__.py:290 (in load())
dwi2response: [ERROR]     **kw)
dwi2response: [ERROR]   /usr/lib64/python2.7/json/__init__.py:338 (in loads())
dwi2response: [ERROR]     return _default_decoder.decode(s)
dwi2response: [ERROR]   /usr/lib64/python2.7/json/decoder.py:366 (in decode())
dwi2response: [ERROR]     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
dwi2response: [ERROR]   /usr/lib64/python2.7/json/decoder.py:384 (in raw_decode())
dwi2response: [ERROR]     raise ValueError("No JSON object could be decoded")

This cluster wrapper bash file works with other MRtrix functions so far, such as mrconvert, dwi2tensor, tensor2metric.

Any thought ?
Thanks,

Ali-Mahzarnia · February 21, 2023, 12:46am

resolution: dwi2response writes a temp json file to a home directory in spite of us specifying the temp directory; and we had to specify bvecs and bvalues via “-fslgrad bvecs bvals” although they are already there in the mif header.

jdtournier · February 21, 2023, 10:23am

OK, thanks for reporting back. I’ve had a look through the code to check what might have happened, and I think the fix is actually elsewhere. I’ll start by going through some of the logic behind our use of temporary files, which will hopefully provide some useful context.

MRtrix makes a distinction between temporary files and intermediate outputs:

Temporary files are expected to have very short lifetimes, and should be used purely to pass data from one command to the next. These are mainly (but not exclusively) used in the context of Unix pipes. The primary criterion for selecting a good location for these is I/O speed (and sufficient capacity) – ideally, this would be a RAM-based local filesystem (such as tmpfs, as is often the case on Linux). The location of these temporary files is determined by:

the MRTRIX_TMPFILE_DIR environment variable (if set);
the TmpFilePrefix config file entry (if set);
/tmp (on Unix) or the current folder (on Windows).

Intermediate outputs on the other hand typically have longer lifetimes, can be inspected if necessary, and can take up a much larger amount of storage. These are what we store in the scratch folders created by Python scripts. The primary criterion for selecting a good location for these is large storage capacity (and I/O speed). The location of these intermediate outputs is determined by (at this point in the code):

the -scratch command-line option (if provided);
the ScriptScratchDir config file entry (if set);
the current folder

Coming back to the issue here, when probing for the DW scheme, the script requests mrinfo to write all the header information to a temporary JSON file that can subsequently be read into Python. Being a temporary file, its location is determined separately from the script scratch folder (as you’ve observed), but for some reason (presumably just the way it evolved) the code uses a slightly different set of rules to determine where it’s going to write this file (at this point in the code):

the TmpFileDir config file entry (if set);
the -scratch command-line option (if provided);
the current folder

I’m guessing the issue you’re seeing will be down to you having set the TmpFileDir entry to your home folder in the config file (unless I’ve overlooked something in the code). If it doesn’t have write access to this location on the SGE cluster, this would cause it to fail to find the DW sampling scheme¹. You’ve managed to sidestep that issue by providing the sampling scheme at the command-line, which avoids the mrinfo → JSON → Python round-trip.

My guess is this would also cause trouble if you were to try using Unix pipes in your SGE script. The cleanest fix here (without changing the code) is to remove the TmpFileDir entry from your config file (if it was indeed set). If it wasn’t set, we’ll need to investigate further…

As a final note, we’ll need to think about the strategy used to determine the location of this file, and whether the specification of an explicit scratch folder should take precedence over the TmpFileDir config file entry – not obvious given that they technically refer to subtly different concepts (as described above). We’re open to thoughts & suggestions on the issue…

¹I would have expected the error message to be “Could not access header information for image”, given the expected failure at this earlier point…?!?