SF-Matching: identification of metabolites from tandem mass spectra
SF-Matching(SubFragment-Matching) is a machine-learning based approach to predict compounds from tandem mass spectra. When using SF-Matching, please consult (and cite) the following reference:
Identification of metabolites from tandem mass spectra with a machine learning approach utilizing structural features.
SF-Matching can be run in different modes:
- Search against pre-calculated library (recommended) if your compound of interest is present one of the chemical databases KEGG, HMDB, ChEBI, and ChEBML, or is a short peptide with length less than 4.
- Use trained models to search other compounds if your compound is not contained in these chemical databases. You can download the trained models and predict if a given spectrum (in
.mgf
format) corresponds to a given molecule. - Create you own pre-calculated library if you have a fixed library of chemical compounds that you want to search against. Once you have created your own pre-calculated library, you could discard the trained models and save disk space.
- Train your own models if you have access to a corpus of spectra to be used as training data.
The source code is available via the EMBL GitLab repository.
Search against pre-calculated library (recommended)
Requirements
- Python 2 or Python 3 (Python 3 is required for searching mzML file.), Python >=3.6 is recommended.
- pandas >= 0.24.2
- numpy >= 1.16.2
Download link
Download the SF-Matching program and pre-calculated library using the link below:
sf-matching_with_precalculated_library.zip
(19 GB, md5sum: 0eb3e32df2c03b3de484c1d49612086c
)
The pre-calculated library will be in the folder sf-matching/Data/library
.
You can also download the program and library from Code Ocean and Zenodo.
Usage
- Download the SF-Matching program with pre-calculated library and unzip it.
- Convert your MS raw file into mgf / mzML / mzXML format.
-
Run the command:
python [path_to_the_program]/SearchWithPreCalculatedLibrary/DatabaseSearching.py \ -db_path [path_to_library] \ -spectra [.mgf/.mzML/.mzXML filename] \ -out [output_filename] \ -ion [ion_type 0 for [M-H]-, 1 for [M+H]+]
Example
If you put the sf-matching program in /data/sf-matching/
, the library data is in /data/sf-matching/Data/library/
.
Now you want to search the spectrum file /data/sf-matching/Data/examples/test-001.mgf
with [M-H]-,
the following command will generate an output file in /data/result.tsv
.
python /data/sf-matching/SearchWithPreCalculatedLibrary/DatabaseSearching.py \
-db_path /data/sf-matching/Data/library/ \
-spectra /data/sf-matching/Data/examples/test-001.mgf \
-out /data/result.tsv \
-ion 0
An example is also contained in the file example_for_score_with_precalculated_library.sh
, and on CodeOcean.
Search with pre-built models
Requirements
- Python == 3.6
- numpy == 1.16.2
- pandas == 0.24.2
- rdkit == 2017.09.2
- joblib == 0.11
- scipy == 1.2.1
- scikit-learn == 0.20.3
- Cython == 0.29.6
The program is tested in the version shown above, other version may work but have not been tested.
Download link
Download the SF-Matching program and pre-built models using the link below:
sf-matching_with_prebuild_model.zip
(106GB, md5sum:2bd3ee20eda31fe3dbb5efb2a809c8f1
)
The pre-built models will be in the folder sf-matching/Data/model/neg
and sf-matching/Data/model/pos
.
You can also download the program and model with [M-H]- from Code Ocean.
Usage
- Download the SF-Matching program with pre-built model and unzip it.
-
Compile the cython module with the following command:
cd [path_to_sf-matching] cd SearchWithModel python setup.py build_ext --inplace
- Convert your spectrum file into mgf format, please make sure that one file only contain one spectrum.
You can find one example in the folder
sf-matching/Data/examples/test-001.mgf
- Prepare the candidate molecules in txt format, one line one molecule, the molecules can be in SMILES or InChI format.
You can find one example in the folder
sf-matching/Data/examples/test-001.txt
-
Run the command:
python [path_to_the_program]/SearchWithModel/ScoreWithPreBuildModel.py \ -spectrum [mgf_file] \ -mol [candidate_molecule_file] \ -model [pre_calculated_model] \ -out_file [output_file] \ -ion [ion_type 0 for [M-H]-, 1 for [M+H]+] \ -threads [the_threads_you_want_to_use]
Example
If you put the sf-matching program in /data/sf-matching/
, the model data is in /data/sf-matching/Data/model/neg/
.
Now you want to search the spectrum file /data/sf-matching/Data/examples/test-001.mgf
with [M-H]-
,
the candidate molecule is in /data/sf-matching/Data/examples/test-001.txt
.
First, run the command below to compile a cython module:
cd /data/sf-matching/SearchWithModel/
python setup.py build_ext --inplace
Then, the following command will generate an output file in /data/result.tsv
.
python /data/sf-matching/SearchWithModel/ScoreWithPreBuildModel.py \
-spectrum /data/sf-matching/Data/examples/test-001.mgf \
-mol /data/sf-matching/Data/examples/test-001.txt \
-model /data/sf-matching/Data/model/neg/ \
-out_file /data/result.tsv \
-ion 0 \
-threads 8
If you want to search spectrum with [M+H]+
, just change the ion to 1
, also change the model data to positive model.
Suppose this model is stored in /data/sf-matching/Data/model/pos/
, you can use the following command:
python /data/sf-matching/SearchWithModel/ScoreWithPreBuildModel.py \
-spectrum /data/sf-matching/Data/examples/test-001.mgf \
-mol /data/sf-matching/Data/examples/test-001.txt \
-model /data/sf-matching/Data/model/pos/ \
-out_file /data/result.tsv \
-ion 1 \
-threads 8
Or you can find an example from the file example_for_score_with_prebuild_model.sh
, and on CodeOcean
Calculate your own library with pre-built model
Requirements
- Python == 3.6
- numpy == 1.16.2
- pandas == 0.24.2
- rdkit == 2017.09.2
- joblib == 0.11
- scipy == 1.2.1
- scikit-learn == 0.20.3
- Cython == 0.29.6
- sqlite3 == 3.27.2
The program is tested in the version shown above, other version may work but have not been tested.
Download link
Download the SF-Matching program and pre-built models using the link below:
sf-matching_with_prebuild_model.zip
(106GB, md5sum:2bd3ee20eda31fe3dbb5efb2a809c8f1
)
The pre-built models will be in the folder sf-matching/Data/model/neg
and sf-matching/Data/model/pos
.
You can also download the program and model with [M-H]- from Code Ocean.
Usage
- Download the SF-Matching program with pre-build model and unzip it.
-
Compile the cython module with the following command:
cd [path_to_sf-matching] cd SearchWithModel python setup.py build_ext --inplace
-
Generate spectral library with the following command:
# Change neg to pos if you are generating [M+H]+ library sqlite3 [path_to_spectral_library]/molecular_spectra_neg.db < [path_to_sf-matching]/SearchWithModel/sql.txt
-
Preparing a txt file contains molecules you want to be included in the library, and run the following command to add those molecules to the spectral database:
python [path_to_sf-matching]/SearchWithModel/30_PreprocessInputForSpectrumPrediction.py \ -in_file [file_to_molecular_inchi] \ -db [path_to_spectral_library]/molecular_spectra_neg.db \ -threads [processes_you_have] \ -batch_size 10000 \ # Decrease this number if you want to seprate into more pieces in next step. -ion 0 # 1 for [M+H]+
-
Read the output from step 4, you will get a number for the batch_num parameter, run the command showed below. The command can be runned in different computers at the same time to shorten waiting time.
CAL_PART=[the_number_you_got_from_step_4's_result] for NUM in $(seq 0 1 $((${CAL_PART} - 1))); do python [path_to_sf-matching]/SearchWithModel/31_GenerateInSilicoSpectrum.py \ -db [path_to_spectral_library]/molecular_spectra_neg.db \ -spectra [path_to_temporary_folder] \ -model [path_to_precalculated_model] \ -batch_num ${NUM} \ -threads [processes_you_have] \ -ion 0 # 1 for [M+H]+ done
- Run the following command to get final pre-calculated library.
python [path_to_sf-matching]/SearchWithModel/32_SaveSpectrum.py \ -db [path_to_spectral_library]/molecular_spectra_neg.db \ -spectra [path_to_temporary_folder] \ -model [path_to_precalculated_model] \ -out_file [path_to_spectral_library]/spectra_data_neg.bin \ -threads [processes_you_have] \ -ion 0 # 1 for [M+H]+
Example
You can find an example from the file example_for_precalculate_library.sh
Build model from scratch
This is only for advanced user who have a database of in-house spectra that can be used for training the model. You may need some SQL knowledge. To calculate the whole model from scratch, around 10,000 - 200,000 cpu hours are needed, depending on the size of spectral database.
Requirement
- Python == 3.6
- numpy == 1.16.2
- pandas == 0.24.2
- rdkit == 2017.09.2
- joblib == 0.11
- scipy == 1.2.1
- scikit-learn == 0.20.3
- Cython == 0.29.6
The program is tested in the version showed above, other version may work but haven’t been tested.
Download link
Download the SF-Matching program:
sf-matching_with_example_spectra_library.zip
(26MB, md5sum:219453bca296bf781724082db86485de
)
The spectral database will be in the folder sf-matching/Data/model/database
.
Usage
- Download the SF-Matching program with spectral database and unzip it.
- Add your own spectral into the spectral database which is located in
sf-matching/Data/database/spectral.db
. -
Compile the cython module with the following command:
cd [path_to_sf-matching] cd SearchWithModel python setup.py build_ext --inplace
- Follow the example:
example_for_build_model.sh