SF-Matching: identification of metabolites from tandem mass spectra

SF-Matching(SubFragment-Matching) is a machine-learning based approach to predict compounds from tandem mass spectra. When using SF-Matching, please consult (and cite) the following reference:

Identification of metabolites from tandem mass spectra with a machine learning approach utilizing structural features.

SF-Matching can be run in different modes:

Search against pre-calculated library (recommended) if your compound of interest is present one of the chemical databases KEGG, HMDB, ChEBI, and ChEBML, or is a short peptide with length less than 4.
Use trained models to search other compounds if your compound is not contained in these chemical databases. You can download the trained models and predict if a given spectrum (in .mgf format) corresponds to a given molecule.
Create you own pre-calculated library if you have a fixed library of chemical compounds that you want to search against. Once you have created your own pre-calculated library, you could discard the trained models and save disk space.
Train your own models if you have access to a corpus of spectra to be used as training data.

The source code is available via the EMBL GitLab repository.

Search against pre-calculated library (recommended)

Requirements

Python 2 or Python 3 (Python 3 is required for searching mzML file.), Python >=3.6 is recommended.
pandas >= 0.24.2
numpy >= 1.16.2

Download link

Download the SF-Matching program and pre-calculated library using the link below:

sf-matching_with_precalculated_library.zip (19 GB, md5sum: 0eb3e32df2c03b3de484c1d49612086c )

The pre-calculated library will be in the folder sf-matching/Data/library.

You can also download the program and library from Code Ocean and Zenodo.

Usage

Download the SF-Matching program with pre-calculated library and unzip it.
Convert your MS raw file into mgf / mzML / mzXML format.

Run the command:

 python [path_to_the_program]/SearchWithPreCalculatedLibrary/DatabaseSearching.py \
     -db_path [path_to_library] \
     -spectra [.mgf/.mzML/.mzXML filename] \
     -out [output_filename] \
     -ion [ion_type 0 for [M-H]-, 1 for [M+H]+]

Example

If you put the sf-matching program in /data/sf-matching/, the library data is in /data/sf-matching/Data/library/. Now you want to search the spectrum file /data/sf-matching/Data/examples/test-001.mgf with [M-H]-, the following command will generate an output file in /data/result.tsv.

    python /data/sf-matching/SearchWithPreCalculatedLibrary/DatabaseSearching.py \
        -db_path /data/sf-matching/Data/library/ \
        -spectra /data/sf-matching/Data/examples/test-001.mgf \
        -out /data/result.tsv \
        -ion 0

An example is also contained in the file example_for_score_with_precalculated_library.sh, and on CodeOcean.

Search with pre-built models

Requirements

Python == 3.6
numpy == 1.16.2
pandas == 0.24.2
rdkit == 2017.09.2
joblib == 0.11
scipy == 1.2.1
scikit-learn == 0.20.3
Cython == 0.29.6

The program is tested in the version shown above, other version may work but have not been tested.

Download link

Download the SF-Matching program and pre-built models using the link below:

sf-matching_with_prebuild_model.zip (106GB, md5sum:2bd3ee20eda31fe3dbb5efb2a809c8f1 )

The pre-built models will be in the folder sf-matching/Data/model/neg and sf-matching/Data/model/pos.

You can also download the program and model with [M-H]- from Code Ocean.

Usage

Download the SF-Matching program with pre-built model and unzip it.

Compile the cython module with the following command:

 cd [path_to_sf-matching]
 cd SearchWithModel
 python setup.py build_ext --inplace

Convert your spectrum file into mgf format, please make sure that one file only contain one spectrum. You can find one example in the folder sf-matching/Data/examples/test-001.mgf
Prepare the candidate molecules in txt format, one line one molecule, the molecules can be in SMILES or InChI format. You can find one example in the folder sf-matching/Data/examples/test-001.txt

Run the command:

 python [path_to_the_program]/SearchWithModel/ScoreWithPreBuildModel.py \
     -spectrum [mgf_file] \
     -mol [candidate_molecule_file] \
     -model [pre_calculated_model] \
     -out_file [output_file] \
     -ion [ion_type 0 for [M-H]-, 1 for [M+H]+] \
     -threads [the_threads_you_want_to_use]

Example

If you put the sf-matching program in /data/sf-matching/, the model data is in /data/sf-matching/Data/model/neg/. Now you want to search the spectrum file /data/sf-matching/Data/examples/test-001.mgf with [M-H]-, the candidate molecule is in /data/sf-matching/Data/examples/test-001.txt.

First, run the command below to compile a cython module:

    cd /data/sf-matching/SearchWithModel/
    python setup.py build_ext --inplace

Then, the following command will generate an output file in /data/result.tsv.

    python /data/sf-matching/SearchWithModel/ScoreWithPreBuildModel.py \
        -spectrum /data/sf-matching/Data/examples/test-001.mgf \
        -mol /data/sf-matching/Data/examples/test-001.txt \
        -model /data/sf-matching/Data/model/neg/ \
        -out_file /data/result.tsv \
        -ion 0 \
        -threads 8

If you want to search spectrum with [M+H]+, just change the ion to 1, also change the model data to positive model. Suppose this model is stored in /data/sf-matching/Data/model/pos/, you can use the following command:

    python /data/sf-matching/SearchWithModel/ScoreWithPreBuildModel.py \
        -spectrum /data/sf-matching/Data/examples/test-001.mgf \
        -mol /data/sf-matching/Data/examples/test-001.txt \
        -model /data/sf-matching/Data/model/pos/ \
        -out_file /data/result.tsv \
        -ion 1 \
        -threads 8

Or you can find an example from the file example_for_score_with_prebuild_model.sh, and on CodeOcean

Calculate your own library with pre-built model

Requirements

Python == 3.6
numpy == 1.16.2
pandas == 0.24.2
rdkit == 2017.09.2
joblib == 0.11
scipy == 1.2.1
scikit-learn == 0.20.3
Cython == 0.29.6
sqlite3 == 3.27.2

The program is tested in the version shown above, other version may work but have not been tested.

Download link

Download the SF-Matching program and pre-built models using the link below:

sf-matching_with_prebuild_model.zip (106GB, md5sum:2bd3ee20eda31fe3dbb5efb2a809c8f1 )

The pre-built models will be in the folder sf-matching/Data/model/neg and sf-matching/Data/model/pos.

You can also download the program and model with [M-H]- from Code Ocean.

Usage

Download the SF-Matching program with pre-build model and unzip it.

Compile the cython module with the following command:

 cd [path_to_sf-matching]
 cd SearchWithModel
 python setup.py build_ext --inplace

Generate spectral library with the following command:

 # Change neg to pos if you are generating [M+H]+ library
 sqlite3 [path_to_spectral_library]/molecular_spectra_neg.db < [path_to_sf-matching]/SearchWithModel/sql.txt

Preparing a txt file contains molecules you want to be included in the library, and run the following command to add those molecules to the spectral database:

 python [path_to_sf-matching]/SearchWithModel/30_PreprocessInputForSpectrumPrediction.py \
 -in_file [file_to_molecular_inchi] \
 -db [path_to_spectral_library]/molecular_spectra_neg.db \
 -threads [processes_you_have] \
 -batch_size 10000 \ # Decrease this number if you want to seprate into more pieces in next step.
 -ion 0 # 1 for [M+H]+

Read the output from step 4, you will get a number for the batch_num parameter, run the command showed below. The command can be runned in different computers at the same time to shorten waiting time.

 CAL_PART=[the_number_you_got_from_step_4's_result]
 for NUM in $(seq 0 1 $((${CAL_PART} - 1))); do
     python [path_to_sf-matching]/SearchWithModel/31_GenerateInSilicoSpectrum.py \
         -db [path_to_spectral_library]/molecular_spectra_neg.db \
         -spectra [path_to_temporary_folder] \
         -model [path_to_precalculated_model] \
         -batch_num ${NUM} \
         -threads [processes_you_have] \
         -ion 0 # 1 for [M+H]+
 done

Run the following command to get final pre-calculated library.

 python [path_to_sf-matching]/SearchWithModel/32_SaveSpectrum.py \
     -db [path_to_spectral_library]/molecular_spectra_neg.db \
     -spectra [path_to_temporary_folder] \
     -model [path_to_precalculated_model] \
     -out_file [path_to_spectral_library]/spectra_data_neg.bin \
     -threads [processes_you_have] \
     -ion 0 # 1 for [M+H]+

Example

You can find an example from the file example_for_precalculate_library.sh

Build model from scratch

This is only for advanced user who have a database of in-house spectra that can be used for training the model. You may need some SQL knowledge. To calculate the whole model from scratch, around 10,000 - 200,000 cpu hours are needed, depending on the size of spectral database.

Requirement

Python == 3.6
numpy == 1.16.2
pandas == 0.24.2
rdkit == 2017.09.2
joblib == 0.11
scipy == 1.2.1
scikit-learn == 0.20.3
Cython == 0.29.6

The program is tested in the version showed above, other version may work but haven’t been tested.

Download link

Download the SF-Matching program:

sf-matching_with_example_spectra_library.zip (26MB, md5sum:219453bca296bf781724082db86485de )

The spectral database will be in the folder sf-matching/Data/model/database.

Usage

Download the SF-Matching program with spectral database and unzip it.
Add your own spectral into the spectral database which is located in sf-matching/Data/database/spectral.db.

Compile the cython module with the following command:

 cd [path_to_sf-matching]
 cd SearchWithModel
 python setup.py build_ext --inplace

Follow the example: example_for_build_model.sh