Pipeline to generate peptide-spectrum matches using different search engines against target and decoy databases. Input may be local or remote (see Data sources).
Allows for the easy addition of datasets and search engines.
Index
- Usage
- Troubleshooting
- 2.1. download_sample
- 2.2. raw_to_mzml
- 2.3. comet
- 2.4. percolator
- Data sources
- Dependencies
To run this pipeline, some previous steps can be performed to prepare the input files; and some subsequent steps to process the output files.
Although not required, it is recommended to use the provided conda environment to launch the protter pipeline, in addition to the external pre and post-processing scripts. This environment contains snakemake, as well as some of the necessary dependencies that are not contained in the pipeline's own conda microenvironments. (See Dependencies)
You need to have a version of conda installed. We recommend Mambaforge, which has Mamba installed in the base environment. To create it, just run inside the base folder of the protter:
mamba env create -f res/conf/protter-env.yamlAnd once all the dependencies have been installed, activate the environment with:
mamba activate protterSometimes it could be useful and interesting to delete the sequences tagged as readthrough transcripts and the chromosome Y pseudoautosomal regions (PAR) from the translations file. Also it is useful to use a non-redundant FASTA file.
It is possible to get this file through the following script; which needs the transcript file and the GTF of the desired gencode version. It can be run from the workflow directory as follows:
python scripts/clean_pc_translations.py path/to/gencode.v{version}.pc_translations.fa.gz path/to/gencode.v{version}.annotation.gtfA transcript file will be obtained in the same folder as the original file and with the
extension .NoRT.u.fa.gz. This indicates that the readthroughs and the PAR_Y (NoRT)
have been eliminated and that the sequences that appear are unique (u). This output,
file 'gencode.v{version}.pc_translations.NoRT.u.fa.gz', is the one that should be
indicated in the protter config.yaml file.
A TSV sample sheet is necessary to run the workflow. At a minimum, this should
contain dataset and sample columns to identify each input sample, and a
file column to indicate the location of the sample input file, where the file
location may be given as a URL or as a local system path. A checksum column is
also needed for remote input files, to verify that they have been downloaded
correctly.
A workflow run may contain several datasets, each with hundreds or thousands of input files, so it would be tedious and error-prone to generate the sample sheet manually. When the workflow config YAML file has been prepared, there is a script to prepare sample sheets automatically from the config file, which can be run from the workflow directory as follows:
python scripts/prep_sample_sheet.py config.yaml samples.tsvFor each dataset marked as enabled in the config file, this obtains the sample
metadata from PRIDE or from local input files, depending on the source
configured for the given dataset. Having obtained the sample metadata of all
enabled datasets, it creates an initial sample sheet. For known datasets, the
sample sheet is further enhanced, filtering irrelevant samples, modifying
existing metadata, and adding new metadata columns.
A dataset is treated as known if there is a module of the same name in the
scripts/ds directory, and if that module contains a function
enhance_sample_metadata that accepts the initial sample sheet as a pandas
DataFrame and returns an enhanced sample sheet DataFrame. (For example,
see scripts/ds/kim_2014.py.) Dataset modules can be added as needed for any
known dataset, and will be recognised automatically by protter provided that
it has the same name as the dataset and contains a working
enhance_sample_metadata function.
Whether the datasets in a sample sheet are known or unknown, it is recommended to review the sample sheet before running the workflow.
It may be necessary to split a sample sheet into multiple parts, if for example, it would take too long to run a workflow for all the samples in a single run. In such cases, it is possible to split the sample sheet as follows. For example:
python scripts/split_sample_sheet.py --split-by dataset "samples.tsv" "samples_"Depending on the configuration, this script can be used to generate several smaller sample sheets, which can then be processed in separate workflow runs.
The protter workflow can be run just like any other Snakemake workflow,
whether by specifying the number of cores and other parameters as needed…
snakemake --use-conda --cores 1…or if using some HPC cluster, like CNIO's, by specifying a slurm profile:
snakemake --use-conda --profile $SMK_PROFILE_SLURM -j 10After completion of a workflow run, the output PSM files can be gathered as follows:
python scripts/gather_psm_files.py config.yaml protter_psm_output.zipThe input config YAML file is used to determine the datasets for which PSM files should be gathered, and the output ZIP archive contains all the output PSM files for those datasets.
If the protter output is going to be used as part of APPRIS it is necessary to
perform post-processing to obtain the final CSV file. To do this, a script has
been prepared that takes the protter_psm_output.zip file and the config.yaml
file as input; and can be run from the
workflow directory as follows:
python scripts/postprocessing_for_proteo.py config.yaml protter_psm_output.zipThis script creates a CSV file for each enabled database in config.yaml and saves
them in the same path that the PSMs zipped file. The files name is proteo_{database}.csv.
There are several steps of the workflow during which issues may arise.
When a sample download fails, it should be possible to identify the cause of failure from the Snakemake log files. However, sometimes a download fails simply because of adverse network conditions, in which case the workflow can be rerun when network conditions are more favourable.
In general, if there is sufficient local storage space, and especially if the same dataset will be processed multiple times, it is recommended to download sample files separately and configure the sample sheet with their local file paths.
In a small number of cases, an error may occur while converting a RAW file to mzML format.
In such cases, it may be possible to convert the RAW file to an mzML file outside of the workflow, then use that mzML file as input for the given sample.
If no way can be found to convert the RAW file to mzML, the corresponding
sample can be dropped from the workflow, either by deleting the sample from
the sample sheet, or by marking the sample as NA in the subset column,
which effectively excludes it.
In a small number of cases, Comet does not create an output file. Sometimes this is because Comet has not searched any spectra. If this occurs and no other error is found, protter will generate a placeholder PIN file so that the workflow will continue uninterrupted.
However, it can save time and resources to simply exclude such samples from
the workflow, whether by deleting them from the sample sheet or marking them
as NA in the subset column.
In a small number of cases, Percolator may fail during the training phase. One
way to ameliorate this issue is to group samples by biosample, experiment,
or some other appropriate grouping.
Input mass spectrometry data may be obtained from files stored locally, or may be downloaded directly from the PRIDE database (Perez-Riverol et al. 2019) by specifying one or more PRIDE project accessions.
This workflow requires Python 3.6 or greater.
For conversion of mass spectrometry data from Thermo RAW formats to the standard mzML format, this workflow depends on ThermoRawFileParser (Hulstaert et al. 2020). ThermoRawFileParser is currently available under an Apache 2.0 License, subject to further restrictions imposed by the license of the Thermo Finnigan LLC vendor library on which it depends. These licenses can be viewed in the ThermoRawFileParser GitHub repository.
The protter workflow also depends on the following software:
- Biopython (Cock et al. 2009) — Biopython License
- Crux toolkit (McIlwain et al. 2014) — Apache 2.0 License
- curl — MIT/X derivative license
- DecoyPYrat (Wright & Choudhary 2016) — MIT License
- pandas (McKinney 2010) — BSD 3-Clause License
- PyYAML — MIT License
- ratelimiter — Apache 2.0 License
- requests — Apache 2.0 License
- rhash — BSD Zero-Clause License
- zlib — zlib License
With specific datasets for which metadata is obtained from an Excel file, packages such as xlrd or openpyxl may also be required to open XLS or XLSX files, respectively.