1. The Galaxy platform features a web-based user interface be which can accessed by a URL. The URL directs users to a local instance, or an instance running on a remote server. The Galaxy user interface has three components.
The Tool Pane (left side column), which comprises of a list of all available software tools installed in Galaxy; the Main Viewing Pane (Center), which is used to set operating parameters for software tools, edit and view workflows and also viewing results; and a History Pane (right side column), which displays the currently active history record of analysis steps and data inputs and outputs.. . For hands-on training of the workflows described here, we have built a training instance accessible through this URL: z.umn.edu/proteogenomicsgateway. To access the Galaxy instance, users have to register and create login/password credentials.
NOTE: The accounts created on the training site are temporary. They will be periodically erased, requiring users to re-register to access training workflows and materials. The email addresses used here are solely for registration and are not saved or used for any other purposes. 2. Galaxy stores the record for each analysis as a History (for further information https://galaxyproject.org/tutorials/histories/).
The History consists of a sequential operation of all the software tools and their respective operating parameters used during the analysis. It also has information regarding all input, output and metadata generated. The length of the History is based on the nature of data analysis. These histories can be saved and shared amongst other users registered on the same Galaxy instance.
The History Pane always displays the active history. 3. A Galaxy workflow comprises of all the software tools used in an analysis, except the input and output data (Further details on workflows https://galaxyproject.org/learn/advanced-workflow/). Importantly, workflows contain the required parameter settings necessary for running the analysis on specific input data.
Like Histories, workflows can be shared with other users of Galaxy. The ability to save validated workflows, and share these, makes for an efficient way to carry out a complex data analysis, avoiding the need for step-by-step optimization of parameters required for optimal results.4. A tool within Galaxy named Protein Database Downloader (PDD) aids in downloading FASTA files of specified protein sequences which can be used for sequence matching to experimental MS/MS data.
Sequences from publically available repositories such as Uniprot, cRAP, EBI Metagenomics, HOMD (Human Oral Microbiome Database) and the Human Microbiome Project can be downloaded through the PDD. It also provides download of any specific database of protein sequences, if a URL is available.5. SearchGUI/PeptideShakerSearchGUI. Protein identification is performed by matching MS/MS peak lists (formatted as Mascot generic files (MGF)) with a sequence database which is in the FASTA file format. SearchGUI(Vaudel et al. Proteomics 2011;11(5):996-9) contains 9 popular open source sequence database searching programs, which can be used in parallel to provide more confident and comprehensive matches to MS/MS data .
Relevant parameters for this particular training workflow are: ? Sequence database search programs used: X!Tandem? Protein digestion parameters: Trypsin, with 2 maximum missed cleavages? The precursor ion tolerance is 10 ppm, with fragment tolerance of 0.05 Da? Minimum/maximum charge of ions: 2/6? Fragment ions searched: y and b? Fixed protein modification: carbamidomethylation of C, ITRAQ 4-Plex of K, ITRAQ 4-Plex of peptide N-termVariable protein modification: Oxidation of M, ITRAQ 4-Plex of Y. PeptideShaker: The output from SearchGUI is processed with the help of PeptideShaker( Vaudel et al. Nature Biotechnol. 2015 Jan;33(1):22–24). It infers protein identities from the peptide sequences that matched to the MS/MS spectra and assigns statistical confidence to identified peptides and proteins.
Within this workflow, the “Default” option is used for processing options and the “Advanced” option for filtering is selected, with relevant parameters as follows:? Minimum and maximum peptide length are 6 and 50 respectively? Maximum precursor error is 10.0 ppm? Outputs selected: PSM report (tabular), Protein report (Tabular) and Certificate of Analysis (text) For this workflow, the dataset collection option (https://galaxyproject.org/tutorials/collections/) is used for creating a list of all the MGF files to be analyzed by SearchGUI/PeptideShaker, generating a single output which then is used as input for identifying novel proteoforms. This option is very useful for MS-based proteomics data where many times a single complex peptide mixture sample is fractionated into more simplified mixtures and analyzed by LC-MS/MS. The raw data from each fraction is then subjected to sequence database searching, and the results aggregated downstream.6. BLAST-P (or Protein BLAST) compares the peptide sequences identified through the PeptideShaker with the non-redundant (nr) database from NCBI.
This step is critical to ensure that putatively novel peptide sequences identified in the workflow have not already been characterized in past studies. For this paper, the mouse nr protein database was used. Short-Blast is selected for this workflow as the databases and datasets are trimmed for training purposes. Parameters for BLAST:Expectation value cutoff: 200000.0Output Format: Tabular (extended 25 columns)Advanced options: Scoring Matrix: PAM30Gap Costs: 9:1Maximum hits to show: 1Maximum HSPs : 1Word Size: 2Multiple hits window size: 40Minimum score (Threshold) : 11Composition-based statistics : 0PSME/MVP7. TBLASTN is used to compare the peptide sequences obtained with the six frame translated nucleotide sequence database.
For this training workflow, the Mus_musculus.GRCm38 database was used. The traditional TBLASTN was used for comparison. Expectation value cutoff: 10000.0Output Format: Tabular (extended 25 columns)Advanced options: Database/subject genetic code : StandardGap Costs: Use defaultsMaximum hits to show: 5Maximum HSPs : 5Word Size: Use DefaultMultiple hits window size: Use DefaultMinimum score (Threshold) : Use DefaultComposition-based statistics : Use Default8. IGV (Integrative Genomics Viewer.
Nature Biotechnology 29, 24–26 (2011), Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in Bioinformatics 14, 178-192 (2013).) is an interactive visualization tool used for exploring large genomic datasets. In this case, the IGV tools helps in visualizing the genomic coordinates and the localization of the peptide on the chromosome.