ProTN is an integrative pipeline for flexible analysis and informative visualization of proteomics data from Mass Spectrometry. ProTN currently works on peptide abundances obtained with Proteome Discoverer (PD) or MaxQuant (MQ), two of the most widely used platforms analyzing raw MS spectra.
-
1. Workflow of ProTN
The ProTN workflow is divided into preprocessing, differential analysis and biological interpretation. During preprocessing, input data are filtered, imputed, normalized, and optionally batch-corrected. Next, differential analysis is performed in parallel at peptide and protein resolution, based on users’ defined comparisons and considering compled designs. Finally, biological interpretation of differentially expressed proteins is performed, including functional enrichment analysis, and detection of communities in protein-protein interaction networks, and kinome tree perturbation analysis.
01. Define analysis settings and load input data files
ProTN analyses the results of Proteome Discoverer and MaxQuant. The essential parameters and files to run ProTN are summarized here (additional details on the input can be found in section 2. Details on the input parameters and files)
-
Analysis title
: title of the experiment. It will be the title of the web page report. -
Identification software
: determine with software was use to identify peptides and proteins. PD for Protein Discoverer, MQ for MaxQuant. -
Sample Annotation file
: Excel file with the information about samples and the association between replicates and conditions. (WARNING: Condition name MUST contain at least 1 character!) -
Peptides file
: raw file of peptides obtained from PD or MQ (file peptides.txt). -
Proteins file
: raw file of protein groups obtained from PD or MQ (file proteinGroups.txt). -
Design and comparison file
: Excel file containing the formulas of the contrast comparison you want to analyse.
02. Normalization and imputation of raw intensities
Intensities are log2 transformed and normalized with DEqMS (Zhu 2022). At the peptide level, the normalization is performed with the function equalMedianNormalization, which normalizes intensity distributions in samples so that they have median equal to 0.
At the protein level, this operation is executed by the function medianSweeping, that applies the same median normalization used for peptides, but also summarizes peptide intensities into protein relative abundances by the median sweeping method.
The principal method is based on the PhosR package (Kim et al. 2021) that performs a complex and well-balanced imputation of the data based on the association between replicates and conditions. As a backup method, ProTN uses a Gaussian round imputation, for condition with only 1 replicate.
In this step, two MDSs and two PCAs (proteins and peptides) are generated for data exploration.
03. Differential analysis
Differential analysis is applied to both proteins and peptides, to identify significant differences. Two slightly different methodologies are applied: the DEqMS package (Zhu 2022),is used for proteins. DEqMS is developed on top of Limma, but the method estimates different prior variances for proteins quantified by different numbers of PSMs/peptides per protein, therefore achieving better accuracy.For single peptides, the Limma package is used (Ritchie et al. 2015).
Limma and DEqMS calculate differentially expressed peptides and proteins (DEPs) for each comparison specified in the design file parameter. Each peptide or protein can be selected as differential based on different parameters: the log2 Fold Change, the P.Value, the adjusted P.Value and the log2 expression. In ProTN, a protein/peptide is significant if passing thresholds on these parameters, set by the user. A protein/peptide for each comparison can be Up-regulated or Down-regulated. It is Up-regulated if:
-
the log2 FC is higher than the Fold Change threshold (FC > Log2 FC thr),
-
the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),
-
the log2 expression is higher than the threshold (log2 expression > Signal log2 expr thr).
It is Down-regulated if:
-
the log2 FC is lower of the Fold Change threshold (FC < -Log2 FC thr),
-
the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),
-
the log2 expression is higher than the threshold (log2 expression > Signal log2 expr thr).
In the output, for each comparison, this distinction is reported in the “class” column, which assumes value “+” if is up-regulated, “-” if down-regulated and “=” if it is not significant.
Various figures are generated, first a bar plot that graphically represents the DEPs identified. Followed by comparison-specific volcano plots.
04. Report creation and download of the results
Results are summarized in a web-page HTML report. Other than this, ProTN generates a large number of useful files: a description of each output file can be found in section 4. Details on the output files. All the files are group in a zip file and downloaded.
ADDITIONAL STEPS:
B1. Batch Effect correction
If required by the experiment, a batch correction step can be applyed using proBatch (Cuklina et al. 2018). Batches need to be defined in the Sample_Annotation file where column MS_batch is required.
E1. Enrichment analysis of the Differentially Expressed Proteins
The biological interpretation of the Differentially Expressed Proteins starts with the enrichment step. To execute this analysis, ProTN uses EnrichR (Jawaid 2022), a popular tool that searches on a large number of data sets to obtain information about many functional categories. EnrichR organises its hundreds of data sources in 8 sections: Transcription, Pathways, Ontologies, Diseases/Drugs, Cell Types, Misc, Legacy, and Crowd.
Each comparison defined in the differential analysis stage can result in 3 sets of proteins: the Up-regulated (called Up), the Down-regulated (called Down), and the merge of the two (called all). ProTN provides for each term statistical parameters like P.Value, fdr, odds ratio, overlap size.
ProTN creates an RData of the complete enrichment data frame, allowing the user an easy import in R to perform further analysis. ProTN also generates an Excel file, containing only the significantly enriched terms, as defined by user settings.
A term to be significative need to have:
-
a Fdr or P.Value lower of P.Value thr for enrichment (P.Value < P.Value thr for enrichment),
-
an Overlap Size higher than Overlap size thr for enrichment (Overlap Size > Overlap size thr for enrichment).
ProTN displays top significant enrichments based on specific annotation datasets or keywords selected by the user.
E1.1. Enrichment analysis of the whole set of proteins discovered by the experiment
In same cases can be usefull have the enrichment of the whole proteome discovered by the experiment. For example it can be used as negative control of the differentially expressed proteins. So, the entire proteome is analysed with EnrichR, and saved in an RData and in an Excel file. Also, as before, 4 plots can be generated, in this case adding as last dot column the negative control provided by the whole proteome.
Additional details on the output can be found in section 4. Details on the output files.
N1. Protein-Protein Interaction network analysis of Differentially Expressed Proteins
ProTN performs Protein-Protein Interaction (PPI) network analysis on differentially expressed proteins. PPIs are essential in almost all processes of the cell, and crucial for understanding cell physiology in different states. For each comparison, ProTN analyses the interaction between the DEPs using STRING (Szklarczyk et al. 2021).
The species-specific database is retrieved from the STRING server, and all the interactions above a threshold are used to generate a network with iGraph (Csardi and Nepusz 2006) Proteins are also clustered in communities via iGraph which identifies dense subgraph by optimizing modularity score.
Two force-directed layouts are used to display networks: Fruchterman-Reingold algorithm and the Kamada-Kawai algorithm.
-
-
2. Details on the input parameters and files
-
Title of Analysis
: title of the experiment. It will be the title of the web page report. -
Brief Description
: description of the current experiment. It is the first paragraph of the report. -
Software Analyzer
: determine with software was use to identify peptides and proteins. PD for Protein Discoverer, MQ for MaxQuant. -
Sample Annotation file
: file with the information about the samples and the correlation between replicate ID and condition (WARNING: Condition name MUST contain at least 1 character!). The Sample_Annotation file is an Excel file with the following column :- Condition column (
REQUIRED
): define the condition of each sample that divide the samples in groups. The conditions need to be the same of the the Contrast Design. (WARNING: Condition name MUST contain at least 1 character!) - Color columns (
OPTIONAL
): define a color for the samples in the graphs. If not present use a default palette. - MS_batch columns: define the groups of batch in the
samples.
REQUIRED FOR BATCH EFFECT CORRECTION
. -
Sample column: define the names for the samples.
- In case of PD files use the Sample_Annotation file obtained from PD, the Sample column in optional, if is not present the software extract the names for the File Name column.
- In case of MQ analysis the this column is
REQUIRED
ATTENTION
: SAMPLE NAME MUST BE EQUAL TO THE NAME INSERTED IN MAXQUANT (name of the column in peptide file).
- Condition column (
-
Peptides file
: raw file of peptides obtained from PD or MQ (file peptides.txt). -
Proteins file
: raw file of protein groups obtained from PD or MQ (file proteinGroups.txt). -
Design for the comparison file
: Excel file containing the formulas of the contrast comparison you want to analyse. The table can have 3 columns :- Formule column (
REQUIRED
): The formulas need to follow the syntax of Limma.AT LEAST 1 FORMULA IS REQUIRED
. - Name column (
OPTIONAL
): personalized name assign to the comparison. - Color columns (
OPTIONAL
): define a color for the condition in the graphs. If not present use the default palette.
- Formule column (
-
OPTIONAL
:-
Signal log2 expr thr: signal log2 expression threshold for the differential analysis. DEFAULT: DEPs if
Signal log2 expression threshold > -∞
(No Limit, represent by value “inf” in the cell) -
Log2 FC thr: Fold Change threshold for the differential analysis. DEFAULT: DEPs if
log2 Fold Change threshold = 0.75
(Up-regulated > 0.75, Down-regulated < -0.75) -
P.Value thr: p.value threshold for the differential analysis. DEFAULT: DEPs if
P value threshold < 0.05
-
Batch Correction: execution of the batch effect correction performed by proBatch. (Cuklina et al. 2018). (TRUE or FALSE). If TRUE, column MS_batch required in Sample_Annotation file.
-
Control Boxplot proteins: list of proteins used as control of the intensities. For each protein a boxplot is generated comparing the mean of the intensities group by condition.
-
Execute PPI network STRINGdb: boolean value for the execution of the network analysis. (TRUE or FALSE)
-
Execute Enrichment: boolean value for the execution of the enrichment step. (TRUE or FALSE)
-
Execute of Whole universe Enrichment: boolean value for the execution of the enrichment step on the whole set of proteins of the experiment. (TRUE or FALSE)
-
P.Value thr for enrichment: p.value threshold of the enriched terms. DEFAULT: Term is significant if
P Value threshold for Enrichment < 0.05
-
Overlap size thr for enrichment: Overlap size threshold. The overlap size is the number of DEPs discovered in the enriched terms. DEFAULT: Term is significant if
Minimum number of overlap genes with enriched terms > 5
-
Terms to search: key-word that you want to search in the results of EnrichR and visualize in a plot (EX: MYC, C-MYC, Senescence,…).
-
DB to analyse: datasets that you want to see in your plots (EX: ChEA_2016, KEGG_2021_Human, BioPlanet_2019, GO_Biological_Process_2021,…), using the same name that you can find in EnrichR and visualize in a plot .
-
-
-
-
3. Example case study - Proteomics from Steger et al. (2016)
Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases
PRIDE: PXD003071.
Des: Steger M, Tonelli F, Ito G, Davies P, Trost M, Vetter M, Wachter S, Lorentzen E, Duddy G, Wilson S, Baptista MA, Fiske BK, Fell MJ, Morrow JA, Reith AD, Alessi DR, Mann M. Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases. Elife. 2016 Jan 29;5.
Its execution is performed clicking the button Case Study Example.
Download Case Study...
-
4. Details on the output files
report.html
: complete report of the analysis with all pics and results of the enrichment.- Data folder
protn_env.RData
: RData object containing the essential variables and data for additional analyses in R. The variables are:- c_anno: dataframe with the description of the samples;
- dat_gene: dataframe with the protein normalized abundances;
- dat_pep: dataframe with the peptide normalized abundances;
- psm_peptide_table: descriptive dataframe of the peptides;
- deps_l_df: dataframe of differential expressed proteins;
- deps_pep_l_df: dataframe of differential expressed peptides;
- expr_avgse_df: dataframe with average, standard error and covariance for each condition based on protein abundances;
- expr_avgse_pep_df: dataframe with average, standard error and covariance for each condition based on peptide abundances;
- formule_contrast: design for the differential analysis;
- Settings: colour_vec: color list for the plot, bf: family text, bs: base text size;
- Thresholds: fc_thr: Fold Change threshold, pval_thr: P.value threshold, pval_enrich_thr: P.value threshold for Enrichment, overlap_size_enrich_thr: Overlap size threshold for enrichment.
enrichment_DE.RData
: RData object containing enrichment results based on differentially expressed proteins
- Tables folder
normalised_abundances.xlsx
: excel file containing abundance values generated by proTN. abundances are log2 transformed, normalized, imputed (and batch corrected). The file is organized in the following sheets:- protein_per_sample: protein abundances per sample.
- peptide_per_sample: peptide abundances per sample.
- protein_per_condition: protein abundances per condition (average & standard deviation), as defined in the Sample Annotation.
- peptide_per_condition: peptide abundances per condition (average & standard deviation), as defined in the Sample Annotation.
differential_expression.xlsx
: excel file containing the results of differential analysis, according to the contrasts defined in the Design file. The file is organized in the following sheets:- protein_DE: protein differential expression results protein abundances per sample.
- peptide_DE: peptide differential expression results. Annotation columns:
- Accession: protein UniprotID
- Description: protein description
- GeneName: Gene Symbol
- Peptide_Sequence: peptide sequence
- Peptide_Modifications: peptide modifications
- Peptide_Position: start and end position of the peptide within the protein sequence, defined UniprotID
- Peptide_Tryptic: peptide tryptic digestion status (fully tryptic, N-semi tryptic, C-semi tryptic, non tryptic)
- Columns for each contrast:
- class: defined according to the fold change, p-value and abundance thresholds specified in the input
- + up-regulated protein/peptide
- - down-regulated protein/peptide
- = invariant protein/peptide
- log2_FC: protein/peptide log2 transformed fold change
- p_val: protein/peptide contrast p-value
- p_adj: protein/peptide adjusted p-value (FDR after BH correction)
- log2_expr: protein/peptide log2 average abundance
enrichment_DE.xlsx
: excel file containing a selection of enrichment results starting from differentially expressed proteins. Terms are selected according to significance thresholds specified in the input (Default: adj.P.Value < 0.05, Overlap Size >= 5)
- Figures folder
PCA_MDS
directory: Directory containing (Multidimensional scaling) and PCA (Principal component analysis) plots of samples.- mds_proteins.pdf: MDS, based on protein abundances.
- mds_peptides.pdf: MDS of the samples by peptides.
- pca_protein.pdf: PCA of the samples by proteins.
- pca_peptides.pdf: PCA of the samples by peptides.
- DE_mds_protein.pdf: MDS based on differentially expressed proteins.
- DE_mds_peptide.pdf: MDS based on differentially expressed peptides.
- DE_pca_protein.pdf: PCA based on differentially expressed proteins.
- DE_pca_peptide.pdf: PCA based on differentially expressed peptides.
Expression
directory: Directory containing figures related to expression analyses.- selected_protein_plot.pdf: plot displaying the abundances of selected proteins
- DE_protein_barplot.pdf: number of differentially expressed proteins found in each contrast.
- DE_peptide_barplot.pdf: number of differentially expressed peptides found in each contrast.
Enrichment
directory: Directory containing figures from functional annotation enrichment analysi- enr_updown_keysources.pdf: dot plot of top enriched terms based on differentially expressed proteins, divided in up- and down-regulated. Terms are filtered for key source datasets selected in the advanced options.
- enr_updown_keywords.pdf: dot plot of top enriched terms based on differentially expressed proteins, divided in up- and down-regulated. Terms are filtered for keywords defined in the advanced options.
- enr_DE_keysources.pdf: dot plot of top enriched terms based on differentially expressed proteins. Terms are filtered for key source datasets selected in the advanced options.
- enr_DE_keywords.pdf: dot plot of top enriched terms based on differentially expressed proteins. Terms are filtered for keywords defined in the advanced options.
Network
directory: Directory with figures from network analysis of differentially expressed proteins, based on STRINGdb protein-protein interactions. For each contrast, two files are generated:- communities_sizes.pdf: histogram representing the number of protein in each network community.
- ppi_network.pdf: network representation in two layouts: Fruchterman Reingold (fr) and Kamada Kawai (kk).
PhosProTN is an integrative pipeline for flexible analysis and informative visualization of phosphoproteomic data from Mass Spectrometry. It perform a complete analysis of the raw files from Proteome Discoverer (PD) or MaxQuant (MQ),two of the most widely used platforms analyzing raw MS spectra.
-
1. Workflow of PhosProTN
The ProTN workflow is divided into preprocessing, differential analysis and biological interpretation. During preprocessing, input data are filtered, imputed, normalized, and optionally batch-corrected. Next, differential analysis is performed in parallel at peptide and protein resolution, based on users’ defined comparisons and considering compled designs. Finally, biological interpretation of differentially expressed proteins is performed, including functional enrichment analysis, and detection of communities in protein-protein interaction networks, and kinome tree perturbation analysis.
01. Define analysis settings and load input data files
PhosProTN analyse the results of Proteome Discoverer and MaxQuant. The parameters and files required to run PhosProTN are described in section 2. Details on the input parameters and files.
-
Analysis title
: title of the experiment. It will be the title of the web page report. -
Identification software
: determine with software was use to identify peptides and proteins. PD for Protein Discoverer, MQ for MaxQuant. -
Design for the comparison file: Excel file containing the formulas of the contrast comparison you want to analyse. The table can have 3 columns:
- Formule column (
REQUIRED
): The formulas need to follow the syntax of Limma.AT LEAST 1 FORMULA IS REQUIRED
. - Name column (
OPTIONAL
): personalized name assign to the comparison. - Color columns (
OPTIONAL
): define a color for the condition in the graphs. If not present use the default palette.
- Formule column (
- Required for Proteome Discoverer:
-
PROTEOMIC files:
-
Sample Annotation file
: file with the information about the samples and the correlation between replicate ID and condition of the proteomic (WARNING: Condition name MUST contain at least 1 character!). The Sample_Annotation file is an Excel file with the following column:- Condition column (
REQUIRED
): define the condition of each sample that divide the samples in groups. The conditions need to be the same of the the Contrast Design and the same of the phospho-proteomics. (WARNING: Condition name MUST contain at least 1 character!) - Color columns (
OPTIONAL
): define a color for the samples in the graphs. If not present use a default palette. - MS_batch columns: define the groups of batch in the
samples.
REQUIRED FOR BATCH EFFECT CORRECTION
. - Sample column: define the names for the samples. In case of PD files use the Sample_Annotation file obtained from PD, the Sample column in optional, if is not present the software extract the names for the File Name column.
- Condition column (
-
Peptides file
: raw file of peptides. -
Proteins file
: raw file of protein groups.
-
-
PHOSPHO-PROTEOMIC files:
-
Sample Annotation file
: file with the information about the samples and the correlation between replicate ID and condition of the phospho-proteomics (WARNING: Condition name MUST contain at least 1 character!). The Sample_Annotation file is an Excel file with the following column:- Condition column (
REQUIRED
): define the condition of each sample that divide the samples in groups. The conditions need to be the same of the the Contrast Design and the same of the proteomics. (WARNING: Condition name MUST contain at least 1 character!) - Color columns (
OPTIONAL
): define a color for the samples in the graphs. If not present use a default palette. - MS_batch columns: define the groups of batch in the
samples.
REQUIRED FOR BATCH EFFECT CORRECTION
. - Sample column: define the names for the samples. In case of PD files use the Sample_Annotation file obtained from PD, the Sample column in optional, if is not present the software extract the names for the File Name column.
- Condition column (
-
Peptides file
: raw file of peptides. -
Proteins file
: raw file of protein groups. -
PSM file
: raw file of the PSM obtained from PD. It is required to overcome the uncertain phosphorilation site identification.
-
-
- File Required for MaxQuant:
-
PROTEOMIC files:
-
Sample Annotation file
: file with the information about the samples and the correlation between replicate ID and condition of the proteomic (WARNING: Condition name MUST contain at least 1 character!). The Sample_Annotation file is an Excel file with the following column:- Condition column (
REQUIRED
): define the condition of each sample that divide the samples in groups. The conditions need to be the same of the the Contrast Design and the same of the phospho-proteomics. (WARNING: Condition name MUST contain at least 1 character!) - Sample column (
REQUIRED
): define the names for the samples. - Color columns (
OPTIONAL
): define a color for the samples in the graphs. If not present use a default palette. - MS_batch columns: define the groups of batch in the
samples.
REQUIRED FOR BATCH EFFECT CORRECTION
.
ATTENTION
: SAMPLE NAME MUST BE EQUAL TO THE NAME INSERTED IN MAXQUANT (name of the column in peptide file). - Condition column (
-
Evidence file
: raw file of peptides. The file required is the evidence.txt file.
-
-
PHOSPHO-PROTEOMIC files:
-
Sample Annotation file
: file with the information about the samples and the correlation between replicate ID and condition of the phospho-proteomics (WARNING: Condition name MUST contain at least 1 character!). The Sample_Annotation file is an Excel file with the following column:- Condition column (
REQUIRED
): define the condition of each sample that divide the samples in groups. The conditions need to be the same of the the Contrast Design and the same of the proteomics. (WARNING: Condition name MUST contain at least 1 character!) - Sample column (
REQUIRED
): define the names for the samples. - Color columns (
OPTIONAL
): define a color for the samples in the graphs. If not present use a default palette. - MS_batch columns: define the groups of batch in the
samples.
REQUIRED FOR BATCH EFFECT CORRECTION
.
ATTENTION
: SAMPLE NAME MUST BE EQUAL TO THE NAME INSERTED IN MAXQUANT (name of the column in peptide file). - Condition column (
-
Evidence file
: raw file of peptides. The file required is the evidence.txt file.
-
-
02. Normalization and imputation of the intensities
Intensities are log2 transformed and normalized with DEqMS (Zhu 2022). At the peptide level, the normalization is performed with the function equalMedianNormalization, which normalizes intensity distributions in samples so that they have median equal to 0.
At the protein level, this operation is executed by the function medianSweeping, that applies the same median normalization used for peptides, but also summarizes peptide intensities into protein relative abundances by the median sweeping method.
The principal method is based on the PhosR package (Kim et al. 2021) that performs a complex and well-balanced imputation of the data based on the association between replicates and conditions. As a backup method, ProTN uses a Gaussian round imputation, for condition with only 1 replicate.
03. Differential analysis
Differential analysis is applied to both proteins and peptides, to identify significant differences. Two slightly different methodologies are applied: the DEqMS package (Zhu 2022),is used for proteins. DEqMS is developed on top of Limma, but the method estimates different prior variances for proteins quantified by different numbers of PSMs/peptides per protein, therefore achieving better accuracy.For single peptides, the Limma package is used (Ritchie et al. 2015).
Limma and DEqMS calculate differentially expressed peptides and proteins (DEPs) for each comparison specified in the design file parameter. Each peptide or protein can be selected as differential based on different parameters: the log2 Fold Change, the P.Value, the adjusted P.Value and the log2 expression. In ProTN, a protein/peptide is significant if passing thresholds on these parameters, set by the user. A protein/peptide for each comparison can be Up-regulated or Down-regulated. It is Up-regulated if:
-
the log2 FC is higher than the Fold Change threshold (FC > Log2 FC thr),
-
the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),
-
the log2 expression is higher than the threshold (log2 expression > Signal log2 expr thr).
It is Down-regulated if:
-
the log2 FC is lower of the Fold Change threshold (FC < -Log2 FC thr),
-
the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),
-
the log2 expression is higher than the threshold (log2 expression > Signal log2 expr thr).
In the output, for each comparison, this distinction is reported in the “class” column, which assumes value “+” if is up-regulated, “-” if down-regulated and “=” if it is not significant.
Various figures are generated, first a bar plot that graphically represents the DEPs identified. Followed by comparison-specific volcano plots.
04. Report creation and download of the results
The results are summarized in a web-page HTML report. Other than this, the experiment is described by a large number of files, a description of each file generated can be found in section 3. Details on the output files. All the files are group in a zip file and downloaded.
ADDITIONAL STEPS:
B1. Batch Effect correction
If required by the experiment, a batch correction step can be applyed using proBatch (Cuklina et al. 2018). The batches need to be defined in the Sample_Annotation file where column MS_batch is required.
E1. Enrichment analysis of the Differentially Expressed Proteins
The biological interpretation of the Differentially Expressed Proteins starts with the enrichment step. To execute this analysis, ProTN uses EnrichR (Jawaid 2022), a popular tool that searches on a large number of data sets to obtain information about many functional categories. EnrichR organises its hundreds of data sources in 8 sections: Transcription, Pathways, Ontologies, Diseases/Drugs, Cell Types, Misc, Legacy, and Crowd.
Each comparison defined in the differential analysis stage can result in 3 sets of proteins: the Up-regulated (called Up), the Down-regulated (called Down), and the merge of the two (called all). ProTN provides for each term statistical parameters like P.Value, fdr, odds ratio, overlap size.
ProTN creates an RData of the complete enrichment data frame, allowing the user an easy import in R to perform further analysis. ProTN also generates an Excel file, containing only the significantly enriched terms, as defined by user settings.
A term to be significative need to have:
-
a Fdr or P.Value lower of P.Value thr for enrichment (P.Value < P.Value thr for enrichment),
-
an Overlap Size higher than Overlap size thr for enrichment (Overlap Size > Overlap size thr for enrichment).
ProTN displays top significant enrichments based on specific annotation datasets or keywords selected by the user.
E1.1. Enrichment analysis of the whole set of proteins discovered by the experiment
In same cases can be usefull have the enrichment of the whole proteome discovered by the experiment. For example it can be used as negative control of the differentially expressed proteins. So, the entire proteome is analysed with EnrichR, and saved in an RData and in an Excel file. Also, as before, 4 plots can be generated, in this case adding as last dot column the negative control provided by the whole proteome.
Additional details on the output can be found in section 4. Details on the output files.
K1. Activity kinase tree analysis of the Differentially Expressed Phosphosite
In phospho-proteomic it extremely useful to study the activation status of the kinase based on the differentially expressed substrate idenfied by the differential analysis. For each comparison, PhosProTN predicts the activation state of the kinases using PhosR (Kim et al. 2021). PhosR provides a kinase-substrate relationship score, and on that it prioritises potential kinases that could be responsible for the phosphorylation change of phosphosite on the basis of kinase recognition motif and phosphoproteomic dynamics.
The activity score provide by PhosR is used to generated a graphical versione of the human kinome tree using CORAL (Metz K.S. et al. 2018), a web shiny app for visualizing both quantitative and qualitative data. It generates high-resolution scalable vector graphic files suitable for publication without the need for refinement in graphic editing software.
N1. Protein-Protein Interaction network analysis of Differentially Expressed Phosphosite
ProTN performs Protein-Protein Interaction (PPI) network analysis on differentially expressed proteins. PPIs are essential in almost all processes of the cell, and crucial for understanding cell physiology in different states. For each comparison, ProTN analyses the interaction between the DEPs using STRING (Szklarczyk et al. 2021).
The species-specific database is retrieved from STRING server, an accurate analysis discover all the interactions and an iGraph (Csardi and Nepusz 2006) network is generated. Later, the proteins are clustered via iGraph function which identify dense subgraph by optimizing modularity score.
Since the network can vary a lot on composition, two ggplot layout are used: Fruchterman-Reingold algorithm and the Kamada-Kawai algorithm.
-
-
2. Details on the input parameters and files
-
Title of Analysis
: title of the experiment. It will be the title of the web page report. -
Brief Description
: description of the current experiment. It is the first paragraph of the report. -
Software Analyzer
: determine with software was use to identify peptides and proteins. PD for Protein Discoverer, MQ for MaxQuant. -
Required for Proteome Discoverer:
-
PROTEOMIC files:
-
Sample Annotation file
: file with the information about the samples and the correlation between replicate ID and condition of the proteomic (WARNING: Condition name MUST contain at least 1 character!). The Sample_Annotation file is an Excel file with the following column :- Condition column (
REQUIRED
): define the condition of each sample that divide the samples in groups. The conditions need to be the same of the the Contrast Design and the same of the phospho-proteomics. (WARNING: Condition name MUST contain at least 1 character!) - Color columns (
OPTIONAL
): define a color for the samples in the graphs. If not present use a default palette. - MS_batch columns: define the groups of batch in the
samples.
REQUIRED FOR BATCH EFFECT CORRECTION
. - Sample column: define the names for the samples. In case of PD files use the Sample_Annotation file obtained from PD, the Sample column in optional, if is not present the software extract the names for the File Name column.
- Condition column (
-
Peptides file
: raw file of peptides. -
Proteins file
: raw file of protein groups.
-
-
PHOSPHO-PROTEOMIC files:
-
Sample Annotation file
: file with the information about the samples and the correlation between replicate ID and condition of the phospho-proteomics (WARNING: Condition name MUST contain at least 1 character!). The Sample_Annotation file is an Excel file with the following column :- Condition column (
REQUIRED
): define the condition of each sample that divide the samples in groups. The conditions need to be the same of the the Contrast Design and the same of the proteomics. (WARNING: Condition name MUST contain at least 1 character!) - Color columns (
OPTIONAL
): define a color for the samples in the graphs. If not present use a default palette. - MS_batch columns: define the groups of batch in the
samples.
REQUIRED FOR BATCH EFFECT CORRECTION
. - Sample column: define the names for the samples. In case of PD files use the Sample_Annotation file obtained from PD, the Sample column in optional, if is not present the software extract the names for the File Name column.
- Condition column (
-
Peptides file
: raw file of peptides. -
Proteins file
: raw file of protein groups. -
PSM file
: raw file of the PSM obtained from PD. It is required to overcome the uncertain phosphorilation site identification.
-
-
PROTEOMIC files:
-
File Required for MaxQuant:
-
PROTEOMIC files:
-
Sample Annotation file
: file with the information about the samples and the correlation between replicate ID and condition of the proteomic (WARNING: Condition name MUST contain at least 1 character!). The Sample_Annotation file is an Excel file with the following column :- Condition column (
REQUIRED
): define the condition of each sample that divide the samples in groups. The conditions need to be the same of the the Contrast Design and the same of the phospho-proteomics. (WARNING: Condition name MUST contain at least 1 character!) -
Sample column (
REQUIRED
): define the names for the samples.ATTENTION
: SAMPLE NAME MUST BE EQUAL TO THE NAME INSERTED IN MAXQUANT (name of the column in peptide file). - Color columns (
OPTIONAL
): define a color for the samples in the graphs. If not present use a default palette. - MS_batch columns: define the groups of batch in the
samples.
REQUIRED FOR BATCH EFFECT CORRECTION
.
- Condition column (
-
Evidence file
: raw file of peptides. The file required is the evidence.txt file.
-
-
PHOSPHO-PROTEOMIC files:
-
Sample Annotation file
: file with the information about the samples and the correlation between replicate ID and condition of the phospho-proteomics (WARNING: Condition name MUST contain at least 1 character!). The Sample_Annotation file is an Excel file with the following column :- Condition column (
REQUIRED
): define the condition of each sample that divide the samples in groups. The conditions need to be the same of the the Contrast Design and the same of the proteomics. (WARNING: Condition name MUST contain at least 1 character!) -
Sample column (
REQUIRED
): define the names for the samples.ATTENTION
: SAMPLE NAME MUST BE EQUAL TO THE NAME INSERTED IN MAXQUANT (name of the column in peptide file). - Color columns (
OPTIONAL
): define a color for the samples in the graphs. If not present use a default palette. - MS_batch columns: define the groups of batch in the
samples.
REQUIRED FOR BATCH EFFECT CORRECTION
.
- Condition column (
-
Evidence file
: raw file of peptides. The file required is the evidence.txt file
-
-
PROTEOMIC files:
-
Design for the comparison file
: Excel file containing the formulas of the contrast comparison you want to analyse. The table can have 3 columns :- Formule column (
REQUIRED
): The formulas need to follow the syntax of Limma.AT LEAST 1 FORMULA IS REQUIRED
. - Name column (
OPTIONAL
): personalized name assign to the comparison. - Color columns (
OPTIONAL
): define a color for the condition in the graphs. If not present use the default palette.
- Formule column (
-
OPTIONAL
:-
Signal log2 expr thr: signal log2 expression threshold for the differential analysis. DEFAULT: DEPs if
Signal log2 expression threshold > -∞
(No Limit, represent by value “inf” in the cell) -
Log2 FC thr: Fold Change threshold for the differential analysis. DEFAULT: DEPs if
log2 Fold Change threshold = 0.75
(Up-regulated > 0.75, Down-regulated < -0.75) -
P.Value thr: p.value threshold for the differential analysis. DEFAULT: DEPs if
P value threshold < 0.05
-
Batch Correction: execution of the batch effect correction performed by proBatch. (Cuklina et al. 2018). (TRUE or FALSE). If TRUE, column MS_batch required in Sample_Annotation file.
-
Control Boxplot proteins: list of proteins used as control of the intensities. For each protein a boxplot is generated comparing the mean of the intensities group by condition.
-
Execute PPI network STRINGdb: boolean value for the execution of the network analysis. (TRUE or FALSE)
-
Execute Enrichment: boolean value for the execution of the enrichment step. (TRUE or FALSE)
-
Execute of Whole universe Enrichment: boolean value for the execution of the enrichment step on the whole set of proteins of the experiment. (TRUE or FALSE)
-
Draw the kinase trees: boolean value for the execution of the kinase tree analysis for each comparison. (TRUE or FALSE)
-
P.Value thr for enrichment: p.value threshold of the enriched terms. DEFAULT: Term is significant if
P Value threshold for Enrichment < 0.05
-
Overlap size thr for enrichment: Overlap size threshold. The overlap size is the number of DEPs discovered in the enriched terms. DEFAULT: Term is significant if
Minimum number of overlap genes with enriched terms > 5
-
Terms to search: key-word that you want to search in the results of EnrichR and visualize in a plot (EX: MYC, C-MYC, Senescence,…).
-
DB to analyse: datasets that you want to see in your plots (EX: ChEA_2016, KEGG_2021_Human, BioPlanet_2019, GO_Biological_Process_2021,…), using the same name that you can find in EnrichR and visualize in a plot .
-
-
-
-
3. Example case study - Phosphoproteomics from Steger et al. (2016)
Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases
PRIDE: PXD003071.
Des: Steger M, Tonelli F, Ito G, Davies P, Trost M, Vetter M, Wachter S, Lorentzen E, Duddy G, Wilson S, Baptista MA, Fiske BK, Fell MJ, Morrow JA, Reith AD, Alessi DR, Mann M. Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases. Elife. 2016 Jan 29;5.
Its execution is performed clicking the button Case Study Example.
-
4. Details on the output files
report.html
: complete report of the analysis with all pics and results of the enrichment.- Data folder
protn_env.RData
: RData object containing the essential variables and data for additional analyses in R. The variables are:- c_anno: dataframe with the description of the samples;
- dat_gene: dataframe with the protein normalized abundances;
- dat_pep: dataframe with the peptide normalized abundances;
- psm_peptide_table: descriptive dataframe of the peptides;
- deps_l_df: dataframe of differential expressed proteins;
- deps_pep_l_df: dataframe of differential expressed peptides;
- expr_avgse_df: dataframe with average, standard error and covariance for each condition based on protein abundances;
- expr_avgse_pep_df: dataframe with average, standard error and covariance for each condition based on peptide abundances;
- formule_contrast: design for the differential analysis;
- Settings: colour_vec: color list for the plot, bf: family text, bs: base text size;
- Thresholds: fc_thr: Fold Change threshold, pval_thr: P.value threshold, pval_enrich_thr: P.value threshold for Enrichment, overlap_size_enrich_thr: Overlap size threshold for enrichment.
enrichment_DE.RData
: RData object containing enrichment results based on differentially expressed proteins
- Tables folder
normalised_abundances.xlsx
: excel file containing abundance values generated by proTN. abundances are log2 transformed, normalized, imputed (and batch corrected). The file is organized in the following sheets:- protein_per_sample: protein abundances per sample.
- peptide_per_sample: peptide abundances per sample.
- protein_per_condition: protein abundances per condition (average & standard deviation), as defined in the Sample Annotation.
- peptide_per_condition: peptide abundances per condition (average & standard deviation), as defined in the Sample Annotation.
differential_expression.xlsx
: excel file containing the results of differential analysis, according to the contrasts defined in the Design file. The file is organized in the following sheets:- protein_DE: protein differential expression results protein abundances per sample.
- peptide_DE: peptide differential expression results. Annotation columns:
- Accession: protein UniprotID
- Description: protein description
- GeneName: Gene Symbol
- Peptide_Sequence: peptide sequence
- Peptide_Modifications: peptide modifications
- Peptide_Position: start and end position of the peptide within the protein sequence, defined UniprotID
- Peptide_Tryptic: peptide tryptic digestion status (fully tryptic, N-semi tryptic, C-semi tryptic, non tryptic)
- Columns for each contrast:
- class: defined according to the fold change, p-value and abundance thresholds specified in the input
- + up-regulated protein/peptide
- - down-regulated protein/peptide
- = invariant protein/peptide
- log2_FC: protein/peptide log2 transformed fold change
- p_val: protein/peptide contrast p-value
- p_adj: protein/peptide adjusted p-value (FDR after BH correction)
- log2_expr: protein/peptide log2 average abundance
enrichment_DE.xlsx
: excel file containing a selection of enrichment results starting from differentially expressed proteins. Terms are selected according to significance thresholds specified in the input (Default: adj.P.Value < 0.05, Overlap Size >= 5)
- Figures folder
PCA_MDS
directory: Directory containing (Multidimensional scaling) and PCA (Principal component analysis) plots of samples.- mds_proteins.pdf: MDS, based on protein abundances.
- mds_peptides.pdf: MDS of the samples by peptides.
- pca_protein.pdf: PCA of the samples by proteins.
- pca_peptides.pdf: PCA of the samples by peptides.
- DE_mds_protein.pdf: MDS based on differentially expressed proteins.
- DE_mds_peptide.pdf: MDS based on differentially expressed peptides.
- DE_pca_protein.pdf: PCA based on differentially expressed proteins.
- DE_pca_peptide.pdf: PCA based on differentially expressed peptides.
Expression
directory: Directory containing figures related to expression analyses.- selected_protein_plot.pdf: plot displaying the abundances of selected proteins
- DE_protein_barplot.pdf: number of differentially expressed proteins found in each contrast.
- DE_peptide_barplot.pdf: number of differentially expressed peptides found in each contrast.
Enrichment
directory: Directory containing figures from functional annotation enrichment analysi- enr_updown_keysources.pdf: dot plot of top enriched terms based on differentially expressed proteins, divided in up- and down-regulated. Terms are filtered for key source datasets selected in the advanced options.
- enr_updown_keywords.pdf: dot plot of top enriched terms based on differentially expressed proteins, divided in up- and down-regulated. Terms are filtered for keywords defined in the advanced options.
- enr_DE_keysources.pdf: dot plot of top enriched terms based on differentially expressed proteins. Terms are filtered for key source datasets selected in the advanced options.
- enr_DE_keywords.pdf: dot plot of top enriched terms based on differentially expressed proteins. Terms are filtered for keywords defined in the advanced options.
Network
directory: Directory with figures from network analysis of differentially expressed proteins, based on STRINGdb protein-protein interactions. For each contrast, two files are generated:- communities_sizes.pdf: histogram representing the number of protein in each network community.
- ppi_network.pdf: network representation in two layouts: Fruchterman Reingold (fr) and Kamada Kawai (kk).
-
KinaseTree
directory: figures and files with the kinase activity trees generated with PhosR and CORAL. For each deasign there are two files:-
TXT file
: text table with the identified actifity of each kinase. -
SVG file
: vectorail image of the kinase tree generated with CORAL.
-
Mission of RNA and Disease Data Science laboratory
Human diseases such as cancer are intrinsically entangled with complexity. The discovery of effective cures requires dealing with this complexity and greatly benefits from the development of high-resolution methods of investigation: genome-wide, multi-modal, single cell, spatially resolved. Understanding human diseases at single-cell resolution within their architectural context is a scientific challenge requiring dedicated computational analysis dealing with the volume and heterogeneity of the data. The research mission of the lab is to understand the RNA molecular mechanisms underlying dysregulation in human diseases, by combining high-throughput and high-resolution analyses, with a pan-disciplinary approach.