ProTN is an integrative pipeline that analyze DDA proteomics data obtained from MS. It perform a complete analysis of the raw files from different software, with their biological interpretation with enrichement and network analysis. ProTN executes a dual level analysis, at protein and peptide level.
-
1. Workflow of ProTN
The ProTN workflow is divided into preprocessing, differential analysis and biological interpretation. During preprocessing, input data are filtered, imputed, normalized, and optionally batch-corrected. Next, differential analysis is performed in parallel at peptide and protein resolution, based on users’ defined comparisons and considering compled designs. Finally, biological interpretation of differentially expressed proteins is performed, including functional enrichment analysis, and detection of communities in protein-protein interaction networks.
01. Define analysis settings and load input data files
ProTN analyses the results of
- Proteome Discoverer
- MaxQuant
- Spectronaut
- FragPipe
Additional details on the input can be found in section 2. Details on the input parameters and files
02. Normalization and imputation of raw intensities
Intensities are log2 transformed and normalized with DEqMS (Zhu 2022). At the peptide level, the normalization is performed with the function equalMedianNormalization, which normalizes intensity distributions in samples so that they have median equal to 0.
At the protein level, this operation is executed by the function medianSweeping, that applies the same median normalization used for peptides, but also summarizes peptide intensities into protein relative abundances by the median sweeping method.
Imputation
- PhosR: Imputation is performed on peptide and protein abundances with the Bioconductor package PhosR. Round imputation is performed in absence of replicates. ProTN uses two functions of PhosR for the imputation: Imputes the missing values for a peptide across replicates within a single condition and Tail-based imputation approach as implemented in Perseus.
- Gaussian Estimation: Imputation is performed on peptide and protein abundances using Gaussian estimation, where missing values are sampled from a normal distribution defined by the mean and standard deviation of observed intensities. This preserves data variance and reduces bias from missingness within conditions.
- missForest: Imputation is performed on peptide and protein abundances using the missForest R package, which applies a non-parametric random forest algorithm to predict missing values. This approach captures nonlinear relationships between features, preserving complex data structures.
-
pcaMethods:
Imputation is performed on peptide and protein abundances using the pcaMethods R package with the
svdImpute
function, which estimates missing values by reconstructing the data matrix from its leading singular vectors. This approach leverages global correlation structures to provide consistent estimates.
Method Package Main Idea Typical Use PhosR imputation PhosR (Bioconductor) Designed for phosphoproteomics; implements round-robin (iterative imputation without replicates) and paired tail-based imputation (using replicate structure). Phosphoproteomics with sparse coverage or missingness tied to peptide abundance. Gaussian estimation imputation Draws values from a Gaussian distribution defined by low-intensity tail of observed data (shifted mean, reduced SD). When MNAR (Missing Not At Random) is likely due to detection limit. missForest missForest (R) Non-parametric iterative imputation using random forests. Predicts missing entries using nonlinear relationships between features. General proteomics where missingness relates to multiple covariates, or when structure is complex. pcaMethods svdImpute
pcaMethods (Bioconductor) Uses Singular Value Decomposition to reconstruct missing values from lower-rank structure in the data. Well-replicated datasets with high correlation between samples. In this step, many figure can be generated regarding information about pre-process, normalization and imputation. An example from the case study is the PCA based on the protein abundances below.
03. Differential analysis
Differential analysis is applied to both proteins and peptides, to identify significant differences. Two slightly different methodologies are applied: the DEqMS package (Zhu 2022),is used for proteins. DEqMS is developed on top of Limma, but the method estimates different prior variances for proteins quantified by different numbers of PSMs/peptides per protein, therefore achieving better accuracy.For single peptides, the Limma package is used (Ritchie et al. 2015).
-
Compile the comparison table: The table have 2 columns:
- Formule column (
REQUIRED
): The formulas need to follow the syntax of Limma (Ex: "cancer-normal"). - Name column (
OPTIONAL
): personalized name assign to the comparison. (Ex: "cancer_vs_normal")
- Formule column (
Limma and DEqMS calculate differentially expressed peptides and proteins (DEPs) for each comparison specified in the design file parameter. Each peptide or protein can be selected as differential based on different parameters: the log2 Fold Change, the P.Value, the adjusted P.Value and the log2 expression. In ProTN, a protein/peptide is significant if passing thresholds on these parameters, set by the user. A protein/peptide for each comparison can be Up-regulated or Down-regulated. It is Up-regulated if:
-
the log2 FC is higher than the Fold Change threshold (FC > Log2 FC thr),
-
the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),
It is Down-regulated if:
-
the log2 FC is lower of the Fold Change threshold (FC < -Log2 FC thr),
-
the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),
In the output, for each comparison, this distinction is reported in the “class” column, which assumes value “+” if is up-regulated, “-” if down-regulated and “=” if it is not significant.
Various figures are generated, first a bar plot that graphically represents the DEPs identified. Followed by comparison-specific volcano plots.
04. Report creation and download of the results
Results are summarized in a web-page HTML report. Other than this, ProTN generates a large number of useful files: a description of each output file can be found in section 4. Details on the output files. All the files are group in a zip file and downloaded.
ADDITIONAL STEPS:
B1. Batch Effect correction
If required by the experiment, a batch correction step can be applyed using proBatch (Cuklina et al. 2018). The batches need to be defined in the sample annotation file where an additional column describe the batches. required.
E1. Enrichment analysis of the Differentially Expressed Proteins
The biological interpretation of the Differentially Expressed Proteins starts with the enrichment step. To execute this analysis, ProTN uses EnrichR (Jawaid 2022), a popular tool that searches on a large number of data sets to obtain information about many functional categories. EnrichR organises its hundreds of data sources in 8 sections: Transcription, Pathways, Ontologies, Diseases/Drugs, Cell Types, Misc, Legacy, and Crowd.
Each comparison defined in the differential analysis stage can result in 3 sets of proteins: the Up-regulated (called Up), the Down-regulated (called Down), and the merge of the two (called all). ProTN provides for each term statistical parameters like P.Value, fdr, odds ratio, overlap size.
ProTN creates an RData of the complete enrichment data frame, allowing the user an easy import in R to perform further analysis. ProTN also generates an Excel file, containing only the significantly enriched terms, as defined by user settings.
A term to be significative need to have:
-
a Fdr or P.Value lower of P.Value thr for enrichment (P.Value < P.Value thr for enrichment),
-
an Overlap Size higher than Overlap size thr for enrichment (Overlap Size > Overlap size thr for enrichment).
ProTN displays top significant enrichments based on specific annotation datasets or keywords selected by the user.
N1. Protein-Protein Interaction network analysis of Differentially Expressed Proteins
ProTN performs Protein-Protein Interaction (PPI) network analysis on differentially expressed proteins. PPIs are essential in almost all processes of the cell, and crucial for understanding cell physiology in different states. For each comparison, ProTN analyses the interaction between the DEPs using STRING (Szklarczyk et al. 2021).
The species-specific database is retrieved from the STRING server, and all the interactions above a user-defined threshold are used to generate a network with
-
2. Details on the input parameters and files
-
Title of Analysis
: title of the experiment. It will be the title of the web page report. -
Brief Description
: description of the current experiment. It is the first paragraph of the report. Software Analyzer
: determine which software was used to identify peptides and proteins.- Proteome Discoverer
- MaxQuant using evidence.txt
- MaxQuant using peptides.txt and proteinGroups.txt
- Spectronaut
- FragPipe
-
File required for Proteome Discoverer
Annotation file
: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:Column Name Description File ID
Identifier used in column headers of the peptide file. Condition
Experimental group name. Used for comparisons. Sample
Clean sample name used downstream. color
(Optional) Plot color. Defaults are applied if missing. batch
(Optional) Batch ID for batch effect correction. Peptides file
: Excel table with annotated peptides and abundance values.Column Name Description Master Protein Accessions
Maps peptide to protein; only first ID is kept. Annotated Sequence
Amino acid sequence including PTM annotations. Modifications
Post-translational modifications. Positions in Master Proteins
Position of peptide in the protein sequence. Abundance: <File ID>
Intensity/abundance for each sample. One column per sample. Proteins file
: Excel table containing descriptive and accession information for proteins.Column Name Description Accession
Unique protein identifier, used to join with peptide file. Description
Descriptive string, e.g., from UniProt.
-
File required for MaxQuant
Annotation file
: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:Column Name Description Condition
Experimental condition (e.g. Control, Treated). Used for group comparison. Sample
Sample identifier. Must match sample names in the peptide file. color
(Optional) Color associated with the condition. If not present, default colors are assigned. batch
(Optional) Batch ID for batch effect correction. Required if batch correction is enabled. - Evidence pipeline:
evidence.txt
: This is a TSV/CSV file containing peptide-level quantification data. Required columns:Column Name Description Sequence
Amino acid sequence of the peptide. Modifications
PTMs of the peptide. Gene names
Gene symbol associated with the peptide. Protein names
Protein description. If missing, will be merged from annotation file. Leading razor protein
UniProt accession. Used for annotation enrichment. Raw file
File/sample ID. Must match entries in the annotation file. Intensity
Peptide intensity value. Used for quantification. Leading proteins
Used for filtering out contaminants (e.g. "CON_").
- Peptide and ProteinGroups pipeline:
peptides.txt
: Tab-delimited file with peptide-level quantification. Required columns:Column Name Description Sequence
Amino acid sequence of the peptide. Gene names
If missing, inferred from Leading razor protein
.Protein names
Protein description. If missing, will be merged from annotation file. Leading razor protein
UniProt accession. Used for annotation enrichment. Intensity <Sample>
Intensity values for each sample (e.g. Intensity Sample1
).proteinGroups.txt
: Tab-delimited file providing protein-level information. Required columns:Column Name Description Majority protein IDs
Used to extract the Leading razor protein
.Fasta headers
Used for protein description.
-
File required for Spectronaut
Annotation file
: (Optional) If provided, this file should contain metadata for each sample. If not provided, the pipeline will extract sample annotations directly from the peptide file.Column Name Description Condition
Experimental group label. Used for comparison between conditions. Sample
Sample identifier. Must match entries in the peptide file. color
(Optional) Color for visualization. Default colors will be assigned if missing. batch
(Optional) Batch ID for batch correction. Required if batch correction is enabled. Spectronaut report
: This is a TSV/CSV file containing peptide-level data. Required columns:Column Name Description PG.ProteinAccessions
Protein group accessions. PEP.StrippedSequence
Peptide sequence without modifications. EG.ModifiedSequence
Peptide sequence with modifications. PEP.Quantity
Peptide quantification value. R.FileName
Sample identifier (column used is defined by sample_col
).R.Condition
Condition identifier (column used if annotation file
not provided).
-
File required for FragPipe
Annotation file
: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:Column Name Description Condition
Experimental condition (e.g. Control, Treated). Used for group comparison. Sample
Sample identifier. Must match sample names in the peptide file. color
(Optional) Color associated with the condition. If not present, default colors are assigned. batch
(Optional) Batch ID for batch effect correction. Required if batch correction is enabled. combined_modified_peptide.tsv
: The peptide quantification file must contain raw or normalized intensity values for each sample and peptide. Required columns:Column Name Description Protein ID
Protein accession or identifier. Protein Description
Descriptive name of the protein. Gene
Gene symbol. Peptide Sequence
Amino acid sequence of the peptide. Assigned Modifications
Sequence with nucleotide modifications. Prev AA
Used to determine tryptic condition. Next AA
Used to determine tryptic condition. <Sample> Intensity
One column per sample named like <Sample> Intensity
.
-
-
3. Example case study - Proteomics from Steger et al. (2016)
Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases
PRIDE: PXD003071.
Des: Steger M, Tonelli F, Ito G, Davies P, Trost M, Vetter M, Wachter S, Lorentzen E, Duddy G, Wilson S, Baptista MA, Fiske BK, Fell MJ, Morrow JA, Reith AD, Alessi DR, Mann M. Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases. Elife. 2016 Jan 29;5.
Possible to download the case study in the Run tab.
-
4. Details on the output files
report.html
: complete report of the analysis with all pics and results of the enrichment.db_results_proTN.RData
: RData object containing all the data produced during the execution ready for additional analyses in R.log_filter_read_function.txt
: Results filter applied during preprocessing step.- input_protn folder
Input files provided
- rdata folder
enrichment.RData
: RData object containing enrichment results based on differentially expressed proteins
- Tables folder
normalised_abundances.xlsx
: excel file containing abundance values generated by proTN. abundances are log2 transformed, normalized, imputed (and batch corrected). The file is organized in the following sheets:- protein_per_sample: protein abundances per sample.
- peptide_per_sample: peptide abundances per sample.
- protein_per_condition: protein abundances per condition (average & standard deviation), as defined in the Sample Annotation.
- peptide_per_condition: peptide abundances per condition (average & standard deviation), as defined in the Sample Annotation.
differential_expression.xlsx
: excel file containing the results of differential analysis, according to the contrasts defined in the Design file. The file is organized in the following sheets:- protein_DE: protein differential expression results protein abundances per sample.
- peptide_DE: peptide differential expression results. Annotation columns:
- Accession: protein UniprotID
- Description: protein description
- GeneName: Gene Symbol
- Peptide_Sequence: peptide sequence
- Peptide_Modifications: peptide modifications
- Peptide_Position: start and end position of the peptide within the protein sequence, defined UniprotID
- Peptide_Tryptic: peptide tryptic digestion status (fully tryptic, N-semi tryptic, C-semi tryptic, non tryptic)
- Columns for each contrast:
- class: defined according to the fold change, p-value and abundance thresholds specified in the input
- + up-regulated protein/peptide
- - down-regulated protein/peptide
- = invariant protein/peptide
- log2_FC: protein/peptide log2 transformed fold change
- p_val: protein/peptide contrast p-value
- p_adj: protein/peptide adjusted p-value (FDR after BH correction)
- log2_expr: protein/peptide log2 average abundance
enrichment.xlsx
: excel file containing a selection of enrichment results starting from differentially expressed proteins. Terms are selected according to significance thresholds specified in the input (Default: adj.P.Value < 0.05, Overlap Size >= 5)
- Figures folder
- PDF version of all figure selected
- enrichment_plot.pdf: dot plot of top enriched terms based on differentially expressed proteins, divided in up- and down-regulated. Terms are filtered for keywords defined in the advanced options.
protein_vulcano
directory: Directory with all the vulcano plots based on the differential proteins.peptide_vulcano
directory: Directory with all the vulcano plots based on the differential peptide.STRINGdb
directory: Directory with figures from network analysis of differentially expressed proteins, based on STRINGdb protein-protein interactions. For each contrast, two files are generated:- *_connection.txt: txt file with all edge of the network.
- *_network.pdf: PDF version of STRINGdb network.
PhosProTN is an integrative pipeline for phosphoproteomic analysis of DDA experimental data obtained from MS. It perform a complete analysis of the raw files from Proteome Discoverer (PD) or MaxQuant (MQ), with their biological interpretation, enrichement and network analysis. PhosProTN analyse the phosphoproteomic data at peptide level.
-
1. Workflow of PhosProTN
The ProTN workflow is divided into preprocessing, differential analysis and biological interpretation. During preprocessing, input data are filtered, imputed, normalized, and optionally batch-corrected. Next, differential analysis is performed in parallel at peptide and protein resolution, based on users’ defined comparisons and considering compled designs. Finally, biological interpretation of differentially expressed proteins is performed, including functional enrichment analysis, and detection of communities in protein-protein interaction networks, and kinome tree perturbation analysis.
01. Define analysis settings and load input data files
PhosProTN analyses the results of
- Proteome Discoverer
- MaxQuant
Additional details on the input can be found in section 2. Details on the input parameters and files
02. Normalization and imputation of raw intensities
Intensities are log2 transformed and normalized with DEqMS (Zhu 2022). At the peptide level, the normalization is performed with the function equalMedianNormalization, which normalizes intensity distributions in samples so that they have median equal to 0.
At the protein level, this operation is executed by the function medianSweeping, that applies the same median normalization used for peptides, but also summarizes peptide intensities into protein relative abundances by the median sweeping method.
Imputation
- PhosR: Imputation is performed on peptide and protein abundances with the Bioconductor package PhosR. Round imputation is performed in absence of replicates. ProTN uses two functions of PhosR for the imputation: Imputes the missing values for a peptide across replicates within a single condition and Tail-based imputation approach as implemented in Perseus.
- Gaussian Estimation: Imputation is performed on peptide and protein abundances using Gaussian estimation, where missing values are sampled from a normal distribution defined by the mean and standard deviation of observed intensities. This preserves data variance and reduces bias from missingness within conditions.
- missForest: Imputation is performed on peptide and protein abundances using the missForest R package, which applies a non-parametric random forest algorithm to predict missing values. This approach captures nonlinear relationships between features, preserving complex data structures.
-
pcaMethods:
Imputation is performed on peptide and protein abundances using the pcaMethods R package with the
svdImpute
function, which estimates missing values by reconstructing the data matrix from its leading singular vectors. This approach leverages global correlation structures to provide consistent estimates.
Method Package Main Idea Typical Use PhosR imputation PhosR (Bioconductor) Designed for phosphoproteomics; implements round-robin (iterative imputation without replicates) and paired tail-based imputation (using replicate structure). Phosphoproteomics with sparse coverage or missingness tied to peptide abundance. Gaussian estimation imputation Draws values from a Gaussian distribution defined by low-intensity tail of observed data (shifted mean, reduced SD). When MNAR (Missing Not At Random) is likely due to detection limit. missForest missForest (R) Non-parametric iterative imputation using random forests. Predicts missing entries using nonlinear relationships between features. General proteomics where missingness relates to multiple covariates, or when structure is complex. pcaMethods svdImpute
pcaMethods (Bioconductor) Uses Singular Value Decomposition to reconstruct missing values from lower-rank structure in the data. Well-replicated datasets with high correlation between samples. In this step, many figure can be generated regarding information about pre-process, normalization and imputation. An example from the case study is the PCA based on the protein abundances below.
03. Differential analysis
Differential analysis is applied to both proteins and peptides, to identify significant differences. Two slightly different methodologies are applied: the DEqMS package (Zhu 2022),is used for proteins. DEqMS is developed on top of Limma, but the method estimates different prior variances for proteins quantified by different numbers of PSMs/peptides per protein, therefore achieving better accuracy.For single peptides, the Limma package is used (Ritchie et al. 2015).
-
Compile the comparison table: The table have 2 columns:
- Formule column (
REQUIRED
): The formulas need to follow the syntax of Limma (Ex: "cancer-normal"). - Name column (
OPTIONAL
): personalized name assign to the comparison. (Ex: "cancer_vs_normal")
- Formule column (
Limma and DEqMS calculate differentially expressed peptides and proteins (DEPs) for each comparison specified in the design file parameter. Each peptide or protein can be selected as differential based on different parameters: the log2 Fold Change, the P.Value, the adjusted P.Value and the log2 expression. In ProTN, a protein/peptide is significant if passing thresholds on these parameters, set by the user. A protein/peptide for each comparison can be Up-regulated or Down-regulated. It is Up-regulated if:
-
the log2 FC is higher than the Fold Change threshold (FC > Log2 FC thr),
-
the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),
It is Down-regulated if:
-
the log2 FC is lower of the Fold Change threshold (FC < -Log2 FC thr),
-
the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),
In the output, for each comparison, this distinction is reported in the “class” column, which assumes value “+” if is up-regulated, “-” if down-regulated and “=” if it is not significant.
Various figures are generated, first a bar plot that graphically represents the DEPs identified. Followed by comparison-specific volcano plots.
04. Report creation and download of the results
The results are summarized in a web-page HTML report. Other than this, the experiment is described by a large number of files, a description of each file generated can be found in section 3. Details on the output files. All the files are group in a zip file and downloaded.
ADDITIONAL STEPS:
B1. Batch Effect correction
If required by the experiment, a batch correction step can be applyed using proBatch (Cuklina et al. 2018). The batches need to be defined in the sample annotation file where an additional column describe the batches. required.
E1. Enrichment analysis of the Differentially Expressed Proteins
The biological interpretation of the Differentially Expressed Proteins starts with the enrichment step. To execute this analysis, ProTN uses EnrichR (Jawaid 2022), a popular tool that searches on a large number of data sets to obtain information about many functional categories. EnrichR organises its hundreds of data sources in 8 sections: Transcription, Pathways, Ontologies, Diseases/Drugs, Cell Types, Misc, Legacy, and Crowd.
Each comparison defined in the differential analysis stage can result in 3 sets of proteins: the Up-regulated (called Up), the Down-regulated (called Down), and the merge of the two (called all). ProTN provides for each term statistical parameters like P.Value, fdr, odds ratio, overlap size.
ProTN creates an RData of the complete enrichment data frame, allowing the user an easy import in R to perform further analysis. ProTN also generates an Excel file, containing only the significantly enriched terms, as defined by user settings.
A term to be significative need to have:
-
a Fdr or P.Value lower of P.Value thr for enrichment (P.Value < P.Value thr for enrichment),
-
an Overlap Size higher than Overlap size thr for enrichment (Overlap Size > Overlap size thr for enrichment).
ProTN displays top significant enrichments based on specific annotation datasets or keywords selected by the user.
K1. Activity kinase tree analysis of the Differentially Expressed Phosphosite
In phospho-proteomic it extremely useful to study the activation status of the kinase based on the differentially expressed substrate idenfied by the differential analysis. For each comparison, PhosProTN predicts the activation state of the kinases using PhosR (Kim et al. 2021). PhosR provides a kinase-substrate relationship score, and on that it prioritises potential kinases that could be responsible for the phosphorylation change of phosphosite on the basis of kinase recognition motif and phosphoproteomic dynamics.
The activity score provide by PhosR is used to generated a graphical versione of the human kinome tree using CORAL (Metz K.S. et al. 2018), a web shiny app for visualizing both quantitative and qualitative data. It generates high-resolution scalable vector graphic files suitable for publication without the need for refinement in graphic editing software.
N1. Protein-Protein Interaction network analysis of Differentially Expressed Phosphosite
ProTN performs Protein-Protein Interaction (PPI) network analysis on differentially expressed proteins. PPIs are essential in almost all processes of the cell, and crucial for understanding cell physiology in different states. For each comparison, ProTN analyses the interaction between the DEPs using STRING (Szklarczyk et al. 2021).
The species-specific database is retrieved from STRING server, an accurate analysis discover all the interactions and an iGraph (Csardi and Nepusz 2006) network is generated. Later, the proteins are clustered via iGraph function which identify dense subgraph by optimizing modularity score.
Since the network can vary a lot on composition, two ggplot layout are used: Fruchterman-Reingold algorithm and the Kamada-Kawai algorithm.
-
2. Details on the input parameters and files
-
Title of Analysis
: title of the experiment. It will be the title of the web page report. -
Brief Description
: description of the current experiment. It is the first paragraph of the report. Software Analyzer
: determine which software was used to identify peptides and proteins.- Proteome Discoverer
- MaxQuant using evidence.txt
- MaxQuant using peptides.txt and proteinGroups.txt
- Spectronaut
- FragPipe
-
File required for Proteome Discoverer
Annotation file
: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:Column Name Description File ID
Identifier used in column headers of the peptide file. Condition
Experimental group name. Used for comparisons. Sample
Clean sample name used downstream. color
(Optional) Plot color. Defaults are applied if missing. batch
(Optional) Batch ID for batch effect correction. Peptides file
: Excel table with annotated peptides and abundance values.Column Name Description Master Protein Accessions
Maps peptide to protein; only first ID is kept. Annotated Sequence
Amino acid sequence including PTM annotations. Modifications
Post-translational modifications. Positions in Master Proteins
Position of peptide in the protein sequence. Abundance: <File ID>
Intensity/abundance for each sample. One column per sample. Proteins file
: Excel table containing descriptive and accession information for proteins.Column Name Description Accession
Unique protein identifier, used to join with peptide file. Description
Descriptive string, e.g., from UniProt. PSM file
:Column Name Description ptmRS: Best Site Probabilities
Used to resolve phosphosite ambiguity. Precursor Abundance
Abundance value used to filter invalid entries. Master Protein Accessions
Matches protein IDs for mapping. Annotated Sequence
Used to resolve conflicting PTM assignments.
-
File required for MaxQuant
Annotation file
: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:Column Name Description Condition
Experimental condition (e.g. Control, Treated). Used for group comparison. Sample
Sample identifier. Must match sample names in the peptide file. color
(Optional) Color associated with the condition. If not present, default colors are assigned. batch
(Optional) Batch ID for batch effect correction. Required if batch correction is enabled. - Evidence pipeline:
evidence.txt
: This is a TSV/CSV file containing peptide-level quantification data. Required columns:Column Name Description Sequence
Amino acid sequence of the peptide. Modifications
PTMs of the peptide. Gene names
Gene symbol associated with the peptide. Protein names
Protein description. If missing, will be merged from annotation file. Leading razor protein
UniProt accession. Used for annotation enrichment. Raw file
File/sample ID. Must match entries in the annotation file. Intensity
Peptide intensity value. Used for quantification. Leading proteins
Used for filtering out contaminants (e.g. "CON_").
-
-
3. Example case study - Phosphoproteomics from Steger et al. (2016)
Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases
PRIDE: PXD003071.
Des: Steger M, Tonelli F, Ito G, Davies P, Trost M, Vetter M, Wachter S, Lorentzen E, Duddy G, Wilson S, Baptista MA, Fiske BK, Fell MJ, Morrow JA, Reith AD, Alessi DR, Mann M. Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases. Elife. 2016 Jan 29;5.
Possible to download the case study in the Run tab.
-
4. Details on the output files
report.html
: complete report of the analysis with all pics and results of the enrichment.db_results_proTN.RData
: RData object containing all the data produced during the execution ready for additional analyses in R.log_filter_read_function.txt
: Results filter applied during preprocessing step.- input_protn folder
Input files provided
- rdata folder
enrichment.RData
: RData object containing enrichment results based on differentially expressed proteins
- Tables folder
normalised_abundances.xlsx
: excel file containing abundance values generated by proTN. abundances are log2 transformed, normalized, imputed (and batch corrected). The file is organized in the following sheets:- protein_per_sample: protein abundances per sample.
- peptide_per_sample: peptide abundances per sample.
- protein_per_condition: protein abundances per condition (average & standard deviation), as defined in the Sample Annotation.
- peptide_per_condition: peptide abundances per condition (average & standard deviation), as defined in the Sample Annotation.
differential_expression.xlsx
: excel file containing the results of differential analysis, according to the contrasts defined in the Design file. The file is organized in the following sheets:- protein_DE: protein differential expression results protein abundances per sample.
- peptide_DE: peptide differential expression results. Annotation columns:
- Accession: protein UniprotID
- Description: protein description
- GeneName: Gene Symbol
- Peptide_Sequence: peptide sequence
- Peptide_Modifications: peptide modifications
- Peptide_Position: start and end position of the peptide within the protein sequence, defined UniprotID
- Peptide_Tryptic: peptide tryptic digestion status (fully tryptic, N-semi tryptic, C-semi tryptic, non tryptic)
- Columns for each contrast:
- class: defined according to the fold change, p-value and abundance thresholds specified in the input
- + up-regulated protein/peptide
- - down-regulated protein/peptide
- = invariant protein/peptide
- log2_FC: protein/peptide log2 transformed fold change
- p_val: protein/peptide contrast p-value
- p_adj: protein/peptide adjusted p-value (FDR after BH correction)
- log2_expr: protein/peptide log2 average abundance
enrichment.xlsx
: excel file containing a selection of enrichment results starting from differentially expressed proteins. Terms are selected according to significance thresholds specified in the input (Default: adj.P.Value < 0.05, Overlap Size >= 5)
- Figures folder
- PDF version of all figure selected
- enrichment_plot.pdf: dot plot of top enriched terms based on differentially expressed proteins, divided in up- and down-regulated. Terms are filtered for keywords defined in the advanced options.
protein_vulcano
directory: Directory with all the vulcano plots based on the differential proteins.peptide_vulcano
directory: Directory with all the vulcano plots based on the differential peptide.STRINGdb
directory: Directory with figures from network analysis of differentially expressed proteins, based on STRINGdb protein-protein interactions. For each contrast, two files are generated:- *_connection.txt: txt file with all edge of the network.
- *_network.pdf: PDF version of STRINGdb network.
-
KinaseTree
directory: figures and files with the kinase activity trees generated with PhosR and CORAL. For each deasign there are two files:-
TXT file
: text table with the identified actifity of each kinase. -
SVG file
: vectorail image of the kinase tree generated with CORAL.
-
PhosProTN with proteome backgorund is an integrative pipeline for phosphoproteomic analysis of DDA experimental data obtained from MS. It perform a complete analysis of the raw files from Proteome Discoverer (PD) or MaxQuant (MQ), with their biological interpretation, enrichement and network analysis. PhosProTN analyse the phosphoproteomic data at peptide level using as background the proteome analysis of the same conditions.
-
1. Workflow of PhosProTN with proteome background
The ProTN workflow is divided into preprocessing, differential analysis and biological interpretation. During preprocessing, input data are filtered, imputed, normalized, and optionally batch-corrected. Next, differential analysis is performed in parallel at peptide and protein resolution, based on users’ defined comparisons and considering compled designs. Finally, biological interpretation of differentially expressed proteins is performed, including functional enrichment analysis, and detection of communities in protein-protein interaction networks, and kinome tree perturbation analysis.
01. Define analysis settings and load input data files
PhosProTN analyses the results of
- Proteome Discoverer
- MaxQuant
Additional details on the input can be found in section 2. Details on the input parameters and files
02. Normalization and imputation of raw intensities
Intensities are log2 transformed and normalized with DEqMS (Zhu 2022). At the peptide level, the normalization is performed with the function equalMedianNormalization, which normalizes intensity distributions in samples so that they have median equal to 0.
At the protein level, this operation is executed by the function medianSweeping, that applies the same median normalization used for peptides, but also summarizes peptide intensities into protein relative abundances by the median sweeping method.
Imputation
- PhosR: Imputation is performed on peptide and protein abundances with the Bioconductor package PhosR. Round imputation is performed in absence of replicates. ProTN uses two functions of PhosR for the imputation: Imputes the missing values for a peptide across replicates within a single condition and Tail-based imputation approach as implemented in Perseus.
- Gaussian Estimation: Imputation is performed on peptide and protein abundances using Gaussian estimation, where missing values are sampled from a normal distribution defined by the mean and standard deviation of observed intensities. This preserves data variance and reduces bias from missingness within conditions.
- missForest: Imputation is performed on peptide and protein abundances using the missForest R package, which applies a non-parametric random forest algorithm to predict missing values. This approach captures nonlinear relationships between features, preserving complex data structures.
-
pcaMethods:
Imputation is performed on peptide and protein abundances using the pcaMethods R package with the
svdImpute
function, which estimates missing values by reconstructing the data matrix from its leading singular vectors. This approach leverages global correlation structures to provide consistent estimates.
Method Package Main Idea Typical Use PhosR imputation PhosR (Bioconductor) Designed for phosphoproteomics; implements round-robin (iterative imputation without replicates) and paired tail-based imputation (using replicate structure). Phosphoproteomics with sparse coverage or missingness tied to peptide abundance. Gaussian estimation imputation Draws values from a Gaussian distribution defined by low-intensity tail of observed data (shifted mean, reduced SD). When MNAR (Missing Not At Random) is likely due to detection limit. missForest missForest (R) Non-parametric iterative imputation using random forests. Predicts missing entries using nonlinear relationships between features. General proteomics where missingness relates to multiple covariates, or when structure is complex. pcaMethods svdImpute
pcaMethods (Bioconductor) Uses Singular Value Decomposition to reconstruct missing values from lower-rank structure in the data. Well-replicated datasets with high correlation between samples. In this step, many figure can be generated regarding information about pre-process, normalization and imputation. An example from the case study is the PCA based on the protein abundances below.
03. Differential analysis
Differential analysis is applied to both proteins and peptides, to identify significant differences. Two slightly different methodologies are applied: the DEqMS package (Zhu 2022),is used for proteins. DEqMS is developed on top of Limma, but the method estimates different prior variances for proteins quantified by different numbers of PSMs/peptides per protein, therefore achieving better accuracy.For single peptides, the Limma package is used (Ritchie et al. 2015).
-
Compile the comparison table: The table have 2 columns:
- Formule column (
REQUIRED
): The formulas need to follow the syntax of Limma (Ex: "cancer-normal"). - Name column (
OPTIONAL
): personalized name assign to the comparison. (Ex: "cancer_vs_normal")
- Formule column (
Limma and DEqMS calculate differentially expressed peptides and proteins (DEPs) for each comparison specified in the design file parameter. Each peptide or protein can be selected as differential based on different parameters: the log2 Fold Change, the P.Value, the adjusted P.Value and the log2 expression. In ProTN, a protein/peptide is significant if passing thresholds on these parameters, set by the user. A protein/peptide for each comparison can be Up-regulated or Down-regulated. It is Up-regulated if:
-
the log2 FC is higher than the Fold Change threshold (FC > Log2 FC thr),
-
the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),
It is Down-regulated if:
-
the log2 FC is lower of the Fold Change threshold (FC < -Log2 FC thr),
-
the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),
In the output, for each comparison, this distinction is reported in the “class” column, which assumes value “+” if is up-regulated, “-” if down-regulated and “=” if it is not significant.
Various figures are generated, first a bar plot that graphically represents the DEPs identified. Followed by comparison-specific volcano plots.
04. Report creation and download of the results
The results are summarized in a web-page HTML report. Other than this, the experiment is described by a large number of files, a description of each file generated can be found in section 3. Details on the output files. All the files are group in a zip file and downloaded.
ADDITIONAL STEPS:
B1. Batch Effect correction
If required by the experiment, a batch correction step can be applyed using proBatch (Cuklina et al. 2018). The batches need to be defined in the sample annotation file where an additional column describe the batches. required.
E1. Enrichment analysis of the Differentially Expressed Proteins
The biological interpretation of the Differentially Expressed Proteins starts with the enrichment step. To execute this analysis, ProTN uses EnrichR (Jawaid 2022), a popular tool that searches on a large number of data sets to obtain information about many functional categories. EnrichR organises its hundreds of data sources in 8 sections: Transcription, Pathways, Ontologies, Diseases/Drugs, Cell Types, Misc, Legacy, and Crowd.
Each comparison defined in the differential analysis stage can result in 3 sets of proteins: the Up-regulated (called Up), the Down-regulated (called Down), and the merge of the two (called all). ProTN provides for each term statistical parameters like P.Value, fdr, odds ratio, overlap size.
ProTN creates an RData of the complete enrichment data frame, allowing the user an easy import in R to perform further analysis. ProTN also generates an Excel file, containing only the significantly enriched terms, as defined by user settings.
A term to be significative need to have:
-
a Fdr or P.Value lower of P.Value thr for enrichment (P.Value < P.Value thr for enrichment),
-
an Overlap Size higher than Overlap size thr for enrichment (Overlap Size > Overlap size thr for enrichment).
ProTN displays top significant enrichments based on specific annotation datasets or keywords selected by the user.
K1. Activity kinase tree analysis of the Differentially Expressed Phosphosite
In phospho-proteomic it extremely useful to study the activation status of the kinase based on the differentially expressed substrate idenfied by the differential analysis. For each comparison, PhosProTN predicts the activation state of the kinases using PhosR (Kim et al. 2021). PhosR provides a kinase-substrate relationship score, and on that it prioritises potential kinases that could be responsible for the phosphorylation change of phosphosite on the basis of kinase recognition motif and phosphoproteomic dynamics.
The activity score provide by PhosR is used to generated a graphical versione of the human kinome tree using CORAL (Metz K.S. et al. 2018), a web shiny app for visualizing both quantitative and qualitative data. It generates high-resolution scalable vector graphic files suitable for publication without the need for refinement in graphic editing software.
N1. Protein-Protein Interaction network analysis of Differentially Expressed Phosphosite
ProTN performs Protein-Protein Interaction (PPI) network analysis on differentially expressed proteins. PPIs are essential in almost all processes of the cell, and crucial for understanding cell physiology in different states. For each comparison, ProTN analyses the interaction between the DEPs using STRING (Szklarczyk et al. 2021).
The species-specific database is retrieved from STRING server, an accurate analysis discover all the interactions and an iGraph (Csardi and Nepusz 2006) network is generated. Later, the proteins are clustered via iGraph function which identify dense subgraph by optimizing modularity score.
Since the network can vary a lot on composition, two ggplot layout are used: Fruchterman-Reingold algorithm and the Kamada-Kawai algorithm.
-
2. Details on the input parameters and files
-
Title of Analysis
: title of the experiment. It will be the title of the web page report. -
Brief Description
: description of the current experiment. It is the first paragraph of the report. Software Analyzer
: determine which software was used to identify peptides and proteins.- Proteome Discoverer
- MaxQuant using evidence.txt
- MaxQuant using peptides.txt and proteinGroups.txt
- Spectronaut
- FragPipe
-
File required for Proteome Discoverer
Annotation file
: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:Column Name Description File ID
Identifier used in column headers of the peptide file. Condition
Experimental group name. Used for comparisons. Sample
Clean sample name used downstream. color
(Optional) Plot color. Defaults are applied if missing. batch
(Optional) Batch ID for batch effect correction. Peptides file
: Excel table with annotated peptides and abundance values.Column Name Description Master Protein Accessions
Maps peptide to protein; only first ID is kept. Annotated Sequence
Amino acid sequence including PTM annotations. Modifications
Post-translational modifications. Positions in Master Proteins
Position of peptide in the protein sequence. Abundance: <File ID>
Intensity/abundance for each sample. One column per sample. Proteins file
: Excel table containing descriptive and accession information for proteins.Column Name Description Accession
Unique protein identifier, used to join with peptide file. Description
Descriptive string, e.g., from UniProt. PSM file
: only for Phospho dataset. For Proteome background is not required.Column Name Description ptmRS: Best Site Probabilities
Used to resolve phosphosite ambiguity. Precursor Abundance
Abundance value used to filter invalid entries. Master Protein Accessions
Matches protein IDs for mapping. Annotated Sequence
Used to resolve conflicting PTM assignments.
-
File required for MaxQuant
Annotation file
: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:Column Name Description Condition
Experimental condition (e.g. Control, Treated). Used for group comparison. Sample
Sample identifier. Must match sample names in the peptide file. color
(Optional) Color associated with the condition. If not present, default colors are assigned. batch
(Optional) Batch ID for batch effect correction. Required if batch correction is enabled. - Evidence pipeline:
evidence.txt
: This is a TSV/CSV file containing peptide-level quantification data. Required columns:Column Name Description Sequence
Amino acid sequence of the peptide. Modifications
PTMs of the peptide. Gene names
Gene symbol associated with the peptide. Protein names
Protein description. If missing, will be merged from annotation file. Leading razor protein
UniProt accession. Used for annotation enrichment. Raw file
File/sample ID. Must match entries in the annotation file. Intensity
Peptide intensity value. Used for quantification. Leading proteins
Used for filtering out contaminants (e.g. "CON_").
-
-
3. Example case study - Phosphoproteomics from Steger et al. (2016)
Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases
PRIDE: PXD003071.
Des: Steger M, Tonelli F, Ito G, Davies P, Trost M, Vetter M, Wachter S, Lorentzen E, Duddy G, Wilson S, Baptista MA, Fiske BK, Fell MJ, Morrow JA, Reith AD, Alessi DR, Mann M. Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases. Elife. 2016 Jan 29;5.
Possible to download the case study in the Run tab.
-
4. Details on the output files
report.html
: complete report of the analysis with all pics and results of the enrichment.db_results_proTN.RData
: RData object containing all the data produced during the execution ready for additional analyses in R.log_filter_read_function.txt
: Results filter applied during preprocessing step.- input_protn folder
Input files provided
- rdata folder
enrichment.RData
: RData object containing enrichment results based on differentially expressed proteins
- Tables folder
normalised_abundances.xlsx
: excel file containing abundance values generated by proTN. abundances are log2 transformed, normalized, imputed (and batch corrected). The file is organized in the following sheets:- protein_per_sample: protein abundances per sample.
- peptide_per_sample: peptide abundances per sample.
- protein_per_condition: protein abundances per condition (average & standard deviation), as defined in the Sample Annotation.
- peptide_per_condition: peptide abundances per condition (average & standard deviation), as defined in the Sample Annotation.
differential_expression.xlsx
: excel file containing the results of differential analysis, according to the contrasts defined in the Design file. The file is organized in the following sheets:- protein_DE: protein differential expression results protein abundances per sample.
- peptide_DE: peptide differential expression results. Annotation columns:
- Accession: protein UniprotID
- Description: protein description
- GeneName: Gene Symbol
- Peptide_Sequence: peptide sequence
- Peptide_Modifications: peptide modifications
- Peptide_Position: start and end position of the peptide within the protein sequence, defined UniprotID
- Peptide_Tryptic: peptide tryptic digestion status (fully tryptic, N-semi tryptic, C-semi tryptic, non tryptic)
- Columns for each contrast:
- class: defined according to the fold change, p-value and abundance thresholds specified in the input
- + up-regulated protein/peptide
- - down-regulated protein/peptide
- = invariant protein/peptide
- log2_FC: protein/peptide log2 transformed fold change
- p_val: protein/peptide contrast p-value
- p_adj: protein/peptide adjusted p-value (FDR after BH correction)
- log2_expr: protein/peptide log2 average abundance
enrichment.xlsx
: excel file containing a selection of enrichment results starting from differentially expressed proteins. Terms are selected according to significance thresholds specified in the input (Default: adj.P.Value < 0.05, Overlap Size >= 5)
- Figures folder
- PDF version of all figure selected
- enrichment_plot.pdf: dot plot of top enriched terms based on differentially expressed proteins, divided in up- and down-regulated. Terms are filtered for keywords defined in the advanced options.
protein_vulcano
directory: Directory with all the vulcano plots based on the differential proteins.peptide_vulcano
directory: Directory with all the vulcano plots based on the differential peptide.STRINGdb
directory: Directory with figures from network analysis of differentially expressed proteins, based on STRINGdb protein-protein interactions. For each contrast, two files are generated:- *_connection.txt: txt file with all edge of the network.
- *_network.pdf: PDF version of STRINGdb network.
-
KinaseTree
directory: figures and files with the kinase activity trees generated with PhosR and CORAL. For each deasign there are two files:-
TXT file
: text table with the identified actifity of each kinase. -
SVG file
: vectorail image of the kinase tree generated with CORAL.
-
InteracTN is an integrated pipeline for the analysis of interactomics data derived from DDA mass spectrometry experiments. The pipeline also includes network reconstruction and visualization, helping users interpret protein–protein interactions in the context of biological pathways and complexes. Designed to be robust and user-friendly, InteracTN enables researchers to efficiently explore and interpret interactome data for a deeper understanding of cellular mechanisms and protein functions.
-
1. Workflow of InteracTN
The InteracTN workflow is divided into preprocessing, differential analysis and biological interpretation. During preprocessing, input data are filtered, imputed, normalized, and optionally batch-corrected. Next, differential analysis is performed in parallel at peptide and protein resolution, based on users’ defined comparisons and considering compled designs. Finally, biological interpretation of differentially expressed proteins is performed, including functional enrichment analysis, and detection of communities in protein-protein interaction networks.
01. Define analysis settings and load input data files
InteracTN analyses the results of
- Proteome Discoverer
- MaxQuant
- Spectronaut
- FragPipe
Additional details on the input can be found in section 2. Details on the input parameters and files
02. Normalization and imputation of raw intensities
Intensities are log2 transformed and normalized with DEqMS (Zhu 2022). At the peptide level, the normalization is performed with the function equalMedianNormalization, which normalizes intensity distributions in samples so that they have median equal to 0.
At the protein level, this operation is executed by the function medianSweeping, that applies the same median normalization used for peptides, but also summarizes peptide intensities into protein relative abundances by the median sweeping method.
Imputation
- PhosR: Imputation is performed on peptide and protein abundances with the Bioconductor package PhosR. Round imputation is performed in absence of replicates. ProTN uses two functions of PhosR for the imputation: Imputes the missing values for a peptide across replicates within a single condition and Tail-based imputation approach as implemented in Perseus.
- Gaussian Estimation: Imputation is performed on peptide and protein abundances using Gaussian estimation, where missing values are sampled from a normal distribution defined by the mean and standard deviation of observed intensities. This preserves data variance and reduces bias from missingness within conditions.
- missForest: Imputation is performed on peptide and protein abundances using the missForest R package, which applies a non-parametric random forest algorithm to predict missing values. This approach captures nonlinear relationships between features, preserving complex data structures.
-
pcaMethods:
Imputation is performed on peptide and protein abundances using the pcaMethods R package with the
svdImpute
function, which estimates missing values by reconstructing the data matrix from its leading singular vectors. This approach leverages global correlation structures to provide consistent estimates.
Method Package Main Idea Typical Use PhosR imputation PhosR (Bioconductor) Designed for phosphoproteomics; implements round-robin (iterative imputation without replicates) and paired tail-based imputation (using replicate structure). Phosphoproteomics with sparse coverage or missingness tied to peptide abundance. Gaussian estimation imputation Draws values from a Gaussian distribution defined by low-intensity tail of observed data (shifted mean, reduced SD). When MNAR (Missing Not At Random) is likely due to detection limit. missForest missForest (R) Non-parametric iterative imputation using random forests. Predicts missing entries using nonlinear relationships between features. General proteomics where missingness relates to multiple covariates, or when structure is complex. pcaMethods svdImpute
pcaMethods (Bioconductor) Uses Singular Value Decomposition to reconstruct missing values from lower-rank structure in the data. Well-replicated datasets with high correlation between samples. In this step, many figure can be generated regarding information about pre-process, normalization and imputation. An example from the case study is the PCA based on the protein abundances below.
03. Differential analysis
Differential analysis is applied to both proteins and peptides, to identify significant differences. Two slightly different methodologies are applied: the DEqMS package (Zhu 2022),is used for proteins. DEqMS is developed on top of Limma, but the method estimates different prior variances for proteins quantified by different numbers of PSMs/peptides per protein, therefore achieving better accuracy.For single peptides, the Limma package is used (Ritchie et al. 2015).
-
Compile the comparison table: The table have 2 columns:
- Formule column (
REQUIRED
): The formulas need to follow the syntax of Limma (Ex: "cancer-normal"). - Name column (
OPTIONAL
): personalized name assign to the comparison. (Ex: "cancer_vs_normal")
- Formule column (
Limma and DEqMS calculate differentially expressed peptides and proteins (DEPs) for each comparison specified in the design file parameter. Each peptide or protein can be selected as differential based on different parameters: the log2 Fold Change, the P.Value, the adjusted P.Value and the log2 expression. In InteracTN, a protein/peptide is significant if passing thresholds on these parameters, set by the user. A protein/peptide for each comparison can be Up-regulated. It is Up-regulated if:
-
the log2 FC is higher than the Fold Change threshold (FC > Log2 FC thr),
-
the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),
In the output, for each comparison, this distinction is reported in the “class” column, which assumes value “+” if is up-regulated and “=” if it is not significant.
Various figures are generated, first a bar plot that graphically represents the DEPs identified. Followed by comparison-specific volcano plots.
04. Report creation and download of the results
Results are summarized in a web-page HTML report. Other than this, InteracTN generates a large number of useful files: a description of each output file can be found in section 4. Details on the output files. All the files are group in a zip file and downloaded.
ADDITIONAL STEPS:
B1. Batch Effect correction
If required by the experiment, a batch correction step can be applyed using proBatch (Cuklina et al. 2018). The batches need to be defined in the sample annotation file where an additional column describe the batches. required.
E1. Enrichment analysis of the Differentially Expressed Proteins
The biological interpretation of the Differentially Expressed Proteins starts with the enrichment step. To execute this analysis, InteracTN uses EnrichR (Jawaid 2022), a popular tool that searches on a large number of data sets to obtain information about many functional categories. EnrichR organises its hundreds of data sources in 8 sections: Transcription, Pathways, Ontologies, Diseases/Drugs, Cell Types, Misc, Legacy, and Crowd.
Each comparison defined in the differential analysis stage can result in 1 sets of proteins: the Up-regulated (called Up). InteracTN provides for each term statistical parameters like P.Value, fdr, odds ratio, overlap size.
InteracTN creates an RData of the complete enrichment data frame, allowing the user an easy import in R to perform further analysis. InteracTN also generates an Excel file, containing only the significantly enriched terms, as defined by user settings.
A term to be significative need to have:
-
a Fdr or P.Value lower of P.Value thr for enrichment (P.Value < P.Value thr for enrichment),
-
an Overlap Size higher than Overlap size thr for enrichment (Overlap Size > Overlap size thr for enrichment).
InteracTN displays top significant enrichments based on specific annotation datasets or keywords selected by the user.
N1. Protein-Protein Interaction network analysis of Differentially Expressed Proteins
InteracTN performs Protein-Protein Interaction (PPI) network analysis on differentially expressed proteins. PPIs are essential in almost all processes of the cell, and crucial for understanding cell physiology in different states. For each comparison, InteracTN analyses the interaction between the DEPs using STRING (Szklarczyk et al. 2021).
The species-specific database is retrieved from the STRING server, and all the interactions above a user-defined threshold are used to generate a network with
-
2. Details on the input parameters and files
-
Title of Analysis
: title of the experiment. It will be the title of the web page report. -
Brief Description
: description of the current experiment. It is the first paragraph of the report. Software Analyzer
: determine which software was used to identify peptides and proteins.- Proteome Discoverer
- MaxQuant using evidence.txt
- MaxQuant using peptides.txt and proteinGroups.txt
- Spectronaut
- FragPipe
-
File required for Proteome Discoverer
Annotation file
: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:Column Name Description File ID
Identifier used in column headers of the peptide file. Condition
Experimental group name. Used for comparisons. Sample
Clean sample name used downstream. color
(Optional) Plot color. Defaults are applied if missing. batch
(Optional) Batch ID for batch effect correction. Peptides file
: Excel table with annotated peptides and abundance values.Column Name Description Master Protein Accessions
Maps peptide to protein; only first ID is kept. Annotated Sequence
Amino acid sequence including PTM annotations. Modifications
Post-translational modifications. Positions in Master Proteins
Position of peptide in the protein sequence. Abundance: <File ID>
Intensity/abundance for each sample. One column per sample. Proteins file
: Excel table containing descriptive and accession information for proteins.Column Name Description Accession
Unique protein identifier, used to join with peptide file. Description
Descriptive string, e.g., from UniProt.
-
File required for MaxQuant
Annotation file
: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:Column Name Description Condition
Experimental condition (e.g. Control, Treated). Used for group comparison. Sample
Sample identifier. Must match sample names in the peptide file. color
(Optional) Color associated with the condition. If not present, default colors are assigned. batch
(Optional) Batch ID for batch effect correction. Required if batch correction is enabled. - Evidence pipeline:
evidence.txt
: This is a TSV/CSV file containing peptide-level quantification data. Required columns:Column Name Description Sequence
Amino acid sequence of the peptide. Modifications
PTMs of the peptide. Gene names
Gene symbol associated with the peptide. Protein names
Protein description. If missing, will be merged from annotation file. Leading razor protein
UniProt accession. Used for annotation enrichment. Raw file
File/sample ID. Must match entries in the annotation file. Intensity
Peptide intensity value. Used for quantification. Leading proteins
Used for filtering out contaminants (e.g. "CON_").
- Peptide and ProteinGroups pipeline:
peptides.txt
: Tab-delimited file with peptide-level quantification. Required columns:Column Name Description Sequence
Amino acid sequence of the peptide. Gene names
If missing, inferred from Leading razor protein
.Protein names
Protein description. If missing, will be merged from annotation file. Leading razor protein
UniProt accession. Used for annotation enrichment. Intensity <Sample>
Intensity values for each sample (e.g. Intensity Sample1
).proteinGroups.txt
: Tab-delimited file providing protein-level information. Required columns:Column Name Description Majority protein IDs
Used to extract the Leading razor protein
.Fasta headers
Used for protein description.
-
File required for Spectronaut
Annotation file
: (Optional) If provided, this file should contain metadata for each sample. If not provided, the pipeline will extract sample annotations directly from the peptide file.Column Name Description Condition
Experimental group label. Used for comparison between conditions. Sample
Sample identifier. Must match entries in the peptide file. color
(Optional) Color for visualization. Default colors will be assigned if missing. batch
(Optional) Batch ID for batch correction. Required if batch correction is enabled. Spectronaut report
: This is a TSV/CSV file containing peptide-level data. Required columns:Column Name Description PG.ProteinAccessions
Protein group accessions. PEP.StrippedSequence
Peptide sequence without modifications. EG.ModifiedSequence
Peptide sequence with modifications. PEP.Quantity
Peptide quantification value. R.FileName
Sample identifier (column used is defined by sample_col
).R.Condition
Condition identifier (column used if annotation file
not provided).
-
File required for FragPipe
Annotation file
: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:Column Name Description Condition
Experimental condition (e.g. Control, Treated). Used for group comparison. Sample
Sample identifier. Must match sample names in the peptide file. color
(Optional) Color associated with the condition. If not present, default colors are assigned. batch
(Optional) Batch ID for batch effect correction. Required if batch correction is enabled. combined_modified_peptide.tsv
: The peptide quantification file must contain raw or normalized intensity values for each sample and peptide. Required columns:Column Name Description Protein ID
Protein accession or identifier. Protein Description
Descriptive name of the protein. Gene
Gene symbol. Peptide Sequence
Amino acid sequence of the peptide. Assigned Modifications
Sequence with nucleotide modifications. Prev AA
Used to determine tryptic condition. Next AA
Used to determine tryptic condition. <Sample> Intensity
One column per sample named like <Sample> Intensity
.
-
-
3. Example case study - Selecting...
-
4. Details on the output files
report.html
: complete report of the analysis with all pics and results of the enrichment.db_results_InteracTN.RData
: RData object containing all the data produced during the execution ready for additional analyses in R.log_filter_read_function.txt
: Results filter applied during preprocessing step.- input_InteracTN folder
Input files provided
- rdata folder
enrichment.RData
: RData object containing enrichment results based on differentially expressed proteins
- Tables folder
normalised_abundances.xlsx
: excel file containing abundance values generated by InteracTN. abundances are log2 transformed, normalized, imputed (and batch corrected). The file is organized in the following sheets:- protein_per_sample: protein abundances per sample.
- peptide_per_sample: peptide abundances per sample.
- protein_per_condition: protein abundances per condition (average & standard deviation), as defined in the Sample Annotation.
- peptide_per_condition: peptide abundances per condition (average & standard deviation), as defined in the Sample Annotation.
differential_expression.xlsx
: excel file containing the results of differential analysis, according to the contrasts defined in the Design file. The file is organized in the following sheets:- protein_DE: protein differential expression results protein abundances per sample.
- peptide_DE: peptide differential expression results. Annotation columns:
- Accession: protein UniprotID
- Description: protein description
- GeneName: Gene Symbol
- Peptide_Sequence: peptide sequence
- Peptide_Modifications: peptide modifications
- Peptide_Position: start and end position of the peptide within the protein sequence, defined UniprotID
- Peptide_Tryptic: peptide tryptic digestion status (fully tryptic, N-semi tryptic, C-semi tryptic, non tryptic)
- Columns for each contrast:
- class: defined according to the fold change, p-value and abundance thresholds specified in the input
- + up-regulated protein/peptide
- - down-regulated protein/peptide
- = invariant protein/peptide
- log2_FC: protein/peptide log2 transformed fold change
- p_val: protein/peptide contrast p-value
- p_adj: protein/peptide adjusted p-value (FDR after BH correction)
- log2_expr: protein/peptide log2 average abundance
enrichment.xlsx
: excel file containing a selection of enrichment results starting from differentially expressed proteins. Terms are selected according to significance thresholds specified in the input (Default: adj.P.Value < 0.05, Overlap Size >= 5)
- Figures folder
- PDF version of all figure selected
- enrichment_plot.pdf: dot plot of top enriched terms based on differentially expressed proteins, divided in up- and down-regulated. Terms are filtered for keywords defined in the advanced options.
protein_vulcano
directory: Directory with all the vulcano plots based on the differential proteins.peptide_vulcano
directory: Directory with all the vulcano plots based on the differential peptide.STRINGdb
directory: Directory with figures from network analysis of differentially expressed proteins, based on STRINGdb protein-protein interactions. For each contrast, two files are generated:- *_connection.txt: txt file with all edge of the network.
- *_network.pdf: PDF version of STRINGdb network.
Mission of RNA and Disease Data Science laboratory
Human diseases such as cancer are intrinsically entangled with complexity. The discovery of effective cures requires dealing with this complexity and greatly benefits from the development of high-resolution methods of investigation: genome-wide, multi-modal, single cell, spatially resolved. Understanding human diseases at single-cell resolution within their architectural context is a scientific challenge requiring dedicated computational analysis dealing with the volume and heterogeneity of the data. The research mission of the lab is to understand the RNA molecular mechanisms underlying dysregulation in human diseases, by combining high-throughput and high-resolution analyses, with a pan-disciplinary approach.