ProTN is an integrative pipeline that analyze DDA proteomics data obtained from MS. It perform a complete analysis of the raw files from different software, with their biological interpretation with enrichement and network analysis. ProTN executes a dual level analysis, at protein and peptide level.

  • 1. Workflow of ProTN


    The ProTN workflow is divided into preprocessing, differential analysis and biological interpretation. During preprocessing, input data are filtered, imputed, normalized, and optionally batch-corrected. Next, differential analysis is performed in parallel at peptide and protein resolution, based on users’ defined comparisons and considering compled designs. Finally, biological interpretation of differentially expressed proteins is performed, including functional enrichment analysis, and detection of communities in protein-protein interaction networks.

    01. Define analysis settings and load input data files

    ProTN analyses the results of

    • Proteome Discoverer
    • MaxQuant
    • Spectronaut
    • FragPipe

    Additional details on the input can be found in section 2. Details on the input parameters and files

    02. Normalization and imputation of raw intensities

    Intensities are log2 transformed and normalized with DEqMS (Zhu 2022). At the peptide level, the normalization is performed with the function equalMedianNormalization, which normalizes intensity distributions in samples so that they have median equal to 0.

    At the protein level, this operation is executed by the function medianSweeping, that applies the same median normalization used for peptides, but also summarizes peptide intensities into protein relative abundances by the median sweeping method.

    Imputation

    • PhosR: Imputation is performed on peptide and protein abundances with the Bioconductor package PhosR. Round imputation is performed in absence of replicates. ProTN uses two functions of PhosR for the imputation: Imputes the missing values for a peptide across replicates within a single condition and Tail-based imputation approach as implemented in Perseus.
    • Gaussian Estimation: Imputation is performed on peptide and protein abundances using Gaussian estimation, where missing values are sampled from a normal distribution defined by the mean and standard deviation of observed intensities. This preserves data variance and reduces bias from missingness within conditions.
    • missForest: Imputation is performed on peptide and protein abundances using the missForest R package, which applies a non-parametric random forest algorithm to predict missing values. This approach captures nonlinear relationships between features, preserving complex data structures.
    • pcaMethods: Imputation is performed on peptide and protein abundances using the pcaMethods R package with the svdImpute function, which estimates missing values by reconstructing the data matrix from its leading singular vectors. This approach leverages global correlation structures to provide consistent estimates.
    Method Package Main Idea Typical Use
    PhosR imputation PhosR (Bioconductor) Designed for phosphoproteomics; implements round-robin (iterative imputation without replicates) and paired tail-based imputation (using replicate structure). Phosphoproteomics with sparse coverage or missingness tied to peptide abundance.
    Gaussian estimation imputation Draws values from a Gaussian distribution defined by low-intensity tail of observed data (shifted mean, reduced SD). When MNAR (Missing Not At Random) is likely due to detection limit.
    missForest missForest (R) Non-parametric iterative imputation using random forests. Predicts missing entries using nonlinear relationships between features. General proteomics where missingness relates to multiple covariates, or when structure is complex.
    pcaMethods svdImpute pcaMethods (Bioconductor) Uses Singular Value Decomposition to reconstruct missing values from lower-rank structure in the data. Well-replicated datasets with high correlation between samples.

    In this step, many figure can be generated regarding information about pre-process, normalization and imputation. An example from the case study is the PCA based on the protein abundances below.

    03. Differential analysis

    Differential analysis is applied to both proteins and peptides, to identify significant differences. Two slightly different methodologies are applied: the DEqMS package (Zhu 2022),is used for proteins. DEqMS is developed on top of Limma, but the method estimates different prior variances for proteins quantified by different numbers of PSMs/peptides per protein, therefore achieving better accuracy.For single peptides, the Limma package is used (Ritchie et al. 2015).

    • Compile the comparison table: The table have 2 columns:

      • Formule column (REQUIRED): The formulas need to follow the syntax of Limma (Ex: "cancer-normal").
      • Name column (OPTIONAL): personalized name assign to the comparison. (Ex: "cancer_vs_normal")

    Limma and DEqMS calculate differentially expressed peptides and proteins (DEPs) for each comparison specified in the design file parameter. Each peptide or protein can be selected as differential based on different parameters: the log2 Fold Change, the P.Value, the adjusted P.Value and the log2 expression. In ProTN, a protein/peptide is significant if passing thresholds on these parameters, set by the user. A protein/peptide for each comparison can be Up-regulated or Down-regulated. It is Up-regulated if:

    • the log2 FC is higher than the Fold Change threshold (FC > Log2 FC thr),

    • the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),

    It is Down-regulated if:

    • the log2 FC is lower of the Fold Change threshold (FC < -Log2 FC thr),

    • the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),

    In the output, for each comparison, this distinction is reported in the “class” column, which assumes value “+” if is up-regulated, “-” if down-regulated and “=” if it is not significant.

    Various figures are generated, first a bar plot that graphically represents the DEPs identified. Followed by comparison-specific volcano plots.

    04. Report creation and download of the results

    Results are summarized in a web-page HTML report. Other than this, ProTN generates a large number of useful files: a description of each output file can be found in section 4. Details on the output files. All the files are group in a zip file and downloaded.

    ADDITIONAL STEPS:

    B1. Batch Effect correction

    If required by the experiment, a batch correction step can be applyed using proBatch (Cuklina et al. 2018). The batches need to be defined in the sample annotation file where an additional column describe the batches. required.

    E1. Enrichment analysis of the Differentially Expressed Proteins

    The biological interpretation of the Differentially Expressed Proteins starts with the enrichment step. To execute this analysis, ProTN uses EnrichR (Jawaid 2022), a popular tool that searches on a large number of data sets to obtain information about many functional categories. EnrichR organises its hundreds of data sources in 8 sections: Transcription, Pathways, Ontologies, Diseases/Drugs, Cell Types, Misc, Legacy, and Crowd.

    Each comparison defined in the differential analysis stage can result in 3 sets of proteins: the Up-regulated (called Up), the Down-regulated (called Down), and the merge of the two (called all). ProTN provides for each term statistical parameters like P.Value, fdr, odds ratio, overlap size.

    ProTN creates an RData of the complete enrichment data frame, allowing the user an easy import in R to perform further analysis. ProTN also generates an Excel file, containing only the significantly enriched terms, as defined by user settings.

    A term to be significative need to have:

    • a Fdr or P.Value lower of P.Value thr for enrichment (P.Value < P.Value thr for enrichment),

    • an Overlap Size higher than Overlap size thr for enrichment (Overlap Size > Overlap size thr for enrichment).

    ProTN displays top significant enrichments based on specific annotation datasets or keywords selected by the user.

    N1. Protein-Protein Interaction network analysis of Differentially Expressed Proteins

    ProTN performs Protein-Protein Interaction (PPI) network analysis on differentially expressed proteins. PPIs are essential in almost all processes of the cell, and crucial for understanding cell physiology in different states. For each comparison, ProTN analyses the interaction between the DEPs using STRING (Szklarczyk et al. 2021).

    The species-specific database is retrieved from the STRING server, and all the interactions above a user-defined threshold are used to generate a network with

  • 2. Details on the input parameters and files


    • Title of Analysis: title of the experiment. It will be the title of the web page report.

    • Brief Description: description of the current experiment. It is the first paragraph of the report.

    • Software Analyzer: determine which software was used to identify peptides and proteins.
      • Proteome Discoverer
      • MaxQuant using evidence.txt
      • MaxQuant using peptides.txt and proteinGroups.txt
      • Spectronaut
      • FragPipe
    • File required for Proteome Discoverer

      • Annotation file: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:
        Column Name Description
        File ID Identifier used in column headers of the peptide file.
        Condition Experimental group name. Used for comparisons.
        Sample Clean sample name used downstream.
        color (Optional) Plot color. Defaults are applied if missing.
        batch (Optional) Batch ID for batch effect correction.
      • Peptides file: Excel table with annotated peptides and abundance values.
        Column Name Description
        Master Protein Accessions Maps peptide to protein; only first ID is kept.
        Annotated Sequence Amino acid sequence including PTM annotations.
        Modifications Post-translational modifications.
        Positions in Master Proteins Position of peptide in the protein sequence.
        Abundance: <File ID> Intensity/abundance for each sample. One column per sample.
      • Proteins file: Excel table containing descriptive and accession information for proteins.
        Column Name Description
        Accession Unique protein identifier, used to join with peptide file.
        Description Descriptive string, e.g., from UniProt.
    • File required for MaxQuant

      • Annotation file: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:
        Column Name Description
        Condition Experimental condition (e.g. Control, Treated). Used for group comparison.
        Sample Sample identifier. Must match sample names in the peptide file.
        color (Optional) Color associated with the condition. If not present, default colors are assigned.
        batch (Optional) Batch ID for batch effect correction. Required if batch correction is enabled.
      • Evidence pipeline:
        • evidence.txt: This is a TSV/CSV file containing peptide-level quantification data. Required columns:
          Column Name Description
          Sequence Amino acid sequence of the peptide.
          Modifications PTMs of the peptide.
          Gene names Gene symbol associated with the peptide.
          Protein names Protein description. If missing, will be merged from annotation file.
          Leading razor protein UniProt accession. Used for annotation enrichment.
          Raw file File/sample ID. Must match entries in the annotation file.
          Intensity Peptide intensity value. Used for quantification.
          Leading proteins Used for filtering out contaminants (e.g. "CON_").
      • Peptide and ProteinGroups pipeline:
        • peptides.txt: Tab-delimited file with peptide-level quantification. Required columns:
          Column Name Description
          Sequence Amino acid sequence of the peptide.
          Gene names If missing, inferred from Leading razor protein.
          Protein names Protein description. If missing, will be merged from annotation file.
          Leading razor protein UniProt accession. Used for annotation enrichment.
          Intensity <Sample> Intensity values for each sample (e.g. Intensity Sample1).
        • proteinGroups.txt: Tab-delimited file providing protein-level information. Required columns:
          Column Name Description
          Majority protein IDs Used to extract the Leading razor protein.
          Fasta headers Used for protein description.
    • File required for Spectronaut

      • Annotation file: (Optional) If provided, this file should contain metadata for each sample. If not provided, the pipeline will extract sample annotations directly from the peptide file.
        Column Name Description
        Condition Experimental group label. Used for comparison between conditions.
        Sample Sample identifier. Must match entries in the peptide file.
        color (Optional) Color for visualization. Default colors will be assigned if missing.
        batch (Optional) Batch ID for batch correction. Required if batch correction is enabled.
      • Spectronaut report: This is a TSV/CSV file containing peptide-level data. Required columns:
        Column Name Description
        PG.ProteinAccessions Protein group accessions.
        PEP.StrippedSequence Peptide sequence without modifications.
        EG.ModifiedSequence Peptide sequence with modifications.
        PEP.Quantity Peptide quantification value.
        R.FileName Sample identifier (column used is defined by sample_col).
        R.Condition Condition identifier (column used if annotation file not provided).
    • File required for FragPipe

      • Annotation file: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:
        Column Name Description
        Condition Experimental condition (e.g. Control, Treated). Used for group comparison.
        Sample Sample identifier. Must match sample names in the peptide file.
        color (Optional) Color associated with the condition. If not present, default colors are assigned.
        batch (Optional) Batch ID for batch effect correction. Required if batch correction is enabled.
      • combined_modified_peptide.tsv: The peptide quantification file must contain raw or normalized intensity values for each sample and peptide. Required columns:
        Column Name Description
        Protein ID Protein accession or identifier.
        Protein Description Descriptive name of the protein.
        Gene Gene symbol.
        Peptide Sequence Amino acid sequence of the peptide.
        Assigned Modifications Sequence with nucleotide modifications.
        Prev AA Used to determine tryptic condition.
        Next AA Used to determine tryptic condition.
        <Sample> Intensity One column per sample named like <Sample> Intensity.
  • 3. Example case study - Proteomics from Steger et al. (2016)


    Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases

    PRIDE: PXD003071.

    Des: Steger M, Tonelli F, Ito G, Davies P, Trost M, Vetter M, Wachter S, Lorentzen E, Duddy G, Wilson S, Baptista MA, Fiske BK, Fell MJ, Morrow JA, Reith AD, Alessi DR, Mann M. Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases. Elife. 2016 Jan 29;5.

    Possible to download the case study in the Run tab.

  • 4. Details on the output files


    • report.html: complete report of the analysis with all pics and results of the enrichment.
    • db_results_proTN.RData: RData object containing all the data produced during the execution ready for additional analyses in R.
    • log_filter_read_function.txt: Results filter applied during preprocessing step.
    • input_protn folder
      • Input files provided
    • rdata folder
      • enrichment.RData: RData object containing enrichment results based on differentially expressed proteins
    • Tables folder
      • normalised_abundances.xlsx: excel file containing abundance values generated by proTN. abundances are log2 transformed, normalized, imputed (and batch corrected). The file is organized in the following sheets:
        • protein_per_sample: protein abundances per sample.
        • peptide_per_sample: peptide abundances per sample.
        • protein_per_condition: protein abundances per condition (average & standard deviation), as defined in the Sample Annotation.
        • peptide_per_condition: peptide abundances per condition (average & standard deviation), as defined in the Sample Annotation.
      • differential_expression.xlsx: excel file containing the results of differential analysis, according to the contrasts defined in the Design file. The file is organized in the following sheets:
        • protein_DE: protein differential expression results protein abundances per sample.
        • peptide_DE: peptide differential expression results. Annotation columns:
        • Accession: protein UniprotID
        • Description: protein description
        • GeneName: Gene Symbol
        • Peptide_Sequence: peptide sequence
        • Peptide_Modifications: peptide modifications
        • Peptide_Position: start and end position of the peptide within the protein sequence, defined UniprotID
        • Peptide_Tryptic: peptide tryptic digestion status (fully tryptic, N-semi tryptic, C-semi tryptic, non tryptic)
        • Columns for each contrast:
        • class: defined according to the fold change, p-value and abundance thresholds specified in the input
        • + up-regulated protein/peptide
        • - down-regulated protein/peptide
        • = invariant protein/peptide
        • log2_FC: protein/peptide log2 transformed fold change
        • p_val: protein/peptide contrast p-value
        • p_adj: protein/peptide adjusted p-value (FDR after BH correction)
        • log2_expr: protein/peptide log2 average abundance
      • enrichment.xlsx: excel file containing a selection of enrichment results starting from differentially expressed proteins. Terms are selected according to significance thresholds specified in the input (Default: adj.P.Value < 0.05, Overlap Size >= 5)
    • Figures folder
    • PDF version of all figure selected
    • enrichment_plot.pdf: dot plot of top enriched terms based on differentially expressed proteins, divided in up- and down-regulated. Terms are filtered for keywords defined in the advanced options.
    • protein_vulcano directory: Directory with all the vulcano plots based on the differential proteins.
    • peptide_vulcano directory: Directory with all the vulcano plots based on the differential peptide.
    • STRINGdb directory: Directory with figures from network analysis of differentially expressed proteins, based on STRINGdb protein-protein interactions. For each contrast, two files are generated:
      • *_connection.txt: txt file with all edge of the network.
      • *_network.pdf: PDF version of STRINGdb network.

Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research” Complex Systems: 1695. https://igraph.org.

Cuklina, Jelena, Chloe H. Lee, Evan G. Willams, Ben Collins, Tatjana Sajic, Patrick Pedrioli, Maria Rodriguez-Martinez, and Ruedi Aebersold. 2018. “Computational Challenges in Biomarker Discovery from High-Throughput Proteomic Data.” https://doi.org/10.3929/ethz-b-000307772.

Jawaid, Wajid. 2022. “enrichR: Provides an r Interface to ’Enrichr’.” https://CRAN.R-project.org/package=enrichR.

Kim, Hani Jieun, Taiyun Kim, Nolan J Hoffman, Di Xiao, David E James, Sean J Humphrey, and Pengyi Yang. 2021. “PhosR Enables Processing and Functional Analysis of Phosphoproteomic Data” 34. https://doi.org/10.1016/j.celrep.2021.108771.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies” 43: e47. https://doi.org/10.1093/nar/gkv007.

Szklarczyk, Damian, Annika L Gable, Katerina C Nastou, David Lyon, Rebecca Kirsch, Sampo Pyysalo, Nadezhda T Doncheva, et al. 2021. “The STRING Database in 2021: Customizable Protein-Protein Networks, and Functional Characterization of User-Uploaded Gene/Measurement Sets.” 49.

Zhu, Yafeng. 2022. “DEqMS: A Tool to Perform Statistical Analysis of Differential Protein Expression for Quantitative Proteomics Data.”

Select what execute:

Differential Analysis:


×

PhosProTN is an integrative pipeline for phosphoproteomic analysis of DDA experimental data obtained from MS. It perform a complete analysis of the raw files from Proteome Discoverer (PD) or MaxQuant (MQ), with their biological interpretation, enrichement and network analysis. PhosProTN analyse the phosphoproteomic data at peptide level.

  • 1. Workflow of PhosProTN


    The ProTN workflow is divided into preprocessing, differential analysis and biological interpretation. During preprocessing, input data are filtered, imputed, normalized, and optionally batch-corrected. Next, differential analysis is performed in parallel at peptide and protein resolution, based on users’ defined comparisons and considering compled designs. Finally, biological interpretation of differentially expressed proteins is performed, including functional enrichment analysis, and detection of communities in protein-protein interaction networks, and kinome tree perturbation analysis.

    01. Define analysis settings and load input data files

    PhosProTN analyses the results of

    • Proteome Discoverer
    • MaxQuant

    Additional details on the input can be found in section 2. Details on the input parameters and files

    02. Normalization and imputation of raw intensities

    Intensities are log2 transformed and normalized with DEqMS (Zhu 2022). At the peptide level, the normalization is performed with the function equalMedianNormalization, which normalizes intensity distributions in samples so that they have median equal to 0.

    At the protein level, this operation is executed by the function medianSweeping, that applies the same median normalization used for peptides, but also summarizes peptide intensities into protein relative abundances by the median sweeping method.

    Imputation

    • PhosR: Imputation is performed on peptide and protein abundances with the Bioconductor package PhosR. Round imputation is performed in absence of replicates. ProTN uses two functions of PhosR for the imputation: Imputes the missing values for a peptide across replicates within a single condition and Tail-based imputation approach as implemented in Perseus.
    • Gaussian Estimation: Imputation is performed on peptide and protein abundances using Gaussian estimation, where missing values are sampled from a normal distribution defined by the mean and standard deviation of observed intensities. This preserves data variance and reduces bias from missingness within conditions.
    • missForest: Imputation is performed on peptide and protein abundances using the missForest R package, which applies a non-parametric random forest algorithm to predict missing values. This approach captures nonlinear relationships between features, preserving complex data structures.
    • pcaMethods: Imputation is performed on peptide and protein abundances using the pcaMethods R package with the svdImpute function, which estimates missing values by reconstructing the data matrix from its leading singular vectors. This approach leverages global correlation structures to provide consistent estimates.
    Method Package Main Idea Typical Use
    PhosR imputation PhosR (Bioconductor) Designed for phosphoproteomics; implements round-robin (iterative imputation without replicates) and paired tail-based imputation (using replicate structure). Phosphoproteomics with sparse coverage or missingness tied to peptide abundance.
    Gaussian estimation imputation Draws values from a Gaussian distribution defined by low-intensity tail of observed data (shifted mean, reduced SD). When MNAR (Missing Not At Random) is likely due to detection limit.
    missForest missForest (R) Non-parametric iterative imputation using random forests. Predicts missing entries using nonlinear relationships between features. General proteomics where missingness relates to multiple covariates, or when structure is complex.
    pcaMethods svdImpute pcaMethods (Bioconductor) Uses Singular Value Decomposition to reconstruct missing values from lower-rank structure in the data. Well-replicated datasets with high correlation between samples.

    In this step, many figure can be generated regarding information about pre-process, normalization and imputation. An example from the case study is the PCA based on the protein abundances below.

    03. Differential analysis

    Differential analysis is applied to both proteins and peptides, to identify significant differences. Two slightly different methodologies are applied: the DEqMS package (Zhu 2022),is used for proteins. DEqMS is developed on top of Limma, but the method estimates different prior variances for proteins quantified by different numbers of PSMs/peptides per protein, therefore achieving better accuracy.For single peptides, the Limma package is used (Ritchie et al. 2015).

    • Compile the comparison table: The table have 2 columns:

      • Formule column (REQUIRED): The formulas need to follow the syntax of Limma (Ex: "cancer-normal").
      • Name column (OPTIONAL): personalized name assign to the comparison. (Ex: "cancer_vs_normal")

    Limma and DEqMS calculate differentially expressed peptides and proteins (DEPs) for each comparison specified in the design file parameter. Each peptide or protein can be selected as differential based on different parameters: the log2 Fold Change, the P.Value, the adjusted P.Value and the log2 expression. In ProTN, a protein/peptide is significant if passing thresholds on these parameters, set by the user. A protein/peptide for each comparison can be Up-regulated or Down-regulated. It is Up-regulated if:

    • the log2 FC is higher than the Fold Change threshold (FC > Log2 FC thr),

    • the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),

    It is Down-regulated if:

    • the log2 FC is lower of the Fold Change threshold (FC < -Log2 FC thr),

    • the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),

    In the output, for each comparison, this distinction is reported in the “class” column, which assumes value “+” if is up-regulated, “-” if down-regulated and “=” if it is not significant.

    Various figures are generated, first a bar plot that graphically represents the DEPs identified. Followed by comparison-specific volcano plots.

    04. Report creation and download of the results

    The results are summarized in a web-page HTML report. Other than this, the experiment is described by a large number of files, a description of each file generated can be found in section 3. Details on the output files. All the files are group in a zip file and downloaded.

    ADDITIONAL STEPS:

    B1. Batch Effect correction

    If required by the experiment, a batch correction step can be applyed using proBatch (Cuklina et al. 2018). The batches need to be defined in the sample annotation file where an additional column describe the batches. required.

    E1. Enrichment analysis of the Differentially Expressed Proteins

    The biological interpretation of the Differentially Expressed Proteins starts with the enrichment step. To execute this analysis, ProTN uses EnrichR (Jawaid 2022), a popular tool that searches on a large number of data sets to obtain information about many functional categories. EnrichR organises its hundreds of data sources in 8 sections: Transcription, Pathways, Ontologies, Diseases/Drugs, Cell Types, Misc, Legacy, and Crowd.

    Each comparison defined in the differential analysis stage can result in 3 sets of proteins: the Up-regulated (called Up), the Down-regulated (called Down), and the merge of the two (called all). ProTN provides for each term statistical parameters like P.Value, fdr, odds ratio, overlap size.

    ProTN creates an RData of the complete enrichment data frame, allowing the user an easy import in R to perform further analysis. ProTN also generates an Excel file, containing only the significantly enriched terms, as defined by user settings.

    A term to be significative need to have:

    • a Fdr or P.Value lower of P.Value thr for enrichment (P.Value < P.Value thr for enrichment),

    • an Overlap Size higher than Overlap size thr for enrichment (Overlap Size > Overlap size thr for enrichment).

    ProTN displays top significant enrichments based on specific annotation datasets or keywords selected by the user.

    K1. Activity kinase tree analysis of the Differentially Expressed Phosphosite

    In phospho-proteomic it extremely useful to study the activation status of the kinase based on the differentially expressed substrate idenfied by the differential analysis. For each comparison, PhosProTN predicts the activation state of the kinases using PhosR (Kim et al. 2021). PhosR provides a kinase-substrate relationship score, and on that it prioritises potential kinases that could be responsible for the phosphorylation change of phosphosite on the basis of kinase recognition motif and phosphoproteomic dynamics.

    The activity score provide by PhosR is used to generated a graphical versione of the human kinome tree using CORAL (Metz K.S. et al. 2018), a web shiny app for visualizing both quantitative and qualitative data. It generates high-resolution scalable vector graphic files suitable for publication without the need for refinement in graphic editing software.

    N1. Protein-Protein Interaction network analysis of Differentially Expressed Phosphosite

    ProTN performs Protein-Protein Interaction (PPI) network analysis on differentially expressed proteins. PPIs are essential in almost all processes of the cell, and crucial for understanding cell physiology in different states. For each comparison, ProTN analyses the interaction between the DEPs using STRING (Szklarczyk et al. 2021).

    The species-specific database is retrieved from STRING server, an accurate analysis discover all the interactions and an iGraph (Csardi and Nepusz 2006) network is generated. Later, the proteins are clustered via iGraph function which identify dense subgraph by optimizing modularity score.

    Since the network can vary a lot on composition, two ggplot layout are used: Fruchterman-Reingold algorithm and the Kamada-Kawai algorithm.

  • 2. Details on the input parameters and files


    • Title of Analysis: title of the experiment. It will be the title of the web page report.

    • Brief Description: description of the current experiment. It is the first paragraph of the report.

    • Software Analyzer: determine which software was used to identify peptides and proteins.
      • Proteome Discoverer
      • MaxQuant using evidence.txt
      • MaxQuant using peptides.txt and proteinGroups.txt
      • Spectronaut
      • FragPipe
    • File required for Proteome Discoverer

      • Annotation file: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:
        Column Name Description
        File ID Identifier used in column headers of the peptide file.
        Condition Experimental group name. Used for comparisons.
        Sample Clean sample name used downstream.
        color (Optional) Plot color. Defaults are applied if missing.
        batch (Optional) Batch ID for batch effect correction.
      • Peptides file: Excel table with annotated peptides and abundance values.
        Column Name Description
        Master Protein Accessions Maps peptide to protein; only first ID is kept.
        Annotated Sequence Amino acid sequence including PTM annotations.
        Modifications Post-translational modifications.
        Positions in Master Proteins Position of peptide in the protein sequence.
        Abundance: <File ID> Intensity/abundance for each sample. One column per sample.
      • Proteins file: Excel table containing descriptive and accession information for proteins.
        Column Name Description
        Accession Unique protein identifier, used to join with peptide file.
        Description Descriptive string, e.g., from UniProt.
      • PSM file:
        Column Name Description
        ptmRS: Best Site Probabilities Used to resolve phosphosite ambiguity.
        Precursor Abundance Abundance value used to filter invalid entries.
        Master Protein Accessions Matches protein IDs for mapping.
        Annotated Sequence Used to resolve conflicting PTM assignments.
    • File required for MaxQuant

      • Annotation file: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:
        Column Name Description
        Condition Experimental condition (e.g. Control, Treated). Used for group comparison.
        Sample Sample identifier. Must match sample names in the peptide file.
        color (Optional) Color associated with the condition. If not present, default colors are assigned.
        batch (Optional) Batch ID for batch effect correction. Required if batch correction is enabled.
      • Evidence pipeline:
        • evidence.txt: This is a TSV/CSV file containing peptide-level quantification data. Required columns:
          Column Name Description
          Sequence Amino acid sequence of the peptide.
          Modifications PTMs of the peptide.
          Gene names Gene symbol associated with the peptide.
          Protein names Protein description. If missing, will be merged from annotation file.
          Leading razor protein UniProt accession. Used for annotation enrichment.
          Raw file File/sample ID. Must match entries in the annotation file.
          Intensity Peptide intensity value. Used for quantification.
          Leading proteins Used for filtering out contaminants (e.g. "CON_").
  • 3. Example case study - Phosphoproteomics from Steger et al. (2016)


    Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases

    PRIDE: PXD003071.

    Des: Steger M, Tonelli F, Ito G, Davies P, Trost M, Vetter M, Wachter S, Lorentzen E, Duddy G, Wilson S, Baptista MA, Fiske BK, Fell MJ, Morrow JA, Reith AD, Alessi DR, Mann M. Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases. Elife. 2016 Jan 29;5.

    Possible to download the case study in the Run tab.

  • 4. Details on the output files


    • report.html: complete report of the analysis with all pics and results of the enrichment.
    • db_results_proTN.RData: RData object containing all the data produced during the execution ready for additional analyses in R.
    • log_filter_read_function.txt: Results filter applied during preprocessing step.
    • input_protn folder
      • Input files provided
    • rdata folder
      • enrichment.RData: RData object containing enrichment results based on differentially expressed proteins
    • Tables folder
      • normalised_abundances.xlsx: excel file containing abundance values generated by proTN. abundances are log2 transformed, normalized, imputed (and batch corrected). The file is organized in the following sheets:
        • protein_per_sample: protein abundances per sample.
        • peptide_per_sample: peptide abundances per sample.
        • protein_per_condition: protein abundances per condition (average & standard deviation), as defined in the Sample Annotation.
        • peptide_per_condition: peptide abundances per condition (average & standard deviation), as defined in the Sample Annotation.
      • differential_expression.xlsx: excel file containing the results of differential analysis, according to the contrasts defined in the Design file. The file is organized in the following sheets:
        • protein_DE: protein differential expression results protein abundances per sample.
        • peptide_DE: peptide differential expression results. Annotation columns:
        • Accession: protein UniprotID
        • Description: protein description
        • GeneName: Gene Symbol
        • Peptide_Sequence: peptide sequence
        • Peptide_Modifications: peptide modifications
        • Peptide_Position: start and end position of the peptide within the protein sequence, defined UniprotID
        • Peptide_Tryptic: peptide tryptic digestion status (fully tryptic, N-semi tryptic, C-semi tryptic, non tryptic)
        • Columns for each contrast:
        • class: defined according to the fold change, p-value and abundance thresholds specified in the input
        • + up-regulated protein/peptide
        • - down-regulated protein/peptide
        • = invariant protein/peptide
        • log2_FC: protein/peptide log2 transformed fold change
        • p_val: protein/peptide contrast p-value
        • p_adj: protein/peptide adjusted p-value (FDR after BH correction)
        • log2_expr: protein/peptide log2 average abundance
      • enrichment.xlsx: excel file containing a selection of enrichment results starting from differentially expressed proteins. Terms are selected according to significance thresholds specified in the input (Default: adj.P.Value < 0.05, Overlap Size >= 5)
    • Figures folder
    • PDF version of all figure selected
    • enrichment_plot.pdf: dot plot of top enriched terms based on differentially expressed proteins, divided in up- and down-regulated. Terms are filtered for keywords defined in the advanced options.
    • protein_vulcano directory: Directory with all the vulcano plots based on the differential proteins.
    • peptide_vulcano directory: Directory with all the vulcano plots based on the differential peptide.
    • STRINGdb directory: Directory with figures from network analysis of differentially expressed proteins, based on STRINGdb protein-protein interactions. For each contrast, two files are generated:
      • *_connection.txt: txt file with all edge of the network.
      • *_network.pdf: PDF version of STRINGdb network.
    • KinaseTree directory: figures and files with the kinase activity trees generated with PhosR and CORAL. For each deasign there are two files:
      • TXT file: text table with the identified actifity of each kinase.
      • SVG file: vectorail image of the kinase tree generated with CORAL.

Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research” Complex Systems: 1695. https://igraph.org.

Cuklina, Jelena, Chloe H. Lee, Evan G. Willams, Ben Collins, Tatjana Sajic, Patrick Pedrioli, Maria Rodriguez-Martinez, and Ruedi Aebersold. 2018. “Computational Challenges in Biomarker Discovery from High-Throughput Proteomic Data.” https://doi.org/10.3929/ethz-b-000307772.

Jawaid, Wajid. 2022. “enrichR: Provides an r Interface to ’Enrichr’.” https://CRAN.R-project.org/package=enrichR.

Kim, Hani Jieun, Taiyun Kim, Nolan J Hoffman, Di Xiao, David E James, Sean J Humphrey, and Pengyi Yang. 2021. “PhosR Enables Processing and Functional Analysis of Phosphoproteomic Data” 34. https://doi.org/10.1016/j.celrep.2021.108771.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies” 43: e47. https://doi.org/10.1093/nar/gkv007.

Kathleen S. Metz, Erika M. Deoudes, Matthew E. Berginski, Ivan Jimenez-Ruiz, Bulent Arman Aksoy, Jeff Hammerbacher, Shawn M. Gomez and Douglas H. Phanstie. 2018. Coral: Clear and Customizable Visualization of Human Kinome Data Cell SystemsVolume 7, Issue 3, 347-350 https://doi.org/10.1016/j.cels.2018.07.001.

Szklarczyk, Damian, Annika L Gable, Katerina C Nastou, David Lyon, Rebecca Kirsch, Sampo Pyysalo, Nadezhda T Doncheva, et al. 2021. “The STRING Database in 2021: Customizable Protein-Protein Networks, and Functional Characterization of User-Uploaded Gene/Measurement Sets.” 49.

Zhu, Yafeng. 2022. “DEqMS: A Tool to Perform Statistical Analysis of Differential Protein Expression for Quantitative Proteomics Data.”

Select what execute:

Differential Analysis:


×

PhosProTN with proteome backgorund is an integrative pipeline for phosphoproteomic analysis of DDA experimental data obtained from MS. It perform a complete analysis of the raw files from Proteome Discoverer (PD) or MaxQuant (MQ), with their biological interpretation, enrichement and network analysis. PhosProTN analyse the phosphoproteomic data at peptide level using as background the proteome analysis of the same conditions.

  • 1. Workflow of PhosProTN with proteome background


    The ProTN workflow is divided into preprocessing, differential analysis and biological interpretation. During preprocessing, input data are filtered, imputed, normalized, and optionally batch-corrected. Next, differential analysis is performed in parallel at peptide and protein resolution, based on users’ defined comparisons and considering compled designs. Finally, biological interpretation of differentially expressed proteins is performed, including functional enrichment analysis, and detection of communities in protein-protein interaction networks, and kinome tree perturbation analysis.

    01. Define analysis settings and load input data files

    PhosProTN analyses the results of

    • Proteome Discoverer
    • MaxQuant

    Additional details on the input can be found in section 2. Details on the input parameters and files

    02. Normalization and imputation of raw intensities

    Intensities are log2 transformed and normalized with DEqMS (Zhu 2022). At the peptide level, the normalization is performed with the function equalMedianNormalization, which normalizes intensity distributions in samples so that they have median equal to 0.

    At the protein level, this operation is executed by the function medianSweeping, that applies the same median normalization used for peptides, but also summarizes peptide intensities into protein relative abundances by the median sweeping method.

    Imputation

    • PhosR: Imputation is performed on peptide and protein abundances with the Bioconductor package PhosR. Round imputation is performed in absence of replicates. ProTN uses two functions of PhosR for the imputation: Imputes the missing values for a peptide across replicates within a single condition and Tail-based imputation approach as implemented in Perseus.
    • Gaussian Estimation: Imputation is performed on peptide and protein abundances using Gaussian estimation, where missing values are sampled from a normal distribution defined by the mean and standard deviation of observed intensities. This preserves data variance and reduces bias from missingness within conditions.
    • missForest: Imputation is performed on peptide and protein abundances using the missForest R package, which applies a non-parametric random forest algorithm to predict missing values. This approach captures nonlinear relationships between features, preserving complex data structures.
    • pcaMethods: Imputation is performed on peptide and protein abundances using the pcaMethods R package with the svdImpute function, which estimates missing values by reconstructing the data matrix from its leading singular vectors. This approach leverages global correlation structures to provide consistent estimates.
    Method Package Main Idea Typical Use
    PhosR imputation PhosR (Bioconductor) Designed for phosphoproteomics; implements round-robin (iterative imputation without replicates) and paired tail-based imputation (using replicate structure). Phosphoproteomics with sparse coverage or missingness tied to peptide abundance.
    Gaussian estimation imputation Draws values from a Gaussian distribution defined by low-intensity tail of observed data (shifted mean, reduced SD). When MNAR (Missing Not At Random) is likely due to detection limit.
    missForest missForest (R) Non-parametric iterative imputation using random forests. Predicts missing entries using nonlinear relationships between features. General proteomics where missingness relates to multiple covariates, or when structure is complex.
    pcaMethods svdImpute pcaMethods (Bioconductor) Uses Singular Value Decomposition to reconstruct missing values from lower-rank structure in the data. Well-replicated datasets with high correlation between samples.

    In this step, many figure can be generated regarding information about pre-process, normalization and imputation. An example from the case study is the PCA based on the protein abundances below.

    03. Differential analysis

    Differential analysis is applied to both proteins and peptides, to identify significant differences. Two slightly different methodologies are applied: the DEqMS package (Zhu 2022),is used for proteins. DEqMS is developed on top of Limma, but the method estimates different prior variances for proteins quantified by different numbers of PSMs/peptides per protein, therefore achieving better accuracy.For single peptides, the Limma package is used (Ritchie et al. 2015).

    • Compile the comparison table: The table have 2 columns:

      • Formule column (REQUIRED): The formulas need to follow the syntax of Limma (Ex: "cancer-normal").
      • Name column (OPTIONAL): personalized name assign to the comparison. (Ex: "cancer_vs_normal")

    Limma and DEqMS calculate differentially expressed peptides and proteins (DEPs) for each comparison specified in the design file parameter. Each peptide or protein can be selected as differential based on different parameters: the log2 Fold Change, the P.Value, the adjusted P.Value and the log2 expression. In ProTN, a protein/peptide is significant if passing thresholds on these parameters, set by the user. A protein/peptide for each comparison can be Up-regulated or Down-regulated. It is Up-regulated if:

    • the log2 FC is higher than the Fold Change threshold (FC > Log2 FC thr),

    • the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),

    It is Down-regulated if:

    • the log2 FC is lower of the Fold Change threshold (FC < -Log2 FC thr),

    • the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),

    In the output, for each comparison, this distinction is reported in the “class” column, which assumes value “+” if is up-regulated, “-” if down-regulated and “=” if it is not significant.

    Various figures are generated, first a bar plot that graphically represents the DEPs identified. Followed by comparison-specific volcano plots.

    04. Report creation and download of the results

    The results are summarized in a web-page HTML report. Other than this, the experiment is described by a large number of files, a description of each file generated can be found in section 3. Details on the output files. All the files are group in a zip file and downloaded.

    ADDITIONAL STEPS:

    B1. Batch Effect correction

    If required by the experiment, a batch correction step can be applyed using proBatch (Cuklina et al. 2018). The batches need to be defined in the sample annotation file where an additional column describe the batches. required.

    E1. Enrichment analysis of the Differentially Expressed Proteins

    The biological interpretation of the Differentially Expressed Proteins starts with the enrichment step. To execute this analysis, ProTN uses EnrichR (Jawaid 2022), a popular tool that searches on a large number of data sets to obtain information about many functional categories. EnrichR organises its hundreds of data sources in 8 sections: Transcription, Pathways, Ontologies, Diseases/Drugs, Cell Types, Misc, Legacy, and Crowd.

    Each comparison defined in the differential analysis stage can result in 3 sets of proteins: the Up-regulated (called Up), the Down-regulated (called Down), and the merge of the two (called all). ProTN provides for each term statistical parameters like P.Value, fdr, odds ratio, overlap size.

    ProTN creates an RData of the complete enrichment data frame, allowing the user an easy import in R to perform further analysis. ProTN also generates an Excel file, containing only the significantly enriched terms, as defined by user settings.

    A term to be significative need to have:

    • a Fdr or P.Value lower of P.Value thr for enrichment (P.Value < P.Value thr for enrichment),

    • an Overlap Size higher than Overlap size thr for enrichment (Overlap Size > Overlap size thr for enrichment).

    ProTN displays top significant enrichments based on specific annotation datasets or keywords selected by the user.

    K1. Activity kinase tree analysis of the Differentially Expressed Phosphosite

    In phospho-proteomic it extremely useful to study the activation status of the kinase based on the differentially expressed substrate idenfied by the differential analysis. For each comparison, PhosProTN predicts the activation state of the kinases using PhosR (Kim et al. 2021). PhosR provides a kinase-substrate relationship score, and on that it prioritises potential kinases that could be responsible for the phosphorylation change of phosphosite on the basis of kinase recognition motif and phosphoproteomic dynamics.

    The activity score provide by PhosR is used to generated a graphical versione of the human kinome tree using CORAL (Metz K.S. et al. 2018), a web shiny app for visualizing both quantitative and qualitative data. It generates high-resolution scalable vector graphic files suitable for publication without the need for refinement in graphic editing software.

    N1. Protein-Protein Interaction network analysis of Differentially Expressed Phosphosite

    ProTN performs Protein-Protein Interaction (PPI) network analysis on differentially expressed proteins. PPIs are essential in almost all processes of the cell, and crucial for understanding cell physiology in different states. For each comparison, ProTN analyses the interaction between the DEPs using STRING (Szklarczyk et al. 2021).

    The species-specific database is retrieved from STRING server, an accurate analysis discover all the interactions and an iGraph (Csardi and Nepusz 2006) network is generated. Later, the proteins are clustered via iGraph function which identify dense subgraph by optimizing modularity score.

    Since the network can vary a lot on composition, two ggplot layout are used: Fruchterman-Reingold algorithm and the Kamada-Kawai algorithm.

  • 2. Details on the input parameters and files


    • Title of Analysis: title of the experiment. It will be the title of the web page report.

    • Brief Description: description of the current experiment. It is the first paragraph of the report.

    • Software Analyzer: determine which software was used to identify peptides and proteins.
      • Proteome Discoverer
      • MaxQuant using evidence.txt
      • MaxQuant using peptides.txt and proteinGroups.txt
      • Spectronaut
      • FragPipe
    • File required for Proteome Discoverer

      • Annotation file: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:
        Column Name Description
        File ID Identifier used in column headers of the peptide file.
        Condition Experimental group name. Used for comparisons.
        Sample Clean sample name used downstream.
        color (Optional) Plot color. Defaults are applied if missing.
        batch (Optional) Batch ID for batch effect correction.
      • Peptides file: Excel table with annotated peptides and abundance values.
        Column Name Description
        Master Protein Accessions Maps peptide to protein; only first ID is kept.
        Annotated Sequence Amino acid sequence including PTM annotations.
        Modifications Post-translational modifications.
        Positions in Master Proteins Position of peptide in the protein sequence.
        Abundance: <File ID> Intensity/abundance for each sample. One column per sample.
      • Proteins file: Excel table containing descriptive and accession information for proteins.
        Column Name Description
        Accession Unique protein identifier, used to join with peptide file.
        Description Descriptive string, e.g., from UniProt.
      • PSM file: only for Phospho dataset. For Proteome background is not required.
        Column Name Description
        ptmRS: Best Site Probabilities Used to resolve phosphosite ambiguity.
        Precursor Abundance Abundance value used to filter invalid entries.
        Master Protein Accessions Matches protein IDs for mapping.
        Annotated Sequence Used to resolve conflicting PTM assignments.
    • File required for MaxQuant

      • Annotation file: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:
        Column Name Description
        Condition Experimental condition (e.g. Control, Treated). Used for group comparison.
        Sample Sample identifier. Must match sample names in the peptide file.
        color (Optional) Color associated with the condition. If not present, default colors are assigned.
        batch (Optional) Batch ID for batch effect correction. Required if batch correction is enabled.
      • Evidence pipeline:
        • evidence.txt: This is a TSV/CSV file containing peptide-level quantification data. Required columns:
          Column Name Description
          Sequence Amino acid sequence of the peptide.
          Modifications PTMs of the peptide.
          Gene names Gene symbol associated with the peptide.
          Protein names Protein description. If missing, will be merged from annotation file.
          Leading razor protein UniProt accession. Used for annotation enrichment.
          Raw file File/sample ID. Must match entries in the annotation file.
          Intensity Peptide intensity value. Used for quantification.
          Leading proteins Used for filtering out contaminants (e.g. "CON_").
  • 3. Example case study - Phosphoproteomics from Steger et al. (2016)


    Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases

    PRIDE: PXD003071.

    Des: Steger M, Tonelli F, Ito G, Davies P, Trost M, Vetter M, Wachter S, Lorentzen E, Duddy G, Wilson S, Baptista MA, Fiske BK, Fell MJ, Morrow JA, Reith AD, Alessi DR, Mann M. Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases. Elife. 2016 Jan 29;5.

    Possible to download the case study in the Run tab.

  • 4. Details on the output files


    • report.html: complete report of the analysis with all pics and results of the enrichment.
    • db_results_proTN.RData: RData object containing all the data produced during the execution ready for additional analyses in R.
    • log_filter_read_function.txt: Results filter applied during preprocessing step.
    • input_protn folder
      • Input files provided
    • rdata folder
      • enrichment.RData: RData object containing enrichment results based on differentially expressed proteins
    • Tables folder
      • normalised_abundances.xlsx: excel file containing abundance values generated by proTN. abundances are log2 transformed, normalized, imputed (and batch corrected). The file is organized in the following sheets:
        • protein_per_sample: protein abundances per sample.
        • peptide_per_sample: peptide abundances per sample.
        • protein_per_condition: protein abundances per condition (average & standard deviation), as defined in the Sample Annotation.
        • peptide_per_condition: peptide abundances per condition (average & standard deviation), as defined in the Sample Annotation.
      • differential_expression.xlsx: excel file containing the results of differential analysis, according to the contrasts defined in the Design file. The file is organized in the following sheets:
        • protein_DE: protein differential expression results protein abundances per sample.
        • peptide_DE: peptide differential expression results. Annotation columns:
        • Accession: protein UniprotID
        • Description: protein description
        • GeneName: Gene Symbol
        • Peptide_Sequence: peptide sequence
        • Peptide_Modifications: peptide modifications
        • Peptide_Position: start and end position of the peptide within the protein sequence, defined UniprotID
        • Peptide_Tryptic: peptide tryptic digestion status (fully tryptic, N-semi tryptic, C-semi tryptic, non tryptic)
        • Columns for each contrast:
        • class: defined according to the fold change, p-value and abundance thresholds specified in the input
        • + up-regulated protein/peptide
        • - down-regulated protein/peptide
        • = invariant protein/peptide
        • log2_FC: protein/peptide log2 transformed fold change
        • p_val: protein/peptide contrast p-value
        • p_adj: protein/peptide adjusted p-value (FDR after BH correction)
        • log2_expr: protein/peptide log2 average abundance
      • enrichment.xlsx: excel file containing a selection of enrichment results starting from differentially expressed proteins. Terms are selected according to significance thresholds specified in the input (Default: adj.P.Value < 0.05, Overlap Size >= 5)
    • Figures folder
    • PDF version of all figure selected
    • enrichment_plot.pdf: dot plot of top enriched terms based on differentially expressed proteins, divided in up- and down-regulated. Terms are filtered for keywords defined in the advanced options.
    • protein_vulcano directory: Directory with all the vulcano plots based on the differential proteins.
    • peptide_vulcano directory: Directory with all the vulcano plots based on the differential peptide.
    • STRINGdb directory: Directory with figures from network analysis of differentially expressed proteins, based on STRINGdb protein-protein interactions. For each contrast, two files are generated:
      • *_connection.txt: txt file with all edge of the network.
      • *_network.pdf: PDF version of STRINGdb network.
    • KinaseTree directory: figures and files with the kinase activity trees generated with PhosR and CORAL. For each deasign there are two files:
      • TXT file: text table with the identified actifity of each kinase.
      • SVG file: vectorail image of the kinase tree generated with CORAL.

Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research” Complex Systems: 1695. https://igraph.org.

Cuklina, Jelena, Chloe H. Lee, Evan G. Willams, Ben Collins, Tatjana Sajic, Patrick Pedrioli, Maria Rodriguez-Martinez, and Ruedi Aebersold. 2018. “Computational Challenges in Biomarker Discovery from High-Throughput Proteomic Data.” https://doi.org/10.3929/ethz-b-000307772.

Jawaid, Wajid. 2022. “enrichR: Provides an r Interface to ’Enrichr’.” https://CRAN.R-project.org/package=enrichR.

Kim, Hani Jieun, Taiyun Kim, Nolan J Hoffman, Di Xiao, David E James, Sean J Humphrey, and Pengyi Yang. 2021. “PhosR Enables Processing and Functional Analysis of Phosphoproteomic Data” 34. https://doi.org/10.1016/j.celrep.2021.108771.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies” 43: e47. https://doi.org/10.1093/nar/gkv007.

Kathleen S. Metz, Erika M. Deoudes, Matthew E. Berginski, Ivan Jimenez-Ruiz, Bulent Arman Aksoy, Jeff Hammerbacher, Shawn M. Gomez and Douglas H. Phanstie. 2018. Coral: Clear and Customizable Visualization of Human Kinome Data Cell SystemsVolume 7, Issue 3, 347-350 https://doi.org/10.1016/j.cels.2018.07.001.

Szklarczyk, Damian, Annika L Gable, Katerina C Nastou, David Lyon, Rebecca Kirsch, Sampo Pyysalo, Nadezhda T Doncheva, et al. 2021. “The STRING Database in 2021: Customizable Protein-Protein Networks, and Functional Characterization of User-Uploaded Gene/Measurement Sets.” 49.

Zhu, Yafeng. 2022. “DEqMS: A Tool to Perform Statistical Analysis of Differential Protein Expression for Quantitative Proteomics Data.”

Select what execute:

Differential Analysis:


×

InteracTN is an integrated pipeline for the analysis of interactomics data derived from DDA mass spectrometry experiments. The pipeline also includes network reconstruction and visualization, helping users interpret protein–protein interactions in the context of biological pathways and complexes. Designed to be robust and user-friendly, InteracTN enables researchers to efficiently explore and interpret interactome data for a deeper understanding of cellular mechanisms and protein functions.

  • 1. Workflow of InteracTN


    The InteracTN workflow is divided into preprocessing, differential analysis and biological interpretation. During preprocessing, input data are filtered, imputed, normalized, and optionally batch-corrected. Next, differential analysis is performed in parallel at peptide and protein resolution, based on users’ defined comparisons and considering compled designs. Finally, biological interpretation of differentially expressed proteins is performed, including functional enrichment analysis, and detection of communities in protein-protein interaction networks.

    01. Define analysis settings and load input data files

    InteracTN analyses the results of

    • Proteome Discoverer
    • MaxQuant
    • Spectronaut
    • FragPipe

    Additional details on the input can be found in section 2. Details on the input parameters and files

    02. Normalization and imputation of raw intensities

    Intensities are log2 transformed and normalized with DEqMS (Zhu 2022). At the peptide level, the normalization is performed with the function equalMedianNormalization, which normalizes intensity distributions in samples so that they have median equal to 0.

    At the protein level, this operation is executed by the function medianSweeping, that applies the same median normalization used for peptides, but also summarizes peptide intensities into protein relative abundances by the median sweeping method.

    Imputation

    • PhosR: Imputation is performed on peptide and protein abundances with the Bioconductor package PhosR. Round imputation is performed in absence of replicates. ProTN uses two functions of PhosR for the imputation: Imputes the missing values for a peptide across replicates within a single condition and Tail-based imputation approach as implemented in Perseus.
    • Gaussian Estimation: Imputation is performed on peptide and protein abundances using Gaussian estimation, where missing values are sampled from a normal distribution defined by the mean and standard deviation of observed intensities. This preserves data variance and reduces bias from missingness within conditions.
    • missForest: Imputation is performed on peptide and protein abundances using the missForest R package, which applies a non-parametric random forest algorithm to predict missing values. This approach captures nonlinear relationships between features, preserving complex data structures.
    • pcaMethods: Imputation is performed on peptide and protein abundances using the pcaMethods R package with the svdImpute function, which estimates missing values by reconstructing the data matrix from its leading singular vectors. This approach leverages global correlation structures to provide consistent estimates.
    Method Package Main Idea Typical Use
    PhosR imputation PhosR (Bioconductor) Designed for phosphoproteomics; implements round-robin (iterative imputation without replicates) and paired tail-based imputation (using replicate structure). Phosphoproteomics with sparse coverage or missingness tied to peptide abundance.
    Gaussian estimation imputation Draws values from a Gaussian distribution defined by low-intensity tail of observed data (shifted mean, reduced SD). When MNAR (Missing Not At Random) is likely due to detection limit.
    missForest missForest (R) Non-parametric iterative imputation using random forests. Predicts missing entries using nonlinear relationships between features. General proteomics where missingness relates to multiple covariates, or when structure is complex.
    pcaMethods svdImpute pcaMethods (Bioconductor) Uses Singular Value Decomposition to reconstruct missing values from lower-rank structure in the data. Well-replicated datasets with high correlation between samples.

    In this step, many figure can be generated regarding information about pre-process, normalization and imputation. An example from the case study is the PCA based on the protein abundances below.

    03. Differential analysis

    Differential analysis is applied to both proteins and peptides, to identify significant differences. Two slightly different methodologies are applied: the DEqMS package (Zhu 2022),is used for proteins. DEqMS is developed on top of Limma, but the method estimates different prior variances for proteins quantified by different numbers of PSMs/peptides per protein, therefore achieving better accuracy.For single peptides, the Limma package is used (Ritchie et al. 2015).

    • Compile the comparison table: The table have 2 columns:

      • Formule column (REQUIRED): The formulas need to follow the syntax of Limma (Ex: "cancer-normal").
      • Name column (OPTIONAL): personalized name assign to the comparison. (Ex: "cancer_vs_normal")

    Limma and DEqMS calculate differentially expressed peptides and proteins (DEPs) for each comparison specified in the design file parameter. Each peptide or protein can be selected as differential based on different parameters: the log2 Fold Change, the P.Value, the adjusted P.Value and the log2 expression. In InteracTN, a protein/peptide is significant if passing thresholds on these parameters, set by the user. A protein/peptide for each comparison can be Up-regulated. It is Up-regulated if:

    • the log2 FC is higher than the Fold Change threshold (FC > Log2 FC thr),

    • the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),

    In the output, for each comparison, this distinction is reported in the “class” column, which assumes value “+” if is up-regulated and “=” if it is not significant.

    Various figures are generated, first a bar plot that graphically represents the DEPs identified. Followed by comparison-specific volcano plots.

    04. Report creation and download of the results

    Results are summarized in a web-page HTML report. Other than this, InteracTN generates a large number of useful files: a description of each output file can be found in section 4. Details on the output files. All the files are group in a zip file and downloaded.

    ADDITIONAL STEPS:

    B1. Batch Effect correction

    If required by the experiment, a batch correction step can be applyed using proBatch (Cuklina et al. 2018). The batches need to be defined in the sample annotation file where an additional column describe the batches. required.

    E1. Enrichment analysis of the Differentially Expressed Proteins

    The biological interpretation of the Differentially Expressed Proteins starts with the enrichment step. To execute this analysis, InteracTN uses EnrichR (Jawaid 2022), a popular tool that searches on a large number of data sets to obtain information about many functional categories. EnrichR organises its hundreds of data sources in 8 sections: Transcription, Pathways, Ontologies, Diseases/Drugs, Cell Types, Misc, Legacy, and Crowd.

    Each comparison defined in the differential analysis stage can result in 1 sets of proteins: the Up-regulated (called Up). InteracTN provides for each term statistical parameters like P.Value, fdr, odds ratio, overlap size.

    InteracTN creates an RData of the complete enrichment data frame, allowing the user an easy import in R to perform further analysis. InteracTN also generates an Excel file, containing only the significantly enriched terms, as defined by user settings.

    A term to be significative need to have:

    • a Fdr or P.Value lower of P.Value thr for enrichment (P.Value < P.Value thr for enrichment),

    • an Overlap Size higher than Overlap size thr for enrichment (Overlap Size > Overlap size thr for enrichment).

    InteracTN displays top significant enrichments based on specific annotation datasets or keywords selected by the user.

    N1. Protein-Protein Interaction network analysis of Differentially Expressed Proteins

    InteracTN performs Protein-Protein Interaction (PPI) network analysis on differentially expressed proteins. PPIs are essential in almost all processes of the cell, and crucial for understanding cell physiology in different states. For each comparison, InteracTN analyses the interaction between the DEPs using STRING (Szklarczyk et al. 2021).

    The species-specific database is retrieved from the STRING server, and all the interactions above a user-defined threshold are used to generate a network with

  • 2. Details on the input parameters and files


    • Title of Analysis: title of the experiment. It will be the title of the web page report.

    • Brief Description: description of the current experiment. It is the first paragraph of the report.

    • Software Analyzer: determine which software was used to identify peptides and proteins.
      • Proteome Discoverer
      • MaxQuant using evidence.txt
      • MaxQuant using peptides.txt and proteinGroups.txt
      • Spectronaut
      • FragPipe
    • File required for Proteome Discoverer

      • Annotation file: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:
        Column Name Description
        File ID Identifier used in column headers of the peptide file.
        Condition Experimental group name. Used for comparisons.
        Sample Clean sample name used downstream.
        color (Optional) Plot color. Defaults are applied if missing.
        batch (Optional) Batch ID for batch effect correction.
      • Peptides file: Excel table with annotated peptides and abundance values.
        Column Name Description
        Master Protein Accessions Maps peptide to protein; only first ID is kept.
        Annotated Sequence Amino acid sequence including PTM annotations.
        Modifications Post-translational modifications.
        Positions in Master Proteins Position of peptide in the protein sequence.
        Abundance: <File ID> Intensity/abundance for each sample. One column per sample.
      • Proteins file: Excel table containing descriptive and accession information for proteins.
        Column Name Description
        Accession Unique protein identifier, used to join with peptide file.
        Description Descriptive string, e.g., from UniProt.
    • File required for MaxQuant

      • Annotation file: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:
        Column Name Description
        Condition Experimental condition (e.g. Control, Treated). Used for group comparison.
        Sample Sample identifier. Must match sample names in the peptide file.
        color (Optional) Color associated with the condition. If not present, default colors are assigned.
        batch (Optional) Batch ID for batch effect correction. Required if batch correction is enabled.
      • Evidence pipeline:
        • evidence.txt: This is a TSV/CSV file containing peptide-level quantification data. Required columns:
          Column Name Description
          Sequence Amino acid sequence of the peptide.
          Modifications PTMs of the peptide.
          Gene names Gene symbol associated with the peptide.
          Protein names Protein description. If missing, will be merged from annotation file.
          Leading razor protein UniProt accession. Used for annotation enrichment.
          Raw file File/sample ID. Must match entries in the annotation file.
          Intensity Peptide intensity value. Used for quantification.
          Leading proteins Used for filtering out contaminants (e.g. "CON_").
      • Peptide and ProteinGroups pipeline:
        • peptides.txt: Tab-delimited file with peptide-level quantification. Required columns:
          Column Name Description
          Sequence Amino acid sequence of the peptide.
          Gene names If missing, inferred from Leading razor protein.
          Protein names Protein description. If missing, will be merged from annotation file.
          Leading razor protein UniProt accession. Used for annotation enrichment.
          Intensity <Sample> Intensity values for each sample (e.g. Intensity Sample1).
        • proteinGroups.txt: Tab-delimited file providing protein-level information. Required columns:
          Column Name Description
          Majority protein IDs Used to extract the Leading razor protein.
          Fasta headers Used for protein description.
    • File required for Spectronaut

      • Annotation file: (Optional) If provided, this file should contain metadata for each sample. If not provided, the pipeline will extract sample annotations directly from the peptide file.
        Column Name Description
        Condition Experimental group label. Used for comparison between conditions.
        Sample Sample identifier. Must match entries in the peptide file.
        color (Optional) Color for visualization. Default colors will be assigned if missing.
        batch (Optional) Batch ID for batch correction. Required if batch correction is enabled.
      • Spectronaut report: This is a TSV/CSV file containing peptide-level data. Required columns:
        Column Name Description
        PG.ProteinAccessions Protein group accessions.
        PEP.StrippedSequence Peptide sequence without modifications.
        EG.ModifiedSequence Peptide sequence with modifications.
        PEP.Quantity Peptide quantification value.
        R.FileName Sample identifier (column used is defined by sample_col).
        R.Condition Condition identifier (column used if annotation file not provided).
    • File required for FragPipe

      • Annotation file: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:
        Column Name Description
        Condition Experimental condition (e.g. Control, Treated). Used for group comparison.
        Sample Sample identifier. Must match sample names in the peptide file.
        color (Optional) Color associated with the condition. If not present, default colors are assigned.
        batch (Optional) Batch ID for batch effect correction. Required if batch correction is enabled.
      • combined_modified_peptide.tsv: The peptide quantification file must contain raw or normalized intensity values for each sample and peptide. Required columns:
        Column Name Description
        Protein ID Protein accession or identifier.
        Protein Description Descriptive name of the protein.
        Gene Gene symbol.
        Peptide Sequence Amino acid sequence of the peptide.
        Assigned Modifications Sequence with nucleotide modifications.
        Prev AA Used to determine tryptic condition.
        Next AA Used to determine tryptic condition.
        <Sample> Intensity One column per sample named like <Sample> Intensity.
  • 3. Example case study - Selecting...


    Selecting...

    PRIDE: Selecting....

    Cite

    Possible to download the case study in the Run tab.

  • 4. Details on the output files


    • report.html: complete report of the analysis with all pics and results of the enrichment.
    • db_results_InteracTN.RData: RData object containing all the data produced during the execution ready for additional analyses in R.
    • log_filter_read_function.txt: Results filter applied during preprocessing step.
    • input_InteracTN folder
      • Input files provided
    • rdata folder
      • enrichment.RData: RData object containing enrichment results based on differentially expressed proteins
    • Tables folder
      • normalised_abundances.xlsx: excel file containing abundance values generated by InteracTN. abundances are log2 transformed, normalized, imputed (and batch corrected). The file is organized in the following sheets:
        • protein_per_sample: protein abundances per sample.
        • peptide_per_sample: peptide abundances per sample.
        • protein_per_condition: protein abundances per condition (average & standard deviation), as defined in the Sample Annotation.
        • peptide_per_condition: peptide abundances per condition (average & standard deviation), as defined in the Sample Annotation.
      • differential_expression.xlsx: excel file containing the results of differential analysis, according to the contrasts defined in the Design file. The file is organized in the following sheets:
        • protein_DE: protein differential expression results protein abundances per sample.
        • peptide_DE: peptide differential expression results. Annotation columns:
        • Accession: protein UniprotID
        • Description: protein description
        • GeneName: Gene Symbol
        • Peptide_Sequence: peptide sequence
        • Peptide_Modifications: peptide modifications
        • Peptide_Position: start and end position of the peptide within the protein sequence, defined UniprotID
        • Peptide_Tryptic: peptide tryptic digestion status (fully tryptic, N-semi tryptic, C-semi tryptic, non tryptic)
        • Columns for each contrast:
        • class: defined according to the fold change, p-value and abundance thresholds specified in the input
        • + up-regulated protein/peptide
        • - down-regulated protein/peptide
        • = invariant protein/peptide
        • log2_FC: protein/peptide log2 transformed fold change
        • p_val: protein/peptide contrast p-value
        • p_adj: protein/peptide adjusted p-value (FDR after BH correction)
        • log2_expr: protein/peptide log2 average abundance
      • enrichment.xlsx: excel file containing a selection of enrichment results starting from differentially expressed proteins. Terms are selected according to significance thresholds specified in the input (Default: adj.P.Value < 0.05, Overlap Size >= 5)
    • Figures folder
    • PDF version of all figure selected
    • enrichment_plot.pdf: dot plot of top enriched terms based on differentially expressed proteins, divided in up- and down-regulated. Terms are filtered for keywords defined in the advanced options.
    • protein_vulcano directory: Directory with all the vulcano plots based on the differential proteins.
    • peptide_vulcano directory: Directory with all the vulcano plots based on the differential peptide.
    • STRINGdb directory: Directory with figures from network analysis of differentially expressed proteins, based on STRINGdb protein-protein interactions. For each contrast, two files are generated:
      • *_connection.txt: txt file with all edge of the network.
      • *_network.pdf: PDF version of STRINGdb network.

Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research” Complex Systems: 1695. https://igraph.org.

Cuklina, Jelena, Chloe H. Lee, Evan G. Willams, Ben Collins, Tatjana Sajic, Patrick Pedrioli, Maria Rodriguez-Martinez, and Ruedi Aebersold. 2018. “Computational Challenges in Biomarker Discovery from High-Throughput Proteomic Data.” https://doi.org/10.3929/ethz-b-000307772.

Jawaid, Wajid. 2022. “enrichR: Provides an r Interface to ’Enrichr’.” https://CRAN.R-project.org/package=enrichR.

Kim, Hani Jieun, Taiyun Kim, Nolan J Hoffman, Di Xiao, David E James, Sean J Humphrey, and Pengyi Yang. 2021. “PhosR Enables Processing and Functional Analysis of Phosphoproteomic Data” 34. https://doi.org/10.1016/j.celrep.2021.108771.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies” 43: e47. https://doi.org/10.1093/nar/gkv007.

Szklarczyk, Damian, Annika L Gable, Katerina C Nastou, David Lyon, Rebecca Kirsch, Sampo Pyysalo, Nadezhda T Doncheva, et al. 2021. “The STRING Database in 2021: Customizable Protein-Protein Networks, and Functional Characterization of User-Uploaded Gene/Measurement Sets.” 49.

Zhu, Yafeng. 2022. “DEqMS: A Tool to Perform Statistical Analysis of Differential Protein Expression for Quantitative Proteomics Data.”

Select what execute:

Differential Analysis:


×

Mission of RNA and Disease Data Science laboratory

Human diseases such as cancer are intrinsically entangled with complexity. The discovery of effective cures requires dealing with this complexity and greatly benefits from the development of high-resolution methods of investigation: genome-wide, multi-modal, single cell, spatially resolved. Understanding human diseases at single-cell resolution within their architectural context is a scientific challenge requiring dedicated computational analysis dealing with the volume and heterogeneity of the data. The research mission of the lab is to understand the RNA molecular mechanisms underlying dysregulation in human diseases, by combining high-throughput and high-resolution analyses, with a pan-disciplinary approach.

Contacts

Gabriele

Gabriele Tomè

Developer

University of Trento

Toma

Toma Tebaldi

PI & Supervisor

University of Trento & Yale University

Romina

Romina Belli

Contributor

MS CIBIO Facility - Trento

Daniele

Daniele Peroni

Contributor

MS CIBIO Facility - Trento