ProTN is an integrative pipeline that analyze DDA proteomics data obtained from MS. It perform a complete analysis of the raw files from different software, with their biological interpretation with enrichement and network analysis. ProTN executes a dual level analysis, at protein and peptide level.

1. Workflow of ProTN

The ProTN workflow is divided into preprocessing, differential analysis and biological interpretation. During preprocessing, input data are filtered, imputed, normalized, and optionally batch-corrected. Next, differential analysis is performed in parallel at peptide and protein resolution, based on users’ defined comparisons and considering compled designs. Finally, biological interpretation of differentially expressed proteins is performed, including functional enrichment analysis, and detection of communities in protein-protein interaction networks.

01. Define analysis settings and load input data files

ProTN analyses the results of

Proteome Discoverer
MaxQuant
Spectronaut
FragPipe

Additional details on the input can be found in section 2. Details on the input parameters and files

02. Normalization and imputation of raw intensities

Intensities are log2 transformed and normalized with DEqMS (Zhu 2022). At the peptide level, the normalization is performed with the function equalMedianNormalization, which normalizes intensity distributions in samples so that they have median equal to 0.

At the protein level, this operation is executed by the function medianSweeping, that applies the same median normalization used for peptides, but also summarizes peptide intensities into protein relative abundances by the median sweeping method.

Imputation

PhosR: Imputation is performed on peptide and protein abundances with the Bioconductor package PhosR. Round imputation is performed in absence of replicates. ProTN uses two functions of PhosR for the imputation: Imputes the missing values for a peptide across replicates within a single condition and Tail-based imputation approach as implemented in Perseus.
Gaussian Estimation: Imputation is performed on peptide and protein abundances using Gaussian estimation, where missing values are sampled from a normal distribution defined by the mean and standard deviation of observed intensities. This preserves data variance and reduces bias from missingness within conditions.
missForest: Imputation is performed on peptide and protein abundances using the missForest R package, which applies a non-parametric random forest algorithm to predict missing values. This approach captures nonlinear relationships between features, preserving complex data structures.
pcaMethods: Imputation is performed on peptide and protein abundances using the pcaMethods R package with the svdImpute function, which estimates missing values by reconstructing the data matrix from its leading singular vectors. This approach leverages global correlation structures to provide consistent estimates.

Method	Package	Main Idea	Typical Use
PhosR imputation	PhosR (Bioconductor)	Designed for phosphoproteomics; implements round-robin (iterative imputation without replicates) and paired tail-based imputation (using replicate structure).	Phosphoproteomics with sparse coverage or missingness tied to peptide abundance.
Gaussian estimation imputation		Draws values from a Gaussian distribution defined by low-intensity tail of observed data (shifted mean, reduced SD).	When MNAR (Missing Not At Random) is likely due to detection limit.
missForest	missForest (R)	Non-parametric iterative imputation using random forests. Predicts missing entries using nonlinear relationships between features.	General proteomics where missingness relates to multiple covariates, or when structure is complex.
pcaMethods `svdImpute`	pcaMethods (Bioconductor)	Uses Singular Value Decomposition to reconstruct missing values from lower-rank structure in the data.	Well-replicated datasets with high correlation between samples.

In this step, many figure can be generated regarding information about pre-process, normalization and imputation. An example from the case study is the PCA based on the protein abundances below.

03. Differential analysis

Differential analysis is applied to both proteins and peptides, to identify significant differences. Two slightly different methodologies are applied: the DEqMS package (Zhu 2022),is used for proteins. DEqMS is developed on top of Limma, but the method estimates different prior variances for proteins quantified by different numbers of PSMs/peptides per protein, therefore achieving better accuracy.For single peptides, the Limma package is used (Ritchie et al. 2015).

Compile the comparison table: The table have 2 columns:
- Formule column (REQUIRED): The formulas need to follow the syntax of Limma (Ex: "cancer-normal").
- Name column (OPTIONAL): personalized name assign to the comparison. (Ex: "cancer_vs_normal")

Limma and DEqMS calculate differentially expressed peptides and proteins (DEPs) for each comparison specified in the design file parameter. Each peptide or protein can be selected as differential based on different parameters: the log2 Fold Change, the P.Value, the adjusted P.Value and the log2 expression. In ProTN, a protein/peptide is significant if passing thresholds on these parameters, set by the user. A protein/peptide for each comparison can be Up-regulated or Down-regulated. It is Up-regulated if:

the log2 FC is higher than the Fold Change threshold (FC > Log2 FC thr),
the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),

It is Down-regulated if:

the log2 FC is lower of the Fold Change threshold (FC < -Log2 FC thr),
the Adj.P.Value or P.Value is lower than the threshold (P.Value < P.Value thr),

In the output, for each comparison, this distinction is reported in the “class” column, which assumes value “+” if is up-regulated, “-” if down-regulated and “=” if it is not significant.

Various figures are generated, first a bar plot that graphically represents the DEPs identified. Followed by comparison-specific volcano plots.

04. Report creation and download of the results

Results are summarized in a web-page HTML report. Other than this, ProTN generates a large number of useful files: a description of each output file can be found in section 4. Details on the output files. All the files are group in a zip file and downloaded.

ADDITIONAL STEPS:

B1. Batch Effect correction

If required by the experiment, a batch correction step can be applyed using proBatch (Cuklina et al. 2018). The batches need to be defined in the sample annotation file where an additional column describe the batches. required.

E1. Enrichment analysis of the Differentially Expressed Proteins

The biological interpretation of the Differentially Expressed Proteins starts with the enrichment step. To execute this analysis, ProTN uses EnrichR (Jawaid 2022), a popular tool that searches on a large number of data sets to obtain information about many functional categories. EnrichR organises its hundreds of data sources in 8 sections: Transcription, Pathways, Ontologies, Diseases/Drugs, Cell Types, Misc, Legacy, and Crowd.

Each comparison defined in the differential analysis stage can result in 3 sets of proteins: the Up-regulated (called Up), the Down-regulated (called Down), and the merge of the two (called all). ProTN provides for each term statistical parameters like P.Value, fdr, odds ratio, overlap size.

ProTN creates an RData of the complete enrichment data frame, allowing the user an easy import in R to perform further analysis. ProTN also generates an Excel file, containing only the significantly enriched terms, as defined by user settings.

A term to be significative need to have:

a Fdr or P.Value lower of P.Value thr for enrichment (P.Value < P.Value thr for enrichment),
an Overlap Size higher than Overlap size thr for enrichment (Overlap Size > Overlap size thr for enrichment).

ProTN displays top significant enrichments based on specific annotation datasets or keywords selected by the user.

N1. Protein-Protein Interaction network analysis of Differentially Expressed Proteins

ProTN performs Protein-Protein Interaction (PPI) network analysis on differentially expressed proteins. PPIs are essential in almost all processes of the cell, and crucial for understanding cell physiology in different states. For each comparison, ProTN analyses the interaction between the DEPs using STRING (Szklarczyk et al. 2021).

The species-specific database is retrieved from the STRING server, and all the interactions above a user-defined threshold are used to generate a network with

2. Details on the input parameters and files

Title of Analysis: title of the experiment. It will be the title of the web page report.
Brief Description: description of the current experiment. It is the first paragraph of the report.
Software Analyzer: determine which software was used to identify peptides and proteins.
- Proteome Discoverer
- MaxQuant using evidence.txt
- MaxQuant using peptides.txt and proteinGroups.txt
- Spectronaut
- FragPipe

File required for Proteome Discoverer

Annotation file: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:

Column Name	Description
`File ID`	Identifier used in column headers of the peptide file.
`Condition`	Experimental group name. Used for comparisons.
`Sample`	Clean sample name used downstream.
`color`	(Optional) Plot color. Defaults are applied if missing.
`batch`	(Optional) Batch ID for batch effect correction.

Peptides file: Excel table with annotated peptides and abundance values.

Column Name	Description
`Master Protein Accessions`	Maps peptide to protein; only first ID is kept.
`Annotated Sequence`	Amino acid sequence including PTM annotations.
`Modifications`	Post-translational modifications.
`Positions in Master Proteins`	Position of peptide in the protein sequence.
`Abundance: <File ID>`	Intensity/abundance for each sample. One column per sample.

Proteins file: Excel table containing descriptive and accession information for proteins.

Column Name	Description
`Accession`	Unique protein identifier, used to join with peptide file.
`Description`	Descriptive string, e.g., from UniProt.

File required for MaxQuant

Annotation file: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:

Column Name	Description
`Condition`	Experimental condition (e.g. Control, Treated). Used for group comparison.
`Sample`	Sample identifier. Must match sample names in the peptide file.
`color`	(Optional) Color associated with the condition. If not present, default colors are assigned.
`batch`	(Optional) Batch ID for batch effect correction. Required if batch correction is enabled.

Evidence pipeline:

evidence.txt: This is a TSV/CSV file containing peptide-level quantification data. Required columns:

Column Name	Description
`Sequence`	Amino acid sequence of the peptide.
`Modifications`	PTMs of the peptide.
`Gene names`	Gene symbol associated with the peptide.
`Protein names`	Protein description. If missing, will be merged from annotation file.
`Leading razor protein`	UniProt accession. Used for annotation enrichment.
`Raw file`	File/sample ID. Must match entries in the annotation file.
`Intensity`	Peptide intensity value. Used for quantification.
`Leading proteins`	Used for filtering out contaminants (e.g. "CON_").

Peptide and ProteinGroups pipeline:

peptides.txt: Tab-delimited file with peptide-level quantification. Required columns:

Column Name	Description
`Sequence`	Amino acid sequence of the peptide.
`Gene names`	If missing, inferred from `Leading razor protein`.
`Protein names`	Protein description. If missing, will be merged from annotation file.
`Leading razor protein`	UniProt accession. Used for annotation enrichment.
`Intensity <Sample>`	Intensity values for each sample (e.g. `Intensity Sample1`).

proteinGroups.txt: Tab-delimited file providing protein-level information. Required columns:

Column Name	Description
`Majority protein IDs`	Used to extract the `Leading razor protein`.
`Fasta headers`	Used for protein description.

File required for Spectronaut

Annotation file: (Optional) If provided, this file should contain metadata for each sample. If not provided, the pipeline will extract sample annotations directly from the peptide file.

Column Name	Description
`Condition`	Experimental group label. Used for comparison between conditions.
`Sample`	Sample identifier. Must match entries in the peptide file.
`color`	(Optional) Color for visualization. Default colors will be assigned if missing.
`batch`	(Optional) Batch ID for batch correction. Required if batch correction is enabled.

Spectronaut report: This is a TSV/CSV file containing peptide-level data. Required columns:

Column Name	Description
`PG.ProteinAccessions`	Protein group accessions.
`PEP.StrippedSequence`	Peptide sequence without modifications.
`EG.ModifiedSequence`	Peptide sequence with modifications.
`PEP.Quantity`	Peptide quantification value.
`R.FileName`	Sample identifier (column used is defined by `sample_col`).
`R.Condition`	Condition identifier (column used if `annotation file` not provided).

File required for FragPipe

Annotation file: This file provides metadata for the samples analyzed. It must be an Excel file with the following required columns:

Column Name	Description
`Condition`	Experimental condition (e.g. Control, Treated). Used for group comparison.
`Sample`	Sample identifier. Must match sample names in the peptide file.
`color`	(Optional) Color associated with the condition. If not present, default colors are assigned.
`batch`	(Optional) Batch ID for batch effect correction. Required if batch correction is enabled.

combined_modified_peptide.tsv: The peptide quantification file must contain raw or normalized intensity values for each sample and peptide. Required columns:

Column Name	Description
`Protein ID`	Protein accession or identifier.
`Protein Description`	Descriptive name of the protein.
`Gene`	Gene symbol.
`Peptide Sequence`	Amino acid sequence of the peptide.
`Assigned Modifications`	Sequence with nucleotide modifications.
`Prev AA`	Used to determine tryptic condition.
`Next AA`	Used to determine tryptic condition.
`<Sample> Intensity`	One column per sample named like `<Sample> Intensity`.

3. Example case study - Proteomics from Steger et al. (2016)

Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases

PRIDE: PXD003071.
Des: Steger M, Tonelli F, Ito G, Davies P, Trost M, Vetter M, Wachter S, Lorentzen E, Duddy G, Wilson S, Baptista MA, Fiske BK, Fell MJ, Morrow JA, Reith AD, Alessi DR, Mann M. Phosphoproteomics reveals that Parkinson's disease kinase LRRK2 regulates a subset of Rab GTPases. Elife. 2016 Jan 29;5.

Possible to download the case study in the Run tab.

4. Details on the output files
- report.html: complete report of the analysis with all pics and results of the enrichment.
- db_results_proTN.RData: RData object containing all the data produced during the execution ready for additional analyses in R.
- log_filter_read_function.txt: Results filter applied during preprocessing step.
- input_protn folder
  - Input files provided
- rdata folder
  - enrichment.RData: RData object containing enrichment results based on differentially expressed proteins
- Tables folder
  - normalised_abundances.xlsx: excel file containing abundance values generated by proTN. abundances are log2 transformed, normalized, imputed (and batch corrected). The file is organized in the following sheets:
    - protein_per_sample: protein abundances per sample.
    - peptide_per_sample: peptide abundances per sample.
    - protein_per_condition: protein abundances per condition (average & standard deviation), as defined in the Sample Annotation.
    - peptide_per_condition: peptide abundances per condition (average & standard deviation), as defined in the Sample Annotation.
  - differential_expression.xlsx: excel file containing the results of differential analysis, according to the contrasts defined in the Design file. The file is organized in the following sheets:
    - protein_DE: protein differential expression results protein abundances per sample.
    - peptide_DE: peptide differential expression results. Annotation columns:
    - Accession: protein UniprotID
    - Description: protein description
    - GeneName: Gene Symbol
    - Peptide_Sequence: peptide sequence
    - Peptide_Modifications: peptide modifications
    - Peptide_Position: start and end position of the peptide within the protein sequence, defined UniprotID
    - Peptide_Tryptic: peptide tryptic digestion status (fully tryptic, N-semi tryptic, C-semi tryptic, non tryptic)
    - Columns for each contrast:
    - class: defined according to the fold change, p-value and abundance thresholds specified in the input
    - + up-regulated protein/peptide
    - - down-regulated protein/peptide
    - = invariant protein/peptide
    - log2_FC: protein/peptide log2 transformed fold change
    - p_val: protein/peptide contrast p-value
    - p_adj: protein/peptide adjusted p-value (FDR after BH correction)
    - log2_expr: protein/peptide log2 average abundance
  - enrichment.xlsx: excel file containing a selection of enrichment results starting from differentially expressed proteins. Terms are selected according to significance thresholds specified in the input (Default: adj.P.Value < 0.05, Overlap Size >= 5)
- Figures folder
- PDF version of all figure selected
- enrichment_plot.pdf: dot plot of top enriched terms based on differentially expressed proteins, divided in up- and down-regulated. Terms are filtered for keywords defined in the advanced options.
- protein_vulcano directory: Directory with all the vulcano plots based on the differential proteins.
- peptide_vulcano directory: Directory with all the vulcano plots based on the differential peptide.
- STRINGdb directory: Directory with figures from network analysis of differentially expressed proteins, based on STRINGdb protein-protein interactions. For each contrast, two files are generated:
  - *_connection.txt: txt file with all edge of the network.
  - *_network.pdf: PDF version of STRINGdb network.

Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research” Complex Systems: 1695. https://igraph.org.

Cuklina, Jelena, Chloe H. Lee, Evan G. Willams, Ben Collins, Tatjana Sajic, Patrick Pedrioli, Maria Rodriguez-Martinez, and Ruedi Aebersold. 2018. “Computational Challenges in Biomarker Discovery from High-Throughput Proteomic Data.” https://doi.org/10.3929/ethz-b-000307772.

Jawaid, Wajid. 2022. “enrichR: Provides an r Interface to ’Enrichr’.” https://CRAN.R-project.org/package=enrichR.

Kim, Hani Jieun, Taiyun Kim, Nolan J Hoffman, Di Xiao, David E James, Sean J Humphrey, and Pengyi Yang. 2021. “PhosR Enables Processing and Functional Analysis of Phosphoproteomic Data” 34. https://doi.org/10.1016/j.celrep.2021.108771.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies” 43: e47. https://doi.org/10.1093/nar/gkv007.

Szklarczyk, Damian, Annika L Gable, Katerina C Nastou, David Lyon, Rebecca Kirsch, Sampo Pyysalo, Nadezhda T Doncheva, et al. 2021. “The STRING Database in 2021: Customizable Protein-Protein Networks, and Functional Characterization of User-Uploaded Gene/Measurement Sets.” 49.

Zhu, Yafeng. 2022. “DEqMS: A Tool to Perform Statistical Analysis of Differential Protein Expression for Quantitative Proteomics Data.”

PhosProTN is an integrative pipeline for phosphoproteomic analysis of DDA experimental data obtained from MS. It perform a complete analysis of the raw files from Proteome Discoverer (PD) or MaxQuant (MQ), with their biological interpretation, enrichement and network analysis. PhosProTN analyse the phosphoproteomic data at peptide level.

1. Workflow of PhosProTN

01. Define analysis settings and load input data files

PhosProTN analyses the results of

Proteome Discoverer
MaxQuant

Additional details on the input can be found in section 2. Details on the input parameters and files

02. Normalization and imputation of raw intensities

Imputation

PhosR: Imputation is performed on peptide and protein abundances with the Bioconductor package PhosR. Round imputation is performed in absence of replicates. ProTN uses two functions of PhosR for the imputation: Imputes the missing values for a peptide across replicates within a single condition and Tail-based imputation approach as implemented in Perseus.
Gaussian Estimation: Imputation is performed on peptide and protein abundances using Gaussian estimation, where missing values are sampled from a normal distribution defined by the mean and standard deviation of observed intensities. This preserves data variance and reduces bias from missingness within conditions.
missForest: Imputation is performed on peptide and protein abundances using the missForest R package, which applies a non-parametric random forest algorithm to predict missing values. This approach captures nonlinear relationships between features, preserving complex data structures.
pcaMethods: Imputation is performed on peptide and protein abundances using the pcaMethods R package with the svdImpute function, which estimates missing values by reconstructing the data matrix from its leading singular vectors. This approach leverages global correlation structures to provide consistent estimates.

Method	Package	Main Idea	Typical Use
PhosR imputation	PhosR (Bioconductor)	Designed for phosphoproteomics; implements round-robin (iterative imputation without replicates) and paired tail-based imputation (using replicate structure).	Phosphoproteomics with sparse coverage or missingness tied to peptide abundance.
Gaussian estimation imputation		Draws values from a Gaussian distribution defined by low-intensity tail of observed data (shifted mean, reduced SD).	When MNAR (Missing Not At Random) is likely due to detection limit.
missForest	missForest (R)	Non-parametric iterative imputation using random forests. Predicts missing entries using nonlinear relationships between features.	General proteomics where missingness relates to multiple covariates, or when structure is complex.
pcaMethods `svdImpute`	pcaMethods (Bioconductor)	Uses Singular Value Decomposition to reconstruct missing values from lower-rank structure in the data.	Well-replicated datasets with high correlation between samples.

In this step, many figure can be generated regarding information about pre-process, normalization and imputation. An example from the case study is the PCA based on the protein abundances below.