Impact on Proteotype: Pipeline STEP2

STEP 2

Normalization/batch-effect check-up and Module Normalization

In this step datasets are checked for possible normalization issues and batch effects on a one-to-one basis. Batch effects are assessed by analysing each sample's distribution using the Shapiro-Wilk test for normality and checking whether the center of each distribution is similar to the other samples; most cases did not require any additional normalization (pre-processing steps in respective publications described in Supplementary Table S1), only Geiger et al. (2012) and Guo et al. (2012) required some additional quantile-normalization to account for slight sample deviations from normal distributions.

This step additionally involves normalizing the abundances of complex-associated subunits to the trimmed mean of the complexes (as previosuly described in Ori et al., 2013; Ori et al., 2016). Briefly, proteins belonging to the same complex were normalized by the respective trimmed mean (or interquartile mean) of the complex subunits across all individuals/samples. In case of proteins involved in multiple complexes, the average value from all the corresponding complexes was taken into account. Given the complex-normalized abundances, the variance of each subunit in a given complex was calculated. The output-directory of this step summarizes these results.

Data/Code Requirements for downloading

dataset_battle_protein_remapped.tsv.gz (9MB)

proteomics data from Battle et al. (2015), Science (Human Individuals)

dataset_battle_ribo_remapped.tsv.gz (27MB)

cribosome profiling data from Battle et al. (2015), Science (Human Individuals)

dataset_battle_rna_remapped.tsv.gz (28MB)

RNAseq data from Battle et al. (2015), Science (Human Individuals)

dataset_gygi1_remapped.tsv.gz (9MB)

proteomics data from Chick et al. (2016), Nature (Founder Mouse strains, MS-proteomics)

dataset_gygi2_remapped.tsv.gz (25MB)

RNAseq data from Chick et al. (2016), Nature (DO Mouse strains, RNAseq)

dataset_gygi3_remapped.tsv.gz (12.5MB)

proteomics data from Chick et al. (2016), Nature (DO Mouse strains, MS-proteomics)

dataset_mann_all_log2_remapped.tsv.gz (15MB)

proteomics data from Geiger et al. (2012), Mol Cell Proteomics (Human Cell Types)

dataset_tiannan_remapped.tsv.gz (4.8MB)

proteomics data from Guo et al. (2012), Nature Medicine (Human Kidney Cells)

dataset_tcga_breast_remapped.tsv.gz (9.6MB)

proteomics data from Mertins et al. (2016), Nature (TCGA Breast Cancer)

dataset_coloCa_remapped.tsv.gz (4.3MB)

proteomics data from Roumeliotis et al. (2017),Cell (TCGA Colorectal Cancer)

dataset_tcga_ovarian_remapped.tsv.gz (11MB)

proteomics data from Zhang et al. (2016), Cell (TCGA Ovarian Cancer)

dataset_bxdMouse_remapped.tsv.gz (2MB)

proteomics data from Williams et al. (2016), Science (BXD Mouse Strains)

Download all input data for this step here (186MB)

wp_step2_code.py

Python code required for checking normalization of datasets and complex-based normalisation.

complex_filtered.zip (41MB)

Result files after filtering for complex-related proteins only.

complex_stoichiometry.zip (52.6MB)

Result files after filtering for complex-related proteins normalized to complex.

Computational Pipeline

Computational Steps in Detail

STEP 2