Introduction

Something scenicdbSAP  is a database regarding human protein variations which are inferred from single nucleotide polymorphisms (SNPs) and genomic mutations. Millions of human genetic variations have been identified using high-throughput sequencing technologies. Genetic variations could be strongly correlated with phenotypic variations, including diseases. Those variations located in coding region have the potential to affect the corresponding amino-acids, which may result in the non-synonymous substitutions of corresponding amino-acid sequence called single amino acid polymorphisms (SAPs). Although some studies have investigated human SAPs, only a small fraction of them have been detected in each study due to inadequately inferred protein variation database, and the low coverage of mass spectrometry experiments.

To increase the spectral usage of MS data and facilitate the detection of protein variations, we first built a comprehensive variation-containing database by integrating the human SNPs and mutations from eight distinct databases (UniProt (04/16/2014), PMD (05/26/2007), HPMD (2012, the latest version), MS-CanProVar (corresponding to Ensembl V54, the latest version), MSIPI (v3.67), COSMIC (v68), dbSNP (dbsnp_138.hg19) and Ensembl (1000 Genome and HapMap, version 74). Then we constructed a workflow to identify variant peptides and associated proteins based on a large amount of proteomic mass spectrometry data (11,865 experiments) of various cancers, normal tissues and cell lines collected from public databases. After a series of strict quality control steps (global FDR <=0.01, group FDR <=0.01), a total of 16,854 unique variant peptides supported by 439,537 unique spectrum were identified. Through integrating the information and relationships among peptides, proteins, genes, diseases and drug targets, we provide a convenient and comprehensive database resource - dbSAP for facilitating diverse related studies of human proteins.

Analysis Pipeline
All MS data were searched against the theoretical SAP database using Mascot database search algorithm. Trypsin was specified as the proteolytic enzyme and allowed up to 2 missing cleavages. Charge states of +2, +3 and +4 were enabled for parent ions. The error window was set to ± 20 ppm on experimental peptide mass values and ± 0.5 Da for MS/MS fragment ion.