Ever since I moved into proteomics and started my bioinformatics career at the PRoteomics IDEntifications database at the European Bioinformatics Institute – in 2010 – I’ve been thinking on what would personal proteomics (pp from now on) mean & how will it look like in the future. By future I meant a couple years at the beginning.
But then in early June, 2011 I read on the $1000 Proteome service offer of the company Bioproximity (found via Twitter) and suddenly the future meant just a couple of months because the $1000 was close enough to overcome a big psychological limit for an early adopter like me.
Indeed within a month I was able to check the mass spectra, the matched peptides and the (many times incorrectly) inferred identified proteins out of my saliva on an amazon cluster somewhere displayed by my smartphone on the shuttle on my way to the Genome Campus. That’s what I call a proper 2011 early adopter experience.
For this to happen I needed a shortcut and it was the fact that Bioproximity (and its founder Brian Balgley) had similar ideas on pp, more importantly they were doing this in practice with a high throughput, state of the art LTQ Velos instrument. Thanks to similar ideas and due to the fact that they already knew of me via my earlier blog they said yes to my offer to venture into pp. We agreed upon doing a 6 months longitudinal study to catch the dynamics too.
After some thinking we decided that the best way to deliver my saliva sample to the US is if I spit into 10 ml Falcon tubes and FEDEX it on dry ice. This is not perfectly user friendly but more feasible compared to prepping the sample yourself in a rented lab before shipping it. Still I needed a lab where I can recycle a styrofoam box with the dry ice in it, freshly arrived. Since I worked as a wet lab guy in the Cambridge Vet School in the Franklin Lab my friend Jeff Huang alerted me when they had some boxes available for recycling purposes. (Many thanks, Jeff and Robin.)
Bioproximity also set up Proteome Cluster on the Amazon Cloud for searching tandem mass spec data with different search engines. In current mass spectrometry proteomics what really detected is peptides that are matched to mass spectra but not proteins. Proteins are only inferred later with much less confidence. The searches are performed against a protein ‘database’ which is just a fancy way to name assembled fasta files containing particular lists of proteins, proteomes.
Scored peptide identifications (peptide-to-spectrum matches, PSMs) with expectation values and additional scores…that’s what a search engine usually delivers and the more search engine used the better the coverage and filtering so the quality of the identifications can improve. (More on this later). Bioproximity currently supports 3 different search algorithms: native and k-score X! Tandem and OMSSA, all based on open source code.
Our first question was about the basic bioinformatics of personal proteomics: what is the pipeline, how manageable are the resulting spectra files and searches and how the data can be presented.
Hereby I’d like to announce the release of 3 mass spectometry proteomics datasets/runs and the associated searches that are publicly available at https://www.proteomecluster.com/, public login as email@example.com, password as guest2012
Once logged in go to https://www.proteomecluster.com/searches
From here you can download the actual spectra files by clicking on runs (for instance 11_407.mgf.gz ~ Mascot Generic Format for storing MS/MS spectra) to start searching them yourself but my recommendation is to click the search ID-s first (259 link is working once you are logged in as firstname.lastname@example.org) so you will see:
Here you can choose between 2 views, View: MSAugury – Proteomics Data Viewer and GPM View, the difference is that the former works for displaying both OMSSA and X!Tandem while GPM View is only active for X!Tandem Results. If you wish to download the actual search result files go grab output.zip. Otherwise let’s head towards GPM View and that’s where the juicy stuff begins:
The MS/MS runs were searched against a giant oral microbiome, a human proteome and a common contamination library. So far I have sent 3 samples to Bioproximity: 11_296.mgf: 2011/07/05, 11_320.mgf: 2011/07/28, 11_407.mgf : 2011/10/20.
When you start checking the actual protein hits you will encounter soon on of the biggest problem of current mass spec based proteomics called protein inference: briefly, the same peptide can be shared between different proteins but as peptides are the ones that are matched to mass spectra we cannot be sure that the inferred proteins were really present in the original sample in case of shared peptides. Or as formulated by Nesvizhskii and Aebersold:
The difficulty of assembling peptide identifications back to the protein level results from
the same factors that made shotgun proteomics approach so successful in the first place, i.e.,
protein digestion at an early stage of the process and elimination of extensive separation at the
protein level. Protein digestion makes peptides, and not the proteins, the currency of the method,
and the connectivity between peptides and proteins is lost at the digestion stage. This loss of
connectivity complicates computational analysis and biological interpretation of the data. The same peptide sequence can be present in multiple different proteins. Therefore, the identification of such shared peptides can lead to ambiguities in the determination of the identities of the sample protein.
So for instance the YTLAGTEVSALLGR peptide has been detected many many times by the search engine and it maps to > 50 protein accessions. Amongst those protein hits the following species are represented amongst others: Eikenella corrodens, Escherichia coli and Yersinia pestis, guess which one of them lives in my mouth? One candidate to start with might be the primary accession chosen by X!Tandem and that is Eikenella corrodens, which is a mostly innocent commensal of the human mouth. You can check for primary and secondary protein accessions when clicking to ‘group’ in GPM View.
In order to make the data more transparent all the spectra information alongside with peptide assignments and protein identifications made by the X!Tandem search engine were submitted to the PRoteomics IDEntifications (PRIDE) public repository
and right now we , by my colleague @ruiwang_cn and me, are in the process of making the data and made publicly available. I’ll update the post once the PRIDE representation of the data is publicly available.
Public experiment accessions: 22142, 22143 and 22144.
Update: you can grab the data via PRIDE public ftp: ftp://ftp.pride.ebi.ac.uk/2012/03/PXD000002/
I think what we have now present on Proteome Cluster does answer our first question, namely figuring out the basic bioinformatics of personal proteomics, n=1.
Our second question is really where the biological journey begins: what does all this data mean, how can the results interpreted in the context of one individual. I’m not even starting to raise the different interesting and important questions here. The little side note on protein inference was already adventuring into the realm of question 2.
There’s also a heavy educational component here: even the general bioinformatics, biology audience knows little about mass spectrometry based proteomics. I have only a 2 years’ work experience as a service (but not a researcher) proteomics bioinformatician, the folks at Bioproximity have substantially more.
The plan is to go into details in subsequent blog posts. I disabled the comments section here because we prefer to answer the questions on Twitter. Look for @bioproximity and @attilacsordas
Let the transparent journey of personal proteomics begin!