GenomeWeb/ProteoMonitor story on the Personal Proteomics project

Hereby I post the full ProteoMonitor story on our project with the permission of Adam Bonislawski in case you missed it (I will also publish the email interview between Adam and me that provided raw material with more details):

EBI Bioinformatician Launches Personal Proteomics Project Tracking His Salivary Proteome Over Time

April 20, 2012

Last month Stanford University researcher Mike Snyder caused something of a stir by publishing in Cell initial results from an ongoing study tracking his “Personal Omics Profile” – a collection of data including Snyder’s genome sequence as well as multiple transcriptomic, proteomic, metabolomic, and autoantibody profiles taken over a period of 14 months (GWDN 3/15/2012).

Snyder, though, isn’t the only scientist serving as a subject in a personal ‘omics project.

Last July, Attila Csordas, a bioinformatician at the European Bioinformatics Institute’s Proteomics Identifications Database – PRIDE – launched what he has termed his Personal Proteome project, an effort to measure and track changes in his salivary proteome over time.

Collaborating with Bioproximity, a Springfield, Va.-based contract research organization that specializes in proteomics, Csordas has thus far collected salivary samples at four time points, three of which he and the company have analyzed via mass spec, confidently identifying roughly 1,000 proteins in each. He has made the data from these runs publicly available on Bioproximity’s Proteome Cluster and on the PRIDE database and is following the effort’s progress on his Personal Proteomics blog.

While the notion of “personal proteomics” is obviously interesting for its potential clinical implications, at this point the project is primarily an exploratory effort, Csordas said.

“At this stage this is strictly a research project,” he told ProteoMonitor in an e-mail this week. “We are trying to figure out the basic bioinformatics of personal proteomics and the different ways analysis can go with this type of data, used [either] by itself or mapped to other types of ‘omics data.”

“We are pushing the early-adopter limits to see what can be achieved with mass spectrometry in the context of individual human proteomes and whether the technology is [able to] finally [provide] valuable information at a reasonably good cost/benefit ratio,” he said.

The project was initially sparked, Csordas said, by a promotion he came across on Bioproximity’s website in which the company offered global shotgun proteomic analysis of any sample type for a flat rate of $1,000.

Interested, Csordas emailed Bioproximity CEO Brian Balgley to see if he might want to collaborate on an open-ended, longitudinal analysis of his personal proteome. Balgley agreed that the project sounded interesting, and several weeks later Csordas packed a saliva sample on dry ice and mailed it via FedEx from his home in Cambridge, UK, to Biopromixity’s Virginia laboratory.

“We’ve done three time points so far, and we’ll continue doing it as long as we’re both willing to do so,” Balgley told ProteoMonitor. “[Bioproximity is] certainly willing to keep doing it. It’s an interesting project, and there’s a lot more to be learned.”

The company, Balgley said, provides the analyses free of charge as part of the collaboration.

The researchers are using a MudPIT set-up run on a Thermo Scientific LTQ Velos – a workflow that, Csordas said “is fairly established and [which Bioproximity’s] lab has been using for years.” This, he noted, has meant that data analysis – rather than data acquisition – has proven the most challenging aspect of the work.

“The challenges really have been presented at the bioinformatics end,” he said. “What search engines can be used with what kind of search parameters? How do you balance between specificity and sensitivity, coverage, and repeatability over time?”

In addition, because Csordas is using saliva samples, the proteomics of his mouth’s microbial population also comes into play. This offers potentially interesting insights into the oral microbiome, but it also significantly increases the size of the peptide library the researchers must search against, he said, noting that the microbiome FASTA file they’ve used contains more than 3.7 million proteins – and that’s before including the human library and a library of common contaminants.

Right now, Balgley said, the researchers are primarily working on building a baseline against which they can gauge observed protein expression differences.

“We’re basically trying to figure out what [proteins are] observable and what are the best ways of observing those proteins,” he said. “We looking at variations we’re getting in our measurements and trying to see how much of that is due to differences in sample collection and how much is due to other factors we might not have considered.”

“There are some significant variations in the amount of the proteins we’ve seen,” Balgley said, noting that this could depend on a variety of factors, including “when [a sample] is collected during the day; how long it’s been since [Csordas] has eaten; whether he’s brushed his teeth or not.”

“I know one of the time points we got was after he had visited the dentist. [For] one of the time points he was recovering from a cold. So we do see a lot of variation,” he said.

The researchers are also looking into various sample-prep and enrichment strategies to see if changes to these might help them better explore different parts of Csordas’ proteome.

“We’re looking, [for instance,] at enriching for exosomes, glycoproteins, and what kind of differences we see in protein populations based on that,” Balgley said. “We’re looking at what the differences are between a 30-minute assay, a two-hour assay, a 12-hour assay. We’re looking at how isoelectric focusing can help us improve the confidence of the proteins identified, especially for the microbial proteins.”

The researchers are sharing all data from the project on Bioproximity’s Proteome Cluster tool and the PRIDE database (available here) and, Csordas said, aim to “make the data as transparent as is possible.”

“It’s one thing to have some data available for download from a study on some servers with minimal metadata,” he said. Ideally, though, researchers would offer “the raw files, search engine output flied, peak list files, search parameter/configuration files, FASTA files, and metadata information all mapped together and made accessible for re-analyzing and further data mining.

“With the combined Proteome Cluster-PRIDE release of the data, we are pretty close to this latter scenario,” he said.

Moving forward, the researchers intend to continue adding new time points in hopes of establishing “a solid bioinformatic foundation” for their analysis, Csordas said.

They are also working to present the project’s findings in papers and at conferences and recently had a poster on the work accepted to the Exploring Human Host-Microbiome Interactions in Health and Disease conference taking place in Cambridge in May.

The project has also picked up support, Csordas said, from some of his UK colleagues, including EBI bioinformaticians Henning Hermjakob, head of the Proteomics Services Team, and Johannes Griss, both of whom have helped with some of the bioinformatics work; and Cambridge Veterinary School researchers Jeff Huang and Robin Franklin, who have contributed supplies to the effort.


Proteome Cluster at AWS Big Data event

Join us for a presentation on Proteome Cluster (account:, password: guest2012) at Amazon’s Big Data and HPC in the Cloud event next Friday, April 27 in Boston. We will discuss the challenge of sharing and reproducibly analyzing large shotgun proteomic datasets and how we use high performance cloud computing to address these issues.

The agenda:

8:30am to 9:30 Registration, Expo and Continental Breakfast (John Hancock Hall)
9:30am to 10:00 Welcome and Keynote – Matt Wood, PhD, AWS Product Manager for Big Data & HPC and John Rauser, Amazon Data Scientist
10:00am to 10:50 Using Elastic MapReduce (EMR) for Big Data – Peter Sirota, General Manager, Amazon Elastic MapReduce with AWS customers:
A data Scientist from Airbnb and
Jeff Sternberg from CapitalIQ
10:50am to 11:40 Using Amazon EC2 for HPC – Deepak Singh, PhD, Principal Product Manager, EC2 with AWS customers:
Justin Riley from MIT,
Brian Balgley from Bioproximity and
11:40am to 11:45 Closing remarks – Matt Wood, PhD, AWS Product Manager for Big Data & HPC

Register here:

Personal Proteomics poster accepted for Exploring Human Host-Microbiome Interactions conference

First steps on a road: Our poster got accepted for the Exploring Human Host-Microbiome Interactions in Health and Disease Conference, 8-10 May 2012 taking place at Wellcome Trust Conference Centre in the Genome Campus, roughly 2 min walk from my workplace at the EBI. Draft schedule is here. An earlier version of the abstract I dug up:

Analyzing the personal oral microbioproteome and human saliva proteome via mass spectrometry proteomics

Attila Csordas1, Johannes Griss1, Henning Hermjakob1, Brian Balgley2

1 European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK; 2Bioproximity, LLC, Springfield, Virginia, USA

Collecting and interpreting the human saliva microbioproteome of diseased and healthy human individuals over different time points via high-throughput methods might be a routine practice in the near future. Here we present a pipeline specifically designed for personal proteomics. Human saliva samples from a 36 year old individual were analyzed using shotgun tandem mass spectrometry (MS) over 3 time points in 2011. MS data was searched against an oral microbiome, a human proteome and a contamination library on Proteome Cluster ( running on the Amazon Cloud. The complete MS data as well as the search engine results can be viewed and downloaded from Proteome Cluster. In order to make the data more transparent all the spectra information alongside with peptide assignments and protein identifications made by the X!Tandem search engine were submitted and made publicly available via the PRoteomics IDEntifications (PRIDE) public repository ( Stringent score thresholds were applied. At a 0.001 peptide expectation score cutoff and requiring at least 2 peptides per protein the global false discovery rate was less than 1 percent, yielding ~4000 protein accessions per MS/MS run. Human proteins accounted for ~40% while oral microbiome proteins covered ~60% of the protein accessions. Concerning the oral microbiome all 6 major Bacteria phyla containing ~96% of the oral taxa were represented.

Accelerating scanning speed of mass spectrometers over time: looking good for personal proteomics!

The main driver of biological research today is genome sequencing and consequently the mainstream type of bioinformatics deals with genomics/sequencing. Personal genomics is already moving away from microarray genotyping platforms to sequencing methods. The possibility and potential growth of personal proteomics is dependent on many things and today I try to talk about only the high-throughput sequencing capacity of mass spec machines in the form of a semi-log plot. I won’t talk about things like sensitivity, mass range, isolation/trapping efficiency, and cost. The idea of the plot is coming from the famous Read more of this post

Comparing the personal proteomics project to the ‘Snyderome’

Last week was coming out time for personal proteomics: not only have my combined and partial saliva proteome and microbioproteome results been released and the data made available but also an interesting paper including individual proteomics data from Michael Snyder has been published and announced as part of the much bigger Molecular Omics Profiling Project ( iPOP approach).

I thought it does make sense to briefly compare our methods/data/context with that of  the  Snyder study partially focusing on its proteomics component.

The basic similarity is that in both projects Read more of this post

Toward Personal Proteomics

Last week Attila Csordas of EBI announced the initial results of our collaboration focused on developing a catalog of and measuring changes in his salivary proteome. We co-released our initial results publicly on PRIDE, EBI’s proteomic data repository, and Proteome Cluster, our pipeline platform for proteomic data analysis. Here we lay out our motivations and goals for undertaking this collaboration. Read more of this post

Personal proteomics – reality check on Bioproximity’s Proteome Cluster, early 2012

Ever since I moved into proteomics and started my bioinformatics career at the PRoteomics IDEntifications database at the European Bioinformatics Institute – in 2010 – I’ve been thinking on what would personal proteomics (pp from now on) mean & how will it look like in the future. By future I meant a couple years at the beginning.

But then in early June, 2011 I read on the $1000 Proteome service offer of the company Bioproximity (found via Twitter) and suddenly the future meant just a couple of months because the $1000 was close enough to overcome a big psychological limit for an early adopter like me.

Indeed within a month I was able to check the mass spectra, the matched peptides and the  (many times incorrectly) inferred identified proteins out of my saliva on an amazon cluster somewhere displayed by my smartphone on the shuttle on my way to the Genome Campus. That’s what I call a proper 2011 early adopter experience.

For this to happen I needed a shortcut and it was the fact that Bioproximity (and its founder Brian Balgley) had similar ideas on pp, more importantly they were doing this in practice with a high throughput, state of the art LTQ Velos instrument. Thanks to similar ideas and due to the fact that they already knew of me via my earlier blog they said yes to my offer to venture into pp. We agreed upon doing a 6 months longitudinal study to catch the dynamics too.

After some thinking we decided that the best way to deliver my saliva sample to the US is if I spit into 10 ml Falcon tubes and FEDEX it on dry ice. This is not perfectly user friendly but more feasible compared to prepping the sample yourself in a rented lab before shipping it. Still I needed a lab where I can recycle a styrofoam box with the dry ice in it, freshly arrived. Since I worked as a wet lab guy in the Cambridge Vet School in the Franklin Lab my friend Jeff Huang alerted me when they had some boxes available for recycling purposes. (Many thanks, Jeff and Robin.)

Bioproximity also set up Proteome Cluster on the Amazon Cloud for searching tandem mass spec data with different search engines. In current mass spectrometry proteomics what really detected is peptides that are matched to mass spectra but not proteins. Proteins are only inferred later with much less confidence. The searches are performed against a protein ‘database’ which is just a fancy way to name assembled fasta files containing particular lists of proteins, proteomes.

Scored peptide identifications (peptide-to-spectrum matches, PSMs) with expectation values and additional scores…that’s what a search engine usually delivers and the more search engine used the better the coverage and filtering so the quality of the identifications can improve. (More on this later). Bioproximity currently supports 3 different search algorithms: native and k-score X! Tandem and OMSSA, all based on open source code.

Our first question was about the basic bioinformatics of personal proteomics: what is the pipeline, how manageable are the resulting spectra files and searches and how the data can be presented.

Hereby I’d like to announce the release of 3 mass spectometry proteomics datasets/runs and the associated searches that are publicly available at, public login as, password as guest2012

Once logged in go to

From here you can download the actual spectra files by clicking on runs (for instance 11_407.mgf.gz ~ Mascot Generic Format for storing MS/MS spectra) to start searching them yourself but my recommendation is to click the search ID-s first (259 link is working once you are logged in as so you will see:

Here you can choose between 2 views, View: MSAugury – Proteomics Data Viewer and GPM View, the difference is that the former works for displaying both OMSSA and X!Tandem while GPM View is only active for X!Tandem Results. If you wish to download the actual search result files go grab Otherwise let’s head towards GPM View and that’s where the juicy stuff begins:

The MS/MS runs were searched against a giant oral microbiome, a human proteome and a common contamination library. So far I have sent 3 samples to Bioproximity: 11_296.mgf: 2011/07/05, 11_320.mgf: 2011/07/28, 11_407.mgf : 2011/10/20.

When you start checking the actual protein hits you will encounter soon on of the biggest problem of current mass spec based proteomics called protein inference: briefly, the same peptide can be shared between different proteins but as peptides are the ones that are matched to mass spectra we cannot be sure that the inferred proteins were really present in the original sample in case of shared peptides. Or as formulated by Nesvizhskii and Aebersold:

The difficulty of assembling peptide identifications back to the protein level results from
the same factors that made shotgun proteomics approach so successful in the first place, i.e.,
protein digestion at an early stage of the process and elimination of extensive separation at the
protein level. Protein digestion makes peptides, and not the proteins, the currency of the method,
and the connectivity between peptides and proteins is lost at the digestion stage. This loss of
connectivity complicates computational analysis and biological interpretation of the data. The same peptide sequence can be present  in multiple different proteins. Therefore, the identification of such shared peptides can lead to ambiguities in the determination of the identities of the sample protein.

So for instance the YTLAGTEVSALLGR peptide has been detected many many times by the search engine and it maps to > 50 protein accessions. Amongst those protein hits the following species are represented amongst others: Eikenella corrodens, Escherichia coli and Yersinia pestis, guess which one of them lives in my mouth? One candidate to start with might be the primary accession chosen by X!Tandem and that is Eikenella corrodens, which is a mostly innocent commensal of the human mouth. You can check for primary and secondary protein accessions when clicking to ‘group’ in GPM View.

In order to make the data more transparent all the spectra information alongside with peptide assignments and protein identifications made by the X!Tandem search engine were submitted to the PRoteomics IDEntifications (PRIDE) public repository and right now we , by my colleague @ruiwang_cn and me,  are in the process of making the data and made publicly available. I’ll update the post once the PRIDE representation of the data is publicly available.

Public experiment accessions: 22142, 22143 and 22144.

Update: you can grab the data via PRIDE public ftp:

I think what we have now present on Proteome Cluster does answer our first question, namely figuring out the basic bioinformatics of personal proteomics, n=1.

Our second question is really where the biological journey begins: what does all this data mean, how can the results interpreted in the context of one individual. I’m not even starting to raise the different interesting and important questions here. The little side note on protein inference was already adventuring into the realm of question 2.

There’s also a heavy educational component here: even the general bioinformatics, biology audience knows little about mass spectrometry based proteomics. I have only a 2 years’ work experience as a service (but not a researcher) proteomics bioinformatician, the folks at Bioproximity have substantially more.

The plan is to go into details in subsequent blog posts. I disabled the comments section here because we prefer to answer the questions on Twitter. Look for @bioproximity and @attilacsordas

Let the transparent journey of personal proteomics begin!